diff --git "a/data/cvpr2024_papers_with_details.csv" "b/data/cvpr2024_papers_with_details.csv"
new file mode 100644--- /dev/null
+++ "b/data/cvpr2024_papers_with_details.csv"
@@ -0,0 +1,44044 @@
+Title,Authors,Link,arXiv_link,other_link,pdf_path,arXiv_title,summary,primary_category,categories
+No title found,No authors listed, ,,https://pine.libguides.com/c.php?g=997445&p=7219661,,,,,nan
+CapsFusion: Rethinking Image-Text Data at Scale,Qiying Yu · Quan Sun · Xiaosong Zhang · Yufeng Cui · Yufeng Cui · Fan Zhang · Yue Cao · Xinlong Wang · Jingjing Liu, ,https://arxiv.org/abs/2310.20550,,,CapsFusion: Rethinking Image-Text Data at Scale,"Large multimodal models demonstrate remarkable generalist ability to perform
+diverse multimodal tasks in a zero-shot manner. Large-scale web-based
+image-text pairs contribute fundamentally to this success, but suffer from
+excessive noise. Recent studies use alternative captions synthesized by
+captioning models and have achieved notable benchmark performance. However, our
+experiments reveal significant Scalability Deficiency and World Knowledge Loss
+issues in models trained with synthetic captions, which have been largely
+obscured by their initial benchmark success. Upon closer examination, we
+identify the root cause as the overly-simplified language structure and lack of
+knowledge details in existing synthetic captions. To provide higher-quality and
+more scalable multimodal pretraining data, we propose CapsFusion, an advanced
+framework that leverages large language models to consolidate and refine
+information from both web-based image-text pairs and synthetic captions.
+Extensive experiments show that CapsFusion captions exhibit remarkable
+all-round superiority over existing captions in terms of model performance
+(e.g., 18.8 and 18.3 improvements in CIDEr score on COCO and NoCaps), sample
+efficiency (requiring 11-16 times less computation than baselines), world
+knowledge depth, and scalability. These effectiveness, efficiency and
+scalability advantages position CapsFusion as a promising candidate for future
+scaling of LMM training.",cs.CV,nan
+Semantic-Aware Multi-Label Adversarial Attacks,Hassan Mahmood · Ehsan Elhamifar, ,https://arxiv.org/abs/2401.16001,,2401.16001.pdf,LESSON: Multi-Label Adversarial False Data Injection Attack for Deep Learning Locational Detection,"Deep learning methods can not only detect false data injection attacks (FDIA)
+but also locate attacks of FDIA. Although adversarial false data injection
+attacks (AFDIA) based on deep learning vulnerabilities have been studied in the
+field of single-label FDIA detection, the adversarial attack and defense
+against multi-label FDIA locational detection are still not involved. To bridge
+this gap, this paper first explores the multi-label adversarial example attacks
+against multi-label FDIA locational detectors and proposes a general
+multi-label adversarial attack framework, namely muLti-labEl adverSarial falSe
+data injectiON attack (LESSON). The proposed LESSON attack framework includes
+three key designs, namely Perturbing State Variables, Tailored Loss Function
+Design, and Change of Variables, which can help find suitable multi-label
+adversarial perturbations within the physical constraints to circumvent both
+Bad Data Detection (BDD) and Neural Attack Location (NAL). Four typical LESSON
+attacks based on the proposed framework and two dimensions of attack objectives
+are examined, and the experimental results demonstrate the effectiveness of the
+proposed attack framework, posing serious and pressing security concerns in
+smart grids.",cs.CR,['cs.CR']
+Towards Better Vision-Inspired Vision-Language Models,Yun-Hao Cao · Kaixiang Ji · Ziyuan Huang · Chuanyang Zheng · Jiajia Liu · Jian Wang · Jingdong Chen · Ming Yang, ,,https://www.youtube.com/watch?v=d91e0EwAIZc,,,,,nan
+HINTED: Hard Instance Enhanced Detector with Mixed-Density Feature Fusion for Sparsely-Supervised 3D Object Detection,Qiming Xia · Wei Ye · Hai Wu · Shijia Zhao · Leyuan Xing · Xun Huang · Jinhao Deng · Xin Li · Chenglu Wen · Cheng Wang,https://github.com/xmuqimingxia/HINTED,https://arxiv.org/abs/2308.04556,,2308.04556.pdf,FocalFormer3D : Focusing on Hard Instance for 3D Object Detection,"False negatives (FN) in 3D object detection, {\em e.g.}, missing predictions
+of pedestrians, vehicles, or other obstacles, can lead to potentially dangerous
+situations in autonomous driving. While being fatal, this issue is understudied
+in many current 3D detection methods. In this work, we propose Hard Instance
+Probing (HIP), a general pipeline that identifies \textit{FN} in a multi-stage
+manner and guides the models to focus on excavating difficult instances. For 3D
+object detection, we instantiate this method as FocalFormer3D, a simple yet
+effective detector that excels at excavating difficult objects and improving
+prediction recall. FocalFormer3D features a multi-stage query generation to
+discover hard objects and a box-level transformer decoder to efficiently
+distinguish objects from massive object candidates. Experimental results on the
+nuScenes and Waymo datasets validate the superior performance of FocalFormer3D.
+The advantage leads to strong performance on both detection and tracking, in
+both LiDAR and multi-modal settings. Notably, FocalFormer3D achieves a 70.5 mAP
+and 73.9 NDS on nuScenes detection benchmark, while the nuScenes tracking
+benchmark shows 72.1 AMOTA, both ranking 1st place on the nuScenes LiDAR
+leaderboard. Our code is available at
+\url{https://github.com/NVlabs/FocalFormer3D}.",cs.CV,['cs.CV']
+"DiG-IN: Diffusion Guidance for Investigating Networks - Uncovering Classifier Differences, Neuron Visualisations, and Visual Counterfactual Explanations",Maximilian Augustin · Yannic Neuhaus · Matthias Hein, ,https://arxiv.org/abs/2311.17833,,2311.17833.pdf,"DiG-IN: Diffusion Guidance for Investigating Networks -- Uncovering Classifier Differences, Neuron Visualisations, and Visual Counterfactual Explanations","While deep learning has led to huge progress in complex image classification
+tasks like ImageNet, unexpected failure modes, e.g. via spurious features, call
+into question how reliably these classifiers work in the wild. Furthermore, for
+safety-critical tasks the black-box nature of their decisions is problematic,
+and explanations or at least methods which make decisions plausible are needed
+urgently. In this paper, we address these problems by generating images that
+optimize a classifier-derived objective using a framework for guided image
+generation. We analyze the decisions of image classifiers by visual
+counterfactual explanations (VCEs), detection of systematic mistakes by
+analyzing images where classifiers maximally disagree, and visualization of
+neurons and spurious features. In this way, we validate existing observations,
+e.g. the shape bias of adversarially robust models, as well as novel failure
+modes, e.g. systematic errors of zero-shot CLIP classifiers. Moreover, our VCEs
+outperform previous work while being more versatile.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models,Fei Deng · Qifei Wang · Wei Wei · Tingbo Hou · Matthias Grundmann, ,https://arxiv.org/abs/2402.08714,,2402.08714.pdf,PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models,"Reward finetuning has emerged as a promising approach to aligning foundation
+models with downstream objectives. Remarkable success has been achieved in the
+language domain by using reinforcement learning (RL) to maximize rewards that
+reflect human preference. However, in the vision domain, existing RL-based
+reward finetuning methods are limited by their instability in large-scale
+training, rendering them incapable of generalizing to complex, unseen prompts.
+In this paper, we propose Proximal Reward Difference Prediction (PRDP),
+enabling stable black-box reward finetuning for diffusion models for the first
+time on large-scale prompt datasets with over 100K prompts. Our key innovation
+is the Reward Difference Prediction (RDP) objective that has the same optimal
+solution as the RL objective while enjoying better training stability.
+Specifically, the RDP objective is a supervised regression objective that tasks
+the diffusion model with predicting the reward difference of generated image
+pairs from their denoising trajectories. We theoretically prove that the
+diffusion model that obtains perfect reward difference prediction is exactly
+the maximizer of the RL objective. We further develop an online algorithm with
+proximal updates to stably optimize the RDP objective. In experiments, we
+demonstrate that PRDP can match the reward maximization ability of
+well-established RL-based methods in small-scale training. Furthermore, through
+large-scale training on text prompts from the Human Preference Dataset v2 and
+the Pick-a-Pic v1 dataset, PRDP achieves superior generation quality on a
+diverse set of complex, unseen prompts whereas RL-based methods completely
+fail.",cs.LG,"['cs.LG', 'cs.AI']"
+SI-MIL: Taming Deep MIL for Self-Interpretability in Gigapixel Histopathology,Saarthak Kapse · Pushpak Pati · Srijan Das · Jingwei Zhang · Chao Chen · Maria Vakalopoulou · Joel Saltz · Dimitris Samaras · Rajarsi Gupta · Prateek Prasanna,https://github.com/bmi-imaginelab/SI-MIL,https://arxiv.org/abs/2312.15010,,2312.15010.pdf,SI-MIL: Taming Deep MIL for Self-Interpretability in Gigapixel Histopathology,"Introducing interpretability and reasoning into Multiple Instance Learning
+(MIL) methods for Whole Slide Image (WSI) analysis is challenging, given the
+complexity of gigapixel slides. Traditionally, MIL interpretability is limited
+to identifying salient regions deemed pertinent for downstream tasks, offering
+little insight to the end-user (pathologist) regarding the rationale behind
+these selections. To address this, we propose Self-Interpretable MIL (SI-MIL),
+a method intrinsically designed for interpretability from the very outset.
+SI-MIL employs a deep MIL framework to guide an interpretable branch grounded
+on handcrafted pathological features, facilitating linear predictions. Beyond
+identifying salient regions, SI-MIL uniquely provides feature-level
+interpretations rooted in pathological insights for WSIs. Notably, SI-MIL, with
+its linear prediction constraints, challenges the prevalent myth of an
+inevitable trade-off between model interpretability and performance,
+demonstrating competitive results compared to state-of-the-art methods on
+WSI-level prediction tasks across three cancer types. In addition, we
+thoroughly benchmark the local and global-interpretability of SI-MIL in terms
+of statistical analysis, a domain expert study, and desiderata of
+interpretability, namely, user-friendliness and faithfulness.",cs.CV,['cs.CV']
+Diffusion Models Without Attention,Jing Nathan Yan · Jiatao Gu · Alexander Rush, ,,https://www.semanticscholar.org/paper/Diffusion-Models-Without-Attention-Yan-Gu/31245344a6eb6cd897a71928dc4b174ab75e4070,,,,,nan
+DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models,Nastaran Saadati · Minh Pham · Nasla Saleem · Joshua R. Waite · Aditya Balu · Zhanhong Jiang · Chinmay Hegde · Soumik Sarkar, ,https://arxiv.org/abs/2404.08079,,2404.08079.pdf,DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models,"Recent advances in decentralized deep learning algorithms have demonstrated
+cutting-edge performance on various tasks with large pre-trained models.
+However, a pivotal prerequisite for achieving this level of competitiveness is
+the significant communication and computation overheads when updating these
+models, which prohibits the applications of them to real-world scenarios. To
+address this issue, drawing inspiration from advanced model merging techniques
+without requiring additional training, we introduce the Decentralized Iterative
+Merging-And-Training (DIMAT) paradigm--a novel decentralized deep learning
+framework. Within DIMAT, each agent is trained on their local data and
+periodically merged with their neighboring agents using advanced model merging
+techniques like activation matching until convergence is achieved. DIMAT
+provably converges with the best available rate for nonconvex functions with
+various first-order methods, while yielding tighter error bounds compared to
+the popular existing approaches. We conduct a comprehensive empirical analysis
+to validate DIMAT's superiority over baselines across diverse computer vision
+tasks sourced from multiple datasets. Empirical results validate our
+theoretical claims by showing that DIMAT attains faster and higher initial gain
+in accuracy with independent and identically distributed (IID) and non-IID
+data, incurring lower communication overhead. This DIMAT paradigm presents a
+new opportunity for the future decentralized learning, enhancing its
+adaptability to real-world with sparse and light-weight communication and
+computation.",cs.LG,"['cs.LG', 'cs.CV', 'math.OC']"
+Mask4Align: Aligned Entity Prompting with Color Masks for Multi-Entity Localization Problem,Haoquan Zhang · Ronggang Huang · Yi Xie · Huaidong Zhang, ,https://arxiv.org/abs/2310.05364,,2310.05364.pdf,Universal Multi-modal Entity Alignment via Iteratively Fusing Modality Similarity Paths,"The objective of Entity Alignment (EA) is to identify equivalent entity pairs
+from multiple Knowledge Graphs (KGs) and create a more comprehensive and
+unified KG. The majority of EA methods have primarily focused on the structural
+modality of KGs, lacking exploration of multi-modal information. A few
+multi-modal EA methods have made good attempts in this field. Still, they have
+two shortcomings: (1) inconsistent and inefficient modality modeling that
+designs complex and distinct models for each modality; (2) ineffective modality
+fusion due to the heterogeneous nature of modalities in EA. To tackle these
+challenges, we propose PathFusion, consisting of two main components: (1) MSP,
+a unified modeling approach that simplifies the alignment process by
+constructing paths connecting entities and modality nodes to represent multiple
+modalities; (2) IRF, an iterative fusion method that effectively combines
+information from different modalities using the path as an information carrier.
+Experimental results on real-world datasets demonstrate the superiority of
+PathFusion over state-of-the-art methods, with 22.4%-28.9% absolute improvement
+on Hits@1, and 0.194-0.245 absolute improvement on MRR.",cs.CL,"['cs.CL', 'cs.AI']"
+Hearing Anything Anywhere,Mason Wang · Ryosuke Sawata · Samuel Clarke · Ruohan Gao · Shangzhe Wu · Jiajun Wu, ,,https://zenodo.org/records/11195833,,,,,nan
+OmniSDF: Scene Reconstruction using Omnidirectional Signed Distance Functions and Adaptive Binoctrees,Hakyeong Kim · Andreas Meuleman · Hyeonjoong Jang · James Tompkin · Min H. Kim,https://vclab.kaist.ac.kr/cvpr2024p2/index.html,https://arxiv.org/abs/2404.00678,,2404.00678.pdf,OmniSDF: Scene Reconstruction using Omnidirectional Signed Distance Functions and Adaptive Binoctrees,"We present a method to reconstruct indoor and outdoor static scene geometry
+and appearance from an omnidirectional video moving in a small circular sweep.
+This setting is challenging because of the small baseline and large depth
+ranges, making it difficult to find ray crossings. To better constrain the
+optimization, we estimate geometry as a signed distance field within a
+spherical binoctree data structure and use a complementary efficient tree
+traversal strategy based on a breadth-first search for sampling. Unlike regular
+grids or trees, the shape of this structure well-matches the camera setting,
+creating a better memory-quality trade-off. From an initial depth estimate, the
+binoctree is adaptively subdivided throughout the optimization; previous
+methods use a fixed depth that leaves the scene undersampled. In comparison
+with three neural optimization methods and two non-neural methods, ours shows
+decreased geometry error on average, especially in a detailed scene, while
+significantly reducing the required number of voxels to represent such details.",cs.CV,"['cs.CV', 'cs.GR']"
+Understanding and Improving Source-free Domain Adaptation from a Theoretical Perspective,Yu Mitsuzumi · Akisato Kimura · Hisashi Kashima, ,,https://akisatok.tech/news/a-paper-accepted-to-cvpr2024,,,,,nan
+BANF: Band-limited Neural Fields for Levels of Detail Reconstruction,Ahan Shabanov · Shrisudhan Govindarajan · Cody Reading · Leili Goli · Daniel Rebain · Kwang Moo Yi · Andrea Tagliasacchi, ,https://arxiv.org/abs/2404.13024,,2404.13024.pdf,BANF: Band-limited Neural Fields for Levels of Detail Reconstruction,"Largely due to their implicit nature, neural fields lack a direct mechanism
+for filtering, as Fourier analysis from discrete signal processing is not
+directly applicable to these representations. Effective filtering of neural
+fields is critical to enable level-of-detail processing in downstream
+applications, and support operations that involve sampling the field on regular
+grids (e.g. marching cubes). Existing methods that attempt to decompose neural
+fields in the frequency domain either resort to heuristics or require extensive
+modifications to the neural field architecture. We show that via a simple
+modification, one can obtain neural fields that are low-pass filtered, and in
+turn show how this can be exploited to obtain a frequency decomposition of the
+entire signal. We demonstrate the validity of our technique by investigating
+level-of-detail reconstruction, and showing how coarser representations can be
+computed effectively.",cs.CV,"['cs.CV', 'eess.IV']"
+PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution,Honghao Chen · Xiangxiang Chu · Renyongjian · Xin Zhao · Kaiqi Huang, ,https://arxiv.org/abs/2403.07589,,2403.07589.pdf,PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution,"Recently, some large kernel convnets strike back with appealing performance
+and efficiency. However, given the square complexity of convolution, scaling up
+kernels can bring about an enormous amount of parameters and the proliferated
+parameters can induce severe optimization problem. Due to these issues, current
+CNNs compromise to scale up to 51x51 in the form of stripe convolution (i.e.,
+51x5 + 5x51) and start to saturate as the kernel size continues growing. In
+this paper, we delve into addressing these vital issues and explore whether we
+can continue scaling up kernels for more performance gains. Inspired by human
+vision, we propose a human-like peripheral convolution that efficiently reduces
+over 90% parameter count of dense grid convolution through parameter sharing,
+and manage to scale up kernel size to extremely large. Our peripheral
+convolution behaves highly similar to human, reducing the complexity of
+convolution from O(K^2) to O(logK) without backfiring performance. Built on
+this, we propose Parameter-efficient Large Kernel Network (PeLK). Our PeLK
+outperforms modern vision Transformers and ConvNet architectures like Swin,
+ConvNeXt, RepLKNet and SLaK on various vision tasks including ImageNet
+classification, semantic segmentation on ADE20K and object detection on MS
+COCO. For the first time, we successfully scale up the kernel size of CNNs to
+an unprecedented 101x101 and demonstrate consistent improvements.",cs.CV,['cs.CV']
+Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos,Mehmet Saygin Seyfioglu · Wisdom Ikezogwo · Fatemeh Ghezloo · Ranjay Krishna · Linda Shapiro, ,https://arxiv.org/abs/2312.04746,,2312.04746.pdf,Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos,"Diagnosis in histopathology requires a global whole slide images (WSIs)
+analysis, requiring pathologists to compound evidence from different WSI
+patches. The gigapixel scale of WSIs poses a challenge for histopathology
+multi-modal models. Training multi-model models for histopathology requires
+instruction tuning datasets, which currently contain information for individual
+image patches, without a spatial grounding of the concepts within each patch
+and without a wider view of the WSI. Therefore, they lack sufficient diagnostic
+capacity for histopathology. To bridge this gap, we introduce Quilt-Instruct, a
+large-scale dataset of 107,131 histopathology-specific instruction
+question/answer pairs, grounded within diagnostically relevant image patches
+that make up the WSI. Our dataset is collected by leveraging educational
+histopathology videos from YouTube, which provides spatial localization of
+narrations by automatically extracting the narrators' cursor positions.
+Quilt-Instruct supports contextual reasoning by extracting diagnosis and
+supporting facts from the entire WSI. Using Quilt-Instruct, we train
+Quilt-LLaVA, which can reason beyond the given single image patch, enabling
+diagnostic reasoning across patches. To evaluate Quilt-LLaVA, we propose a
+comprehensive evaluation dataset created from 985 images and 1283
+human-generated question-answers. We also thoroughly evaluate Quilt-LLaVA using
+public histopathology datasets, where Quilt-LLaVA significantly outperforms
+SOTA by over 10% on relative GPT-4 score and 4% and 9% on open and closed set
+VQA. Our code, data, and model are publicly accessible at
+quilt-llava.github.io.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']"
+From Correspondences to Pose: Non-minimal Certifiably Optimal Relative Pose without Disambiguation,Javier Tirado-Garín · Javier Civera,https://github.com/javrtg/C2P,https://arxiv.org/abs/2312.05995,,2312.05995.pdf,From Correspondences to Pose: Non-minimal Certifiably Optimal Relative Pose without Disambiguation,"Estimating the relative camera pose from $n \geq 5$ correspondences between
+two calibrated views is a fundamental task in computer vision. This process
+typically involves two stages: 1) estimating the essential matrix between the
+views, and 2) disambiguating among the four candidate relative poses that
+satisfy the epipolar geometry. In this paper, we demonstrate a novel approach
+that, for the first time, bypasses the second stage. Specifically, we show that
+it is possible to directly estimate the correct relative camera pose from
+correspondences without needing a post-processing step to enforce the
+cheirality constraint on the correspondences. Building on recent advances in
+certifiable non-minimal optimization, we frame the relative pose estimation as
+a Quadratically Constrained Quadratic Program (QCQP). By applying the
+appropriate constraints, we ensure the estimation of a camera pose that
+corresponds to a valid 3D geometry and that is globally optimal when certified.
+We validate our method through exhaustive synthetic and real-world experiments,
+confirming the efficacy, efficiency and accuracy of the proposed approach. Code
+is available at https://github.com/javrtg/C2P.",cs.CV,['cs.CV']
+Diffusion-based Blind Text Image Super-Resolution,Yuzhe Zhang · jiawei zhang · Hao Li · Zhouxia Wang · Luwei Hou · Dongqing Zou · Liheng Bian, ,https://arxiv.org/abs/2312.08886,,2312.08886.pdf,Diffusion-based Blind Text Image Super-Resolution,"Recovering degraded low-resolution text images is challenging, especially for
+Chinese text images with complex strokes and severe degradation in real-world
+scenarios. Ensuring both text fidelity and style realness is crucial for
+high-quality text image super-resolution. Recently, diffusion models have
+achieved great success in natural image synthesis and restoration due to their
+powerful data distribution modeling abilities and data generation capabilities.
+In this work, we propose an Image Diffusion Model (IDM) to restore text images
+with realistic styles. For diffusion models, they are not only suitable for
+modeling realistic image distribution but also appropriate for learning text
+distribution. Since text prior is important to guarantee the correctness of the
+restored text structure according to existing arts, we also propose a Text
+Diffusion Model (TDM) for text recognition which can guide IDM to generate text
+images with correct structures. We further propose a Mixture of Multi-modality
+module (MoM) to make these two diffusion models cooperate with each other in
+all the diffusion steps. Extensive experiments on synthetic and real-world
+datasets demonstrate that our Diffusion-based Blind Text Image Super-Resolution
+(DiffTSR) can restore text images with more accurate text structures as well as
+more realistic appearances simultaneously.",cs.CV,['cs.CV']
+Language-driven Grasp Detection,An Dinh Vuong · Minh Nhat VU · Baoru Huang · Nghia Nguyen · Hieu Le · Thieu Vo · Thieu Vo · Anh Nguyen,https://airvlab.github.io/grasp-anything/,https://ar5iv.labs.arxiv.org/html/2309.09818,,2309.09818.pdf,Grasp-Anything: Large-scale Grasp Dataset from Foundation Models,"Foundation models such as ChatGPT have made significant strides in robotic
+tasks due to their universal representation of real-world domains. In this
+paper, we leverage foundation models to tackle grasp detection, a persistent
+challenge in robotics with broad industrial applications. Despite numerous
+grasp datasets, their object diversity remains limited compared to real-world
+figures. Fortunately, foundation models possess an extensive repository of
+real-world knowledge, including objects we encounter in our daily lives. As a
+consequence, a promising solution to the limited representation in previous
+grasp datasets is to harness the universal knowledge embedded in these
+foundation models. We present Grasp-Anything, a new large-scale grasp dataset
+synthesized from foundation models to implement this solution. Grasp-Anything
+excels in diversity and magnitude, boasting 1M samples with text descriptions
+and more than 3M objects, surpassing prior datasets. Empirically, we show that
+Grasp-Anything successfully facilitates zero-shot grasp detection on
+vision-based tasks and real-world robotic experiments. Our dataset and code are
+available at https://grasp-anything-2023.github.io.",cs.RO,"['cs.RO', 'cs.CV']"
+Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs,Shengbang Tong · Zhuang Liu · Zhuang Liu · Yuexiang Zhai · Yi Ma · Yann LeCun · Saining Xie, ,http://export.arxiv.org/abs/2401.06209,,2401.06209.pdf,Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs,"Is vision good enough for language? Recent advancements in multimodal models
+primarily stem from the powerful reasoning abilities of large language models
+(LLMs). However, the visual component typically depends only on the
+instance-level contrastive language-image pre-training (CLIP). Our research
+reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still
+exhibit systematic shortcomings. To understand the roots of these errors, we
+explore the gap between the visual embedding space of CLIP and vision-only
+self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP
+perceives as similar despite their clear visual differences. With these pairs,
+we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes
+areas where state-of-the-art systems, including GPT-4V, struggle with
+straightforward questions across nine basic visual patterns, often providing
+incorrect answers and hallucinated explanations. We further evaluate various
+CLIP-based vision-and-language models and found a notable correlation between
+visual patterns that challenge CLIP models and those problematic for multimodal
+LLMs. As an initial effort to address these issues, we propose a Mixture of
+Features (MoF) approach, demonstrating that integrating vision self-supervised
+learning features with MLLMs can significantly enhance their visual grounding
+capabilities. Together, our research suggests visual representation learning
+remains an open challenge, and accurate visual grounding is crucial for future
+successful multimodal systems.",cs.CV,['cs.CV']
+Evaluating Transferability in Retrieval Tasks: An Approach Using MMD and Kernel Methods,Mengyu Dai · Amir Hossein Raffiee · Aashish Jain · Joshua Correa, ,,https://ieeexplore.ieee.org/document/10452779,,,,,nan
+Alpha Invariance: On Inverse Scaling Between Distance and Volume Density in Neural Radiance Fields,Joshua Ahn · Haochen Wang · Raymond A. Yeh · Greg Shakhnarovich,https://pals.ttic.edu/p/alpha-invariance,https://arxiv.org/abs/2404.02155,,2404.02155.pdf,Alpha Invariance: On Inverse Scaling Between Distance and Volume Density in Neural Radiance Fields,"Scale-ambiguity in 3D scene dimensions leads to magnitude-ambiguity of
+volumetric densities in neural radiance fields, i.e., the densities double when
+scene size is halved, and vice versa. We call this property alpha invariance.
+For NeRFs to better maintain alpha invariance, we recommend 1) parameterizing
+both distance and volume densities in log space, and 2) a
+discretization-agnostic initialization strategy to guarantee high ray
+transmittance. We revisit a few popular radiance field models and find that
+these systems use various heuristics to deal with issues arising from scene
+scaling. We test their behaviors and show our recipe to be more robust.",cs.CV,['cs.CV']
+Prompt-Driven Referring Image Segmentation with Instance Contrasting,Chao Shang · Zichen Song · Heqian Qiu · Lanxiao Wang · Fanman Meng · Hongliang Li, ,https://arxiv.org/abs/2310.19721,,2310.19721.pdf,Promise:Prompt-driven 3D Medical Image Segmentation Using Pretrained Image Foundation Models,"To address prevalent issues in medical imaging, such as data acquisition
+challenges and label availability, transfer learning from natural to medical
+image domains serves as a viable strategy to produce reliable segmentation
+results. However, several existing barriers between domains need to be broken
+down, including addressing contrast discrepancies, managing anatomical
+variability, and adapting 2D pretrained models for 3D segmentation tasks. In
+this paper, we propose ProMISe,a prompt-driven 3D medical image segmentation
+model using only a single point prompt to leverage knowledge from a pretrained
+2D image foundation model. In particular, we use the pretrained vision
+transformer from the Segment Anything Model (SAM) and integrate lightweight
+adapters to extract depth-related (3D) spatial context without updating the
+pretrained weights. For robust results, a hybrid network with complementary
+encoders is designed, and a boundary-aware loss is proposed to achieve precise
+boundaries. We evaluate our model on two public datasets for colon and pancreas
+tumor segmentations, respectively. Compared to the state-of-the-art
+segmentation methods with and without prompt engineering, our proposed method
+achieves superior performance. The code is publicly available at
+https://github.com/MedICL-VU/ProMISe.",eess.IV,"['eess.IV', 'cs.CV']"
+DreamVideo: Composing Your Dream Videos with Customized Subject and Motion,Yujie Wei · Shiwei Zhang · Zhiwu Qing · Hangjie Yuan · Zhiheng Liu · Yu Liu · Yingya Zhang · Jingren Zhou · Hongming Shan,https://dreamvideo-t2v.github.io/,https://arxiv.org/abs/2312.04433,,2312.04433.pdf,DreamVideo: Composing Your Dream Videos with Customized Subject and Motion,"Customized generation using diffusion models has made impressive progress in
+image generation, but remains unsatisfactory in the challenging video
+generation task, as it requires the controllability of both subjects and
+motions. To that end, we present DreamVideo, a novel approach to generating
+personalized videos from a few static images of the desired subject and a few
+videos of target motion. DreamVideo decouples this task into two stages,
+subject learning and motion learning, by leveraging a pre-trained video
+diffusion model. The subject learning aims to accurately capture the fine
+appearance of the subject from provided images, which is achieved by combining
+textual inversion and fine-tuning of our carefully designed identity adapter.
+In motion learning, we architect a motion adapter and fine-tune it on the given
+videos to effectively model the target motion pattern. Combining these two
+lightweight and efficient adapters allows for flexible customization of any
+subject with any motion. Extensive experimental results demonstrate the
+superior performance of our DreamVideo over the state-of-the-art methods for
+customized video generation. Our project page is at
+https://dreamvideo-t2v.github.io.",cs.CV,['cs.CV']
+Multi-Attribute Interactions Matter for 3D Visual Grounding,Can Xu · Yuehui Han · Rui Xu · Le Hui · Jin Xie · Jian Yang, ,https://arxiv.org/abs/2404.19696,,2404.19696.pdf,Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners,"3D visual grounding is a challenging task that often requires direct and
+dense supervision, notably the semantic label for each object in the scene. In
+this paper, we instead study the naturally supervised setting that learns from
+only 3D scene and QA pairs, where prior works underperform. We propose the
+Language-Regularized Concept Learner (LARC), which uses constraints from
+language as regularization to significantly improve the accuracy of
+neuro-symbolic concept learners in the naturally supervised setting. Our
+approach is based on two core insights: the first is that language constraints
+(e.g., a word's relation to another) can serve as effective regularization for
+structured representations in neuro-symbolic models; the second is that we can
+query large language models to distill such constraints from language
+properties. We show that LARC improves performance of prior works in naturally
+supervised 3D visual grounding, and demonstrates a wide range of 3D visual
+reasoning capabilities-from zero-shot composition, to data efficiency and
+transferability. Our method represents a promising step towards regularizing
+structured visual reasoning frameworks with language-based priors, for learning
+in settings without dense supervision.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']"
+Intraoperative 2D/3D Image Registration via Differentiable X-ray Rendering,Vivek Gopalakrishnan · Neel Dey · Polina Golland, ,https://arxiv.org/abs/2312.06358,,2312.06358.pdf,Intraoperative 2D/3D Image Registration via Differentiable X-ray Rendering,"Surgical decisions are informed by aligning rapid portable 2D intraoperative
+images (e.g., X-rays) to a high-fidelity 3D preoperative reference scan (e.g.,
+CT). 2D/3D image registration often fails in practice: conventional
+optimization methods are prohibitively slow and susceptible to local minima,
+while neural networks trained on small datasets fail on new patients or require
+impractical landmark supervision. We present DiffPose, a self-supervised
+approach that leverages patient-specific simulation and differentiable
+physics-based rendering to achieve accurate 2D/3D registration without relying
+on manually labeled data. Preoperatively, a CNN is trained to regress the pose
+of a randomly oriented synthetic X-ray rendered from the preoperative CT. The
+CNN then initializes rapid intraoperative test-time optimization that uses the
+differentiable X-ray renderer to refine the solution. Our work further proposes
+several geometrically principled methods for sampling camera poses from
+$\mathbf{SE}(3)$, for sparse differentiable rendering, and for driving
+registration in the tangent space $\mathfrak{se}(3)$ with geodesic and
+multiscale locality-sensitive losses. DiffPose achieves sub-millimeter accuracy
+across surgical datasets at intraoperative speeds, improving upon existing
+unsupervised methods by an order of magnitude and even outperforming supervised
+baselines. Our code is available at https://github.com/eigenvivek/DiffPose.",cs.CV,['cs.CV']
+MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant,Chenlu Zhan · Gaoang Wang · Yu LIN · Hongwei Wang · Jian Wu, ,https://arxiv.org/abs/2403.04290,,2403.04290.pdf,MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant,"Medical generative models, acknowledged for their high-quality sample
+generation ability, have accelerated the fast growth of medical applications.
+However, recent works concentrate on separate medical generation models for
+distinct medical tasks and are restricted to inadequate medical multi-modal
+knowledge, constraining medical comprehensive diagnosis. In this paper, we
+propose MedM2G, a Medical Multi-Modal Generative framework, with the key
+innovation to align, extract, and generate medical multi-modal within a unified
+model. Extending beyond single or two medical modalities, we efficiently align
+medical multi-modal through the central alignment approach in the unified
+space. Significantly, our framework extracts valuable clinical knowledge by
+preserving the medical visual invariant of each imaging modal, thereby
+enhancing specific medical information for multi-modal generation. By
+conditioning the adaptive cross-guided parameters into the multi-flow diffusion
+framework, our model promotes flexible interactions among medical multi-modal
+for generation. MedM2G is the first medical generative model that unifies
+medical generation tasks of text-to-image, image-to-text, and unified
+generation of medical modalities (CT, MRI, X-ray). It performs 5 medical
+generation tasks across 10 datasets, consistently outperforming various
+state-of-the-art works.",eess.IV,"['eess.IV', 'cs.CV', 'cs.LG']"
+SeD: Semantic-Aware Discriminator for Image Super-Resolution,Bingchen Li · Xin Li · Hanxin Zhu · YEYING JIN · Ruoyu Feng · Zhizheng Zhang · Zhibo Chen, ,https://arxiv.org/abs/2402.19387,,2402.19387.pdf,SeD: Semantic-Aware Discriminator for Image Super-Resolution,"Generative Adversarial Networks (GANs) have been widely used to recover vivid
+textures in image super-resolution (SR) tasks. In particular, one discriminator
+is utilized to enable the SR network to learn the distribution of real-world
+high-quality images in an adversarial training manner. However, the
+distribution learning is overly coarse-grained, which is susceptible to virtual
+textures and causes counter-intuitive generation results. To mitigate this, we
+propose the simple and effective Semantic-aware Discriminator (denoted as SeD),
+which encourages the SR network to learn the fine-grained distributions by
+introducing the semantics of images as a condition. Concretely, we aim to
+excavate the semantics of images from a well-trained semantic extractor. Under
+different semantics, the discriminator is able to distinguish the real-fake
+images individually and adaptively, which guides the SR network to learn the
+more fine-grained semantic-aware textures. To obtain accurate and abundant
+semantics, we take full advantage of recently popular pretrained vision models
+(PVMs) with extensive datasets, and then incorporate its semantic features into
+the discriminator through a well-designed spatial cross-attention module. In
+this way, our proposed semantic-aware discriminator empowered the SR network to
+produce more photo-realistic and pleasing images. Extensive experiments on two
+typical tasks, i.e., SR and Real SR have demonstrated the effectiveness of our
+proposed methods.",eess.IV,"['eess.IV', 'cs.CV']"
+Taming Self-Training for Open-Vocabulary Object Detection,Shiyu Zhao · Samuel Schulter · Long Zhao · Zhixing Zhang · Vijay Kumar BG · Yumin Suh · Manmohan Chandraker · Dimitris N. Metaxas, ,https://arxiv.org/abs/2308.06412,,2308.06412.pdf,Taming Self-Training for Open-Vocabulary Object Detection,"Recent studies have shown promising performance in open-vocabulary object
+detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and
+language models (VLMs). However, teacher-student self-training, a powerful and
+widely used paradigm to leverage PLs, is rarely explored for OVD. This work
+identifies two challenges of using self-training in OVD: noisy PLs from VLMs
+and frequent distribution changes of PLs. To address these challenges, we
+propose SAS-Det that tames self-training for OVD from two key perspectives.
+First, we present a split-and-fusion (SAF) head that splits a standard
+detection into an open-branch and a closed-branch. This design can reduce noisy
+supervision from pseudo boxes. Moreover, the two branches learn complementary
+knowledge from different training data, significantly enhancing performance
+when fused together. Second, in our view, unlike in closed-set tasks, the PL
+distributions in OVD are solely determined by the teacher model. We introduce a
+periodic update strategy to decrease the number of updates to the teacher,
+thereby decreasing the frequency of changes in PL distributions, which
+stabilizes the training process. Extensive experiments demonstrate SAS-Det is
+both efficient and effective. SAS-Det outperforms recent models of the same
+scale by a clear margin and achieves 37.4 AP50 and 29.1 APr on novel categories
+of the COCO and LVIS benchmarks, respectively. Code is available at
+\url{https://github.com/xiaofeng94/SAS-Det}.",cs.CV,['cs.CV']
+Edit One for All: Interactive Batch Image Editing,Thao Nguyen · Utkarsh Ojha · Yuheng Li · Haotian Liu · Yong Jae Lee,https://thaoshibe.github.io/edit-one-for-all,https://arxiv.org/abs/2401.10219,,2401.10219.pdf,Edit One for All: Interactive Batch Image Editing,"In recent years, image editing has advanced remarkably. With increased human
+control, it is now possible to edit an image in a plethora of ways; from
+specifying in text what we want to change, to straight up dragging the contents
+of the image in an interactive point-based manner. However, most of the focus
+has remained on editing single images at a time. Whether and how we can
+simultaneously edit large batches of images has remained understudied. With the
+goal of minimizing human supervision in the editing process, this paper
+presents a novel method for interactive batch image editing using StyleGAN as
+the medium. Given an edit specified by users in an example image (e.g., make
+the face frontal), our method can automatically transfer that edit to other
+test images, so that regardless of their initial state (pose), they all arrive
+at the same final state (e.g., all facing front). Extensive experiments
+demonstrate that edits performed using our method have similar visual quality
+to existing single-image-editing methods, while having more visual consistency
+and saving significant time and human effort.",cs.CV,['cs.CV']
+Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning,Desai Xie · Jiahao Li · Hao Tan · Xin Sun · Zhixin Shu · Yi Zhou · Sai Bi · Soren Pirk · Soeren Pirk · ARIE KAUFMAN,https://desaixie.github.io/carve-3d/,https://arxiv.org/abs/2312.13980v1,,2312.13980v1.pdf,Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning,"Recent advancements in the text-to-3D task leverage finetuned text-to-image
+diffusion models to generate multi-view images, followed by NeRF
+reconstruction. Yet, existing supervised finetuned (SFT) diffusion models still
+suffer from multi-view inconsistency and the resulting NeRF artifacts. Although
+training longer with SFT improves consistency, it also causes distribution
+shift, which reduces diversity and realistic details. We argue that the SFT of
+multi-view diffusion models resembles the instruction finetuning stage of the
+LLM alignment pipeline and can benefit from RL finetuning (RLFT) methods.
+Essentially, RLFT methods optimize models beyond their SFT data distribution by
+using their own outputs, effectively mitigating distribution shift. To this
+end, we introduce Carve3D, a RLFT method coupled with the Multi-view
+Reconstruction Consistency (MRC) metric, to improve the consistency of
+multi-view diffusion models. To compute MRC on a set of multi-view images, we
+compare them with their corresponding renderings of the reconstructed NeRF at
+the same viewpoints. We validate the robustness of MRC with extensive
+experiments conducted under controlled inconsistency levels. We enhance the
+base RLFT algorithm to stabilize the training process, reduce distribution
+shift, and identify scaling laws. Through qualitative and quantitative
+experiments, along with a user study, we demonstrate Carve3D's improved
+multi-view consistency, the resulting superior NeRF reconstruction quality, and
+minimal distribution shift compared to longer SFT. Project webpage:
+https://desaixie.github.io/carve-3d.",cs.CV,"['cs.CV', 'cs.LG']"
+Density-Guided Semi-Supervised 3D Semantic Segmentation with Dual-Space Hardness Sampling,Jianan Li · Qiulei Dong, ,https://arxiv.org/abs/2306.08045,,2306.08045.pdf,Efficient 3D Semantic Segmentation with Superpoint Transformer,"We introduce a novel superpoint-based transformer architecture for efficient
+semantic segmentation of large-scale 3D scenes. Our method incorporates a fast
+algorithm to partition point clouds into a hierarchical superpoint structure,
+which makes our preprocessing 7 times faster than existing superpoint-based
+approaches. Additionally, we leverage a self-attention mechanism to capture the
+relationships between superpoints at multiple scales, leading to
+state-of-the-art performance on three challenging benchmark datasets: S3DIS
+(76.0% mIoU 6-fold validation), KITTI-360 (63.5% on Val), and DALES (79.6%).
+With only 212k parameters, our approach is up to 200 times more compact than
+other state-of-the-art models while maintaining similar performance.
+Furthermore, our model can be trained on a single GPU in 3 hours for a fold of
+the S3DIS dataset, which is 7x to 70x fewer GPU-hours than the best-performing
+methods. Our code and models are accessible at
+github.com/drprojects/superpoint_transformer.",cs.CV,['cs.CV']
+Unifying Automatic and Interactive Matting with Pretrained ViTs,Zixuan Ye · Wenze Liu · He Guo · Yujia Liang · Chaoyi Hong · Hao Lu · Zhiguo Cao, ,,https://dl.acm.org/doi/10.1016/j.inffus.2023.102091,,,,,nan
+S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes,Xingyi Li · Zhiguo Cao · Yizheng Wu · Kewei Wang · Ke Xian · Zhe Wang · Guosheng Lin, ,https://arxiv.org/abs/2403.06205,,2403.06205.pdf,S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes,"Current 3D stylization methods often assume static scenes, which violates the
+dynamic nature of our real world. To address this limitation, we present
+S-DyRF, a reference-based spatio-temporal stylization method for dynamic neural
+radiance fields. However, stylizing dynamic 3D scenes is inherently challenging
+due to the limited availability of stylized reference images along the temporal
+axis. Our key insight lies in introducing additional temporal cues besides the
+provided reference. To this end, we generate temporal pseudo-references from
+the given stylized reference. These pseudo-references facilitate the
+propagation of style information from the reference to the entire dynamic 3D
+scene. For coarse style transfer, we enforce novel views and times to mimic the
+style details present in pseudo-references at the feature level. To preserve
+high-frequency details, we create a collection of stylized temporal pseudo-rays
+from temporal pseudo-references. These pseudo-rays serve as detailed and
+explicit stylization guidance for achieving fine style transfer. Experiments on
+both synthetic and real-world datasets demonstrate that our method yields
+plausible stylized results of space-time view synthesis on dynamic 3D scenes.",cs.CV,['cs.CV']
+Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis,Simon Niedermayr · Josef Stumpfegger · rüdiger westermann,https://keksboter.github.io/c3dgs/,https://arxiv.org/abs/2401.02436,,2401.02436.pdf,Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis,"Recently, high-fidelity scene reconstruction with an optimized 3D Gaussian
+splat representation has been introduced for novel view synthesis from sparse
+image sets. Making such representations suitable for applications like network
+streaming and rendering on low-power devices requires significantly reduced
+memory consumption as well as improved rendering efficiency. We propose a
+compressed 3D Gaussian splat representation that utilizes sensitivity-aware
+vector clustering with quantization-aware training to compress directional
+colors and Gaussian parameters. The learned codebooks have low bitrates and
+achieve a compression rate of up to $31\times$ on real-world scenes with only
+minimal degradation of visual quality. We demonstrate that the compressed splat
+representation can be efficiently rendered with hardware rasterization on
+lightweight GPUs at up to $4\times$ higher framerates than reported via an
+optimized GPU compute pipeline. Extensive experiments across multiple datasets
+demonstrate the robustness and rendering speed of the proposed approach.",cs.CV,"['cs.CV', 'cs.GR']"
+ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images,Nicolas Bourriez · Ihab Bendidi · Cohen Ethan · Gabriel Watkinson · Maxime Sanchez · Guillaume Bollot · Auguste Genovesio, ,https://arxiv.org/abs/2311.15264,,2311.15264.pdf,ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images,"Unlike color photography images, which are consistently encoded into RGB
+channels, biological images encompass various modalities, where the type of
+microscopy and the meaning of each channel varies with each experiment.
+Importantly, the number of channels can range from one to a dozen and their
+correlation is often comparatively much lower than RGB, as each of them brings
+specific information content. This aspect is largely overlooked by methods
+designed out of the bioimage field, and current solutions mostly focus on
+intra-channel spatial attention, often ignoring the relationship between
+channels, yet crucial in most biological applications. Importantly, the
+variable channel type and count prevent the projection of several experiments
+to a unified representation for large scale pre-training. In this study, we
+propose ChAda-ViT, a novel Channel Adaptive Vision Transformer architecture
+employing an Inter-Channel Attention mechanism on images with an arbitrary
+number, order and type of channels. We also introduce IDRCell100k, a bioimage
+dataset with a rich set of 79 experiments covering 7 microscope modalities,
+with a multitude of channel types, and counts varying from 1 to 10 per
+experiment. Our architecture, trained in a self-supervised manner, outperforms
+existing approaches in several biologically relevant downstream tasks.
+Additionally, it can be used to bridge the gap for the first time between
+assays with different microscopes, channel numbers or types by embedding
+various image and experimental modalities into a unified biological image
+representation. The latter should facilitate interdisciplinary studies and pave
+the way for better adoption of deep learning in biological image-based
+analyses. Code and Data available at https://github.com/nicoboou/chadavit.",cs.CV,"['cs.CV', 'cs.LG']"
+Generating Enhanced Negatives for Training Language-Based Object Detectors,Shiyu Zhao · Long Zhao · Vijay Kumar BG · Yumin Suh · Dimitris N. Metaxas · Manmohan Chandraker · Samuel Schulter, ,https://arxiv.org/abs/2401.00094,,2401.00094.pdf,Generating Enhanced Negatives for Training Language-Based Object Detectors,"The recent progress in language-based open-vocabulary object detection can be
+largely attributed to finding better ways of leveraging large-scale data with
+free-form text annotations. Training such models with a discriminative
+objective function has proven successful, but requires good positive and
+negative samples. However, the free-form nature and the open vocabulary of
+object descriptions make the space of negatives extremely large. Prior works
+randomly sample negatives or use rule-based techniques to build them. In
+contrast, we propose to leverage the vast knowledge built into modern
+generative models to automatically build negatives that are more relevant to
+the original data. Specifically, we use large-language-models to generate
+negative text descriptions, and text-to-image diffusion models to also generate
+corresponding negative images. Our experimental analysis confirms the relevance
+of the generated negative data, and its use in language-based detectors
+improves performance on two complex benchmarks. Code is available at
+\url{https://github.com/xiaofeng94/Gen-Enhanced-Negs}.",cs.CV,['cs.CV']
+Named Entity Driven Zero-Shot Image Manipulation,Zhida Feng · Li Chen · Jing Tian · Jiaxiang Liu · Shikun Feng,https://github.com/feng-zhida/StyleEntity,https://arxiv.org/abs/2307.13497,,2307.13497.pdf,Zshot: An Open-source Framework for Zero-Shot Named Entity Recognition and Relation Extraction,"The Zero-Shot Learning (ZSL) task pertains to the identification of entities
+or relations in texts that were not seen during training. ZSL has emerged as a
+critical research area due to the scarcity of labeled data in specific domains,
+and its applications have grown significantly in recent years. With the advent
+of large pretrained language models, several novel methods have been proposed,
+resulting in substantial improvements in ZSL performance. There is a growing
+demand, both in the research community and industry, for a comprehensive ZSL
+framework that facilitates the development and accessibility of the latest
+methods and pretrained models.In this study, we propose a novel ZSL framework
+called Zshot that aims to address the aforementioned challenges. Our primary
+objective is to provide a platform that allows researchers to compare different
+state-of-the-art ZSL methods with standard benchmark datasets. Additionally, we
+have designed our framework to support the industry with readily available APIs
+for production under the standard SpaCy NLP pipeline. Our API is extendible and
+evaluable, moreover, we include numerous enhancements such as boosting the
+accuracy with pipeline ensembling and visualization utilities available as a
+SpaCy extension.",cs.CL,"['cs.CL', 'cs.AI', 'cs.LG']"
+Learned Scanpaths Aid Blind Panoramic Video Quality Assessment,Kanglong FAN · Wen Wen · Mu Li · YIFAN PENG · Kede Ma,https://github.com/kalofan/AutoScanpathQA,https://arxiv.org/abs/2404.00252,,2404.00252.pdf,Learned Scanpaths Aid Blind Panoramic Video Quality Assessment,"Panoramic videos have the advantage of providing an immersive and interactive
+viewing experience. Nevertheless, their spherical nature gives rise to various
+and uncertain user viewing behaviors, which poses significant challenges for
+panoramic video quality assessment (PVQA). In this work, we propose an
+end-to-end optimized, blind PVQA method with explicit modeling of user viewing
+patterns through visual scanpaths. Our method consists of two modules: a
+scanpath generator and a quality assessor. The scanpath generator is initially
+trained to predict future scanpaths by minimizing their expected code length
+and then jointly optimized with the quality assessor for quality prediction.
+Our blind PVQA method enables direct quality assessment of panoramic images by
+treating them as videos composed of identical frames. Experiments on three
+public panoramic image and video quality datasets, encompassing both synthetic
+and authentic distortions, validate the superiority of our blind PVQA model
+over existing methods.",eess.IV,"['eess.IV', 'cs.CV']"
+Molecular Data Programming: Towards Molecule Pseudo-labeling with Systematic Weak Supervision,Xin Juan · Kaixiong Zhou · Ninghao Liu · Tianlong Chen · Xin Wang, ,https://arxiv.org/abs/2309.05203,,2309.05203.pdf,From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery,"Molecule discovery serves as a cornerstone in numerous scientific domains,
+fueling the development of new materials and innovative drug designs. Recent
+developments of in-silico molecule discovery have highlighted the promising
+results of cross-modal techniques, which bridge molecular structures with their
+descriptive annotations. However, these cross-modal methods frequently
+encounter the issue of data scarcity, hampering their performance and
+application. In this paper, we address the low-resource challenge by utilizing
+artificially-real data generated by Large Language Models (LLMs). We first
+introduce a retrieval-based prompting strategy to construct high-quality pseudo
+data, then explore the optimal method to effectively leverage this pseudo data.
+Experiments show that using pseudo data for domain adaptation outperforms all
+existing methods, while also requiring a smaller model scale, reduced data size
+and lower training cost, highlighting its efficiency. Furthermore, our method
+shows a sustained improvement as the volume of pseudo data increases, revealing
+the great potential of pseudo data in advancing low-resource cross-modal
+molecule discovery. Our code and data are available at
+https://github.com/SCIR-HI/ArtificiallyR2R.",cs.CL,['cs.CL']
+ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation,Xiaoqi Li · Mingxu Zhang · Yiran Geng · Haoran Geng · Haoran Geng · Yuxing Long · Yan Shen · Renrui Zhang · Jiaming Liu · Hao Dong,https://sites.google.com/view/manipllm,https://arxiv.org/abs/2312.16217,,2312.16217.pdf,ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation,"Robot manipulation relies on accurately predicting contact points and
+end-effector directions to ensure successful operation. However, learning-based
+robot manipulation, trained on a limited category within a simulator, often
+struggles to achieve generalizability, especially when confronted with
+extensive categories. Therefore, we introduce an innovative approach for robot
+manipulation that leverages the robust reasoning capabilities of Multimodal
+Large Language Models (MLLMs) to enhance the stability and generalization of
+manipulation. By fine-tuning the injected adapters, we preserve the inherent
+common sense and reasoning ability of the MLLMs while equipping them with the
+ability for manipulation. The fundamental insight lies in the introduced
+fine-tuning paradigm, encompassing object category understanding, affordance
+prior reasoning, and object-centric pose prediction to stimulate the reasoning
+ability of MLLM in manipulation. During inference, our approach utilizes an RGB
+image and text prompt to predict the end effector's pose in chain of thoughts.
+After the initial contact is established, an active impedance adaptation policy
+is introduced to plan the upcoming waypoints in a closed-loop manner. Moreover,
+in real world, we design a test-time adaptation (TTA) strategy for manipulation
+to enable the model better adapt to the current real-world scene configuration.
+Experiments in simulator and real-world show the promising performance of
+ManipLLM. More details and demonstrations can be found at
+https://sites.google.com/view/manipllm.",cs.CV,"['cs.CV', 'cs.RO']"
+Consistent Prompting for Rehearsal-Free Continual Learning,Zhanxin Gao · Jun Cen · Xiaobin Chang,https://github.com/Zhanxin-Gao/CPrompt,https://arxiv.org/abs/2403.08568,,2403.08568.pdf,Consistent Prompting for Rehearsal-Free Continual Learning,"Continual learning empowers models to adapt autonomously to the ever-changing
+environment or data streams without forgetting old knowledge. Prompt-based
+approaches are built on frozen pre-trained models to learn the task-specific
+prompts and classifiers efficiently. Existing prompt-based methods are
+inconsistent between training and testing, limiting their effectiveness. Two
+types of inconsistency are revealed. Test predictions are made from all
+classifiers while training only focuses on the current task classifier without
+holistic alignment, leading to Classifier inconsistency. Prompt inconsistency
+indicates that the prompt selected during testing may not correspond to the one
+associated with this task during training. In this paper, we propose a novel
+prompt-based method, Consistent Prompting (CPrompt), for more aligned training
+and testing. Specifically, all existing classifiers are exposed to prompt
+training, resulting in classifier consistency learning. In addition, prompt
+consistency learning is proposed to enhance prediction robustness and boost
+prompt selection accuracy. Our Consistent Prompting surpasses its prompt-based
+counterparts and achieves state-of-the-art performance on multiple continual
+learning benchmarks. Detailed analysis shows that improvements come from more
+consistent training and testing.",cs.CV,"['cs.CV', 'cs.LG']"
+Pixel-level Semantic Correspondence through Layout-aware Representation Learning and Multi-scale Matching Integration,Yixuan Sun · Zhangyue Yin · Haibo Wang · Yan Wang · Xipeng Qiu · Weifeng Ge · Wenqiang Zhang, ,https://ar5iv.labs.arxiv.org/html/2401.11739,,2401.11739.pdf,EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models,"Diffusion models have recently received increasing research attention for
+their remarkable transfer abilities in semantic segmentation tasks. However,
+generating fine-grained segmentation masks with diffusion models often requires
+additional training on annotated datasets, leaving it unclear to what extent
+pre-trained diffusion models alone understand the semantic relations of their
+generated images. To address this question, we leverage the semantic knowledge
+extracted from Stable Diffusion (SD) and aim to develop an image segmentor
+capable of generating fine-grained segmentation maps without any additional
+training. The primary difficulty stems from the fact that semantically
+meaningful feature maps typically exist only in the spatially lower-dimensional
+layers, which poses a challenge in directly extracting pixel-level semantic
+relations from these feature maps. To overcome this issue, our framework
+identifies semantic correspondences between image pixels and spatial locations
+of low-dimensional feature maps by exploiting SD's generation process and
+utilizes them for constructing image-resolution segmentation maps. In extensive
+experiments, the produced segmentation maps are demonstrated to be well
+delineated and capture detailed parts of the images, indicating the existence
+of highly accurate pixel-level semantic knowledge in diffusion models.",cs.CV,"['cs.CV', 'cs.LG']"
+Model Adaptation for Time Constrained Embodied Control,Jaehyun Song · Minjong Yoo · Honguk Woo, ,,https://ieeexplore.ieee.org/document/10510652,,,,,nan
+360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries,Huajian Huang · Changkun Liu · Yipeng Zhu · Hui Cheng · Tristan Braud · Sai-Kit Yeung, ,https://arxiv.org/abs/2311.17389,,2311.17389.pdf,360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries,"Portable 360$^\circ$ cameras are becoming a cheap and efficient tool to
+establish large visual databases. By capturing omnidirectional views of a
+scene, these cameras could expedite building environment models that are
+essential for visual localization. However, such an advantage is often
+overlooked due to the lack of valuable datasets. This paper introduces a new
+benchmark dataset, 360Loc, composed of 360$^\circ$ images with ground truth
+poses for visual localization. We present a practical implementation of
+360$^\circ$ mapping combining 360$^\circ$ images with lidar data to generate
+the ground truth 6DoF poses. 360Loc is the first dataset and benchmark that
+explores the challenge of cross-device visual positioning, involving
+360$^\circ$ reference frames, and query frames from pinhole, ultra-wide FoV
+fisheye, and 360$^\circ$ cameras. We propose a virtual camera approach to
+generate lower-FoV query frames from 360$^\circ$ images, which ensures a fair
+comparison of performance among different query types in visual localization
+tasks. We also extend this virtual camera approach to feature matching-based
+and pose regression-based methods to alleviate the performance loss caused by
+the cross-device domain gap, and evaluate its effectiveness against
+state-of-the-art baselines. We demonstrate that omnidirectional visual
+localization is more robust in challenging large-scale scenes with symmetries
+and repetitive structures. These results provide new insights into 360-camera
+mapping and omnidirectional visual localization with cross-device queries.",cs.CV,['cs.CV']
+Layout-Agnostic Scene Text Image Synthesis with Diffusion Models,Qilong Zhangli · Jindong Jiang · Di Liu · Licheng Yu · Xiaoliang Dai · Ankit Ramchandani · Guan Pang · Dimitris N. Metaxas · Praveen Krishnan, ,https://arxiv.org/abs/2312.04884,,2312.04884.pdf,UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models,"Text-to-Image (T2I) generation methods based on diffusion model have garnered
+significant attention in the last few years. Although these image synthesis
+methods produce visually appealing results, they frequently exhibit spelling
+errors when rendering text within the generated images. Such errors manifest as
+missing, incorrect or extraneous characters, thereby severely constraining the
+performance of text image generation based on diffusion models. To address the
+aforementioned issue, this paper proposes a novel approach for text image
+generation, utilizing a pre-trained diffusion model (i.e., Stable Diffusion
+[27]). Our approach involves the design and training of a light-weight
+character-level text encoder, which replaces the original CLIP encoder and
+provides more robust text embeddings as conditional guidance. Then, we
+fine-tune the diffusion model using a large-scale dataset, incorporating local
+attention control under the supervision of character-level segmentation maps.
+Finally, by employing an inference stage refinement process, we achieve a
+notably high sequence accuracy when synthesizing text in arbitrarily given
+images. Both qualitative and quantitative results demonstrate the superiority
+of our method to the state of the art. Furthermore, we showcase several
+potential applications of the proposed UDiffText, including text-centric image
+synthesis, scene text editing, etc. Code and model will be available at
+https://github.com/ZYM-PKU/UDiffText .",cs.CV,['cs.CV']
+Amodal Completion via Progressive Mixed Context Diffusion,Katherine Xu · Lingzhi Zhang · Jianbo Shi,https://k8xu.github.io/amodal,https://arxiv.org/abs/2312.15540,,2312.15540.pdf,Amodal Completion via Progressive Mixed Context Diffusion,"Our brain can effortlessly recognize objects even when partially hidden from
+view. Seeing the visible of the hidden is called amodal completion; however,
+this task remains a challenge for generative AI despite rapid progress. We
+propose to sidestep many of the difficulties of existing approaches, which
+typically involve a two-step process of predicting amodal masks and then
+generating pixels. Our method involves thinking outside the box, literally! We
+go outside the object bounding box to use its context to guide a pre-trained
+diffusion inpainting model, and then progressively grow the occluded object and
+trim the extra background. We overcome two technical challenges: 1) how to be
+free of unwanted co-occurrence bias, which tends to regenerate similar
+occluders, and 2) how to judge if an amodal completion has succeeded. Our
+amodal completion method exhibits improved photorealistic completion results
+compared to existing approaches in numerous successful completion cases. And
+the best part? It doesn't require any special training or fine-tuning of
+models.",cs.CV,['cs.CV']
+Make Pixels Dance: High-Dynamic Video Generation,Yan Zeng · Guoqiang Wei · Jiani Zheng · Jiaxin Zou · Yang Wei · Yuchen Zhang · Yuchen Zhang · Hang Li, ,https://arxiv.org/abs/2311.10982,,2311.10982.pdf,Make Pixels Dance: High-Dynamic Video Generation,"Creating high-dynamic videos such as motion-rich actions and sophisticated
+visual effects poses a significant challenge in the field of artificial
+intelligence. Unfortunately, current state-of-the-art video generation methods,
+primarily focusing on text-to-video generation, tend to produce video clips
+with minimal motions despite maintaining high fidelity. We argue that relying
+solely on text instructions is insufficient and suboptimal for video
+generation. In this paper, we introduce PixelDance, a novel approach based on
+diffusion models that incorporates image instructions for both the first and
+last frames in conjunction with text instructions for video generation.
+Comprehensive experimental results demonstrate that PixelDance trained with
+public data exhibits significantly better proficiency in synthesizing videos
+with complex scenes and intricate motions, setting a new standard for video
+generation.",cs.CV,['cs.CV']
+MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World,Yining Hong · Zishuo Zheng · Peihao Chen · Yian Wang · Junyan Li · Chuang Gan, ,https://arxiv.org/abs/2401.08577,,2401.08577.pdf,MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World,"Human beings possess the capability to multiply a melange of multisensory
+cues while actively exploring and interacting with the 3D world. Current
+multi-modal large language models, however, passively absorb sensory data as
+inputs, lacking the capacity to actively interact with the objects in the 3D
+environment and dynamically collect their multisensory information. To usher in
+the study of this area, we propose MultiPLY, a multisensory embodied large
+language model that could incorporate multisensory interactive data, including
+visual, audio, tactile, and thermal information into large language models,
+thereby establishing the correlation among words, actions, and percepts. To
+this end, we first collect Multisensory Universe, a large-scale multisensory
+interaction dataset comprising 500k data by deploying an LLM-powered embodied
+agent to engage with the 3D environment. To perform instruction tuning with
+pre-trained LLM on such generated data, we first encode the 3D scene as
+abstracted object-centric representations and then introduce action tokens
+denoting that the embodied agent takes certain actions within the environment,
+as well as state tokens that represent the multisensory state observations of
+the agent at each time step. In the inference time, MultiPLY could generate
+action tokens, instructing the agent to take the action in the environment and
+obtain the next multisensory state observation. The observation is then
+appended back to the LLM via state tokens to generate subsequent text or action
+tokens. We demonstrate that MultiPLY outperforms baselines by a large margin
+through a diverse set of embodied tasks involving object retrieval, tool use,
+multisensory captioning, and task decomposition.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG', 'cs.RO']"
+Referring Expression Counting,Siyang Dai · Jun Liu · Ngai-Man Cheung, ,https://arxiv.org/abs/2405.15658,,2405.15658.pdf,HDC: Hierarchical Semantic Decoding with Counting Assistance for Generalized Referring Expression Segmentation,"The newly proposed Generalized Referring Expression Segmentation (GRES)
+amplifies the formulation of classic RES by involving multiple/non-target
+scenarios. Recent approaches focus on optimizing the last modality-fused
+feature which is directly utilized for segmentation and object-existence
+identification. However, the attempt to integrate all-grained information into
+a single joint representation is impractical in GRES due to the increased
+complexity of the spatial relationships among instances and deceptive text
+descriptions. Furthermore, the subsequent binary target justification across
+all referent scenarios fails to specify their inherent differences, leading to
+ambiguity in object understanding. To address the weakness, we propose a
+$\textbf{H}$ierarchical Semantic $\textbf{D}$ecoding with $\textbf{C}$ounting
+Assistance framework (HDC). It hierarchically transfers complementary modality
+information across granularities, and then aggregates each well-aligned
+semantic correspondence for multi-level decoding. Moreover, with complete
+semantic context modeling, we endow HDC with explicit counting capability to
+facilitate comprehensive object perception in multiple/single/non-target
+settings. Experimental results on gRefCOCO, Ref-ZOM, R-RefCOCO, and RefCOCO
+benchmarks demonstrate the effectiveness and rationality of HDC which
+outperforms the state-of-the-art GRES methods by a remarkable margin. Code will
+be available $\href{https://github.com/RobertLuo1/HDC}{here}$.",cs.CV,"['cs.CV', 'cs.AI']"
+UnionFormer: Unified-Learning Transformer with Multi-View Representation for Image Manipulation Detection and Localization,Shuaibo Li · Wei Ma · Jianwei Guo · Shibiao Xu · Benchong Li · Xiaopeng Zhang, ,,https://ieeexplore.ieee.org/abstract/document/10155416,,,,,nan
+Improving Physics-Augmented Continuum Neural Radiance Field-Based Geometry-Agnostic System Identification with Lagrangian Particle Optimization,Takuhiro Kaneko, ,,https://adversarr.github.io/ps/Papers/2024/03/14/pac-nerf-physics-augmented-continuum-neural-radiance-fields-for-geometry-agnostic-system-identification/,,,,,nan
+SonicVisionLM: Playing Sound with Vision Language Models,Zhifeng Xie · Shengye Yu · Qile He · Mengtian Li, ,https://arxiv.org/abs/2401.04394,,2401.04394.pdf,SonicVisionLM: Playing Sound with Vision Language Models,"There has been a growing interest in the task of generating sound for silent
+videos, primarily because of its practicality in streamlining video
+post-production. However, existing methods for video-sound generation attempt
+to directly create sound from visual representations, which can be challenging
+due to the difficulty of aligning visual representations with audio
+representations. In this paper, we present SonicVisionLM, a novel framework
+aimed at generating a wide range of sound effects by leveraging vision-language
+models(VLMs). Instead of generating audio directly from video, we use the
+capabilities of powerful VLMs. When provided with a silent video, our approach
+first identifies events within the video using a VLM to suggest possible sounds
+that match the video content. This shift in approach transforms the challenging
+task of aligning image and audio into more well-studied sub-problems of
+aligning image-to-text and text-to-audio through the popular diffusion models.
+To improve the quality of audio recommendations with LLMs, we have collected an
+extensive dataset that maps text descriptions to specific sound effects and
+developed a time-controlled audio adapter. Our approach surpasses current
+state-of-the-art methods for converting video to audio, enhancing
+synchronization with the visuals, and improving alignment between audio and
+video components. Project page:
+https://yusiissy.github.io/SonicVisionLM.github.io/",cs.MM,"['cs.MM', 'cs.SD', 'eess.AS']"
+A Dual-Augmentor Framework for Domain Generalization in 3D Human Pose Estimation,Qucheng Peng · Ce Zheng · Chen Chen, ,https://arxiv.org/abs/2403.11310,,2403.11310.pdf,A Dual-Augmentor Framework for Domain Generalization in 3D Human Pose Estimation,"3D human pose data collected in controlled laboratory settings present
+challenges for pose estimators that generalize across diverse scenarios. To
+address this, domain generalization is employed. Current methodologies in
+domain generalization for 3D human pose estimation typically utilize
+adversarial training to generate synthetic poses for training. Nonetheless,
+these approaches exhibit several limitations. First, the lack of prior
+information about the target domain complicates the application of suitable
+augmentation through a single pose augmentor, affecting generalization on
+target domains. Moreover, adversarial training's discriminator tends to enforce
+similarity between source and synthesized poses, impeding the exploration of
+out-of-source distributions. Furthermore, the pose estimator's optimization is
+not exposed to domain shifts, limiting its overall generalization ability.
+  To address these limitations, we propose a novel framework featuring two pose
+augmentors: the weak and the strong augmentors. Our framework employs
+differential strategies for generation and discrimination processes,
+facilitating the preservation of knowledge related to source poses and the
+exploration of out-of-source distributions without prior information about
+target poses. Besides, we leverage meta-optimization to simulate domain shifts
+in the optimization process of the pose estimator, thereby improving its
+generalization ability. Our proposed approach significantly outperforms
+existing methods, as demonstrated through comprehensive experiments on various
+benchmark datasets.Our code will be released at
+\url{https://github.com/davidpengucf/DAF-DG}.",cs.CV,['cs.CV']
+ProMotion: Prototypes As Motion Learners,Yawen Lu · Dongfang Liu · Qifan Wang · Cheng Han · Yiming Cui · Yiming Cui · Zhiwen Cao · Xueling Zhang · Yingjie Victor Chen · Heng Fan, ,https://ar5iv.labs.arxiv.org/html/2304.11523,,2304.11523.pdf,TransFlow: Transformer as Flow Learner,"Optical flow is an indispensable building block for various important
+computer vision tasks, including motion estimation, object tracking, and
+disparity measurement. In this work, we propose TransFlow, a pure transformer
+architecture for optical flow estimation. Compared to dominant CNN-based
+methods, TransFlow demonstrates three advantages. First, it provides more
+accurate correlation and trustworthy matching in flow estimation by utilizing
+spatial self-attention and cross-attention mechanisms between adjacent frames
+to effectively capture global dependencies; Second, it recovers more
+compromised information (e.g., occlusion and motion blur) in flow estimation
+through long-range temporal association in dynamic scenes; Third, it enables a
+concise self-learning paradigm and effectively eliminate the complex and
+laborious multi-stage pre-training procedures. We achieve the state-of-the-art
+results on the Sintel, KITTI-15, as well as several downstream tasks, including
+video object detection, interpolation and stabilization. For its efficacy, we
+hope TransFlow could serve as a flexible baseline for optical flow estimation.",cs.CV,['cs.CV']
+Event-assisted Low-Light Video Object Segmentation,Li Hebei · Jin Wang · Jiahui Yuan · Yue Li · Wenming Weng · Yansong Peng · Yueyi Zhang · Zhiwei Xiong · Xiaoyan Sun, ,https://arxiv.org/abs/2404.01945,,2404.01945.pdf,Event-assisted Low-Light Video Object Segmentation,"In the realm of video object segmentation (VOS), the challenge of operating
+under low-light conditions persists, resulting in notably degraded image
+quality and compromised accuracy when comparing query and memory frames for
+similarity computation. Event cameras, characterized by their high dynamic
+range and ability to capture motion information of objects, offer promise in
+enhancing object visibility and aiding VOS methods under such low-light
+conditions. This paper introduces a pioneering framework tailored for low-light
+VOS, leveraging event camera data to elevate segmentation accuracy. Our
+approach hinges on two pivotal components: the Adaptive Cross-Modal Fusion
+(ACMF) module, aimed at extracting pertinent features while fusing image and
+event modalities to mitigate noise interference, and the Event-Guided Memory
+Matching (EGMM) module, designed to rectify the issue of inaccurate matching
+prevalent in low-light settings. Additionally, we present the creation of a
+synthetic LLE-DAVIS dataset and the curation of a real-world LLE-VOS dataset,
+encompassing frames and events. Experimental evaluations corroborate the
+efficacy of our method across both datasets, affirming its effectiveness in
+low-light scenarios.",cs.CV,['cs.CV']
+Towards Backward-Compatible Continual Learning of Image Compression,Zhihao Duan · Ming Lu · Justin Yang · Jiangpeng He · Zhan Ma · Fengqing Zhu, ,https://arxiv.org/abs/2402.18862,,2402.18862.pdf,Towards Backward-Compatible Continual Learning of Image Compression,"This paper explores the possibility of extending the capability of
+pre-trained neural image compressors (e.g., adapting to new data or target
+bitrates) without breaking backward compatibility, the ability to decode
+bitstreams encoded by the original model. We refer to this problem as continual
+learning of image compression. Our initial findings show that baseline
+solutions, such as end-to-end fine-tuning, do not preserve the desired backward
+compatibility. To tackle this, we propose a knowledge replay training strategy
+that effectively addresses this issue. We also design a new model architecture
+that enables more effective continual learning than existing baselines.
+Experiments are conducted for two scenarios: data-incremental learning and
+rate-incremental learning. The main conclusion of this paper is that neural
+image compressors can be fine-tuned to achieve better performance (compared to
+their pre-trained version) on new data and rates without compromising backward
+compatibility. Our code is available at
+https://gitlab.com/viper-purdue/continual-compression",eess.IV,['eess.IV']
+Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses,Inhee Lee · Byungjun Kim · Hanbyul Joo, ,http://export.arxiv.org/abs/2404.14410,,2404.14410.pdf,Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses,"In this paper, we present a method to reconstruct the world and multiple
+dynamic humans in 3D from a monocular video input. As a key idea, we represent
+both the world and multiple humans via the recently emerging 3D Gaussian
+Splatting (3D-GS) representation, enabling to conveniently and efficiently
+compose and render them together. In particular, we address the scenarios with
+severely limited and sparse observations in 3D human reconstruction, a common
+challenge encountered in the real world. To tackle this challenge, we introduce
+a novel approach to optimize the 3D-GS representation in a canonical space by
+fusing the sparse cues in the common space, where we leverage a pre-trained 2D
+diffusion model to synthesize unseen views while keeping the consistency with
+the observed 2D appearances. We demonstrate our method can reconstruct
+high-quality animatable 3D humans in various challenging examples, in the
+presence of occlusion, image crops, few-shot, and extremely sparse
+observations. After reconstruction, our method is capable of not only rendering
+the scene in any novel views at arbitrary time instances, but also editing the
+3D scene by removing individual humans or applying different motions for each
+human. Through various experiments, we demonstrate the quality and efficiency
+of our methods over alternative existing approaches.",cs.CV,['cs.CV']
+EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling,Haiyang Liu · Zihao Zhu · Giorgio Becherini · YICHEN PENG · Mingyang Su · YOU ZHOU · Xuefei Zhe · Naoya Iwamoto · Bo Zheng · Michael J. Black,https://pantomatrix.github.io/EMAGE/,https://arxiv.org/abs/2401.00374,,2401.00374.pdf,EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling,"We propose EMAGE, a framework to generate full-body human gestures from audio
+and masked gestures, encompassing facial, local body, hands, and global
+movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new
+mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with
+FLAME head parameters and further refines the modeling of head, neck, and
+finger movements, offering a community-standardized, high-quality 3D motion
+captured dataset. EMAGE leverages masked body gesture priors during training to
+boost inference performance. It involves a Masked Audio Gesture Transformer,
+facilitating joint training on audio-to-gesture generation and masked gesture
+reconstruction to effectively encode audio and body gesture hints. Encoded body
+hints from masked gestures are then separately employed to generate facial and
+body movements. Moreover, EMAGE adaptively merges speech features from the
+audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance
+the results' fidelity and diversity. Experiments demonstrate that EMAGE
+generates holistic gestures with state-of-the-art performance and is flexible
+in accepting predefined spatial-temporal gesture inputs, generating complete,
+audio-synchronized results. Our code and dataset are available
+https://pantomatrix.github.io/EMAGE/",cs.CV,['cs.CV']
+A Semi-supervised Nighttime Dehazing Baseline with Spatial-Frequency Aware and Realistic Brightness Constraint,Xiaofeng Cong · Jie Gui · Jing Zhang · Junming Hou · Hao Shen,https://github.com/Xiaofeng-life/SFSNiD/,https://arxiv.org/abs/2403.18548,,2403.18548.pdf,A Semi-supervised Nighttime Dehazing Baseline with Spatial-Frequency Aware and Realistic Brightness Constraint,"Existing research based on deep learning has extensively explored the problem
+of daytime image dehazing. However, few studies have considered the
+characteristics of nighttime hazy scenes. There are two distinctions between
+nighttime and daytime haze. First, there may be multiple active colored light
+sources with lower illumination intensity in nighttime scenes, which may cause
+haze, glow and noise with localized, coupled and frequency inconsistent
+characteristics. Second, due to the domain discrepancy between simulated and
+real-world data, unrealistic brightness may occur when applying a dehazing
+model trained on simulated data to real-world data. To address the above two
+issues, we propose a semi-supervised model for real-world nighttime dehazing.
+First, the spatial attention and frequency spectrum filtering are implemented
+as a spatial-frequency domain information interaction module to handle the
+first issue. Second, a pseudo-label-based retraining strategy and a local
+window-based brightness loss for semi-supervised training process is designed
+to suppress haze and glow while achieving realistic brightness. Experiments on
+public benchmarks validate the effectiveness of the proposed method and its
+superiority over state-of-the-art methods. The source code and Supplementary
+Materials are placed in the https://github.com/Xiaofeng-life/SFSNiD.",cs.CV,['cs.CV']
+How to Configure Good In-Context Sequence for Visual Question Answering,Li Li · Jiawei Peng · huiyi chen · Chongyang Gao · Xu Yang, ,https://arxiv.org/abs/2312.01571,,2312.01571.pdf,How to Configure Good In-Context Sequence for Visual Question Answering,"Inspired by the success of Large Language Models in dealing with new tasks
+via In-Context Learning (ICL) in NLP, researchers have also developed Large
+Vision-Language Models (LVLMs) with ICL capabilities. However, when
+implementing ICL using these LVLMs, researchers usually resort to the simplest
+way like random sampling to configure the in-context sequence, thus leading to
+sub-optimal results. To enhance the ICL performance, in this study, we use
+Visual Question Answering (VQA) as case study to explore diverse in-context
+configurations to find the powerful ones. Additionally, through observing the
+changes of the LVLM outputs by altering the in-context sequence, we gain
+insights into the inner properties of LVLMs, improving our understanding of
+them. Specifically, to explore in-context configurations, we design diverse
+retrieval methods and employ different strategies to manipulate the retrieved
+demonstrations. Through exhaustive experiments on three VQA datasets: VQAv2,
+VizWiz, and OK-VQA, we uncover three important inner properties of the applied
+LVLM and demonstrate which strategies can consistently improve the ICL VQA
+performance. Our code is provided in:
+https://github.com/GaryJiajia/OFv2_ICL_VQA.",cs.CV,"['cs.CV', 'cs.AI']"
+Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval,Minkuk Kim · Hyeon Bae Kim · Jinyoung Moon · Jinwoo Choi · Seong Tae Kim, ,https://arxiv.org/abs/2404.07610,,2404.07610.pdf,Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval,"There has been significant attention to the research on dense video
+captioning, which aims to automatically localize and caption all events within
+untrimmed video. Several studies introduce methods by designing dense video
+captioning as a multitasking problem of event localization and event captioning
+to consider inter-task relations. However, addressing both tasks using only
+visual input is challenging due to the lack of semantic content. In this study,
+we address this by proposing a novel framework inspired by the cognitive
+information processing of humans. Our model utilizes external memory to
+incorporate prior knowledge. The memory retrieval method is proposed with
+cross-modal video-to-text matching. To effectively incorporate retrieved text
+features, the versatile encoder and the decoder with visual and textual
+cross-attention modules are designed. Comparative experiments have been
+conducted to show the effectiveness of the proposed method on ActivityNet
+Captions and YouCook2 datasets. Experimental results show promising performance
+of our model without extensive pretraining from a large video dataset.",cs.CV,['cs.CV']
+Towards Text-guided 3D Scene Composition,Qihang Zhang · Chaoyang Wang · Aliaksandr Siarohin · Peiye Zhuang · Yinghao Xu · Ceyuan Yang · Dahua Lin · Bolei Zhou · Sergey Tulyakov · Hsin-Ying Lee, ,https://arxiv.org/abs/2312.08885,,2312.08885.pdf,SceneWiz3D: Towards Text-guided 3D Scene Composition,"We are witnessing significant breakthroughs in the technology for generating
+3D objects from text. Existing approaches either leverage large text-to-image
+models to optimize a 3D representation or train 3D generators on object-centric
+datasets. Generating entire scenes, however, remains very challenging as a
+scene contains multiple 3D objects, diverse and scattered. In this work, we
+introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes
+from text. We marry the locality of objects with globality of scenes by
+introducing a hybrid 3D representation: explicit for objects and implicit for
+scenes. Remarkably, an object, being represented explicitly, can be either
+generated from text using conventional text-to-3D approaches, or provided by
+users. To configure the layout of the scene and automatically place objects, we
+apply the Particle Swarm Optimization technique during the optimization
+process. Furthermore, it is difficult for certain parts of the scene (e.g.,
+corners, occlusion) to receive multi-view supervision, leading to inferior
+geometry. We incorporate an RGBD panorama diffusion model to mitigate it,
+resulting in high-quality geometry. Extensive evaluation supports that our
+approach achieves superior quality over previous approaches, enabling the
+generation of detailed and view-consistent 3D scenes.",cs.CV,['cs.CV']
+Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs,Kanchana Ranasinghe · Satya Narayan Shukla · Omid Poursaeed · Michael Ryoo · Tsung-Yu Lin, ,https://arxiv.org/abs/2404.07449,,2404.07449.pdf,Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs,"Integration of Large Language Models (LLMs) into visual domain tasks,
+resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in
+vision-language tasks, particularly for visual question answering (VQA).
+However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial
+reasoning and localization awareness. Despite generating highly descriptive and
+elaborate textual answers, these models fail at simple tasks like
+distinguishing a left vs right location. In this work, we explore how
+image-space coordinate based instruction fine-tuning objectives could inject
+spatial awareness into V-LLMs. We discover optimal coordinate representations,
+data-efficient instruction fine-tuning objectives, and pseudo-data generation
+strategies that lead to improved spatial awareness in V-LLMs. Additionally, our
+resulting model improves VQA across image and video domains, reduces undesired
+hallucination, and generates better contextual object descriptions. Experiments
+across 5 vision-language tasks involving 14 different datasets establish the
+clear performance improvements achieved by our proposed framework.",cs.CV,['cs.CV']
+Finding Lottery Tickets in Vision Models via Data-driven Spectral Foresight Pruning,Leonardo Iurada · Marco Ciccone · Tatiana Tommasi,https://iurada.github.io/PX,https://arxiv.org/abs/2405.00906,,2405.00906.pdf,LOTUS: Improving Transformer Efficiency with Sparsity Pruning and Data Lottery Tickets,"Vision transformers have revolutionized computer vision, but their
+computational demands present challenges for training and deployment. This
+paper introduces LOTUS (LOttery Transformers with Ultra Sparsity), a novel
+method that leverages data lottery ticket selection and sparsity pruning to
+accelerate vision transformer training while maintaining accuracy. Our approach
+focuses on identifying and utilizing the most informative data subsets and
+eliminating redundant model parameters to optimize the training process.
+Through extensive experiments, we demonstrate the effectiveness of LOTUS in
+achieving rapid convergence and high accuracy with significantly reduced
+computational requirements. This work highlights the potential of combining
+data selection and sparsity techniques for efficient vision transformer
+training, opening doors for further research and development in this area.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Fully Geometric Panoramic Localization,Junho Kim · Jiwon Jeong · Young Min Kim,https://82magnolia.github.io/fgpl/,https://arxiv.org/abs/2403.19904,,2403.19904.pdf,Fully Geometric Panoramic Localization,"We introduce a lightweight and accurate localization method that only
+utilizes the geometry of 2D-3D lines. Given a pre-captured 3D map, our approach
+localizes a panorama image, taking advantage of the holistic 360 view. The
+system mitigates potential privacy breaches or domain discrepancies by avoiding
+trained or hand-crafted visual descriptors. However, as lines alone can be
+ambiguous, we express distinctive yet compact spatial contexts from
+relationships between lines, namely the dominant directions of parallel lines
+and the intersection between non-parallel lines. The resulting representations
+are efficient in processing time and memory compared to conventional visual
+descriptor-based methods. Given the groups of dominant line directions and
+their intersections, we accelerate the search process to test thousands of pose
+candidates in less than a millisecond without sacrificing accuracy. We
+empirically show that the proposed 2D-3D matching can localize panoramas for
+challenging scenes with similar structures, dramatic domain shifts or
+illumination changes. Our fully geometric approach does not involve extensive
+parameter tuning or neural network training, making it a practical algorithm
+that can be readily deployed in the real world. Project page including the code
+is available through this link: https://82magnolia.github.io/fgpl/.",cs.CV,['cs.CV']
+VS: Reconstructing Clothed 3D Human from Single Image via Vertex Shift,Leyuan Liu · Yuhan Li · Yunqi Gao · Changxin Gao · Yuanyuan Liu · Jingying Chen,https://github.com/naivate/VS.git,https://arxiv.org/abs/2309.13524,,2309.13524.pdf,Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction,"Reconstructing 3D clothed human avatars from single images is a challenging
+task, especially when encountering complex poses and loose clothing. Current
+methods exhibit limitations in performance, largely attributable to their
+dependence on insufficient 2D image features and inconsistent query methods.
+Owing to this, we present the Global-correlated 3D-decoupling Transformer for
+clothed Avatar reconstruction (GTA), a novel transformer-based architecture
+that reconstructs clothed human avatars from monocular images. Our approach
+leverages transformer architectures by utilizing a Vision Transformer model as
+an encoder for capturing global-correlated image features. Subsequently, our
+innovative 3D-decoupling decoder employs cross-attention to decouple tri-plane
+features, using learnable embeddings as queries for cross-plane generation. To
+effectively enhance feature fusion with the tri-plane 3D feature and human body
+prior, we propose a hybrid prior fusion strategy combining spatial and
+prior-enhanced queries, leveraging the benefits of spatial localization and
+human body prior knowledge. Comprehensive experiments on CAPE and THuman2.0
+datasets illustrate that our method outperforms state-of-the-art approaches in
+both geometry and texture reconstruction, exhibiting high robustness to
+challenging poses and loose clothing, and producing higher-resolution textures.
+Codes will be available at https://github.com/River-Zhang/GTA.",cs.CV,"['cs.CV', 'cs.AI']"
+Towards Real-World HDR Video Reconstruction: A Large-Scale Benchmark Dataset and A Two-Stage Alignment Network,Yong Shu · Liquan Shen · Xiangyu Hu · Mengyao Li · Zihao Zhou,https://github.com/yungsyu99/Real-HDRV,https://arxiv.org/abs/2405.00244,,2405.00244.pdf,Towards Real-World HDR Video Reconstruction: A Large-Scale Benchmark Dataset and A Two-Stage Alignment Network,"As an important and practical way to obtain high dynamic range (HDR) video,
+HDR video reconstruction from sequences with alternating exposures is still
+less explored, mainly due to the lack of large-scale real-world datasets.
+Existing methods are mostly trained on synthetic datasets, which perform poorly
+in real scenes. In this work, to facilitate the development of real-world HDR
+video reconstruction, we present Real-HDRV, a large-scale real-world benchmark
+dataset for HDR video reconstruction, featuring various scenes, diverse motion
+patterns, and high-quality labels. Specifically, our dataset contains 500
+LDRs-HDRs video pairs, comprising about 28,000 LDR frames and 4,000 HDR labels,
+covering daytime, nighttime, indoor, and outdoor scenes. To our best knowledge,
+our dataset is the largest real-world HDR video reconstruction dataset.
+Correspondingly, we propose an end-to-end network for HDR video reconstruction,
+where a novel two-stage strategy is designed to perform alignment sequentially.
+Specifically, the first stage performs global alignment with the adaptively
+estimated global offsets, reducing the difficulty of subsequent alignment. The
+second stage implicitly performs local alignment in a coarse-to-fine manner at
+the feature level using the adaptive separable convolution. Extensive
+experiments demonstrate that: (1) models trained on our dataset can achieve
+better performance on real scenes than those trained on synthetic datasets; (2)
+our method outperforms previous state-of-the-art methods. Our dataset is
+available at https://github.com/yungsyu99/Real-HDRV.",cs.CV,['cs.CV']
+Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding,Jin-Chuan Shi · Miao Wang · Haobin Duan · Shaohua Guan,https://buaavrcg.github.io/LEGaussians/,https://arxiv.org/abs/2311.18482,,2311.18482.pdf,Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding,"Open-vocabulary querying in 3D space is challenging but essential for scene
+understanding tasks such as object localization and segmentation.
+Language-embedded scene representations have made progress by incorporating
+language features into 3D spaces. However, their efficacy heavily depends on
+neural networks that are resource-intensive in training and rendering. Although
+recent 3D Gaussians offer efficient and high-quality novel view synthesis,
+directly embedding language features in them leads to prohibitive memory usage
+and decreased performance. In this work, we introduce Language Embedded 3D
+Gaussians, a novel scene representation for open-vocabulary query tasks.
+Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we
+propose a dedicated quantization scheme that drastically alleviates the memory
+requirement, and a novel embedding procedure that achieves smoother yet high
+accuracy query, countering the multi-view feature inconsistencies and the
+high-frequency inductive bias in point-based representations. Our comprehensive
+experiments show that our representation achieves the best visual quality and
+language querying accuracy across current language-embedded representations,
+while maintaining real-time rendering frame rates on a single desktop GPU.",cs.CV,"['cs.CV', 'cs.GR']"
+GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians,Shenhan Qian · Tobias Kirschstein · Liam Schoneveld · Davide Davoli · Simon Giebenhain · Matthias Nießner,https://shenhanqian.github.io/gaussian-avatars,https://arxiv.org/abs/2312.02069,,2312.02069.pdf,GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians,"We introduce GaussianAvatars, a new method to create photorealistic head
+avatars that are fully controllable in terms of expression, pose, and
+viewpoint. The core idea is a dynamic 3D representation based on 3D Gaussian
+splats that are rigged to a parametric morphable face model. This combination
+facilitates photorealistic rendering while allowing for precise animation
+control via the underlying parametric model, e.g., through expression transfer
+from a driving sequence or by manually changing the morphable model parameters.
+We parameterize each splat by a local coordinate frame of a triangle and
+optimize for explicit displacement offset to obtain a more accurate geometric
+representation. During avatar reconstruction, we jointly optimize for the
+morphable model parameters and Gaussian splat parameters in an end-to-end
+fashion. We demonstrate the animation capabilities of our photorealistic avatar
+in several challenging scenarios. For instance, we show reenactments from a
+driving video, where our method outperforms existing works by a significant
+margin.",cs.CV,['cs.CV']
+Garment Recovery with Shape and Deformation Priors,Ren Li · Corentin Dumery · Benoît Guillard · Pascal Fua, ,https://arxiv.org/abs/2311.10356,,2311.10356.pdf,Garment Recovery with Shape and Deformation Priors,"While modeling people wearing tight-fitting clothing has made great strides
+in recent years, loose-fitting clothing remains a challenge. We propose a
+method that delivers realistic garment models from real-world images,
+regardless of garment shape or deformation. To this end, we introduce a fitting
+approach that utilizes shape and deformation priors learned from synthetic data
+to accurately capture garment shapes and deformations, including large ones.
+Not only does our approach recover the garment geometry accurately, it also
+yields models that can be directly used by downstream applications such as
+animation and simulation.",cs.CV,['cs.CV']
+Neighbor Relations Matter in Video Scene Detection,Jiawei Tan · Hongxing Wang · Jiaxin Li · Zhilong Ou · Zhangbin Qian, ,,https://www.semanticscholar.org/paper/Characters-Link-Shots:-Character-Attention-Network-Tan-Wang/031a0952b156f36ea9da7113ade868754100e4b7,,,,,nan
+The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective,Wenqi Jia · Miao Liu · Hao Jiang · Ishwarya Ananthabhotla · James Rehg · Vamsi Krishna Ithapu · Ruohan Gao, ,https://arxiv.org/abs/2312.12870,,2312.12870.pdf,The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective,"In recent years, the thriving development of research related to egocentric
+videos has provided a unique perspective for the study of conversational
+interactions, where both visual and audio signals play a crucial role. While
+most prior work focus on learning about behaviors that directly involve the
+camera wearer, we introduce the Ego-Exocentric Conversational Graph Prediction
+problem, marking the first attempt to infer exocentric conversational
+interactions from egocentric videos. We propose a unified multi-modal framework
+-- Audio-Visual Conversational Attention (AV-CONV), for the joint prediction of
+conversation behaviors -- speaking and listening -- for both the camera wearer
+as well as all other social partners present in the egocentric video.
+Specifically, we adopt the self-attention mechanism to model the
+representations across-time, across-subjects, and across-modalities. To
+validate our method, we conduct experiments on a challenging egocentric video
+dataset that includes multi-speaker and multi-conversation scenarios. Our
+results demonstrate the superior performance of our method compared to a series
+of baselines. We also present detailed ablation studies to assess the
+contribution of each component in our model. Check our project page at
+https://vjwq.github.io/AV-CONV/.",cs.CV,['cs.CV']
+Dense Vision Transformer Compression with Few Samples,Hanxiao Zhang · Yifan Zhou · Guo-Hua Wang, ,https://arxiv.org/abs/2403.18708,,2403.18708.pdf,Dense Vision Transformer Compression with Few Samples,"Few-shot model compression aims to compress a large model into a more compact
+one with only a tiny training set (even without labels). Block-level pruning
+has recently emerged as a leading technique in achieving high accuracy and low
+latency in few-shot CNN compression. But, few-shot compression for Vision
+Transformers (ViT) remains largely unexplored, which presents a new challenge.
+In particular, the issue of sparse compression exists in traditional CNN
+few-shot methods, which can only produce very few compressed models of
+different model sizes. This paper proposes a novel framework for few-shot ViT
+compression named DC-ViT. Instead of dropping the entire block, DC-ViT
+selectively eliminates the attention module while retaining and reusing
+portions of the MLP module. DC-ViT enables dense compression, which outputs
+numerous compressed models that densely populate the range of model complexity.
+DC-ViT outperforms state-of-the-art few-shot compression methods by a
+significant margin of 10 percentage points, along with lower latency in the
+compression of ViT and its variants.",cs.CV,['cs.CV']
+Structure-from-Motion from Pixel-wise Correspondences,Philipp Lindenberger · Paul-Edouard Sarlin · Marc Pollefeys, ,http://export.arxiv.org/abs/2306.13643,,2306.13643.pdf,LightGlue: Local Feature Matching at Light Speed,"We introduce LightGlue, a deep neural network that learns to match local
+features across images. We revisit multiple design decisions of SuperGlue, the
+state of the art in sparse matching, and derive simple but effective
+improvements. Cumulatively, they make LightGlue more efficient - in terms of
+both memory and computation, more accurate, and much easier to train. One key
+property is that LightGlue is adaptive to the difficulty of the problem: the
+inference is much faster on image pairs that are intuitively easy to match, for
+example because of a larger visual overlap or limited appearance change. This
+opens up exciting prospects for deploying deep matchers in latency-sensitive
+applications like 3D reconstruction. The code and trained models are publicly
+available at https://github.com/cvg/LightGlue.",cs.CV,['cs.CV']
+KITRO: Refining Human Mesh by 2D Clues and Kinematic-tree Rotation,Fengyuan Yang · Kerui Gu · Angela Yao,https://github.com/MartaYang/KITRO,https://arxiv.org/abs/2405.19833,,2405.19833.pdf,KITRO: Refining Human Mesh by 2D Clues and Kinematic-tree Rotation,"2D keypoints are commonly used as an additional cue to refine estimated 3D
+human meshes. Current methods optimize the pose and shape parameters with a
+reprojection loss on the provided 2D keypoints. Such an approach, while simple
+and intuitive, has limited effectiveness because the optimal solution is hard
+to find in ambiguous parameter space and may sacrifice depth. Additionally,
+divergent gradients from distal joints complicate and deviate the refinement of
+proximal joints in the kinematic chain. To address these, we introduce
+Kinematic-Tree Rotation (KITRO), a novel mesh refinement strategy that
+explicitly models depth and human kinematic-tree structure. KITRO treats
+refinement from a bone-wise perspective. Unlike previous methods which perform
+gradient-based optimizations, our method calculates bone directions in closed
+form. By accounting for the 2D pose, bone length, and parent joint's depth, the
+calculation results in two possible directions for each child joint. We then
+use a decision tree to trace binary choices for all bones along the human
+skeleton's kinematic-tree to select the most probable hypothesis. Our
+experiments across various datasets and baseline models demonstrate that KITRO
+significantly improves 3D joint estimation accuracy and achieves an ideal 2D
+fit simultaneously. Our code available at: https://github.com/MartaYang/KITRO.",cs.CV,['cs.CV']
+Orthogonal Adaptation for Modular Customization of Diffusion Models,Ryan Po · Guandao Yang · Kfir Aberman · Gordon Wetzstein, ,https://arxiv.org/abs/2312.02432,,2312.02432.pdf,Orthogonal Adaptation for Modular Customization of Diffusion Models,"Customization techniques for text-to-image models have paved the way for a
+wide range of previously unattainable applications, enabling the generation of
+specific concepts across diverse contexts and styles. While existing methods
+facilitate high-fidelity customization for individual concepts or a limited,
+pre-defined set of them, they fall short of achieving scalability, where a
+single model can seamlessly render countless concepts. In this paper, we
+address a new problem called Modular Customization, with the goal of
+efficiently merging customized models that were fine-tuned independently for
+individual concepts. This allows the merged model to jointly synthesize
+concepts in one image without compromising fidelity or incurring any additional
+computational costs.
+  To address this problem, we introduce Orthogonal Adaptation, a method
+designed to encourage the customized models, which do not have access to each
+other during fine-tuning, to have orthogonal residual weights. This ensures
+that during inference time, the customized models can be summed with minimal
+interference.
+  Our proposed method is both simple and versatile, applicable to nearly all
+optimizable weights in the model architecture. Through an extensive set of
+quantitative and qualitative evaluations, our method consistently outperforms
+relevant baselines in terms of efficiency and identity preservation,
+demonstrating a significant leap toward scalable customization of diffusion
+models.",cs.CV,['cs.CV']
+Open-World Human-Object Interaction Detection via Multi-modal Prompts,Jie Yang · Bingliang Li · Ailing Zeng · Ailing Zeng · Lei Zhang · Ruimao Zhang, ,,https://openreview.net/forum?id=qrv4wcmmxe,,,,,nan
+Adaptive Multi-Modal Cross-Entropy Loss for Stereo Matching,Peng Xu · Zhiyu Xiang · Chengyu Qiao · Jingyun Fu · Tianyu Pu, ,https://arxiv.org/abs/2306.15612,,2306.15612.pdf,Adaptive Multi-Modal Cross-Entropy Loss for Stereo Matching,"Despite the great success of deep learning in stereo matching, recovering
+accurate disparity maps is still challenging. Currently, L1 and cross-entropy
+are the two most widely used losses for stereo network training. Compared with
+the former, the latter usually performs better thanks to its probability
+modeling and direct supervision to the cost volume. However, how to accurately
+model the stereo ground-truth for cross-entropy loss remains largely
+under-explored. Existing works simply assume that the ground-truth
+distributions are uni-modal, which ignores the fact that most of the edge
+pixels can be multi-modal. In this paper, a novel adaptive multi-modal
+cross-entropy loss (ADL) is proposed to guide the networks to learn different
+distribution patterns for each pixel. Moreover, we optimize the disparity
+estimator to further alleviate the bleeding or misalignment artifacts in
+inference. Extensive experimental results show that our method is generic and
+can help classic stereo networks regain state-of-the-art performance. In
+particular, GANet with our method ranks $1^{st}$ on both the KITTI 2015 and
+2012 benchmarks among the published methods. Meanwhile, excellent
+synthetic-to-realistic generalization performance can be achieved by simply
+replacing the traditional loss with ours.",cs.CV,['cs.CV']
+EvalCrafter: Benchmarking and Evaluating Large Video Generation Models,Yaofang Liu · Xiaodong Cun · Xuebo Liu · Xintao Wang · Yong Zhang · Haoxin Chen · Yang Liu · Tieyong Zeng · Raymond Chan · Ying Shan, ,https://arxiv.org/abs/2310.11440,,2310.11440.pdf,EvalCrafter: Benchmarking and Evaluating Large Video Generation Models,"The vision and language generative models have been overgrown in recent
+years. For video generation, various open-sourced models and public-available
+services have been developed to generate high-quality videos. However, these
+methods often use a few metrics, e.g., FVD or IS, to evaluate the performance.
+We argue that it is hard to judge the large conditional generative models from
+the simple metrics since these models are often trained on very large datasets
+with multi-aspect abilities. Thus, we propose a novel framework and pipeline
+for exhaustively evaluating the performance of the generated videos. Our
+approach involves generating a diverse and comprehensive list of 700 prompts
+for text-to-video generation, which is based on an analysis of real-world user
+data and generated with the assistance of a large language model. Then, we
+evaluate the state-of-the-art video generative models on our carefully designed
+benchmark, in terms of visual qualities, content qualities, motion qualities,
+and text-video alignment with 17 well-selected objective metrics. To obtain the
+final leaderboard of the models, we further fit a series of coefficients to
+align the objective metrics to the users' opinions. Based on the proposed human
+alignment method, our final score shows a higher correlation than simply
+averaging the metrics, showing the effectiveness of the proposed evaluation
+method.",cs.CV,['cs.CV']
+HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud,WENCAN CHENG · WENCAN CHENG · Hao Tang · Luc Van Gool · Jong Hwan Ko, ,https://arxiv.org/abs/2404.03159,,2404.03159.pdf,HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud,"Extracting keypoint locations from input hand frames, known as 3D hand pose
+estimation, is a critical task in various human-computer interaction
+applications. Essentially, the 3D hand pose estimation can be regarded as a 3D
+point subset generative problem conditioned on input frames. Thanks to the
+recent significant progress on diffusion-based generative models, hand pose
+estimation can also benefit from the diffusion model to estimate keypoint
+locations with high quality. However, directly deploying the existing diffusion
+models to solve hand pose estimation is non-trivial, since they cannot achieve
+the complex permutation mapping and precise localization. Based on this
+motivation, this paper proposes HandDiff, a diffusion-based hand pose
+estimation model that iteratively denoises accurate hand pose conditioned on
+hand-shaped image-point clouds. In order to recover keypoint permutation and
+accurate location, we further introduce joint-wise condition and local detail
+condition. Experimental results demonstrate that the proposed HandDiff
+significantly outperforms the existing approaches on four challenging hand pose
+benchmark datasets. Codes and pre-trained models are publicly available at
+https://github.com/cwc1260/HandDiff.",cs.CV,['cs.CV']
+Tuning Stable Rank Shrinkage: Aiming at the Overlooked Structural Risk in Fine-tuning,Sicong Shen · Yang Zhou · Bingzheng Wei · Eric Chang · Yan Xu, ,https://arxiv.org/abs/2312.03732,,2312.03732.pdf,A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA,"As large language models (LLMs) have become increasingly compute and memory
+intensive, parameter-efficient fine-tuning (PEFT) methods are now a common
+strategy to fine-tune LLMs. A popular PEFT method is Low-Rank Adapters (LoRA),
+which adds trainable low-rank ""adapters"" to selected layers. Each adapter
+consists of a low-rank matrix product, multiplicatively scaled by a
+rank-dependent factor. This scaling factor, which divides adapters by a factor
+of the rank, results in slowed learning and stunted performance for LoRA with
+higher-rank adapters. Consequently, the use of LoRA in practice has generally
+been limited to very low ranks. In this work, we study the impact of the
+scaling factor on the learning process and prove that LoRA adapters should be
+divided by a factor of the square root of the rank. Modifying LoRA with the
+appropriate scaling factor, which we call the rank-stabilized LoRA (rsLoRA)
+method, easily provides for a fine-tuning compute/performance trade-off, where
+larger ranks can be used to trade off increased computational resources during
+training for better fine-tuning performance, with no change in inference
+computing cost.",cs.CL,"['cs.CL', 'cs.LG', 'I.2.7']"
+En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data,Yifang Men · Biwen Lei · Yuan Yao · Miaomiao Cui · Zhouhui Lian · Xuansong Xie,https://menyifang.github.io/projects/En3D/index.html,https://arxiv.org/abs/2401.01173,,2401.01173.pdf,En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data,"We present En3D, an enhanced generative scheme for sculpting high-quality 3D
+human avatars. Unlike previous works that rely on scarce 3D datasets or limited
+2D collections with imbalanced viewing angles and imprecise pose priors, our
+approach aims to develop a zero-shot 3D generative scheme capable of producing
+visually realistic, geometrically accurate and content-wise diverse 3D humans
+without relying on pre-existing 3D or 2D assets. To address this challenge, we
+introduce a meticulously crafted workflow that implements accurate physical
+modeling to learn the enhanced 3D generative model from synthetic 2D data.
+During inference, we integrate optimization modules to bridge the gap between
+realistic appearances and coarse 3D shapes. Specifically, En3D comprises three
+modules: a 3D generator that accurately models generalizable 3D humans with
+realistic appearance from synthesized balanced, diverse, and structured human
+images; a geometry sculptor that enhances shape quality using multi-view normal
+constraints for intricate human anatomy; and a texturing module that
+disentangles explicit texture maps with fidelity and editability, leveraging
+semantical UV partitioning and a differentiable rasterizer. Experimental
+results show that our approach significantly outperforms prior works in terms
+of image quality, geometry accuracy and content diversity. We also showcase the
+applicability of our generated avatars for animation and editing, as well as
+the scalability of our approach for content-style free adaptation.",cs.CV,['cs.CV']
+Differentiable Point-based Inverse Rendering,Hoon-Gyu Chung · Seokjun Choi · Seung-Hwan Baek,https://hg-chung.github.io/DPIR/,https://arxiv.org/abs/2312.02480,,2312.02480.pdf,Differentiable Point-based Inverse Rendering,"We present differentiable point-based inverse rendering, DPIR, an
+analysis-by-synthesis method that processes images captured under diverse
+illuminations to estimate shape and spatially-varying BRDF. To this end, we
+adopt point-based rendering, eliminating the need for multiple samplings per
+ray, typical of volumetric rendering, thus significantly enhancing the speed of
+inverse rendering. To realize this idea, we devise a hybrid point-volumetric
+representation for geometry and a regularized basis-BRDF representation for
+reflectance. The hybrid geometric representation enables fast rendering through
+point-based splatting while retaining the geometric details and stability
+inherent to SDF-based representations. The regularized basis-BRDF mitigates the
+ill-posedness of inverse rendering stemming from limited light-view angular
+samples. We also propose an efficient shadow detection method using point-based
+shadow map rendering. Our extensive evaluations demonstrate that DPIR
+outperforms prior works in terms of reconstruction accuracy, computational
+efficiency, and memory footprint. Furthermore, our explicit point-based
+representation and rendering enables intuitive geometry and reflectance
+editing.",cs.CV,['cs.CV']
+ICP-Flow: LiDAR Scene Flow Estimation with ICP,Yancong Lin · Holger Caesar,https://github.com/yanconglin/ICP-Flow,https://arxiv.org/abs/2402.17351,,2402.17351.pdf,ICP-Flow: LiDAR Scene Flow Estimation with ICP,"Scene flow characterizes the 3D motion between two LiDAR scans captured by an
+autonomous vehicle at nearby timesteps. Prevalent methods consider scene flow
+as point-wise unconstrained flow vectors that can be learned by either
+large-scale training beforehand or time-consuming optimization at inference.
+However, these methods do not take into account that objects in autonomous
+driving often move rigidly. We incorporate this rigid-motion assumption into
+our design, where the goal is to associate objects over scans and then estimate
+the locally rigid transformations. We propose ICP-Flow, a learning-free flow
+estimator. The core of our design is the conventional Iterative Closest Point
+(ICP) algorithm, which aligns the objects over time and outputs the
+corresponding rigid transformations. Crucially, to aid ICP, we propose a
+histogram-based initialization that discovers the most likely translation, thus
+providing a good starting point for ICP. The complete scene flow is then
+recovered from the rigid transformations. We outperform state-of-the-art
+baselines, including supervised models, on the Waymo dataset and perform
+competitively on Argoverse-v2 and nuScenes. Further, we train a feedforward
+neural network, supervised by the pseudo labels from our model, and achieve top
+performance among all models capable of real-time inference. We validate the
+advantage of our model on scene flow estimation with longer temporal gaps, up
+to 0.4 seconds where other models fail to deliver meaningful results.",cs.CV,['cs.CV']
+Rolling Shutter Correction with Intermediate Distortion Flow Estimation,Mingdeng Cao · Sidi Yang · Yujiu Yang · Yinqiang Zheng,https://github.com/ljzycmd/DFRSC,https://arxiv.org/abs/2404.06350,,2404.06350.pdf,Rolling Shutter Correction with Intermediate Distortion Flow Estimation,"This paper proposes to correct the rolling shutter (RS) distorted images by
+estimating the distortion flow from the global shutter (GS) to RS directly.
+Existing methods usually perform correction using the undistortion flow from
+the RS to GS. They initially predict the flow from consecutive RS frames,
+subsequently rescaling it as the displacement fields from the RS frame to the
+underlying GS image using time-dependent scaling factors. Following this,
+RS-aware forward warping is employed to convert the RS image into its GS
+counterpart. Nevertheless, this strategy is prone to two shortcomings. First,
+the undistortion flow estimation is rendered inaccurate by merely linear
+scaling the flow, due to the complex non-linear motion nature. Second, RS-aware
+forward warping often results in unavoidable artifacts. To address these
+limitations, we introduce a new framework that directly estimates the
+distortion flow and rectifies the RS image with the backward warping operation.
+More specifically, we first propose a global correlation-based flow attention
+mechanism to estimate the initial distortion flow and GS feature jointly, which
+are then refined by the following coarse-to-fine decoder layers. Additionally,
+a multi-distortion flow prediction strategy is integrated to mitigate the issue
+of inaccurate flow estimation further. Experimental results validate the
+effectiveness of the proposed method, which outperforms state-of-the-art
+approaches on various benchmarks while maintaining high efficiency. The project
+is available at \url{https://github.com/ljzycmd/DFRSC}.",cs.CV,['cs.CV']
+Programmable Motion Generation for Open-set Motion Control Tasks,Hanchao Liu · Xiaohang Zhan · Shaoli Huang · Tai-Jiang Mu · Ying Shan, ,https://arxiv.org/abs/2405.19283,,2405.19283.pdf,Programmable Motion Generation for Open-Set Motion Control Tasks,"Character animation in real-world scenarios necessitates a variety of
+constraints, such as trajectories, key-frames, interactions, etc. Existing
+methodologies typically treat single or a finite set of these constraint(s) as
+separate control tasks. They are often specialized, and the tasks they address
+are rarely extendable or customizable. We categorize these as solutions to the
+close-set motion control problem. In response to the complexity of practical
+motion control, we propose and attempt to solve the open-set motion control
+problem. This problem is characterized by an open and fully customizable set of
+motion control tasks. To address this, we introduce a new paradigm,
+programmable motion generation. In this paradigm, any given motion control task
+is broken down into a combination of atomic constraints. These constraints are
+then programmed into an error function that quantifies the degree to which a
+motion sequence adheres to them. We utilize a pre-trained motion generation
+model and optimize its latent code to minimize the error function of the
+generated motion. Consequently, the generated motion not only inherits the
+prior of the generative model but also satisfies the required constraints.
+Experiments show that we can generate high-quality motions when addressing a
+wide range of unseen tasks. These tasks encompass motion control by motion
+dynamics, geometric constraints, physical laws, interactions with scenes,
+objects or the character own body parts, etc. All of these are achieved in a
+unified approach, without the need for ad-hoc paired training data collection
+or specialized network designs. During the programming of novel tasks, we
+observed the emergence of new skills beyond those of the prior model. With the
+assistance of large language models, we also achieved automatic programming. We
+hope that this work will pave the way for the motion control of general AI
+agents.",cs.CV,['cs.CV']
+Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation,Bingxin Ke · Anton Obukhov · Shengyu Huang · Nando Metzger · Rodrigo Caye Daudt · Konrad Schindler, ,https://arxiv.org/abs/2312.02145,,2312.02145.pdf,Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation,"Monocular depth estimation is a fundamental computer vision task. Recovering
+3D depth from a single image is geometrically ill-posed and requires scene
+understanding, so it is not surprising that the rise of deep learning has led
+to a breakthrough. The impressive progress of monocular depth estimators has
+mirrored the growth in model capacity, from relatively modest CNNs to large
+Transformer architectures. Still, monocular depth estimators tend to struggle
+when presented with images with unfamiliar content and layout, since their
+knowledge of the visual world is restricted by the data seen during training,
+and challenged by zero-shot generalization to new domains. This motivates us to
+explore whether the extensive priors captured in recent generative diffusion
+models can enable better, more generalizable depth estimation. We introduce
+Marigold, a method for affine-invariant monocular depth estimation that is
+derived from Stable Diffusion and retains its rich prior knowledge. The
+estimator can be fine-tuned in a couple of days on a single GPU using only
+synthetic training data. It delivers state-of-the-art performance across a wide
+range of datasets, including over 20% performance gains in specific cases.
+Project page: https://marigoldmonodepth.github.io.",cs.CV,['cs.CV']
+I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions,Chengfeng Zhao · Juze Zhang · Jiashen Du · Ziwei Shan · Junye Wang · Jingyi Yu · Jingya Wang · Lan Xu, ,https://arxiv.org/abs/2312.08869,,2312.08869.pdf,I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions,"We are living in a world surrounded by diverse and ""smart"" devices with rich
+modalities of sensing ability. Conveniently capturing the interactions between
+us humans and these objects remains far-reaching. In this paper, we present
+I'm-HOI, a monocular scheme to faithfully capture the 3D motions of both the
+human and object in a novel setting: using a minimal amount of RGB camera and
+object-mounted Inertial Measurement Unit (IMU). It combines general motion
+inference and category-aware refinement. For the former, we introduce a
+holistic human-object tracking method to fuse the IMU signals and the RGB
+stream and progressively recover the human motions and subsequently the
+companion object motions. For the latter, we tailor a category-aware motion
+diffusion model, which is conditioned on both the raw IMU observations and the
+results from the previous stage under over-parameterization representation. It
+significantly refines the initial results and generates vivid body, hand, and
+object motions. Moreover, we contribute a large dataset with ground truth human
+and object motions, dense RGB inputs, and rich object-mounted IMU measurements.
+Extensive experiments demonstrate the effectiveness of I'm-HOI under a hybrid
+capture setting. Our dataset and code will be released to the community.",cs.CV,['cs.CV']
+From a Bird’s Eye View to See: Joint Camera and Subject Registration without the Camera Calibration,Zekun Qian · Ruize Han · Wei Feng · Song Wang,https://github.com/zekunqian/bevsee,,https://allainews.com/item/from-a-birds-eye-view-to-see-joint-camera-and-subject-registration-without-the-camera-calibration-2024-04-30/,,,,,nan
+LMDrive: Closed-Loop End-to-End Driving  with Large Language Models,Hao Shao · Yuxuan Hu · Letian Wang · Guanglu Song · Steven L. Waslander · Yu Liu · Hongsheng Li, ,https://arxiv.org/abs/2312.07488,,2312.07488.pdf,LMDrive: Closed-Loop End-to-End Driving with Large Language Models,"Despite significant recent progress in the field of autonomous driving,
+modern methods still struggle and can incur serious accidents when encountering
+long-tail unforeseen events and challenging urban scenarios. On the one hand,
+large language models (LLM) have shown impressive reasoning capabilities that
+approach ""Artificial General Intelligence"". On the other hand, previous
+autonomous driving methods tend to rely on limited-format inputs (e.g. sensor
+data and navigation waypoints), restricting the vehicle's ability to understand
+language information and interact with humans. To this end, this paper
+introduces LMDrive, a novel language-guided, end-to-end, closed-loop autonomous
+driving framework. LMDrive uniquely processes and integrates multi-modal sensor
+data with natural language instructions, enabling interaction with humans and
+navigation software in realistic instructional settings. To facilitate further
+research in language-based closed-loop autonomous driving, we also publicly
+release the corresponding dataset which includes approximately 64K
+instruction-following data clips, and the LangAuto benchmark that tests the
+system's ability to handle complex instructions and challenging driving
+scenarios. Extensive closed-loop experiments are conducted to demonstrate
+LMDrive's effectiveness. To the best of our knowledge, we're the very first
+work to leverage LLMs for closed-loop end-to-end autonomous driving. Codes,
+models, and datasets can be found at https://github.com/opendilab/LMDrive",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']"
+TUMTraf V2X Cooperative Perception Dataset,Walter Zimmer · Gerhard Arya Wardana · Suren Sritharan · Xingcheng Zhou · Rui Song · Alois Knoll,https://tum-traffic-dataset.github.io/tumtraf-v2x,https://arxiv.org/abs/2403.01316,,2403.01316.pdf,TUMTraf V2X Cooperative Perception Dataset,"Cooperative perception offers several benefits for enhancing the capabilities
+of autonomous vehicles and improving road safety. Using roadside sensors in
+addition to onboard sensors increases reliability and extends the sensor range.
+External sensors offer higher situational awareness for automated vehicles and
+prevent occlusions. We propose CoopDet3D, a cooperative multi-modal fusion
+model, and TUMTraf-V2X, a perception dataset, for the cooperative 3D object
+detection and tracking task. Our dataset contains 2,000 labeled point clouds
+and 5,000 labeled images from five roadside and four onboard sensors. It
+includes 30k 3D boxes with track IDs and precise GPS and IMU data. We labeled
+eight categories and covered occlusion scenarios with challenging driving
+maneuvers, like traffic violations, near-miss events, overtaking, and U-turns.
+Through multiple experiments, we show that our CoopDet3D camera-LiDAR fusion
+model achieves an increase of +14.36 3D mAP compared to a vehicle camera-LiDAR
+fusion model. Finally, we make our dataset, model, labeling tool, and dev-kit
+publicly available on our website:
+https://tum-traffic-dataset.github.io/tumtraf-v2x.",cs.CV,['cs.CV']
+Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization,Guopeng Li · Ming Qian · Gui-Song Xia, ,https://arxiv.org/abs/2403.14198v1,,2403.14198v1.pdf,Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization,"This paper investigates the effective utilization of unlabeled data for
+large-area cross-view geo-localization (CVGL), encompassing both unsupervised
+and semi-supervised settings. Common approaches to CVGL rely on
+ground-satellite image pairs and employ label-driven supervised training.
+However, the cost of collecting precise cross-view image pairs hinders the
+deployment of CVGL in real-life scenarios. Without the pairs, CVGL will be more
+challenging to handle the significant imaging and spatial gaps between ground
+and satellite images. To this end, we propose an unsupervised framework
+including a cross-view projection to guide the model for retrieving initial
+pseudo-labels and a fast re-ranking mechanism to refine the pseudo-labels by
+leveraging the fact that ``the perfectly paired ground-satellite image is
+located in a unique and identical scene"". The framework exhibits competitive
+performance compared with supervised works on three open-source benchmarks. Our
+code and models will be released on https://github.com/liguopeng0923/UCVGL.",cs.CV,['cs.CV']
+Deep Imbalanced Regression via Hierarchical Classification Adjustment,Haipeng Xiong · Angela Yao, ,https://arxiv.org/abs/2310.17154,,2310.17154.pdf,Deep Imbalanced Regression via Hierarchical Classification Adjustment,"Regression tasks in computer vision, such as age estimation or counting, are
+often formulated into classification by quantizing the target space into
+classes. Yet real-world data is often imbalanced -- the majority of training
+samples lie in a head range of target values, while a minority of samples span
+a usually larger tail range. By selecting the class quantization, one can
+adjust imbalanced regression targets into balanced classification outputs,
+though there are trade-offs in balancing classification accuracy and
+quantization error. To improve regression performance over the entire range of
+data, we propose to construct hierarchical classifiers for solving imbalanced
+regression tasks. The fine-grained classifiers limit the quantization error
+while being modulated by the coarse predictions to ensure high accuracy.
+Standard hierarchical classification approaches, however, when applied to the
+regression problem, fail to ensure that predicted ranges remain consistent
+across the hierarchy. As such, we propose a range-preserving distillation
+process that can effectively learn a single classifier from the set of
+hierarchical classifiers. Our novel hierarchical classification adjustment
+(HCA) for imbalanced regression shows superior results on three diverse tasks:
+age estimation, crowd counting and depth estimation. We will release the source
+code upon acceptance.",cs.CV,['cs.CV']
+Ensemble Diversity Facilitates Adversarial Transferability,Bowen Tang · Zheng Wang · Yi Bin · Qi Dou · Yang Yang · Heng Tao Shen, ,https://arxiv.org/abs/2403.16405,,2403.16405.pdf,Ensemble Adversarial Defense via Integration of Multiple Dispersed Low Curvature Models,"The integration of an ensemble of deep learning models has been extensively
+explored to enhance defense against adversarial attacks. The diversity among
+sub-models increases the attack cost required to deceive the majority of the
+ensemble, thereby improving the adversarial robustness. While existing
+approaches mainly center on increasing diversity in feature representations or
+dispersion of first-order gradients with respect to input, the limited
+correlation between these diversity metrics and adversarial robustness
+constrains the performance of ensemble adversarial defense. In this work, we
+aim to enhance ensemble diversity by reducing attack transferability. We
+identify second-order gradients, which depict the loss curvature, as a key
+factor in adversarial robustness. Computing the Hessian matrix involved in
+second-order gradients is computationally expensive. To address this, we
+approximate the Hessian-vector product using differential approximation. Given
+that low curvature provides better robustness, our ensemble model was designed
+to consider the influence of curvature among different sub-models. We introduce
+a novel regularizer to train multiple more-diverse low-curvature network
+models. Extensive experiments across various datasets demonstrate that our
+ensemble model exhibits superior robustness against a range of attacks,
+underscoring the effectiveness of our approach.",cs.LG,"['cs.LG', 'cs.CR', 'cs.CV']"
+DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation,Haonan Lin, ,https://arxiv.org/abs/2403.19235,,2403.19235.pdf,DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation,"While large-scale pre-trained text-to-image models can synthesize diverse and
+high-quality human-centered images, novel challenges arise with a nuanced task
+of ""identity fine editing"": precisely modifying specific features of a subject
+while maintaining its inherent identity and context. Existing personalization
+methods either require time-consuming optimization or learning additional
+encoders, adept in ""identity re-contextualization"". However, they often
+struggle with detailed and sensitive tasks like human face editing. To address
+these challenges, we introduce DreamSalon, a noise-guided, staged-editing
+framework, uniquely focusing on detailed image manipulations and
+identity-context preservation. By discerning editing and boosting stages via
+the frequency and gradient of predicted noises, DreamSalon first performs
+detailed manipulations on specific features in the editing stage, guided by
+high-frequency information, and then employs stochastic denoising in the
+boosting stage to improve image quality. For more precise editing, DreamSalon
+semantically mixes source and target textual prompts, guided by differences in
+their embedding covariances, to direct the model's focus on specific
+manipulation areas. Our experiments demonstrate DreamSalon's ability to
+efficiently and faithfully edit fine details on human faces, outperforming
+existing methods both qualitatively and quantitatively.",cs.CV,['cs.CV']
+RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation,Zeyuan Yang · LIU JIAGENG · Peihao Chen · Anoop Cherian · Tim Marks · Jonathan Le Roux · Chuang Gan, ,,https://github.com/zchoi/Awesome-Embodied-Agent-with-LLMs,,,,,nan
+FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models,Adrian Bulat · Yassine Ouali · Georgios Tzimiropoulos, ,https://arxiv.org/abs/2405.10286,,2405.10286.pdf,FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models,"Despite noise and caption quality having been acknowledged as important
+factors impacting vision-language contrastive pre-training, in this paper, we
+show that the full potential of improving the training process by addressing
+such issues is yet to be realized. Specifically, we firstly study and analyze
+two issues affecting training: incorrect assignment of negative pairs, and low
+caption quality and diversity. Then, we devise effective solutions for
+addressing both problems, which essentially require training with multiple true
+positive pairs. Finally, we propose training with sigmoid loss to address such
+a requirement. We show very large gains over the current state-of-the-art for
+both image recognition ($\sim +6\%$ on average over 11 datasets) and image
+retrieval ($\sim +19\%$ on Flickr30k and $\sim +15\%$ on MSCOCO).",cs.CV,"['cs.CV', 'cs.AI']"
+Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning,Da-Wei Zhou · Hai-Long Sun · Han-Jia Ye · De-Chuan Zhan,https://github.com/sun-hailong/CVPR24-Ease,https://arxiv.org/abs/2403.12030v1,,2403.12030v1.pdf,Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning,"Class-Incremental Learning (CIL) requires a learning system to continually
+learn new classes without forgetting. Despite the strong performance of
+Pre-Trained Models (PTMs) in CIL, a critical issue persists: learning new
+classes often results in the overwriting of old ones. Excessive modification of
+the network causes forgetting, while minimal adjustments lead to an inadequate
+fit for new classes. As a result, it is desired to figure out a way of
+efficient model updating without harming former knowledge. In this paper, we
+propose ExpAndable Subspace Ensemble (EASE) for PTM-based CIL. To enable model
+updating without conflict, we train a distinct lightweight adapter module for
+each new task, aiming to create task-specific subspaces. These adapters span a
+high-dimensional feature space, enabling joint decision-making across multiple
+subspaces. As data evolves, the expanding subspaces render the old class
+classifiers incompatible with new-stage spaces. Correspondingly, we design a
+semantic-guided prototype complement strategy that synthesizes old classes' new
+features without using any old class instance. Extensive experiments on seven
+benchmark datasets verify EASE's state-of-the-art performance. Code is
+available at: https://github.com/sun-hailong/CVPR24-Ease",cs.CV,"['cs.CV', 'cs.LG']"
+Generating Handwritten Mathematical Expressions From Symbol Graphs: An End-to-End Pipeline,Yu chen · Fei Gao · YanguangZhang · Maoying Qiao · Nannan Wang,https://github.com/AiArt-HDU/HMEG,,https://link.springer.com/chapter/10.1007/978-3-031-41676-7_9,,,,,nan
+AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation,Jeongsoo Choi · Se Jin Park · Minsu Kim · Yong Man Ro, ,https://arxiv.org/html/2312.02512v2,,2312.02512v2.pdf,AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation,"This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech
+Translation (AV2AV) framework, where the input and output of the system are
+multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key
+advantages can be brought: 1) We can perform real-like conversations with
+individuals worldwide in a virtual meeting by utilizing our own primary
+languages. In contrast to Speech-to-Speech Translation (A2A), which solely
+translates between audio modalities, the proposed AV2AV directly translates
+between audio-visual speech. This capability enhances the dialogue experience
+by presenting synchronized lip movements along with the translated speech. 2)
+We can improve the robustness of the spoken language translation system. By
+employing the complementary information of audio-visual speech, the system can
+effectively translate spoken language even in the presence of acoustic noise,
+showcasing robust performance. To mitigate the problem of the absence of a
+parallel AV2AV translation dataset, we propose to train our spoken language
+translation system with the audio-only dataset of A2A. This is done by learning
+unified audio-visual speech representations through self-supervised learning in
+advance to train the translation system. Moreover, we propose an AV-Renderer
+that can generate raw audio and video in parallel. It is designed with
+zero-shot speaker modeling, thus the speaker in source audio-visual speech can
+be maintained at the target translated audio-visual speech. The effectiveness
+of AV2AV is evaluated with extensive experiments in a many-to-many language
+translation setting. Demo page is available on
+https://choijeongsoo.github.io/av2av.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM', 'eess.AS']"
+DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation,Zeeshan Hayder · Xuming He,https://zeeshanhayder.github.io/DSGG,https://arxiv.org/abs/2403.14886,,2403.14886.pdf,DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation,"Scene graph generation aims to capture detailed spatial and semantic
+relationships between objects in an image, which is challenging due to
+incomplete labelling, long-tailed relationship categories, and relational
+semantic overlap. Existing Transformer-based methods either employ distinct
+queries for objects and predicates or utilize holistic queries for relation
+triplets and hence often suffer from limited capacity in learning low-frequency
+relationships. In this paper, we present a new Transformer-based method, called
+DSGG, that views scene graph detection as a direct graph prediction problem
+based on a unique set of graph-aware queries. In particular, each graph-aware
+query encodes a compact representation of both the node and all of its
+relations in the graph, acquired through the utilization of a relaxed sub-graph
+matching during the training process. Moreover, to address the problem of
+relational semantic overlap, we utilize a strategy for relation distillation,
+aiming to efficiently learn multiple instances of semantic relationships.
+Extensive experiments on the VG and the PSG datasets show that our model
+achieves state-of-the-art results, showing a significant improvement of 3.5\%
+and 6.7\% in mR@50 and mR@100 for the scene-graph generation task and achieves
+an even more substantial improvement of 8.5\% and 10.3\% in mR@50 and mR@100
+for the panoptic scene graph generation task. Code is available at
+\url{https://github.com/zeeshanhayder/DSGG}.",cs.CV,['cs.CV']
+Learn from View Correlation: An Anchor Enhancement Strategy for Multi-view Clustering,Suyuan Liu · KE LIANG · Zhibin Dong · Siwei Wang · Xihong Yang · sihang zhou �� En Zhu · Xinwang Liu, ,https://arxiv.org/abs/2309.00024,,2309.00024.pdf,Efficient Multi-View Graph Clustering with Local and Global Structure Preservation,"Anchor-based multi-view graph clustering (AMVGC) has received abundant
+attention owing to its high efficiency and the capability to capture
+complementary structural information across multiple views. Intuitively, a
+high-quality anchor graph plays an essential role in the success of AMVGC.
+However, the existing AMVGC methods only consider single-structure information,
+i.e., local or global structure, which provides insufficient information for
+the learning task. To be specific, the over-scattered global structure leads to
+learned anchors failing to depict the cluster partition well. In contrast, the
+local structure with an improper similarity measure results in potentially
+inaccurate anchor assignment, ultimately leading to sub-optimal clustering
+performance. To tackle the issue, we propose a novel anchor-based multi-view
+graph clustering framework termed Efficient Multi-View Graph Clustering with
+Local and Global Structure Preservation (EMVGC-LG). Specifically, a unified
+framework with a theoretical guarantee is designed to capture local and global
+information. Besides, EMVGC-LG jointly optimizes anchor construction and graph
+learning to enhance the clustering quality. In addition, EMVGC-LG inherits the
+linear complexity of existing AMVGC methods respecting the sample number, which
+is time-economical and scales well with the data size. Extensive experiments
+demonstrate the effectiveness and efficiency of our proposed method.",cs.LG,['cs.LG']
+SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation,Jiaben Chen · Huaizu Jiang, ,https://arxiv.org/abs/2308.16876v2,,2308.16876v2.pdf,SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation,"Human-centric video frame interpolation has great potential for improving
+people's entertainment experiences and finding commercial applications in the
+sports analysis industry, e.g., synthesizing slow-motion videos. Although there
+are multiple benchmark datasets available in the community, none of them is
+dedicated for human-centric scenarios. To bridge this gap, we introduce
+SportsSloMo, a benchmark consisting of more than 130K video clips and 1M video
+frames of high-resolution ($\geq$720p) slow-motion sports videos crawled from
+YouTube. We re-train several state-of-the-art methods on our benchmark, and the
+results show a decrease in their accuracy compared to other datasets. It
+highlights the difficulty of our benchmark and suggests that it poses
+significant challenges even for the best-performing methods, as human bodies
+are highly deformable and occlusions are frequent in sports videos. To improve
+the accuracy, we introduce two loss terms considering the human-aware priors,
+where we add auxiliary supervision to panoptic segmentation and human keypoints
+detection, respectively. The loss terms are model agnostic and can be easily
+plugged into any video frame interpolation approaches. Experimental results
+validate the effectiveness of our proposed loss terms, leading to consistent
+performance improvement over 5 existing models, which establish strong baseline
+models on our benchmark. The dataset and code can be found at:
+https://neu-vi.github.io/SportsSlomo/.",cs.CV,['cs.CV']
+G-NeRF: Geometry-enhanced Novel View Synthesis from Single-View Images,Zixiong Huang · Qi Chen · Libo Sun · Yifan Yang · Naizhou Wang · Qi Wu · Mingkui Tan, ,https://arxiv.org/abs/2404.07474,,2404.07474.pdf,G-NeRF: Geometry-enhanced Novel View Synthesis from Single-View Images,"Novel view synthesis aims to generate new view images of a given view image
+collection. Recent attempts address this problem relying on 3D geometry priors
+(e.g., shapes, sizes, and positions) learned from multi-view images. However,
+such methods encounter the following limitations: 1) they require a set of
+multi-view images as training data for a specific scene (e.g., face, car or
+chair), which is often unavailable in many real-world scenarios; 2) they fail
+to extract the geometry priors from single-view images due to the lack of
+multi-view supervision. In this paper, we propose a Geometry-enhanced NeRF
+(G-NeRF), which seeks to enhance the geometry priors by a geometry-guided
+multi-view synthesis approach, followed by a depth-aware training. In the
+synthesis process, inspired that existing 3D GAN models can unconditionally
+synthesize high-fidelity multi-view images, we seek to adopt off-the-shelf 3D
+GAN models, such as EG3D, as a free source to provide geometry priors through
+synthesizing multi-view data. Simultaneously, to further improve the geometry
+quality of the synthetic data, we introduce a truncation method to effectively
+sample latent codes within 3D GAN models. To tackle the absence of multi-view
+supervision for single-view images, we design the depth-aware training
+approach, incorporating a depth-aware discriminator to guide geometry priors
+through depth maps. Experiments demonstrate the effectiveness of our method in
+terms of both qualitative and quantitative results.",cs.CV,['cs.CV']
+MaskPLAN: Masked Generative Layout Planning from Partial Input,Hang Zhang · Anton Savov · Benjamin Dillenburger, ,https://arxiv.org/abs/2312.05039,,2312.05039.pdf,SmartMask: Context Aware High-Fidelity Mask Generation for Fine-grained Object Insertion and Layout Control,"The field of generative image inpainting and object insertion has made
+significant progress with the recent advent of latent diffusion models.
+Utilizing a precise object mask can greatly enhance these applications.
+However, due to the challenges users encounter in creating high-fidelity masks,
+there is a tendency for these methods to rely on more coarse masks (e.g.,
+bounding box) for these applications. This results in limited control and
+compromised background content preservation. To overcome these limitations, we
+introduce SmartMask, which allows any novice user to create detailed masks for
+precise object insertion. Combined with a ControlNet-Inpaint model, our
+experiments demonstrate that SmartMask achieves superior object insertion
+quality, preserving the background content more effectively than previous
+methods. Notably, unlike prior works the proposed approach can also be used
+even without user-mask guidance, which allows it to perform mask-free object
+insertion at diverse positions and scales. Furthermore, we find that when used
+iteratively with a novel instruction-tuning based planning model, SmartMask can
+be used to design detailed layouts from scratch. As compared with user-scribble
+based layout design, we observe that SmartMask allows for better quality
+outputs with layout-to-image generation methods. Project page is available at
+https://smartmask-gen.github.io",cs.CV,"['cs.CV', 'cs.AI', 'cs.HC', 'cs.LG', 'cs.MM']"
+OneLLM: One Framework to Align All Modalities with Language,Jiaming Han · Kaixiong Gong · Yiyuan Zhang · Jiaqi Wang · Kaipeng Zhang · Dahua Lin · Yu Qiao · Peng Gao · Xiangyu Yue, ,https://arxiv.org/abs/2312.03700,,2312.03700.pdf,OneLLM: One Framework to Align All Modalities with Language,"Multimodal large language models (MLLMs) have gained significant attention
+due to their strong multimodal understanding capability. However, existing
+works rely heavily on modality-specific encoders, which usually differ in
+architecture and are limited to common modalities. In this paper, we present
+OneLLM, an MLLM that aligns eight modalities to language using a unified
+framework. We achieve this through a unified multimodal encoder and a
+progressive multimodal alignment pipeline. In detail, we first train an image
+projection module to connect a vision encoder with LLM. Then, we build a
+universal projection module (UPM) by mixing multiple image projection modules
+and dynamic routing. Finally, we progressively align more modalities to LLM
+with the UPM. To fully leverage the potential of OneLLM in following
+instructions, we also curated a comprehensive multimodal instruction dataset,
+including 2M items from image, audio, video, point cloud, depth/normal map, IMU
+and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks,
+encompassing tasks such as multimodal captioning, question answering and
+reasoning, where it delivers excellent performance. Code, data, model and
+online demo are available at https://github.com/csuhan/OneLLM",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG', 'cs.MM']"
+Open-World Semantic Segmentation Including Class Similarity,Matteo Sodano · Federico Magistri · Lucas Nunes · Jens Behley · Cyrill Stachniss, ,https://arxiv.org/abs/2403.07532,,2403.07532.pdf,Open-World Semantic Segmentation Including Class Similarity,"Interpreting camera data is key for autonomously acting systems, such as
+autonomous vehicles. Vision systems that operate in real-world environments
+must be able to understand their surroundings and need the ability to deal with
+novel situations. This paper tackles open-world semantic segmentation, i.e.,
+the variant of interpreting image data in which objects occur that have not
+been seen during training. We propose a novel approach that performs accurate
+closed-world semantic segmentation and, at the same time, can identify new
+categories without requiring any additional training data. Our approach
+additionally provides a similarity measure for every newly discovered class in
+an image to a known category, which can be useful information in downstream
+tasks such as planning or mapping. Through extensive experiments, we show that
+our model achieves state-of-the-art results on classes known from training data
+as well as for anomaly segmentation and can distinguish between different
+unknown classes.",cs.CV,['cs.CV']
+MovieChat: From Dense Token to Sparse Memory for Long Video Understanding,Enxin Song · Wenhao Chai · Guanhong Wang · Haoyang Zhou · Feiyang Wu · Yucheng Zhang · Tian Ye · Haozhe Chi · Xun Guo · Yanting Zhang · Yan Lu · Jenq-Neng Hwang · Gaoang Wang, ,https://arxiv.org/abs/2307.16449,,2307.16449.pdf,MovieChat: From Dense Token to Sparse Memory for Long Video Understanding,"Recently, integrating video foundation models and large language models to
+build a video understanding system can overcome the limitations of specific
+pre-defined vision tasks. Yet, existing systems can only handle videos with
+very few frames. For long videos, the computation complexity, memory cost, and
+long-term temporal connection impose additional challenges. Taking advantage of
+the Atkinson-Shiffrin memory model, with tokens in Transformers being employed
+as the carriers of memory in combination with our specially designed memory
+mechanism, we propose the MovieChat to overcome these challenges. MovieChat
+achieves state-of-the-art performance in long video understanding, along with
+the released MovieChat-1K benchmark with 1K long video and 14K manual
+annotations for validation of the effectiveness of our method.",cs.CV,['cs.CV']
+Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models,Haoning Wu · Zicheng Zhang · Erli Zhang · Chaofeng Chen · Liang Liao · Annan Wang · Kaixin Xu · Chunyi Li · Jingwen Hou · Guangtao Zhai · Xue Geng · Wenxiu Sun · Qiong Yan · Weisi Lin,https://q-future.github.io/Q-Instruct,https://arxiv.org/abs/2311.06783,,2311.06783.pdf,Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models,"Multi-modality foundation models, as represented by GPT-4V, have brought a
+new paradigm for low-level visual perception and understanding tasks, that can
+respond to a broad range of natural human instructions in a model. While
+existing foundation models have shown exciting potentials on low-level visual
+tasks, their related abilities are still preliminary and need to be improved.
+In order to enhance these models, we conduct a large-scale subjective
+experiment collecting a vast number of real human feedbacks on low-level
+vision. Each feedback follows a pathway that starts with a detailed description
+on the low-level visual appearance (*e.g. clarity, color, brightness* of an
+image, and ends with an overall conclusion, with an average length of 45 words.
+The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on
+18,973 images with diverse low-level appearance. Moreover, to enable foundation
+models to robustly respond to diverse types of questions, we design a
+GPT-participated conversion to process these feedbacks into diverse-format 200K
+instruction-response pairs. Experimental results indicate that the
+**Q-Instruct** consistently elevates low-level perception and understanding
+abilities across several foundational models. We anticipate that our datasets
+can pave the way for a future that general intelligence can perceive,
+understand low-level visual appearance and evaluate visual quality like a
+human. Our dataset, model zoo, and demo is published at:
+https://q-future.github.io/Q-Instruct.",cs.CV,"['cs.CV', 'cs.MM']"
+WaveFace: Authentic Face Restoration with Efficient Frequency Recovery,Yunqi Miao · Jiankang Deng · Jungong Han,https://yoqim.github.io/waveface_page/,https://arxiv.org/abs/2403.12760,,2403.12760.pdf,WaveFace: Authentic Face Restoration with Efficient Frequency Recovery,"Although diffusion models are rising as a powerful solution for blind face
+restoration, they are criticized for two problems: 1) slow training and
+inference speed, and 2) failure in preserving identity and recovering
+fine-grained facial details. In this work, we propose WaveFace to solve the
+problems in the frequency domain, where low- and high-frequency components
+decomposed by wavelet transformation are considered individually to maximize
+authenticity as well as efficiency. The diffusion model is applied to recover
+the low-frequency component only, which presents general information of the
+original image but 1/16 in size. To preserve the original identity, the
+generation is conditioned on the low-frequency component of low-quality images
+at each denoising step. Meanwhile, high-frequency components at multiple
+decomposition levels are handled by a unified network, which recovers complex
+facial details in a single step. Evaluations on four benchmark datasets show
+that: 1) WaveFace outperforms state-of-the-art methods in authenticity,
+especially in terms of identity preservation, and 2) authentic images are
+restored with the efficiency 10x faster than existing diffusion model-based BFR
+methods.",cs.CV,['cs.CV']
+MAP: MAsk-Pruning for Source-Free Model Intellectual Property Protection,Boyang Peng · Sanqing Qu · Yong Wu · Tianpei Zou · Lianghua He · Alois Knoll · Guang Chen · Changjun Jiang,https://github.com/ispc-lab/MAP,https://arxiv.org/abs/2403.04149,,2403.04149.pdf,MAP: MAsk-Pruning for Source-Free Model Intellectual Property Protection,"Deep learning has achieved remarkable progress in various applications,
+heightening the importance of safeguarding the intellectual property (IP) of
+well-trained models. It entails not only authorizing usage but also ensuring
+the deployment of models in authorized data domains, i.e., making models
+exclusive to certain target domains. Previous methods necessitate concurrent
+access to source training data and target unauthorized data when performing IP
+protection, making them risky and inefficient for decentralized private data.
+In this paper, we target a practical setting where only a well-trained source
+model is available and investigate how we can realize IP protection. To achieve
+this, we propose a novel MAsk Pruning (MAP) framework. MAP stems from an
+intuitive hypothesis, i.e., there are target-related parameters in a
+well-trained model, locating and pruning them is the key to IP protection.
+Technically, MAP freezes the source model and learns a target-specific binary
+mask to prevent unauthorized data usage while minimizing performance
+degradation on authorized data. Moreover, we introduce a new metric aimed at
+achieving a better balance between source and target performance degradation.
+To verify the effectiveness and versatility, we have evaluated MAP in a variety
+of scenarios, including vanilla source-available, practical source-free, and
+challenging data-free. Extensive experiments indicate that MAP yields new
+state-of-the-art performance.",cs.CV,['cs.CV']
+Unsegment Anything by Simulating Deformation,Jiahao Lu · Xingyi Yang · Xinchao Wang, ,https://arxiv.org/abs/2404.02585,,2404.02585.pdf,Unsegment Anything by Simulating Deformation,"Foundation segmentation models, while powerful, pose a significant risk: they
+enable users to effortlessly extract any objects from any digital content with
+a single click, potentially leading to copyright infringement or malicious
+misuse. To mitigate this risk, we introduce a new task ""Anything Unsegmentable""
+to grant any image ""the right to be unsegmented"". The ambitious pursuit of the
+task is to achieve highly transferable adversarial attacks against all
+prompt-based segmentation models, regardless of model parameterizations and
+prompts. We highlight the non-transferable and heterogeneous nature of
+prompt-specific adversarial noises. Our approach focuses on disrupting image
+encoder features to achieve prompt-agnostic attacks. Intriguingly, targeted
+feature attacks exhibit better transferability compared to untargeted ones,
+suggesting the optimal update direction aligns with the image manifold. Based
+on the observations, we design a novel attack named Unsegment Anything by
+Simulating Deformation (UAD). Our attack optimizes a differentiable deformation
+function to create a target deformed image, which alters structural information
+while preserving achievable feature distance by adversarial example. Extensive
+experiments verify the effectiveness of our approach, compromising a variety of
+promptable segmentation models with different architectures and prompt
+interfaces. We release the code at
+https://github.com/jiahaolu97/anything-unsegmentable.",cs.CV,['cs.CV']
+"Low-power, Continuous Remote Behavioral Localization with Event Cameras",Friedhelm Hamann · Suman Ghosh · Ignacio Juarez Martinez · Tom Hart · Alex Kacelnik · Guillermo Gallego,https://tub-rip.github.io/eventpenguins/,https://arxiv.org/abs/2312.03799,,2312.03799.pdf,"Low-power, Continuous Remote Behavioral Localization with Event Cameras","Researchers in natural science need reliable methods for quantifying animal
+behavior. Recently, numerous computer vision methods emerged to automate the
+process. However, observing wild species at remote locations remains a
+challenging task due to difficult lighting conditions and constraints on power
+supply and data storage. Event cameras offer unique advantages for
+battery-dependent remote monitoring due to their low power consumption and high
+dynamic range capabilities. We use this novel sensor to quantify a behavior in
+Chinstrap penguins called ecstatic display. We formulate the problem as a
+temporal action detection task, determining the start and end times of the
+behavior. For this purpose, we recorded a colony of breeding penguins in
+Antarctica for several weeks and labeled event data on 16 nests. The developed
+method consists of a generator of candidate time intervals (proposals) and a
+classifier of the actions within them. The experiments show that the event
+cameras' natural response to motion is effective for continuous behavior
+monitoring and detection, reaching a mean average precision (mAP) of 58% (which
+increases to 63% in good weather conditions). The results also demonstrate the
+robustness against various lighting conditions contained in the challenging
+dataset. The low-power capabilities of the event camera allow it to record
+significantly longer than with a conventional camera. This work pioneers the
+use of event cameras for remote wildlife observation, opening new
+interdisciplinary opportunities. https://tub-rip.github.io/eventpenguins/",cs.CV,"['cs.CV', 'cs.AI']"
+Text-to-3D using Gaussian Splatting,Zilong Chen · Feng Wang · Yikai Wang · Huaping Liu,https://gsgen3d.github.io/,https://arxiv.org/abs/2309.16585,,2309.16585.pdf,Text-to-3D using Gaussian Splatting,"Automatic text-to-3D generation that combines Score Distillation Sampling
+(SDS) with the optimization of volume rendering has achieved remarkable
+progress in synthesizing realistic 3D objects. Yet most existing text-to-3D
+methods by SDS and volume rendering suffer from inaccurate geometry, e.g., the
+Janus issue, since it is hard to explicitly integrate 3D priors into implicit
+3D representations. Besides, it is usually time-consuming for them to generate
+elaborate 3D models with rich colors. In response, this paper proposes GSGEN, a
+novel method that adopts Gaussian Splatting, a recent state-of-the-art
+representation, to text-to-3D generation. GSGEN aims at generating high-quality
+3D objects and addressing existing shortcomings by exploiting the explicit
+nature of Gaussian Splatting that enables the incorporation of 3D prior.
+Specifically, our method adopts a progressive optimization strategy, which
+includes a geometry optimization stage and an appearance refinement stage. In
+geometry optimization, a coarse representation is established under 3D point
+cloud diffusion prior along with the ordinary 2D SDS optimization, ensuring a
+sensible and 3D-consistent rough shape. Subsequently, the obtained Gaussians
+undergo an iterative appearance refinement to enrich texture details. In this
+stage, we increase the number of Gaussians by compactness-based densification
+to enhance continuity and improve fidelity. With these designs, our approach
+can generate 3D assets with delicate details and accurate geometry. Extensive
+evaluations demonstrate the effectiveness of our method, especially for
+capturing high-frequency components. Our code is available at
+https://github.com/gsgen3d/gsgen",cs.CV,['cs.CV']
+UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory,Haiwen Diao · Bo Wan · Ying Zhang · Xu Jia · Huchuan Lu · Long Chen,https://github.com/Paranioar/UniPT,https://arxiv.org/abs/2308.14316v2,,2308.14316v2.pdf,UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory,"Parameter-efficient transfer learning (PETL), i.e., fine-tuning a small
+portion of parameters, is an effective strategy for adapting pre-trained models
+to downstream domains. To further reduce the memory demand, recent PETL works
+focus on the more valuable memory-efficient characteristic. In this paper, we
+argue that the scalability, adaptability, and generalizability of
+state-of-the-art methods are hindered by structural dependency and pertinency
+on specific pre-trained backbones. To this end, we propose a new
+memory-efficient PETL strategy, Universal Parallel Tuning (UniPT), to mitigate
+these weaknesses. Specifically, we facilitate the transfer process via a
+lightweight and learnable parallel network, which consists of: 1) A parallel
+interaction module that decouples the sequential connections and processes the
+intermediate activations detachedly from the pre-trained network. 2) A
+confidence aggregation module that learns optimal strategies adaptively for
+integrating cross-layer features. We evaluate UniPT with different backbones
+(e.g., T5, VSE$\infty$, CLIP4Clip, Clip-ViL, and MDETR) on various
+vision-and-language and pure NLP tasks. Extensive ablations on 18 datasets have
+validated that UniPT can not only dramatically reduce memory consumption and
+outperform the best competitor, but also achieve competitive performance over
+other plain PETL methods with lower training memory overhead. Our code is
+publicly available at: https://github.com/Paranioar/UniPT.",cs.CV,"['cs.CV', 'cs.MM']"
+Single-View Refractive Index Tomography with Neural Fields,Brandon Zhao · Aviad Levis · Liam Connor · Pratul P. Srinivasan · Katherine Bouman, ,https://arxiv.org/abs/2309.04437,,2309.04437.pdf,Single View Refractive Index Tomography with Neural Fields,"Refractive Index Tomography is the inverse problem of reconstructing the
+continuously-varying 3D refractive index in a scene using 2D projected image
+measurements. Although a purely refractive field is not directly visible, it
+bends light rays as they travel through space, thus providing a signal for
+reconstruction. The effects of such fields appear in many scientific computer
+vision settings, ranging from refraction due to transparent cells in microscopy
+to the lensing of distant galaxies caused by dark matter in astrophysics.
+Reconstructing these fields is particularly difficult due to the complex
+nonlinear effects of the refractive field on observed images. Furthermore,
+while standard 3D reconstruction and tomography settings typically have access
+to observations of the scene from many viewpoints, many refractive index
+tomography problem settings only have access to images observed from a single
+viewpoint. We introduce a method that leverages prior knowledge of light
+sources scattered throughout the refractive medium to help disambiguate the
+single-view refractive index tomography problem. We differentiably trace curved
+rays through a neural field representation of the refractive field, and
+optimize its parameters to best reproduce the observed image. We demonstrate
+the efficacy of our approach by reconstructing simulated refractive fields,
+analyze the effects of light source distribution on the recovered field, and
+test our method on a simulated dark matter mapping problem where we
+successfully recover the 3D refractive field caused by a realistic dark matter
+distribution.",cs.CV,"['cs.CV', 'astro-ph.CO']"
+MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos,Jielin Qiu · Jiacheng Zhu · William Han · Aditesh Kumar · Karthik Mittal · Claire Jin · Zhengyuan Yang · Linjie Li · Jianfeng Wang · DING ZHAO · Bo Li · Lijuan Wang, ,https://arxiv.org/abs/2306.04216,,2306.04216.pdf,MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos,"Multimodal summarization with multimodal output (MSMO) has emerged as a
+promising research direction. Nonetheless, numerous limitations exist within
+existing public MSMO datasets, including insufficient maintenance, data
+inaccessibility, limited size, and the absence of proper categorization, which
+pose significant challenges. To address these challenges and provide a
+comprehensive dataset for this new direction, we have meticulously curated the
+\textbf{MMSum} dataset. Our new dataset features (1) Human-validated summaries
+for both video and textual content, providing superior human instruction and
+labels for multimodal learning. (2) Comprehensively and meticulously arranged
+categorization, spanning 17 principal categories and 170 subcategories to
+encapsulate a diverse array of real-world scenarios. (3) Benchmark tests
+performed on the proposed dataset to assess various tasks and methods,
+including \textit{video summarization}, \textit{text summarization}, and
+\textit{multimodal summarization}. To champion accessibility and collaboration,
+we will release the \textbf{MMSum} dataset and the data collection tool as
+fully open-source resources, fostering transparency and accelerating future
+developments. Our project website can be found
+at~\url{https://mmsum-dataset.github.io/}",cs.CV,"['cs.CV', 'cs.MM']"
+Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching,Shitong Shao · Zeyuan Yin · Muxin Zhou · Xindong Zhang · Zhiqiang Shen, ,https://arxiv.org/abs/2311.17950,,2311.17950.pdf,Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching,"The lightweight ""local-match-global"" matching introduced by SRe2L
+successfully creates a distilled dataset with comprehensive information on the
+full 224x224 ImageNet-1k. However, this one-sided approach is limited to a
+particular backbone, layer, and statistics, which limits the improvement of the
+generalization of a distilled dataset. We suggest that sufficient and various
+""local-match-global"" matching are more precise and effective than a single one
+and has the ability to create a distilled dataset with richer information and
+better generalization. We call this perspective ""generalized matching"" and
+propose Generalized Various Backbone and Statistical Matching (G-VBSM) in this
+work, which aims to create a synthetic dataset with densities, ensuring
+consistency with the complete dataset across various backbones, layers, and
+statistics. As experimentally demonstrated, G-VBSM is the first algorithm to
+obtain strong performance across both small-scale and large-scale datasets.
+Specifically, G-VBSM achieves a performance of 38.7% on CIFAR-100 with
+128-width ConvNet, 47.6% on Tiny-ImageNet with ResNet18, and 31.4% on the full
+224x224 ImageNet-1k with ResNet18, under images per class (IPC) 10, 50, and 10,
+respectively. These results surpass all SOTA methods by margins of 3.9%, 6.5%,
+and 10.1%, respectively.",cs.CV,"['cs.CV', 'cs.AI']"
+Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation,Haofeng Liu · Chenshu Xu · Yifei Yang · Lihua Zeng · Shengfeng He,https://github.com/haofengl/DragNoise,https://arxiv.org/abs/2404.01050,,2404.01050.pdf,Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation,"Point-based interactive editing serves as an essential tool to complement the
+controllability of existing generative models. A concurrent work,
+DragDiffusion, updates the diffusion latent map in response to user inputs,
+causing global latent map alterations. This results in imprecise preservation
+of the original content and unsuccessful editing due to gradient vanishing. In
+contrast, we present DragNoise, offering robust and accelerated editing without
+retracing the latent map. The core rationale of DragNoise lies in utilizing the
+predicted noise output of each U-Net as a semantic editor. This approach is
+grounded in two critical observations: firstly, the bottleneck features of
+U-Net inherently possess semantically rich features ideal for interactive
+editing; secondly, high-level semantics, established early in the denoising
+process, show minimal variation in subsequent stages. Leveraging these
+insights, DragNoise edits diffusion semantics in a single denoising step and
+efficiently propagates these changes, ensuring stability and efficiency in
+diffusion editing. Comparative experiments reveal that DragNoise achieves
+superior control and semantic retention, reducing the optimization time by over
+50% compared to DragDiffusion. Our codes are available at
+https://github.com/haofengl/DragNoise.",cs.CV,"['cs.CV', 'cs.GR', 'cs.HC', 'cs.LG']"
+CMA: A Chromaticity Map Adapter for Robust Detection of Screen-Recapture Document Images,Changsheng Chen · Liangwei Lin · Yongqi Chen · Bin Li · Jishen Zeng · Jiwu Huang,https://github.com/chenlewis/Chromaticity-Map-Adapter-for-DPAD,https://arxiv.org/abs/2404.06663,,2404.06663.pdf,Multi-modal Document Presentation Attack Detection With Forensics Trace Disentanglement,"Document Presentation Attack Detection (DPAD) is an important measure in
+protecting the authenticity of a document image. However, recent DPAD methods
+demand additional resources, such as manual effort in collecting additional
+data or knowing the parameters of acquisition devices. This work proposes a
+DPAD method based on multi-modal disentangled traces (MMDT) without the above
+drawbacks. We first disentangle the recaptured traces by a self-supervised
+disentanglement and synthesis network to enhance the generalization capacity in
+document images with different contents and layouts. Then, unlike the existing
+DPAD approaches that rely only on data in the RGB domain, we propose to
+explicitly employ the disentangled recaptured traces as new modalities in the
+transformer backbone through adaptive multi-modal adapters to fuse RGB/trace
+features efficiently. Visualization of the disentangled traces confirms the
+effectiveness of the proposed method in different document contents. Extensive
+experiments on three benchmark datasets demonstrate the superiority of our MMDT
+method on representing forensic traces of recapturing distortion.",cs.CV,['cs.CV']
+Navigating Beyond Dropout: An Intriguing Solution towards Generalizable Image Super-Resolution,Hongjun Wang · Jiyuan Chen · Yinqiang Zheng · Tieyong Zeng, ,https://arxiv.org/abs/2402.18929,,2402.18929.pdf,Navigating Beyond Dropout: An Intriguing Solution Towards Generalizable Image Super Resolution,"Deep learning has led to a dramatic leap on Single Image Super-Resolution
+(SISR) performances in recent years. %Despite the substantial advancement%
+While most existing work assumes a simple and fixed degradation model (e.g.,
+bicubic downsampling), the research of Blind SR seeks to improve model
+generalization ability with unknown degradation. Recently, Kong et al pioneer
+the investigation of a more suitable training strategy for Blind SR using
+Dropout. Although such method indeed brings substantial generalization
+improvements via mitigating overfitting, we argue that Dropout simultaneously
+introduces undesirable side-effect that compromises model's capacity to
+faithfully reconstruct fine details. We show both the theoretical and
+experimental analyses in our paper, and furthermore, we present another easy
+yet effective training strategy that enhances the generalization ability of the
+model by simply modulating its first and second-order features statistics.
+Experimental results have shown that our method could serve as a model-agnostic
+regularization and outperforms Dropout on seven benchmark datasets including
+both synthetic and real-world scenarios.",cs.CV,"['cs.CV', 'cs.AI']"
+AUEditNet: Dual-Branch Facial Action Unit Intensity Manipulation with Implicit Disentanglement,Shiwei Jin · Zhen Wang · Lei Wang · Peng Liu · Ning Bi · Truong Nguyen, ,https://arxiv.org/abs/2404.05063,,2404.05063.pdf,AUEditNet: Dual-Branch Facial Action Unit Intensity Manipulation with Implicit Disentanglement,"Facial action unit (AU) intensity plays a pivotal role in quantifying
+fine-grained expression behaviors, which is an effective condition for facial
+expression manipulation. However, publicly available datasets containing
+intensity annotations for multiple AUs remain severely limited, often featuring
+a restricted number of subjects. This limitation places challenges to the AU
+intensity manipulation in images due to disentanglement issues, leading
+researchers to resort to other large datasets with pretrained AU intensity
+estimators for pseudo labels. In addressing this constraint and fully
+leveraging manual annotations of AU intensities for precise manipulation, we
+introduce AUEditNet. Our proposed model achieves impressive intensity
+manipulation across 12 AUs, trained effectively with only 18 subjects.
+Utilizing a dual-branch architecture, our approach achieves comprehensive
+disentanglement of facial attributes and identity without necessitating
+additional loss functions or implementing with large batch sizes. This approach
+offers a potential solution to achieve desired facial attribute editing despite
+the dataset's limited subject count. Our experiments demonstrate AUEditNet's
+superior accuracy in editing AU intensities, affirming its capability in
+disentangling facial attributes and identity within a limited subject pool.
+AUEditNet allows conditioning by either intensity values or target images,
+eliminating the need for constructing AU combinations for specific facial
+expression synthesis. Moreover, AU intensity estimation, as a downstream task,
+validates the consistency between real and edited images, confirming the
+effectiveness of our proposed AU intensity manipulation method.",cs.CV,['cs.CV']
+Degree-of-Freedom Matters: Inferring Dynamics from Point Trajectories,Yan Zhang · Sergey Prokudin · Marko Mihajlovic · Qianli Ma · Siyu Tang, ,,https://www.nature.com/articles/s44172-024-00179-3,,,,,nan
+Structure-Guided Adversarial Training of Diffusion Models,Ling Yang · Haotian Qian · Zhilong Zhang · Jingwei Liu · Bin CUI, ,https://arxiv.org/abs/2402.17563v1,,2402.17563v1.pdf,Structure-Guided Adversarial Training of Diffusion Models,"Diffusion models have demonstrated exceptional efficacy in various generative
+applications. While existing models focus on minimizing a weighted sum of
+denoising score matching losses for data distribution modeling, their training
+primarily emphasizes instance-level optimization, overlooking valuable
+structural information within each mini-batch, indicative of pair-wise
+relationships among samples. To address this limitation, we introduce
+Structure-guided Adversarial training of Diffusion Models (SADM). In this
+pioneering approach, we compel the model to learn manifold structures between
+samples in each training batch. To ensure the model captures authentic manifold
+structures in the data distribution, we advocate adversarial training of the
+diffusion generator against a novel structure discriminator in a minimax game,
+distinguishing real manifold structures from the generated ones. SADM
+substantially improves existing diffusion transformers (DiT) and outperforms
+existing methods in image generation and cross-domain fine-tuning tasks across
+12 datasets, establishing a new state-of-the-art FID of 1.58 and 2.11 on
+ImageNet for class-conditional image generation at resolutions of 256x256 and
+512x512, respectively.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners,Chun Feng · Joy Hsu · Weiyu Liu · Jiajun Wu,https://chunfeng3364.github.io/projects/larc_website/project_page.html,https://arxiv.org/abs/2404.19696,,,Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners,"3D visual grounding is a challenging task that often requires direct and
+dense supervision, notably the semantic label for each object in the scene. In
+this paper, we instead study the naturally supervised setting that learns from
+only 3D scene and QA pairs, where prior works underperform. We propose the
+Language-Regularized Concept Learner (LARC), which uses constraints from
+language as regularization to significantly improve the accuracy of
+neuro-symbolic concept learners in the naturally supervised setting. Our
+approach is based on two core insights: the first is that language constraints
+(e.g., a word's relation to another) can serve as effective regularization for
+structured representations in neuro-symbolic models; the second is that we can
+query large language models to distill such constraints from language
+properties. We show that LARC improves performance of prior works in naturally
+supervised 3D visual grounding, and demonstrates a wide range of 3D visual
+reasoning capabilities-from zero-shot composition, to data efficiency and
+transferability. Our method represents a promising step towards regularizing
+structured visual reasoning frameworks with language-based priors, for learning
+in settings without dense supervision.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']"
+SuperPrimitive: Scene Reconstruction at a Primitive Level,Kirill Mazur · Gwangbin Bae · Andrew J. Davison, ,https://arxiv.org/abs/2312.05889,,2312.05889.pdf,SuperPrimitive: Scene Reconstruction at a Primitive Level,"Joint camera pose and dense geometry estimation from a set of images or a
+monocular video remains a challenging problem due to its computational
+complexity and inherent visual ambiguities. Most dense incremental
+reconstruction systems operate directly on image pixels and solve for their 3D
+positions using multi-view geometry cues. Such pixel-level approaches suffer
+from ambiguities or violations of multi-view consistency (e.g. caused by
+textureless or specular surfaces).
+  We address this issue with a new image representation which we call a
+SuperPrimitive. SuperPrimitives are obtained by splitting images into
+semantically correlated local regions and enhancing them with estimated surface
+normal directions, both of which are predicted by state-of-the-art single image
+neural networks. This provides a local geometry estimate per SuperPrimitive,
+while their relative positions are adjusted based on multi-view observations.
+  We demonstrate the versatility of our new representation by addressing three
+3D reconstruction tasks: depth completion, few-view structure from motion, and
+monocular dense visual odometry.",cs.CV,['cs.CV']
+Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification,Chao Yi · Lu Ren · De-Chuan Zhan · Han-Jia Ye, ,https://arxiv.org/abs/2404.17753,,2404.17753.pdf,Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification,"CLIP showcases exceptional cross-modal matching capabilities due to its
+training on image-text contrastive learning tasks. However, without specific
+optimization for unimodal scenarios, its performance in single-modality feature
+extraction might be suboptimal. Despite this, some studies have directly used
+CLIP's image encoder for tasks like few-shot classification, introducing a
+misalignment between its pre-training objectives and feature extraction
+methods. This inconsistency can diminish the quality of the image's feature
+representation, adversely affecting CLIP's effectiveness in target tasks. In
+this paper, we view text features as precise neighbors of image features in
+CLIP's space and present a novel CrOss-moDal nEighbor Representation(CODER)
+based on the distance structure between images and their neighbor texts. This
+feature extraction method aligns better with CLIP's pre-training objectives,
+thereby fully leveraging CLIP's robust cross-modal capabilities. The key to
+construct a high-quality CODER lies in how to create a vast amount of
+high-quality and diverse texts to match with images. We introduce the Auto Text
+Generator(ATG) to automatically generate the required texts in a data-free and
+training-free manner. We apply CODER to CLIP's zero-shot and few-shot image
+classification tasks. Experiment results across various datasets and models
+confirm CODER's effectiveness. Code is available
+at:https://github.com/YCaigogogo/CVPR24-CODER.",cs.CV,"['cs.CV', 'cs.AI']"
+MicroDiffusion: Implicit Representation-Guided Diffusion for 3D Reconstruction from Limited 2D Microscopy Projections,mude hui · Zihao Wei · Hongru Zhu · Fei Xia · Yuyin Zhou,https://github.com/UCSC-VLAA/MicroDiffusion,https://arxiv.org/abs/2403.10815,,2403.10815.pdf,MicroDiffusion: Implicit Representation-Guided Diffusion for 3D Reconstruction from Limited 2D Microscopy Projections,"Volumetric optical microscopy using non-diffracting beams enables rapid
+imaging of 3D volumes by projecting them axially to 2D images but lacks crucial
+depth information. Addressing this, we introduce MicroDiffusion, a pioneering
+tool facilitating high-quality, depth-resolved 3D volume reconstruction from
+limited 2D projections. While existing Implicit Neural Representation (INR)
+models often yield incomplete outputs and Denoising Diffusion Probabilistic
+Models (DDPM) excel at capturing details, our method integrates INR's
+structural coherence with DDPM's fine-detail enhancement capabilities. We
+pretrain an INR model to transform 2D axially-projected images into a
+preliminary 3D volume. This pretrained INR acts as a global prior guiding
+DDPM's generative process through a linear interpolation between INR outputs
+and noise inputs. This strategy enriches the diffusion process with structured
+3D information, enhancing detail and reducing noise in localized 2D images. By
+conditioning the diffusion model on the closest 2D projection, MicroDiffusion
+substantially enhances fidelity in resulting 3D reconstructions, surpassing INR
+and standard DDPM outputs with unparalleled image quality and structural
+fidelity. Our code and dataset are available at
+https://github.com/UCSC-VLAA/MicroDiffusion.",eess.IV,"['eess.IV', 'cs.CV']"
+Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data,Yu Deng · Duomin Wang · Xiaohang Ren · Xingyu Chen · Baoyuan Wang,https://github.com/YuDeng/Portrait-4D,https://arxiv.org/abs/2311.18729,,2311.18729.pdf,Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data,"Existing one-shot 4D head synthesis methods usually learn from monocular
+videos with the aid of 3DMM reconstruction, yet the latter is evenly
+challenging which restricts them from reasonable 4D head synthesis. We present
+a method to learn one-shot 4D head synthesis via large-scale synthetic data.
+The key is to first learn a part-wise 4D generative model from monocular images
+via adversarial learning, to synthesize multi-view images of diverse identities
+and full motions as training data; then leverage a transformer-based animatable
+triplane reconstructor to learn 4D head reconstruction using the synthetic
+data. A novel learning strategy is enforced to enhance the generalizability to
+real images by disentangling the learning process of 3D reconstruction and
+reenactment. Experiments demonstrate our superiority over the prior art.",cs.CV,['cs.CV']
+Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes,Hmrishav Bandyopadhyay · Subhadeep Koley · Ayan Das · Ayan Kumar Bhunia · Aneeshan Sain · Pinaki Nath Chowdhury · Tao Xiang · Yi-Zhe Song,https://hmrishavbandy.github.io/doodle23d/,https://arxiv.org/abs/2312.04043,,2312.04043.pdf,Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes,"In this paper, we democratise 3D content creation, enabling precise
+generation of 3D shapes from abstract sketches while overcoming limitations
+tied to drawing skills. We introduce a novel part-level modelling and alignment
+framework that facilitates abstraction modelling and cross-modal
+correspondence. Leveraging the same part-level decoder, our approach seamlessly
+extends to sketch modelling by establishing correspondence between CLIPasso
+edgemaps and projected 3D part regions, eliminating the need for a dataset
+pairing human sketches and 3D shapes. Additionally, our method introduces a
+seamless in-position editing process as a byproduct of cross-modal part-aligned
+modelling. Operating in a low-dimensional implicit space, our approach
+significantly reduces computational demands and processing time.",cs.CV,"['cs.CV', 'cs.AI']"
+Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring,Huicong Zhang · Haozhe Xie · Hongxun Yao,https://vilab.hit.edu.cn/projects/bsstnet,,https://github.com/huicongzhang/BSSTNet,,,,,nan
+Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications,Junyi Ma · Xieyuanli Chen · Jiawei Huang · Jingyi Xu · Zhen Luo · Jintao Xu · Weihao Gu · Rui Ai · Hesheng Wang,https://github.com/haomo-ai/Cam4DOcc,https://arxiv.org/abs/2311.17663,,2311.17663.pdf,Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications,"Understanding how the surrounding environment changes is crucial for
+performing downstream tasks safely and reliably in autonomous driving
+applications. Recent occupancy estimation techniques using only camera images
+as input can provide dense occupancy representations of large-scale scenes
+based on the current observation. However, they are mostly limited to
+representing the current 3D space and do not consider the future state of
+surrounding objects along the time axis. To extend camera-only occupancy
+estimation into spatiotemporal prediction, we propose Cam4DOcc, a new benchmark
+for camera-only 4D occupancy forecasting, evaluating the surrounding scene
+changes in a near future. We build our benchmark based on multiple publicly
+available datasets, including nuScenes, nuScenes-Occupancy, and Lyft-Level5,
+which provides sequential occupancy states of general movable and static
+objects, as well as their 3D backward centripetal flow. To establish this
+benchmark for future research with comprehensive comparisons, we introduce four
+baseline types from diverse camera-based perception and prediction
+implementations, including a static-world occupancy model, voxelization of
+point cloud prediction, 2D-3D instance-based prediction, and our proposed novel
+end-to-end 4D occupancy forecasting network. Furthermore, the standardized
+evaluation protocol for preset multiple tasks is also provided to compare the
+performance of all the proposed baselines on present and future occupancy
+estimation with respect to objects of interest in autonomous driving scenarios.
+The dataset and our implementation of all four baselines in the proposed
+Cam4DOcc benchmark will be released here: https://github.com/haomo-ai/Cam4DOcc.",cs.CV,['cs.CV']
+DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations,Tianhao Qi · Shancheng Fang · Yanze Wu · Hongtao Xie · Jiawei Liu · Lang chen · Qian HE · Yongdong Zhang,https://tianhao-qi.github.io/DEADiff/,https://arxiv.org/abs/2403.06951,,2403.06951.pdf,DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations,"The diffusion-based text-to-image model harbors immense potential in
+transferring reference style. However, current encoder-based approaches
+significantly impair the text controllability of text-to-image models while
+transferring styles. In this paper, we introduce DEADiff to address this issue
+using the following two strategies: 1) a mechanism to decouple the style and
+semantics of reference images. The decoupled feature representations are first
+extracted by Q-Formers which are instructed by different text descriptions.
+Then they are injected into mutually exclusive subsets of cross-attention
+layers for better disentanglement. 2) A non-reconstructive learning method. The
+Q-Formers are trained using paired images rather than the identical target, in
+which the reference image and the ground-truth image are with the same style or
+semantics. We show that DEADiff attains the best visual stylization results and
+optimal balance between the text controllability inherent in the text-to-image
+model and style similarity to the reference image, as demonstrated both
+quantitatively and qualitatively. Our project page is
+https://tianhao-qi.github.io/DEADiff/.",cs.CV,['cs.CV']
+What Sketch Explainability Really Means for Downstream Tasks ?,Hmrishav Bandyopadhyay · Pinaki Nath Chowdhury · Ayan Kumar Bhunia · Aneeshan Sain · Tao Xiang · Yi-Zhe Song, ,https://arxiv.org/abs/2403.09480,,2403.09480.pdf,What Sketch Explainability Really Means for Downstream Tasks,"In this paper, we explore the unique modality of sketch for explainability,
+emphasising the profound impact of human strokes compared to conventional
+pixel-oriented studies. Beyond explanations of network behavior, we discern the
+genuine implications of explainability across diverse downstream sketch-related
+tasks. We propose a lightweight and portable explainability solution -- a
+seamless plugin that integrates effortlessly with any pre-trained model,
+eliminating the need for re-training. Demonstrating its adaptability, we
+present four applications: highly studied retrieval and generation, and
+completely novel assisted drawing and sketch adversarial attacks. The
+centrepiece to our solution is a stroke-level attribution map that takes
+different forms when linked with downstream tasks. By addressing the inherent
+non-differentiability of rasterisation, we enable explanations at both coarse
+stroke level (SLA) and partial stroke level (P-SLA), each with its advantages
+for specific downstream tasks.",cs.CV,"['cs.CV', 'cs.AI']"
+OHTA: One-shot Hand Avatar via Data-driven Implicit Priors,Xiaozheng Zheng · Chao Wen · Zhuo Su · Zeran Xu · Zhaohu Li · Yang Zhao · Zhou Xue,https://zxz267.github.io/OHTA/,https://arxiv.org/abs/2402.18969,,2402.18969.pdf,OHTA: One-shot Hand Avatar via Data-driven Implicit Priors,"In this paper, we delve into the creation of one-shot hand avatars, attaining
+high-fidelity and drivable hand representations swiftly from a single image.
+With the burgeoning domains of the digital human, the need for quick and
+personalized hand avatar creation has become increasingly critical. Existing
+techniques typically require extensive input data and may prove cumbersome or
+even impractical in certain scenarios. To enhance accessibility, we present a
+novel method OHTA (One-shot Hand avaTAr) that enables the creation of detailed
+hand avatars from merely one image. OHTA tackles the inherent difficulties of
+this data-limited problem by learning and utilizing data-driven hand priors.
+Specifically, we design a hand prior model initially employed for 1) learning
+various hand priors with available data and subsequently for 2) the inversion
+and fitting of the target identity with prior knowledge. OHTA demonstrates the
+capability to create high-fidelity hand avatars with consistent animatable
+quality, solely relying on a single image. Furthermore, we illustrate the
+versatility of OHTA through diverse applications, encompassing text-to-avatar
+conversion, hand editing, and identity latent space manipulation.",cs.CV,['cs.CV']
+FedUV: Uniformity and Variance for Heterogeneous Federated Learning,Ha Min Son · Moon-Hyun Kim · Tai-Myoung Chung · Chao Huang · Xin Liu,https://github.com/sonhamin/FedUV,https://arxiv.org/abs/2402.18372,,2402.18372.pdf,FedUV: Uniformity and Variance for Heterogeneous Federated Learning,"Federated learning is a promising framework to train neural networks with
+widely distributed data. However, performance degrades heavily with
+heterogeneously distributed data. Recent work has shown this is due to the
+final layer of the network being most prone to local bias, some finding success
+freezing the final layer as an orthogonal classifier. We investigate the
+training dynamics of the classifier by applying SVD to the weights motivated by
+the observation that freezing weights results in constant singular values. We
+find that there are differences when training in IID and non-IID settings.
+Based on this finding, we introduce two regularization terms for local training
+to continuously emulate IID settings: (1) variance in the dimension-wise
+probability distribution of the classifier and (2) hyperspherical uniformity of
+representations of the encoder. These regularizations promote local models to
+act as if it were in an IID setting regardless of the local data distribution,
+thus offsetting proneness to bias while being flexible to the data. On
+extensive experiments in both label-shift and feature-shift settings, we verify
+that our method achieves highest performance by a large margin especially in
+highly non-IID cases in addition to being scalable to larger models and
+datasets.",cs.LG,"['cs.LG', 'cs.AI', 'cs.DC']"
+WinSyn: A High Resolution Testbed for Synthetic Data,Tom Kelly · John Femiani · Peter Wonka, ,https://arxiv.org/abs/2310.08471,,2310.08471.pdf,WinSyn: A High Resolution Testbed for Synthetic Data,"We present WinSyn, a unique dataset and testbed for creating high-quality
+synthetic data with procedural modeling techniques. The dataset contains
+high-resolution photographs of windows, selected from locations around the
+world, with 89,318 individual window crops showcasing diverse geometric and
+material characteristics. We evaluate a procedural model by training semantic
+segmentation networks on both synthetic and real images and then comparing
+their performances on a shared test set of real images. Specifically, we
+measure the difference in mean Intersection over Union (mIoU) and determine the
+effective number of real images to match synthetic data's training performance.
+We design a baseline procedural model as a benchmark and provide 21,290
+synthetically generated images. By tuning the procedural model, key factors are
+identified which significantly influence the model's fidelity in replicating
+real-world scenarios. Importantly, we highlight the challenge of procedural
+modeling using current techniques, especially in their ability to replicate the
+spatial semantics of real-world scenarios. This insight is critical because of
+the potential of procedural models to bridge to hidden scene aspects such as
+depth, reflectivity, material properties, and lighting conditions.",cs.CV,"['cs.CV', 'cs.GR']"
+Rethinking Inductive Biases for Surface Normal Estimation,Gwangbin Bae · Andrew J. Davison, ,https://arxiv.org/abs/2403.00712,,2403.00712.pdf,Rethinking Inductive Biases for Surface Normal Estimation,"Despite the growing demand for accurate surface normal estimation models,
+existing methods use general-purpose dense prediction models, adopting the same
+inductive biases as other tasks. In this paper, we discuss the inductive biases
+needed for surface normal estimation and propose to (1) utilize the per-pixel
+ray direction and (2) encode the relationship between neighboring surface
+normals by learning their relative rotation. The proposed method can generate
+crisp - yet, piecewise smooth - predictions for challenging in-the-wild images
+of arbitrary resolution and aspect ratio. Compared to a recent ViT-based
+state-of-the-art model, our method shows a stronger generalization ability,
+despite being trained on an orders of magnitude smaller dataset. The code is
+available at https://github.com/baegwangbin/DSINE.",cs.CV,['cs.CV']
+MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding,Xu Cao · Tong Zhou · Yunsheng Ma · Wenqian Ye · Can Cui · Kun Tang · Zhipeng Cao · Kaizhao Liang · Ziran Wang · James Rehg · chao zheng, ,,https://ysma.me/,,,,,nan
+In2SET: Intra-Inter Similarity Exploiting Transformer for  Dual-Camera Compressive Hyperspectral Imaging,Xin Wang · Lizhi Wang · Xiangtian Ma · Maoqing Zhang · Lin Zhu · Hua Huang,https://github.com/2JONAS/In2SET,https://arxiv.org/abs/2312.13319,,2312.13319.pdf,In2SET: Intra-Inter Similarity Exploiting Transformer for Dual-Camera Compressive Hyperspectral Imaging,"Dual-Camera Compressed Hyperspectral Imaging (DCCHI) offers the capability to
+reconstruct 3D Hyperspectral Image (HSI) by fusing compressive and Panchromatic
+(PAN) image, which has shown great potential for snapshot hyperspectral imaging
+in practice. In this paper, we introduce a novel DCCHI reconstruction network,
+the Intra-Inter Similarity Exploiting Transformer (In2SET). Our key insight is
+to make full use of the PAN image to assist the reconstruction. To this end, we
+propose using the intra-similarity within the PAN image as a proxy for
+approximating the intra-similarity in the original HSI, thereby offering an
+enhanced content prior for more accurate HSI reconstruction. Furthermore, we
+aim to align the features from the underlying HSI with those of the PAN image,
+maintaining semantic consistency and introducing new contextual information for
+the reconstruction process. By integrating In2SET into a PAN-guided unrolling
+framework, our method substantially enhances the spatial-spectral fidelity and
+detail of the reconstructed images, providing a more comprehensive and accurate
+depiction of the scene. Extensive experiments conducted on both real and
+simulated datasets demonstrate that our approach consistently outperforms
+existing state-of-the-art methods in terms of reconstruction quality and
+computational complexity. Code will be released.",eess.IV,"['eess.IV', 'cs.CV']"
+Describing Differences in Image Sets with Natural Language,Lisa Dunlap · Yuhui Zhang · Xiaohan Wang · Ruiqi Zhong · Trevor Darrell · Jacob Steinhardt · Joseph Gonzalez · Serena Yeung,https://understanding-visual-datasets.github.io/VisDiff-website/,https://arxiv.org/abs/2312.02974,,2312.02974.pdf,Describing Differences in Image Sets with Natural Language,"How do two sets of images differ? Discerning set-level differences is crucial
+for understanding model behaviors and analyzing datasets, yet manually sifting
+through thousands of images is impractical. To aid in this discovery process,
+we explore the task of automatically describing the differences between two
+$\textbf{sets}$ of images, which we term Set Difference Captioning. This task
+takes in image sets $D_A$ and $D_B$, and outputs a description that is more
+often true on $D_A$ than $D_B$. We outline a two-stage approach that first
+proposes candidate difference descriptions from image sets and then re-ranks
+the candidates by checking how well they can differentiate the two sets. We
+introduce VisDiff, which first captions the images and prompts a language model
+to propose candidate descriptions, then re-ranks these descriptions using CLIP.
+To evaluate VisDiff, we collect VisDiffBench, a dataset with 187 paired image
+sets with ground truth difference descriptions. We apply VisDiff to various
+domains, such as comparing datasets (e.g., ImageNet vs. ImageNetV2), comparing
+classification models (e.g., zero-shot CLIP vs. supervised ResNet), summarizing
+model failure modes (supervised ResNet), characterizing differences between
+generative models (e.g., StableDiffusionV1 and V2), and discovering what makes
+images memorable. Using VisDiff, we are able to find interesting and previously
+unknown differences in datasets and models, demonstrating its utility in
+revealing nuanced insights.",cs.CV,"['cs.CV', 'cs.CL', 'cs.CY', 'cs.LG']"
+SketchINR: A First Look into Sketches as Implicit Neural Representations,Hmrishav Bandyopadhyay · Ayan Kumar Bhunia · Pinaki Nath Chowdhury · Aneeshan Sain · Tao Xiang · Timothy Hospedales · Yi-Zhe Song,https://hmrishavbandy.github.io/sketchinr,https://arxiv.org/abs/2403.09344,,2403.09344.pdf,SketchINR: A First Look into Sketches as Implicit Neural Representations,"We propose SketchINR, to advance the representation of vector sketches with
+implicit neural models. A variable length vector sketch is compressed into a
+latent space of fixed dimension that implicitly encodes the underlying shape as
+a function of time and strokes. The learned function predicts the $xy$ point
+coordinates in a sketch at each time and stroke. Despite its simplicity,
+SketchINR outperforms existing representations at multiple tasks: (i) Encoding
+an entire sketch dataset into a fixed size latent vector, SketchINR gives
+$60\times$ and $10\times$ data compression over raster and vector sketches,
+respectively. (ii) SketchINR's auto-decoder provides a much higher-fidelity
+representation than other learned vector sketch representations, and is
+uniquely able to scale to complex vector sketches such as FS-COCO. (iii)
+SketchINR supports parallelisation that can decode/render $\sim$$100\times$
+faster than other learned vector representations such as SketchRNN. (iv)
+SketchINR, for the first time, emulates the human ability to reproduce a sketch
+with varying abstraction in terms of number and complexity of strokes. As a
+first look at implicit sketches, SketchINR's compact high-fidelity
+representation will support future work in modelling long and complex sketches.",cs.CV,"['cs.CV', 'cs.AI']"
+Commonsense Prototype for Outdoor Unsupervised 3D Object Detection,Hai Wu · Shijia Zhao · Xun Huang · Chenglu Wen · Xin Li · Cheng Wang,https://github.com/hailanyi/CPD,https://arxiv.org/abs/2404.16493,,2404.16493.pdf,Commonsense Prototype for Outdoor Unsupervised 3D Object Detection,"The prevalent approaches of unsupervised 3D object detection follow
+cluster-based pseudo-label generation and iterative self-training processes.
+However, the challenge arises due to the sparsity of LiDAR scans, which leads
+to pseudo-labels with erroneous size and position, resulting in subpar
+detection performance. To tackle this problem, this paper introduces a
+Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object
+detection. CPD first constructs Commonsense Prototype (CProto) characterized by
+high-quality bounding box and dense points, based on commonsense intuition.
+Subsequently, CPD refines the low-quality pseudo-labels by leveraging the size
+prior from CProto. Furthermore, CPD enhances the detection accuracy of sparsely
+scanned objects by the geometric knowledge from CProto. CPD outperforms
+state-of-the-art unsupervised 3D detectors on Waymo Open Dataset (WOD),
+PandaSet, and KITTI datasets by a large margin. Besides, by training CPD on WOD
+and testing on KITTI, CPD attains 90.85% and 81.01% 3D Average Precision on
+easy and moderate car classes, respectively. These achievements position CPD in
+close proximity to fully supervised detectors, highlighting the significance of
+our method. The code will be available at https://github.com/hailanyi/CPD.",cs.CV,['cs.CV']
+Global and Hierarchical Geometry Consistency Priors for Few-shot NeRFs in Indoor Scenes,Xiaotian Sun · Qingshan Xu · Xinjie Yang · Yu Zang · Cheng Wang, ,https://arxiv.org/html/2404.00992v1,,2404.00992v1.pdf,SGCNeRF: Few-Shot Neural Rendering via Sparse Geometric Consistency Guidance,"Neural Radiance Field (NeRF) technology has made significant strides in
+creating novel viewpoints. However, its effectiveness is hampered when working
+with sparsely available views, often leading to performance dips due to
+overfitting. FreeNeRF attempts to overcome this limitation by integrating
+implicit geometry regularization, which incrementally improves both geometry
+and textures. Nonetheless, an initial low positional encoding bandwidth results
+in the exclusion of high-frequency elements. The quest for a holistic approach
+that simultaneously addresses overfitting and the preservation of
+high-frequency details remains ongoing. This study introduces a novel feature
+matching based sparse geometry regularization module. This module excels in
+pinpointing high-frequency keypoints, thereby safeguarding the integrity of
+fine details. Through progressive refinement of geometry and textures across
+NeRF iterations, we unveil an effective few-shot neural rendering architecture,
+designated as SGCNeRF, for enhanced novel view synthesis. Our experiments
+demonstrate that SGCNeRF not only achieves superior geometry-consistent
+outcomes but also surpasses FreeNeRF, with improvements of 0.7 dB and 0.6 dB in
+PSNR on the LLFF and DTU datasets, respectively.",cs.CV,['cs.CV']
+Segment Every Out-of-Distribution Object,Wenjie Zhao · Jia Li · Xin Dong · Yu Xiang · Yunhui Guo, ,https://arxiv.org/abs/2311.16516,,2311.16516.pdf,Segment Every Out-of-Distribution Object,"Semantic segmentation models, while effective for in-distribution categories,
+face challenges in real-world deployment due to encountering
+out-of-distribution (OoD) objects. Detecting these OoD objects is crucial for
+safety-critical applications. Existing methods rely on anomaly scores, but
+choosing a suitable threshold for generating masks presents difficulties and
+can lead to fragmentation and inaccuracy. This paper introduces a method to
+convert anomaly \textbf{S}core \textbf{T}o segmentation \textbf{M}ask, called
+S2M, a simple and effective framework for OoD detection in semantic
+segmentation. Unlike assigning anomaly scores to pixels, S2M directly segments
+the entire OoD object. By transforming anomaly scores into prompts for a
+promptable segmentation model, S2M eliminates the need for threshold selection.
+Extensive experiments demonstrate that S2M outperforms the state-of-the-art by
+approximately 20% in IoU and 40% in mean F1 score, on average, across various
+benchmarks including Fishyscapes, Segment-Me-If-You-Can, and RoadAnomaly
+datasets.",cs.CV,['cs.CV']
+Learning to Segment Referred Objects from Narrated Egocentric Videos,Yuhan Shen · Huiyu Wang · Xitong Yang · Matt Feiszli · Ehsan Elhamifar · Lorenzo Torresani · Effrosyni Mavroudi, ,https://arxiv.org/abs/2404.05206,,2404.05206.pdf,SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos,"We propose a novel self-supervised embedding to learn how actions sound from
+narrated in-the-wild egocentric videos. Whereas existing methods rely on
+curated data with known audio-visual correspondence, our multimodal
+contrastive-consensus coding (MC3) embedding reinforces the associations
+between audio, language, and vision when all modality pairs agree, while
+diminishing those associations when any one pair does not. We show our approach
+can successfully discover how the long tail of human actions sound from
+egocentric video, outperforming an array of recent multimodal embedding
+techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal
+tasks.",cs.CV,"['cs.CV', 'cs.MM', 'cs.SD', 'eess.AS']"
+Low-Resource Vision Challenges for Foundation Models,Yunhua Zhang · Hazel Doughty · Cees G. M. Snoek, ,https://arxiv.org/abs/2401.04716,,2401.04716.pdf,Low-Resource Vision Challenges for Foundation Models,"Low-resource settings are well-established in natural language processing,
+where many languages lack sufficient data for deep learning at scale. However,
+low-resource problems are under-explored in computer vision. In this paper, we
+address this gap and explore the challenges of low-resource image tasks with
+vision foundation models. We first collect a benchmark of genuinely
+low-resource image data, covering historic maps, circuit diagrams, and
+mechanical drawings. These low-resource settings all share three challenges:
+data scarcity, fine-grained differences, and the distribution shift from
+natural images to the specialized domain of interest. While existing foundation
+models have shown impressive generalizability, we find they cannot transfer
+well to our low-resource tasks. To begin to tackle the challenges of
+low-resource vision, we introduce one simple baseline per challenge.
+Specifically, we i) enlarge the data space by generative models, ii) adopt the
+best sub-kernels to encode local regions for fine-grained difference discovery
+and iii) learn attention for specialized domains. Experiments on our three
+low-resource tasks demonstrate our proposals already provide a better baseline
+than transfer learning, data augmentation, and fine-grained methods. This
+highlights the unique characteristics and challenges of low-resource vision for
+foundation models that warrant further investigation. Project page:
+https://xiaobai1217.github.io/Low-Resource-Vision/.",cs.CV,['cs.CV']
+SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors,Dave Zhenyu Chen · Haoxuan Li · Hsin-Ying Lee · Sergey Tulyakov · Matthias Nießner,https://daveredrum.github.io/SceneTex/,https://arxiv.org/abs/2311.17261,,2311.17261.pdf,SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors,"We propose SceneTex, a novel method for effectively generating high-quality
+and style-consistent textures for indoor scenes using depth-to-image diffusion
+priors. Unlike previous methods that either iteratively warp 2D views onto a
+mesh surface or distillate diffusion latent features without accurate geometric
+and style cues, SceneTex formulates the texture synthesis task as an
+optimization problem in the RGB space where style and geometry consistency are
+properly reflected. At its core, SceneTex proposes a multiresolution texture
+field to implicitly encode the mesh appearance. We optimize the target texture
+via a score-distillation-based objective function in respective RGB renderings.
+To further secure the style consistency across views, we introduce a
+cross-attention decoder to predict the RGB values by cross-attending to the
+pre-sampled reference locations in each instance. SceneTex enables various and
+accurate texture synthesis for 3D-FRONT scenes, demonstrating significant
+improvements in visual quality and prompt fidelity over the prior texture
+generation methods.",cs.CV,['cs.CV']
+TCP: Textual-based Class-aware Prompt tuning for Visual-Language Model,Hantao Yao · Rui Zhang · Changsheng Xu, ,https://arxiv.org/abs/2311.18231,,2311.18231.pdf,TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model,"Prompt tuning represents a valuable technique for adapting pre-trained
+visual-language models (VLM) to various downstream tasks. Recent advancements
+in CoOp-based methods propose a set of learnable domain-shared or
+image-conditional textual tokens to facilitate the generation of task-specific
+textual classifiers. However, those textual tokens have a limited
+generalization ability regarding unseen domains, as they cannot dynamically
+adjust to the distribution of testing classes. To tackle this issue, we present
+a novel Textual-based Class-aware Prompt tuning(TCP) that explicitly
+incorporates prior knowledge about classes to enhance their discriminability.
+The critical concept of TCP involves leveraging Textual Knowledge Embedding
+(TKE) to map the high generalizability of class-level textual knowledge into
+class-aware textual tokens. By seamlessly integrating these class-aware prompts
+into the Text Encoder, a dynamic class-aware classifier is generated to enhance
+discriminability for unseen domains. During inference, TKE dynamically
+generates class-aware prompts related to the unseen classes. Comprehensive
+evaluations demonstrate that TKE serves as a plug-and-play module effortlessly
+combinable with existing methods. Furthermore, TCP consistently achieves
+superior performance while demanding less training time.
+Code:https://github.com/htyao89/Textual-based_Class-aware_prompt_tuning/",cs.CV,['cs.CV']
+URHand: Universal Relightable Hands,Zhaoxi Chen · Gyeongsik Moon · Kaiwen Guo · Chen Cao · Stanislav Pidhorskyi · Tomas Simon · Rohan Joshi · Yuan Dong · Yichen Xu · Bernardo Pires · He Wen · Lucas Evans · Bo Peng · Julia Buffalini · Autumn Trimble · Kevyn McPhail · Melissa Schoeller · Shoou-I Yu · Javier Romero · Michael Zollhoefer · Yaser Sheikh · Ziwei Liu · Shunsuke Saito,https://frozenburning.github.io/projects/urhand/,http://export.arxiv.org/abs/2401.05334,,2401.05334.pdf,URHand: Universal Relightable Hands,"Existing photorealistic relightable hand models require extensive
+identity-specific observations in different views, poses, and illuminations,
+and face challenges in generalizing to natural illuminations and novel
+identities. To bridge this gap, we present URHand, the first universal
+relightable hand model that generalizes across viewpoints, poses,
+illuminations, and identities. Our model allows few-shot personalization using
+images captured with a mobile phone, and is ready to be photorealistically
+rendered under novel illuminations. To simplify the personalization process
+while retaining photorealism, we build a powerful universal relightable prior
+based on neural relighting from multi-view images of hands captured in a light
+stage with hundreds of identities. The key challenge is scaling the
+cross-identity training while maintaining personalized fidelity and sharp
+details without compromising generalization under natural illuminations. To
+this end, we propose a spatially varying linear lighting model as the neural
+renderer that takes physics-inspired shading as input feature. By removing
+non-linear activations and bias, our specifically designed lighting model
+explicitly keeps the linearity of light transport. This enables single-stage
+training from light-stage data while generalizing to real-time rendering under
+arbitrary continuous illuminations across diverse identities. In addition, we
+introduce the joint learning of a physically based model and our neural
+relighting model, which further improves fidelity and generalization. Extensive
+experiments show that our approach achieves superior performance over existing
+methods in terms of both quality and generalizability. We also demonstrate
+quick personalization of URHand from a short phone scan of an unseen identity.",cs.CV,"['cs.CV', 'cs.GR']"
+EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams,Christen Millerdurai · Hiroyasu Akada · Jian Wang · Diogo Luvizon · Christian Theobalt · Vladislav Golyanik,https://4dqv.mpi-inf.mpg.de/EventEgo3D/,https://arxiv.org/abs/2404.08640,,2404.08640.pdf,EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams,"Monocular egocentric 3D human motion capture is a challenging and actively
+researched problem. Existing methods use synchronously operating visual sensors
+(e.g. RGB cameras) and often fail under low lighting and fast motions, which
+can be restricting in many applications involving head-mounted devices. In
+response to the existing limitations, this paper 1) introduces a new problem,
+i.e., 3D human motion capture from an egocentric monocular event camera with a
+fisheye lens, and 2) proposes the first approach to it called EventEgo3D
+(EE3D). Event streams have high temporal resolution and provide reliable cues
+for 3D human motion capture under high-speed human motions and rapidly changing
+illumination. The proposed EE3D framework is specifically tailored for learning
+with event streams in the LNES representation, enabling high 3D reconstruction
+accuracy. We also design a prototype of a mobile head-mounted device with an
+event camera and record a real dataset with event observations and the
+ground-truth 3D human poses (in addition to the synthetic dataset). Our EE3D
+demonstrates robustness and superior 3D accuracy compared to existing solutions
+across various challenging experiments while supporting real-time 3D pose
+update rates of 140Hz.",cs.CV,['cs.CV']
+WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights,Youngdong Jang · Dong In Lee · MinHyuk Jang · Jong Wook Kim · Feng Yang · Sangpil Kim,https://kuai-lab.github.io/cvpr2024waterf/,https://arxiv.org/abs/2405.02066,,2405.02066.pdf,WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights,"The advances in the Neural Radiance Fields (NeRF) research offer extensive
+applications in diverse domains, but protecting their copyrights has not yet
+been researched in depth. Recently, NeRF watermarking has been considered one
+of the pivotal solutions for safely deploying NeRF-based 3D representations.
+However, existing methods are designed to apply only to implicit or explicit
+NeRF representations. In this work, we introduce an innovative watermarking
+method that can be employed in both representations of NeRF. This is achieved
+by fine-tuning NeRF to embed binary messages in the rendering process. In
+detail, we propose utilizing the discrete wavelet transform in the NeRF space
+for watermarking. Furthermore, we adopt a deferred back-propagation technique
+and introduce a combination with the patch-wise loss to improve rendering
+quality and bit accuracy with minimum trade-offs. We evaluate our method in
+three different aspects: capacity, invisibility, and robustness of the embedded
+watermarks in the 2D-rendered images. Our method achieves state-of-the-art
+performance with faster training speed over the compared state-of-the-art
+methods.",cs.CV,"['cs.CV', 'eess.IV']"
+ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association,Shuxiao Ding · Lukas Schneider · Marius Cordts · Jürgen Gall,https://github.com/dsx0511/ADA-Track,https://arxiv.org/abs/2405.08909,,2405.08909.pdf,ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association,"Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the
+tracking-by-attention paradigm, utilizing track queries for identity-consistent
+detection and object queries for identity-agnostic track spawning.
+Tracking-by-attention, however, entangles detection and tracking queries in one
+embedding for both the detection and tracking task, which is sub-optimal. Other
+approaches resemble the tracking-by-detection paradigm, detecting objects using
+decoupled track and detection queries followed by a subsequent association.
+These methods, however, do not leverage synergies between the detection and
+association task. Combining the strengths of both paradigms, we introduce
+ADA-Track, a novel end-to-end framework for 3D MOT from multi-view cameras. We
+introduce a learnable data association module based on edge-augmented
+cross-attention, leveraging appearance and geometric features. Furthermore, we
+integrate this association module into the decoder layer of a DETR-based 3D
+detector, enabling simultaneous DETR-like query-to-image cross-attention for
+detection and query-to-query cross-attention for data association. By stacking
+these decoder layers, queries are refined for the detection and association
+task alternately, effectively harnessing the task dependencies. We evaluate our
+method on the nuScenes dataset and demonstrate the advantage of our approach
+compared to the two previous paradigms. Code is available at
+https://github.com/dsx0511/ADA-Track.",cs.CV,['cs.CV']
+Scale Decoupled Distillation,Shicai Wei · Chunbo Luo · Yang Luo, ,https://arxiv.org/abs/2403.13512,,2403.13512.pdf,Scale Decoupled Distillation,"Logit knowledge distillation attracts increasing attention due to its
+practicality in recent studies. However, it often suffers inferior performance
+compared to the feature knowledge distillation. In this paper, we argue that
+existing logit-based methods may be sub-optimal since they only leverage the
+global logit output that couples multiple semantic knowledge. This may transfer
+ambiguous knowledge to the student and mislead its learning. To this end, we
+propose a simple but effective method, i.e., Scale Decoupled Distillation
+(SDD), for logit knowledge distillation. SDD decouples the global logit output
+into multiple local logit outputs and establishes distillation pipelines for
+them. This helps the student to mine and inherit fine-grained and unambiguous
+logit knowledge. Moreover, the decoupled knowledge can be further divided into
+consistent and complementary logit knowledge that transfers the semantic
+information and sample ambiguity, respectively. By increasing the weight of
+complementary parts, SDD can guide the student to focus more on ambiguous
+samples, improving its discrimination ability. Extensive experiments on several
+benchmark datasets demonstrate the effectiveness of SDD for wide
+teacher-student pairs, especially in the fine-grained classification task. Code
+is available at: https://github.com/shicaiwei123/SDD-CVPR2024",cs.CV,"['cs.CV', 'cs.AI']"
+SIRA: Scalable Inter-frame Relation and Association for Radar Perception,Ryoma Yataka · Pu (Perry) Wang · Petros Boufounos · Ryuhei Takahashi, ,,https://www.semanticscholar.org/paper/Radar-Perception-with-Scalable-Connective-Temporal-Yataka-Wang/78d83560c7e2aee39d8153bafc815482dcbd163e,,,,,nan
+Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning,Rui Li · Tobias Fischer · Mattia Segu · Marc Pollefeys · Luc Van Gool · Federico Tombari, ,https://arxiv.org/abs/2404.03658,,2404.03658.pdf,Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning,"Recovering the 3D scene geometry from a single view is a fundamental yet
+ill-posed problem in computer vision. While classical depth estimation methods
+infer only a 2.5D scene representation limited to the image plane, recent
+approaches based on radiance fields reconstruct a full 3D representation.
+However, these methods still struggle with occluded regions since inferring
+geometry without visual observation requires (i) semantic knowledge of the
+surroundings, and (ii) reasoning about spatial context. We propose KYN, a novel
+method for single-view scene reconstruction that reasons about semantic and
+spatial context to predict each point's density. We introduce a vision-language
+modulation module to enrich point features with fine-grained semantic
+information. We aggregate point representations across the scene through a
+language-guided spatial attention mechanism to yield per-point density
+predictions aware of the 3D semantic context. We show that KYN improves 3D
+shape recovery compared to predicting density for each 3D point in isolation.
+We achieve state-of-the-art results in scene and object reconstruction on
+KITTI-360, and show improved zero-shot generalization compared to prior work.
+Project page: https://ruili3.github.io/kyn.",cs.CV,['cs.CV']
+Rich Human Feedback for Text-to-Image Generation,Youwei Liang · Junfeng He · Gang Li · Peizhao Li · Arseniy Klimovskiy · Nicholas Carolan · Jiao Sun · Jordi Pont-Tuset · Sarah Young · Feng Yang · Junjie Ke · Krishnamurthy Dvijotham · Katherine Collins · Yiwen Luo · Yang Li · Kai Kohlhoff · Deepak Ramachandran · Vidhya Navalpakkam, ,https://arxiv.org/abs/2312.10240,,2312.10240.pdf,Rich Human Feedback for Text-to-Image Generation,"Recent Text-to-Image (T2I) generation models such as Stable Diffusion and
+Imagen have made significant progress in generating high-resolution images
+based on text descriptions. However, many generated images still suffer from
+issues such as artifacts/implausibility, misalignment with text descriptions,
+and low aesthetic quality. Inspired by the success of Reinforcement Learning
+with Human Feedback (RLHF) for large language models, prior works collected
+human-provided scores as feedback on generated images and trained a reward
+model to improve the T2I generation. In this paper, we enrich the feedback
+signal by (i) marking image regions that are implausible or misaligned with the
+text, and (ii) annotating which words in the text prompt are misrepresented or
+missing on the image. We collect such rich human feedback on 18K generated
+images (RichHF-18K) and train a multimodal transformer to predict the rich
+feedback automatically. We show that the predicted rich human feedback can be
+leveraged to improve image generation, for example, by selecting high-quality
+training data to finetune and improve the generative models, or by creating
+masks with predicted heatmaps to inpaint the problematic regions. Notably, the
+improvements generalize to models (Muse) beyond those used to generate the
+images on which human feedback data were collected (Stable Diffusion variants).
+The RichHF-18K data set will be released in our GitHub repository:
+https://github.com/google-research/google-research/tree/master/richhf_18k.",cs.CV,['cs.CV']
+"AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond",Zixiang Zhou · Yu Wan · Baoyuan Wang,https://github.com/zixiangzhou916/AvatarGPT,,https://www.semanticscholar.org/paper/AvatarGPT:-All-in-One-Framework-for-Motion-and-Zhou-Wan/b4e6f30ab07666dc7d485b24f072f2533609545c/figure/4,,,,,nan
+SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection,Gang Zhang · Chen Junnan · Guohuan Gao · Jianmin Li · Si Liu · Xiaolin Hu, ,https://arxiv.org/abs/2403.05817,,2403.05817.pdf,SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection,"LiDAR-based 3D object detection plays an essential role in autonomous
+driving. Existing high-performing 3D object detectors usually build dense
+feature maps in the backbone network and prediction head. However, the
+computational costs introduced by the dense feature maps grow quadratically as
+the perception range increases, making these models hard to scale up to
+long-range detection. Some recent works have attempted to construct fully
+sparse detectors to solve this issue; nevertheless, the resulting models either
+rely on a complex multi-stage pipeline or exhibit inferior performance. In this
+work, we propose SAFDNet, a straightforward yet highly effective architecture,
+tailored for fully sparse 3D object detection. In SAFDNet, an adaptive feature
+diffusion strategy is designed to address the center feature missing problem.
+We conducted extensive experiments on Waymo Open, nuScenes, and Argoverse2
+datasets. SAFDNet performed slightly better than the previous SOTA on the first
+two datasets but much better on the last dataset, which features long-range
+detection, verifying the efficacy of SAFDNet in scenarios where long-range
+detection is required. Notably, on Argoverse2, SAFDNet surpassed the previous
+best hybrid detector HEDNet by 2.6% mAP while being 2.1x faster, and yielded
+2.1% mAP gains over the previous best sparse detector FSDv2 while being 1.3x
+faster. The code will be available at https://github.com/zhanggang001/HEDNet.",cs.CV,['cs.CV']
+Neural Clustering based Visual Representation Learning,Guikun Chen · Xia Li · Yi Yang · Wenguan Wang,https://github.com/guikunchen/FEC,https://arxiv.org/abs/2403.17409,,2403.17409.pdf,Neural Clustering based Visual Representation Learning,"We investigate a fundamental aspect of machine vision: the measurement of
+features, by revisiting clustering, one of the most classic approaches in
+machine learning and data analysis. Existing visual feature extractors,
+including ConvNets, ViTs, and MLPs, represent an image as rectangular regions.
+Though prevalent, such a grid-style paradigm is built upon engineering practice
+and lacks explicit modeling of data distribution. In this work, we propose
+feature extraction with clustering (FEC), a conceptually elegant yet
+surprisingly ad-hoc interpretable neural clustering framework, which views
+feature extraction as a process of selecting representatives from data and thus
+automatically captures the underlying data distribution. Given an image, FEC
+alternates between grouping pixels into individual clusters to abstract
+representatives and updating the deep features of pixels with current
+representatives. Such an iterative working mechanism is implemented in the form
+of several neural layers and the final representatives can be used for
+downstream tasks. The cluster assignments across layers, which can be viewed
+and inspected by humans, make the forward process of FEC fully transparent and
+empower it with promising ad-hoc interpretability. Extensive experiments on
+various visual recognition models and tasks verify the effectiveness,
+generality, and interpretability of FEC. We expect this work will provoke a
+rethink of the current de facto grid-style paradigm.",cs.CV,['cs.CV']
+Neural Redshift: Random Networks are not Random Functions,Damien Teney · Armand Nicolicioiu · Valentin Hartmann · Ehsan Abbasnejad, ,https://arxiv.org/abs/2403.02241,,2403.02241.pdf,Neural Redshift: Random Networks are not Random Functions,"Our understanding of the generalization capabilities of neural networks (NNs)
+is still incomplete. Prevailing explanations are based on implicit biases of
+gradient descent (GD) but they cannot account for the capabilities of models
+from gradient-free methods nor the simplicity bias recently observed in
+untrained networks. This paper seeks other sources of generalization in NNs.
+  Findings. To understand the inductive biases provided by architectures
+independently from GD, we examine untrained, random-weight networks. Even
+simple MLPs show strong inductive biases: uniform sampling in weight space
+yields a very biased distribution of functions in terms of complexity. But
+unlike common wisdom, NNs do not have an inherent ""simplicity bias"". This
+property depends on components such as ReLUs, residual connections, and layer
+normalizations. Alternative architectures can be built with a bias for any
+level of complexity. Transformers also inherit all these properties from their
+building blocks.
+  Implications. We provide a fresh explanation for the success of deep learning
+independent from gradient-based training. It points at promising avenues for
+controlling the solutions implemented by trained models.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+Segment Any Event Streams via Weighted Adaptation of Pivotal Tokens,Zhiwen Chen · Zhiyu Zhu · Yifan Zhang · Junhui Hou · Guangming Shi · Jinjian Wu,https://github.com/happychenpipi/EventSAM/,https://arxiv.org/abs/2312.16222,,2312.16222.pdf,Segment Any Events via Weighted Adaptation of Pivotal Tokens,"In this paper, we delve into the nuanced challenge of tailoring the Segment
+Anything Models (SAMs) for integration with event data, with the overarching
+objective of attaining robust and universal object segmentation within the
+event-centric domain. One pivotal issue at the heart of this endeavor is the
+precise alignment and calibration of embeddings derived from event-centric data
+such that they harmoniously coincide with those originating from RGB imagery.
+Capitalizing on the vast repositories of datasets with paired events and RGB
+images, our proposition is to harness and extrapolate the profound knowledge
+encapsulated within the pre-trained SAM framework. As a cornerstone to
+achieving this, we introduce a multi-scale feature distillation methodology.
+This methodology rigorously optimizes the alignment of token embeddings
+originating from event data with their RGB image counterparts, thereby
+preserving and enhancing the robustness of the overall architecture.
+Considering the distinct significance that token embeddings from intermediate
+layers hold for higher-level embeddings, our strategy is centered on accurately
+calibrating the pivotal token embeddings. This targeted calibration is aimed at
+effectively managing the discrepancies in high-level embeddings originating
+from both the event and image domains. Extensive experiments on different
+datasets demonstrate the effectiveness of the proposed distillation method.
+Code in http://github.com/happychenpipi/EventSAM.",cs.CV,['cs.CV']
+Continual Forgetting for Pre-trained Vision Models,Hongbo Zhao · Bolin Ni · Junsong Fan · Yuxi Wang · Yuntao Chen · Gaofeng Meng · Zhaoxiang Zhang,https://github.com/bjzhb666/GS-LoRA,https://arxiv.org/abs/2403.11530,,2403.11530.pdf,Continual Forgetting for Pre-trained Vision Models,"For privacy and security concerns, the need to erase unwanted information
+from pre-trained vision models is becoming evident nowadays. In real-world
+scenarios, erasure requests originate at any time from both users and model
+owners. These requests usually form a sequence. Therefore, under such a
+setting, selective information is expected to be continuously removed from a
+pre-trained model while maintaining the rest. We define this problem as
+continual forgetting and identify two key challenges. (i) For unwanted
+knowledge, efficient and effective deleting is crucial. (ii) For remaining
+knowledge, the impact brought by the forgetting procedure should be minimal. To
+address them, we propose Group Sparse LoRA (GS-LoRA). Specifically, towards
+(i), we use LoRA modules to fine-tune the FFN layers in Transformer blocks for
+each forgetting task independently, and towards (ii), a simple group sparse
+regularization is adopted, enabling automatic selection of specific LoRA groups
+and zeroing out the others. GS-LoRA is effective, parameter-efficient,
+data-efficient, and easy to implement. We conduct extensive experiments on face
+recognition, object detection and image classification and demonstrate that
+GS-LoRA manages to forget specific classes with minimal impact on other
+classes. Codes will be released on \url{https://github.com/bjzhb666/GS-LoRA}.",cs.CV,['cs.CV']
+Seeing Unseen: Discover Novel Biomedical Concepts via Geometry-Constrained Probabilistic Modeling,Jianan Fan · Dongnan Liu · Hang Chang · Heng Huang · Mei Chen · Weidong Cai, ,https://arxiv.org/abs/2403.01053,,2403.01053.pdf,Seeing Unseen: Discover Novel Biomedical Concepts via Geometry-Constrained Probabilistic Modeling,"Machine learning holds tremendous promise for transforming the fundamental
+practice of scientific discovery by virtue of its data-driven nature. With the
+ever-increasing stream of research data collection, it would be appealing to
+autonomously explore patterns and insights from observational data for
+discovering novel classes of phenotypes and concepts. However, in the
+biomedical domain, there are several challenges inherently presented in the
+cumulated data which hamper the progress of novel class discovery. The
+non-i.i.d. data distribution accompanied by the severe imbalance among
+different groups of classes essentially leads to ambiguous and biased semantic
+representations. In this work, we present a geometry-constrained probabilistic
+modeling treatment to resolve the identified issues. First, we propose to
+parameterize the approximated posterior of instance embedding as a marginal von
+MisesFisher distribution to account for the interference of distributional
+latent bias. Then, we incorporate a suite of critical geometric properties to
+impose proper constraints on the layout of constructed embedding space, which
+in turn minimizes the uncontrollable risk for unknown class learning and
+structuring. Furthermore, a spectral graph-theoretic method is devised to
+estimate the number of potential novel classes. It inherits two intriguing
+merits compared to existent approaches, namely high computational efficiency
+and flexibility for taxonomy-adaptive estimation. Extensive experiments across
+various biomedical scenarios substantiate the effectiveness and general
+applicability of our method.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+Exploring Region-Word Alignment in Built-in Detector for Open-Vocabulary Object Detection,Heng Zhang · Qiuyu Zhao · Linyu Zheng · Hao Zeng · Zhiwei Ge · Tianhao Li · Sulong Xu, ,https://arxiv.org/abs/2310.16667,,2310.16667.pdf,CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection,"Deriving reliable region-word alignment from image-text pairs is critical to
+learn object-level vision-language representations for open-vocabulary object
+detection. Existing methods typically rely on pre-trained or self-trained
+vision-language models for alignment, which are prone to limitations in
+localization accuracy or generalization capabilities. In this paper, we propose
+CoDet, a novel approach that overcomes the reliance on pre-aligned
+vision-language space by reformulating region-word alignment as a co-occurring
+object discovery problem. Intuitively, by grouping images that mention a shared
+concept in their captions, objects corresponding to the shared concept shall
+exhibit high co-occurrence among the group. CoDet then leverages visual
+similarities to discover the co-occurring objects and align them with the
+shared concept. Extensive experiments demonstrate that CoDet has superior
+performances and compelling scalability in open-vocabulary detection, e.g., by
+scaling up the visual backbone, CoDet achieves 37.0 $\text{AP}^m_{novel}$ and
+44.7 $\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2
+$\text{AP}^m_{novel}$ and 9.8 $\text{AP}^m_{all}$. Code is available at
+https://github.com/CVMI-Lab/CoDet.",cs.CV,['cs.CV']
+Blind Image Quality Assessment Based on Geometric Order Learning,Nyeong-Ho Shin · Seon-Ho Lee · Chang-Su Kim, ,https://arxiv.org/abs/2404.14949,,2404.14949.pdf,Multi-Modal Prompt Learning on Blind Image Quality Assessment,"Image Quality Assessment (IQA) models benefit significantly from semantic
+information, which allows them to treat different types of objects distinctly.
+Currently, leveraging semantic information to enhance IQA is a crucial research
+direction. Traditional methods, hindered by a lack of sufficiently annotated
+data, have employed the CLIP image-text pretraining model as their backbone to
+gain semantic awareness. However, the generalist nature of these pre-trained
+Vision-Language (VL) models often renders them suboptimal for IQA-specific
+tasks. Recent approaches have attempted to address this mismatch using prompt
+technology, but these solutions have shortcomings. Existing prompt-based VL
+models overly focus on incremental semantic information from text, neglecting
+the rich insights available from visual data analysis. This imbalance limits
+their performance improvements in IQA tasks. This paper introduces an
+innovative multi-modal prompt-based methodology for IQA. Our approach employs
+carefully crafted prompts that synergistically mine incremental semantic
+information from both visual and linguistic data. Specifically, in the visual
+branch, we introduce a multi-layer prompt structure to enhance the VL model's
+adaptability. In the text branch, we deploy a dual-prompt scheme that steers
+the model to recognize and differentiate between scene category and distortion
+type, thereby refining the model's capacity to assess image quality. Our
+experimental findings underscore the effectiveness of our method over existing
+Blind Image Quality Assessment (BIQA) approaches. Notably, it demonstrates
+competitive performance across various datasets. Our method achieves Spearman
+Rank Correlation Coefficient (SRCC) values of 0.961(surpassing 0.946 in CSIQ)
+and 0.941 (exceeding 0.930 in KADID), illustrating its robustness and accuracy
+in diverse contexts.",cs.CV,['cs.CV']
+MemoNav: Working Memory Model for Visual Navigation,Hongxin Li · Zeyu Wang · Xu Yang · yuran Yang · Shuqi Mei · Zhaoxiang Zhang,https://github.com/ZJULiHongxin/MemoNav,https://arxiv.org/abs/2402.19161v1,,2402.19161v1.pdf,MemoNav: Working Memory Model for Visual Navigation,"Image-goal navigation is a challenging task that requires an agent to
+navigate to a goal indicated by an image in unfamiliar environments. Existing
+methods utilizing diverse scene memories suffer from inefficient exploration
+since they use all historical observations for decision-making without
+considering the goal-relevant fraction. To address this limitation, we present
+MemoNav, a novel memory model for image-goal navigation, which utilizes a
+working memory-inspired pipeline to improve navigation performance.
+Specifically, we employ three types of navigation memory. The node features on
+a map are stored in the short-term memory (STM), as these features are
+dynamically updated. A forgetting module then retains the informative STM
+fraction to increase efficiency. We also introduce long-term memory (LTM) to
+learn global scene representations by progressively aggregating STM features.
+Subsequently, a graph attention module encodes the retained STM and the LTM to
+generate working memory (WM) which contains the scene features essential for
+efficient navigation. The synergy among these three memory types boosts
+navigation performance by enabling the agent to learn and leverage
+goal-relevant scene features within a topological map. Our evaluation on
+multi-goal tasks demonstrates that MemoNav significantly outperforms previous
+methods across all difficulty levels in both Gibson and Matterport3D scenes.
+Qualitative results further illustrate that MemoNav plans more efficient
+routes.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']"
+Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts,Fei Ni · Jianye Hao · Shiguang Wu · Longxin Kou · Jiashun Liu · YAN ZHENG · Bin Wang · Yuzheng Zhuang, ,,https://pub.towardsai.net/ai-robotics-breakthroughs-and-trends-at-cvpr-2024-d4a83b5f9564,,,,,nan
+RCBEVDet: Radar-camera Fusion in Bird’s Eye View for 3D Object Detection,Zhiwei Lin · Zhe Liu · Zhongyu Xia · Xinhao Wang · Yongtao Wang · Shengxiang Qi · Yang Dong · Nan Dong · Le Zhang · Ce Zhu, ,https://arxiv.org/abs/2403.16440,,2403.16440.pdf,RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection,"Three-dimensional object detection is one of the key tasks in autonomous
+driving. To reduce costs in practice, low-cost multi-view cameras for 3D object
+detection are proposed to replace the expansive LiDAR sensors. However, relying
+solely on cameras is difficult to achieve highly accurate and robust 3D object
+detection. An effective solution to this issue is combining multi-view cameras
+with the economical millimeter-wave radar sensor to achieve more reliable
+multi-modal 3D object detection. In this paper, we introduce RCBEVDet, a
+radar-camera fusion 3D object detection method in the bird's eye view (BEV).
+Specifically, we first design RadarBEVNet for radar BEV feature extraction.
+RadarBEVNet consists of a dual-stream radar backbone and a Radar Cross-Section
+(RCS) aware BEV encoder. In the dual-stream radar backbone, a point-based
+encoder and a transformer-based encoder are proposed to extract radar features,
+with an injection and extraction module to facilitate communication between the
+two encoders. The RCS-aware BEV encoder takes RCS as the object size prior to
+scattering the point feature in BEV. Besides, we present the Cross-Attention
+Multi-layer Fusion module to automatically align the multi-modal BEV feature
+from radar and camera with the deformable attention mechanism, and then fuse
+the feature with channel and spatial fusion layers. Experimental results show
+that RCBEVDet achieves new state-of-the-art radar-camera fusion results on
+nuScenes and view-of-delft (VoD) 3D object detection benchmarks. Furthermore,
+RCBEVDet achieves better 3D detection results than all real-time camera-only
+and radar-camera 3D object detectors with a faster inference speed at 21~28
+FPS. The source code will be released at https://github.com/VDIGPKU/RCBEVDet.",cs.CV,['cs.CV']
+Instance-based Max-margin for Practical Few-shot Recognition,Minghao Fu · Ke Zhu,https://github.com/heekhero/IbM2,https://arxiv.org/abs/2312.07856,,2312.07856.pdf,DTL: Disentangled Transfer Learning for Visual Recognition,"When pre-trained models become rapidly larger, the cost of fine-tuning on
+downstream tasks steadily increases, too. To economically fine-tune these
+models, parameter-efficient transfer learning (PETL) is proposed, which only
+tunes a tiny subset of trainable parameters to efficiently learn quality
+representations. However, current PETL methods are facing the dilemma that
+during training the GPU memory footprint is not effectively reduced as
+trainable parameters. PETL will likely fail, too, if the full fine-tuning
+encounters the out-of-GPU-memory issue. This phenomenon happens because
+trainable parameters from these methods are generally entangled with the
+backbone, such that a lot of intermediate states have to be stored in GPU
+memory for gradient propagation. To alleviate this problem, we introduce
+Disentangled Transfer Learning (DTL), which disentangles the trainable
+parameters from the backbone using a lightweight Compact Side Network (CSN). By
+progressively extracting task-specific information with a few low-rank linear
+mappings and appropriately adding the information back to the backbone, CSN
+effectively realizes knowledge transfer in various downstream tasks. We
+conducted extensive experiments to validate the effectiveness of our method.
+The proposed method not only reduces a large amount of GPU memory usage and
+trainable parameters, but also outperforms existing PETL methods by a
+significant margin in accuracy, achieving new state-of-the-art on several
+standard benchmarks. The code is available at https://github.com/heekhero/DTL.",cs.CV,"['cs.CV', 'cs.AI']"
+TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding,Shuhuai Ren · Linli Yao · Shicheng Li · Xu Sun · Lu Hou,https://github.com/RenShuhuai-Andy/TimeChat,https://arxiv.org/abs/2312.02051,,2312.02051.pdf,TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding,"This work proposes TimeChat, a time-sensitive multimodal large language model
+specifically designed for long video understanding. Our model incorporates two
+key architectural contributions: (1) a timestamp-aware frame encoder that binds
+visual content with the timestamp of each frame, and (2) a sliding video
+Q-Former that produces a video token sequence of varying lengths to accommodate
+videos of various durations. Additionally, we construct an instruction-tuning
+dataset, encompassing 6 tasks and a total of 125K instances, to further enhance
+TimeChat's instruction-following performance. Experiment results across various
+video understanding tasks, such as dense captioning, temporal grounding, and
+highlight detection, demonstrate TimeChat's strong zero-shot temporal
+localization and reasoning capabilities. For example, it achieves +9.2 F1 score
+and +2.8 CIDEr on YouCook2, +5.8 HIT@1 on QVHighlights, and +27.5 R@1 (IoU=0.5)
+on Charades-STA, compared to state-of-the-art video large language models,
+holding the potential to serve as a versatile video assistant for long-form
+video comprehension tasks and satisfy realistic user requirements.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']"
+G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis,Yufei Ye · Abhinav Gupta · Kris Kitani · Shubham Tulsiani, ,https://arxiv.org/abs/2404.12383,,2404.12383.pdf,G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis,"We propose G-HOP, a denoising diffusion based generative prior for
+hand-object interactions that allows modeling both the 3D object and a human
+hand, conditioned on the object category. To learn a 3D spatial diffusion model
+that can capture this joint distribution, we represent the human hand via a
+skeletal distance field to obtain a representation aligned with the (latent)
+signed distance field for the object. We show that this hand-object prior can
+then serve as generic guidance to facilitate other tasks like reconstruction
+from interaction clip and human grasp synthesis. We believe that our model,
+trained by aggregating seven diverse real-world interaction datasets spanning
+across 155 categories, represents a first approach that allows jointly
+generating both hand and object. Our empirical evaluations demonstrate the
+benefit of this joint prior in video-based reconstruction and human grasp
+synthesis, outperforming current task-specific baselines.
+  Project website: https://judyye.github.io/ghop-www",cs.CV,['cs.CV']
+VTimeLLM: Empower LLM to Grasp Video Moments,Bin Huang · Xin Wang · Hong Chen · Zihan Song · Wenwu Zhu, ,https://arxiv.org/abs/2311.18445v1,,2311.18445v1.pdf,VTimeLLM: Empower LLM to Grasp Video Moments,"Large language models (LLMs) have shown remarkable text understanding
+capabilities, which have been extended as Video LLMs to handle video data for
+comprehending visual details. However, existing Video LLMs can only provide a
+coarse description of the entire video, failing to capture the precise start
+and end time boundary of specific events. In this paper, we solve this issue
+via proposing VTimeLLM, a novel Video LLM designed for fine-grained video
+moment understanding and reasoning with respect to time boundary. Specifically,
+our VTimeLLM adopts a boundary-aware three-stage training strategy, which
+respectively utilizes image-text pairs for feature alignment, multiple-event
+videos to increase temporal-boundary awareness, and high-quality
+video-instruction tuning to further improve temporal understanding ability as
+well as align with human intents. Extensive experiments demonstrate that in
+fine-grained time-related comprehension tasks for videos such as Temporal Video
+Grounding and Dense Video Captioning, VTimeLLM significantly outperforms
+existing Video LLMs. Besides, benefits from the fine-grained temporal
+understanding of the videos further enable VTimeLLM to beat existing Video LLMs
+in video dialogue benchmark, showing its superior cross-modal understanding and
+reasoning abilities.",cs.CV,['cs.CV']
+DITTO: Dual and Integrated Latent Topologies for Implicit 3D Reconstruction,Jaehyeok Shim · Kyungdon Joo, ,https://arxiv.org/abs/2403.05005,,2403.05005.pdf,DITTO: Dual and Integrated Latent Topologies for Implicit 3D Reconstruction,"We propose a novel concept of dual and integrated latent topologies (DITTO in
+short) for implicit 3D reconstruction from noisy and sparse point clouds. Most
+existing methods predominantly focus on single latent type, such as point or
+grid latents. In contrast, the proposed DITTO leverages both point and grid
+latents (i.e., dual latent) to enhance their strengths, the stability of grid
+latents and the detail-rich capability of point latents. Concretely, DITTO
+consists of dual latent encoder and integrated implicit decoder. In the dual
+latent encoder, a dual latent layer, which is the key module block composing
+the encoder, refines both latents in parallel, maintaining their distinct
+shapes and enabling recursive interaction. Notably, a newly proposed dynamic
+sparse point transformer within the dual latent layer effectively refines point
+latents. Then, the integrated implicit decoder systematically combines these
+refined latents, achieving high-fidelity 3D reconstruction and surpassing
+previous state-of-the-art methods on object- and scene-level datasets,
+especially in thin and detailed structures.",cs.CV,['cs.CV']
+Mind The Edge: Refining Depth Edges in Sparsely-Supervised Monocular Depth Estimation,Lior Talker · Aviad Cohen · Erez Yosef · Alexandra Dana · Michael Dinerstein,https://github.com/liortalker/MindTheEdge,,https://www.youtube.com/watch?v=WPmbAnJk3rE,,,,,nan
+StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN,Jongwoo Choi · Kwanggyoon Seo · Amirsaman Ashtari · Junyong Noh,https://jeolpyeoni.github.io/stylecinegan_project/,https://arxiv.org/abs/2403.14186,,2403.14186.pdf,StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN,"We propose a method that can generate cinemagraphs automatically from a still
+landscape image using a pre-trained StyleGAN. Inspired by the success of recent
+unconditional video generation, we leverage a powerful pre-trained image
+generator to synthesize high-quality cinemagraphs. Unlike previous approaches
+that mainly utilize the latent space of a pre-trained StyleGAN, our approach
+utilizes its deep feature space for both GAN inversion and cinemagraph
+generation. Specifically, we propose multi-scale deep feature warping (MSDFW),
+which warps the intermediate features of a pre-trained StyleGAN at different
+resolutions. By using MSDFW, the generated cinemagraphs are of high resolution
+and exhibit plausible looping animation. We demonstrate the superiority of our
+method through user studies and quantitative comparisons with state-of-the-art
+cinemagraph generation methods and a video generation method that uses a
+pre-trained StyleGAN.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR']"
+Decoupled Pseudo-labeling in Semi-Supervised Monocular 3D Object Detection,Jiacheng Zhang · Jiaming Li · Xiangru Lin · Wei Zhang · Xiao Tan · Junyu Han · Errui Ding · Jingdong Wang · Guanbin Li, ,https://arxiv.org/abs/2403.17387,,2403.17387.pdf,Decoupled Pseudo-labeling for Semi-Supervised Monocular 3D Object Detection,"We delve into pseudo-labeling for semi-supervised monocular 3D object
+detection (SSM3OD) and discover two primary issues: a misalignment between the
+prediction quality of 3D and 2D attributes and the tendency of depth
+supervision derived from pseudo-labels to be noisy, leading to significant
+optimization conflicts with other reliable forms of supervision. We introduce a
+novel decoupled pseudo-labeling (DPL) approach for SSM3OD. Our approach
+features a Decoupled Pseudo-label Generation (DPG) module, designed to
+efficiently generate pseudo-labels by separately processing 2D and 3D
+attributes. This module incorporates a unique homography-based method for
+identifying dependable pseudo-labels in BEV space, specifically for 3D
+attributes. Additionally, we present a DepthGradient Projection (DGP) module to
+mitigate optimization conflicts caused by noisy depth supervision of
+pseudo-labels, effectively decoupling the depth gradient and removing
+conflicting gradients. This dual decoupling strategy-at both the pseudo-label
+generation and gradient levels-significantly improves the utilization of
+pseudo-labels in SSM3OD. Our comprehensive experiments on the KITTI benchmark
+demonstrate the superiority of our method over existing approaches.",cs.CV,['cs.CV']
+T-VSL: Text-Guided Visual Sound Source Localization in Mixtures,Tanvir Mahmud · Yapeng Tian · Diana Marculescu, ,https://arxiv.org/abs/2404.01751v1,,2404.01751v1.pdf,T-VSL: Text-Guided Visual Sound Source Localization in Mixtures,"Visual sound source localization poses a significant challenge in identifying
+the semantic region of each sounding source within a video. Existing
+self-supervised and weakly supervised source localization methods struggle to
+accurately distinguish the semantic regions of each sounding object,
+particularly in multi-source mixtures. These methods often rely on audio-visual
+correspondence as guidance, which can lead to substantial performance drops in
+complex multi-source localization scenarios. The lack of access to individual
+source sounds in multi-source mixtures during training exacerbates the
+difficulty of learning effective audio-visual correspondence for localization.
+To address this limitation, in this paper, we propose incorporating the text
+modality as an intermediate feature guide using tri-modal joint embedding
+models (e.g., AudioCLIP) to disentangle the semantic audio-visual source
+correspondence in multi-source mixtures. Our framework, dubbed T-VSL, begins by
+predicting the class of sounding entities in mixtures. Subsequently, the
+textual representation of each sounding source is employed as guidance to
+disentangle fine-grained audio-visual source correspondence from multi-source
+mixtures, leveraging the tri-modal AudioCLIP embedding. This approach enables
+our framework to handle a flexible number of sources and exhibits promising
+zero-shot transferability to unseen classes during test time. Extensive
+experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets
+demonstrate significant performance improvements over state-of-the-art methods.",cs.CV,"['cs.CV', 'cs.SD', 'eess.AS']"
+USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation,Xiaoqi Wang · Wenbin He · Xiwei Xuan · Clint Sebastian · Jorge Piazentin Ono · Xin Li · Sima Behpour · Thang Doan · Liang Gou · Shen · Liu Ren, ,http://export.arxiv.org/abs/2307.00764,,2307.00764.pdf,Hierarchical Open-vocabulary Universal Image Segmentation,"Open-vocabulary image segmentation aims to partition an image into semantic
+regions according to arbitrary text descriptions. However, complex visual
+scenes can be naturally decomposed into simpler parts and abstracted at
+multiple levels of granularity, introducing inherent segmentation ambiguity.
+Unlike existing methods that typically sidestep this ambiguity and treat it as
+an external factor, our approach actively incorporates a hierarchical
+representation encompassing different semantic-levels into the learning
+process. We propose a decoupled text-image fusion mechanism and representation
+learning modules for both ""things"" and ""stuff"". Additionally, we systematically
+examine the differences that exist in the textual and visual features between
+these types of categories. Our resulting model, named HIPIE, tackles
+HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a
+unified framework. Benchmarked on over 40 datasets, e.g., ADE20K, COCO,
+Pascal-VOC Part, RefCOCO/RefCOCOg, ODinW and SeginW, HIPIE achieves the
+state-of-the-art results at various levels of image comprehension, including
+semantic-level (e.g., semantic segmentation), instance-level (e.g.,
+panoptic/referring segmentation and object detection), as well as part-level
+(e.g., part/subpart segmentation) tasks. Our code is released at
+https://github.com/berkeley-hipie/HIPIE.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+DVMNet: Computing Relative Pose for Unseen Objects Beyond Hypotheses,Chen Zhao · Tong Zhang · Zheng Dang · Mathieu Salzmann, ,https://arxiv.org/abs/2403.13683,,2403.13683.pdf,DVMNet: Computing Relative Pose for Unseen Objects Beyond Hypotheses,"Determining the relative pose of an object between two images is pivotal to
+the success of generalizable object pose estimation. Existing approaches
+typically approximate the continuous pose representation with a large number of
+discrete pose hypotheses, which incurs a computationally expensive process of
+scoring each hypothesis at test time. By contrast, we present a Deep Voxel
+Matching Network (DVMNet) that eliminates the need for pose hypotheses and
+computes the relative object pose in a single pass. To this end, we map the two
+input RGB images, reference and query, to their respective voxelized 3D
+representations. We then pass the resulting voxels through a pose estimation
+module, where the voxels are aligned and the pose is computed in an end-to-end
+fashion by solving a least-squares problem. To enhance robustness, we introduce
+a weighted closest voxel algorithm capable of mitigating the impact of noisy
+voxels. We conduct extensive experiments on the CO3D, LINEMOD, and Objaverse
+datasets, demonstrating that our method delivers more accurate relative pose
+estimates for novel objects at a lower computational cost compared to
+state-of-the-art methods. Our code is released at:
+https://github.com/sailor-z/DVMNet/.",cs.CV,"['cs.CV', 'cs.RO']"
+pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction,David Charatan · Sizhe Lester Li · Andrea Tagliasacchi · Vincent Sitzmann, ,https://arxiv.org/abs/2312.12337,,2312.12337.pdf,pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction,"We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D
+radiance fields parameterized by 3D Gaussian primitives from pairs of images.
+Our model features real-time and memory-efficient rendering for scalable
+training as well as fast 3D reconstruction at inference time. To overcome local
+minima inherent to sparse and locally supported representations, we predict a
+dense probability distribution over 3D and sample Gaussian means from that
+probability distribution. We make this sampling operation differentiable via a
+reparameterization trick, allowing us to back-propagate gradients through the
+Gaussian splatting representation. We benchmark our method on wide-baseline
+novel view synthesis on the real-world RealEstate10k and ACID datasets, where
+we outperform state-of-the-art light field transformers and accelerate
+rendering by 2.5 orders of magnitude while reconstructing an interpretable and
+editable 3D radiance field.",cs.CV,"['cs.CV', 'cs.LG']"
+MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation,Yanhui Wang · Jianmin Bao · Wenming Weng · Ruoyu Feng · Dacheng Yin · Tao Yang · Jingxu Zhang · Qi Dai · Zhiyuan Zhao · Chunyu Wang · Kai Qiu · Yuhui Yuan · Xiaoyan Sun · Chong Luo · Baining Guo, ,https://arxiv.org/abs/2311.18829,,2311.18829.pdf,MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation,"We present MicroCinema, a straightforward yet effective framework for
+high-quality and coherent text-to-video generation. Unlike existing approaches
+that align text prompts with video directly, MicroCinema introduces a
+Divide-and-Conquer strategy which divides the text-to-video into a two-stage
+process: text-to-image generation and image\&text-to-video generation. This
+strategy offers two significant advantages. a) It allows us to take full
+advantage of the recent advances in text-to-image models, such as Stable
+Diffusion, Midjourney, and DALLE, to generate photorealistic and highly
+detailed images. b) Leveraging the generated image, the model can allocate less
+focus to fine-grained appearance details, prioritizing the efficient learning
+of motion dynamics. To implement this strategy effectively, we introduce two
+core designs. First, we propose the Appearance Injection Network, enhancing the
+preservation of the appearance of the given image. Second, we introduce the
+Appearance Noise Prior, a novel mechanism aimed at maintaining the capabilities
+of pre-trained 2D diffusion models. These design elements empower MicroCinema
+to generate high-quality videos with precise motion, guided by the provided
+text prompts. Extensive experiments demonstrate the superiority of the proposed
+framework. Concretely, MicroCinema achieves SOTA zero-shot FVD of 342.86 on
+UCF-101 and 377.40 on MSR-VTT. See
+https://wangyanhui666.github.io/MicroCinema.github.io/ for video samples.",cs.CV,['cs.CV']
+Domain Prompt Learning with Quaternion Networks,Qinglong Cao · Zhengqin Xu · Yuntian Chen · Chao Ma · Xiaokang Yang, ,https://arxiv.org/abs/2312.08878,,2312.08878.pdf,Domain Prompt Learning with Quaternion Networks,"Prompt learning has emerged as an effective and data-efficient technique in
+large Vision-Language Models (VLMs). However, when adapting VLMs to specialized
+domains such as remote sensing and medical imaging, domain prompt learning
+remains underexplored. While large-scale domain-specific foundation models can
+help tackle this challenge, their concentration on a single vision level makes
+it challenging to prompt both vision and language modalities. To overcome this,
+we propose to leverage domain-specific knowledge from domain-specific
+foundation models to transfer the robust recognition ability of VLMs from
+generalized to specialized domains, using quaternion networks. Specifically,
+the proposed method involves using domain-specific vision features from
+domain-specific foundation models to guide the transformation of generalized
+contextual embeddings from the language branch into a specialized space within
+the quaternion networks. Moreover, we present a hierarchical approach that
+generates vision prompt features by analyzing intermodal relationships between
+hierarchical language prompt features and domain-specific vision features. In
+this way, quaternion networks can effectively mine the intermodal relationships
+in the specific domain, facilitating domain-specific vision-language
+contrastive learning. Extensive experiments on domain-specific datasets show
+that our proposed method achieves new state-of-the-art results in prompt
+learning.",cs.CV,"['cs.CV', 'cs.LG', 'stat.AP']"
+MS-MANO: Enabling Hand Pose Tracking with Biomechanical Constraints,Pengfei Xie · Wenqiang Xu · Tutian Tang · Zhenjun Yu · Cewu Lu, ,https://arxiv.org/abs/2404.10227,,2404.10227.pdf,MS-MANO: Enabling Hand Pose Tracking with Biomechanical Constraints,"This work proposes a novel learning framework for visual hand dynamics
+analysis that takes into account the physiological aspects of hand motion. The
+existing models, which are simplified joint-actuated systems, often produce
+unnatural motions. To address this, we integrate a musculoskeletal system with
+a learnable parametric hand model, MANO, to create a new model, MS-MANO. This
+model emulates the dynamics of muscles and tendons to drive the skeletal
+system, imposing physiologically realistic constraints on the resulting torque
+trajectories. We further propose a simulation-in-the-loop pose refinement
+framework, BioPR, that refines the initial estimated pose through a multi-layer
+perceptron (MLP) network. Our evaluation of the accuracy of MS-MANO and the
+efficacy of the BioPR is conducted in two separate parts. The accuracy of
+MS-MANO is compared with MyoSuite, while the efficacy of BioPR is benchmarked
+against two large-scale public datasets and two recent state-of-the-art
+methods. The results demonstrate that our approach consistently improves the
+baseline methods both quantitatively and qualitatively.",cs.CV,"['cs.CV', 'cs.RO']"
+JDEC: JPEG Decoding via Enhanced Continuous Cosine Coefficients,Woo Kyoung Han · Sunghoon Im · Jaedeok Kim · Kyong Hwan Jin,https://wookyounghan.github.io/JDEC/,https://arxiv.org/abs/2404.05558,,2404.05558.pdf,JDEC: JPEG Decoding via Enhanced Continuous Cosine Coefficients,"We propose a practical approach to JPEG image decoding, utilizing a local
+implicit neural representation with continuous cosine formulation. The JPEG
+algorithm significantly quantizes discrete cosine transform (DCT) spectra to
+achieve a high compression rate, inevitably resulting in quality degradation
+while encoding an image. We have designed a continuous cosine spectrum
+estimator to address the quality degradation issue that restores the distorted
+spectrum. By leveraging local DCT formulations, our network has the privilege
+to exploit dequantization and upsampling simultaneously. Our proposed model
+enables decoding compressed images directly across different quality factors
+using a single pre-trained model without relying on a conventional JPEG
+decoder. As a result, our proposed network achieves state-of-the-art
+performance in flexible color image JPEG artifact removal tasks. Our source
+code is available at https://github.com/WooKyoungHan/JDEC.",eess.IV,"['eess.IV', 'cs.CV']"
+Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation,Sihan liu · Yiwei Ma · Xiaoqing Zhang · Haowei Wang · Jiayi Ji · Xiaoshuai Sun · Rongrong Ji, ,https://arxiv.org/abs/2312.12470,,2312.12470.pdf,Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation,"Referring Remote Sensing Image Segmentation (RRSIS) is a new challenge that
+combines computer vision and natural language processing, delineating specific
+regions in aerial images as described by textual queries. Traditional Referring
+Image Segmentation (RIS) approaches have been impeded by the complex spatial
+scales and orientations found in aerial imagery, leading to suboptimal
+segmentation results. To address these challenges, we introduce the Rotated
+Multi-Scale Interaction Network (RMSIN), an innovative approach designed for
+the unique demands of RRSIS. RMSIN incorporates an Intra-scale Interaction
+Module (IIM) to effectively address the fine-grained detail required at
+multiple scales and a Cross-scale Interaction Module (CIM) for integrating
+these details coherently across the network. Furthermore, RMSIN employs an
+Adaptive Rotated Convolution (ARC) to account for the diverse orientations of
+objects, a novel contribution that significantly enhances segmentation
+accuracy. To assess the efficacy of RMSIN, we have curated an expansive dataset
+comprising 17,402 image-caption-mask triplets, which is unparalleled in terms
+of scale and variety. This dataset not only presents the model with a wide
+range of spatial and rotational scenarios but also establishes a stringent
+benchmark for the RRSIS task, ensuring a rigorous evaluation of performance.
+Our experimental evaluations demonstrate the exceptional performance of RMSIN,
+surpassing existing state-of-the-art models by a significant margin. All
+datasets and code are made available at https://github.com/Lsan2401/RMSIN.",cs.CV,['cs.CV']
+"IDGuard: Robust, General, Identity-centric POI Proactive Defense Against Face Editing Abuse",Yunshu Dai · Jianwei Fei · Fangjun Huang, ,https://arxiv.org/abs/2311.01357,,2311.01357.pdf,Robust Identity Perceptual Watermark Against Deepfake Face Swapping,"Notwithstanding offering convenience and entertainment to society, Deepfake
+face swapping has caused critical privacy issues with the rapid development of
+deep generative models. Due to imperceptible artifacts in high-quality
+synthetic images, passive detection models against face swapping in recent
+years usually suffer performance damping regarding the generalizability issue.
+Therefore, several studies have been attempted to proactively protect the
+original images against malicious manipulations by inserting invisible signals
+in advance. However, the existing proactive defense approaches demonstrate
+unsatisfactory results with respect to visual quality, detection accuracy, and
+source tracing ability. In this study, to fulfill the research gap, we propose
+the first robust identity perceptual watermarking framework that concurrently
+performs detection and source tracing against Deepfake face swapping
+proactively. We assign identity semantics regarding the image contents to the
+watermarks and devise an unpredictable and nonreversible chaotic encryption
+system to ensure watermark confidentiality. The watermarks are encoded and
+recovered by jointly training an encoder-decoder framework along with
+adversarial image manipulations. Falsification and source tracing are
+accomplished by justifying the consistency between the content-matched identity
+perceptual watermark and the recovered robust watermark from the image.
+Extensive experiments demonstrate state-of-the-art detection performance on
+Deepfake face swapping under both cross-dataset and cross-manipulation
+settings.",cs.CV,['cs.CV']
+Tri-Perspective View Decomposition for Geometry-Aware Depth Completion,Zhiqiang Yan · Yuankai Lin · Kun Wang · Yupeng Zheng · Yufei Wang · Zhenyu Zhang · Jun Li · Jian Yang, ,https://arxiv.org/abs/2403.15008,,2403.15008.pdf,Tri-Perspective View Decomposition for Geometry-Aware Depth Completion,"Depth completion is a vital task for autonomous driving, as it involves
+reconstructing the precise 3D geometry of a scene from sparse and noisy depth
+measurements. However, most existing methods either rely only on 2D depth
+representations or directly incorporate raw 3D point clouds for compensation,
+which are still insufficient to capture the fine-grained 3D geometry of the
+scene. To address this challenge, we introduce Tri-Perspective view
+Decomposition (TPVD), a novel framework that can explicitly model 3D geometry.
+In particular, (1) TPVD ingeniously decomposes the original point cloud into
+three 2D views, one of which corresponds to the sparse depth input. (2) We
+design TPV Fusion to update the 2D TPV features through recurrent 2D-3D-2D
+aggregation, where a Distance-Aware Spherical Convolution (DASC) is applied.
+(3) By adaptively choosing TPV affinitive neighbors, the newly proposed
+Geometric Spatial Propagation Network (GSPN) further improves the geometric
+consistency. As a result, our TPVD outperforms existing methods on KITTI,
+NYUv2, and SUN RGBD. Furthermore, we build a novel depth completion dataset
+named TOFDC, which is acquired by the time-of-flight (TOF) sensor and the color
+camera on smartphones. Project page:
+https://yanzq95.github.io/projectpage/TOFDC/index.html",cs.CV,['cs.CV']
+Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing,Bingyan Liu · Chengyu Wang · Tingfeng Cao · Kui Jia · Jun Huang, ,https://arxiv.org/abs/2403.03431,,2403.03431.pdf,Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing,"Deep Text-to-Image Synthesis (TIS) models such as Stable Diffusion have
+recently gained significant popularity for creative Text-to-image generation.
+Yet, for domain-specific scenarios, tuning-free Text-guided Image Editing (TIE)
+is of greater importance for application developers, which modify objects or
+object properties in images by manipulating feature components in attention
+layers during the generation process. However, little is known about what
+semantic meanings these attention layers have learned and which parts of the
+attention maps contribute to the success of image editing. In this paper, we
+conduct an in-depth probing analysis and demonstrate that cross-attention maps
+in Stable Diffusion often contain object attribution information that can
+result in editing failures. In contrast, self-attention maps play a crucial
+role in preserving the geometric and shape details of the source image during
+the transformation to the target image. Our analysis offers valuable insights
+into understanding cross and self-attention maps in diffusion models. Moreover,
+based on our findings, we simplify popular image editing methods and propose a
+more straightforward yet more stable and efficient tuning-free procedure that
+only modifies self-attention maps of the specified attention layers during the
+denoising process. Experimental results show that our simplified method
+consistently surpasses the performance of popular approaches on multiple
+datasets.",cs.CV,['cs.CV']
+RoHM: Robust Human Motion Reconstruction via Diffusion,Siwei Zhang · Bharat Lal Bhatnagar · Yuanlu Xu · Alexander Winkler · Petr Kadlecek · Siyu Tang · Federica Bogo,https://sanweiliti.github.io/ROHM/ROHM.html,https://arxiv.org/abs/2401.08570,,2401.08570.pdf,RoHM: Robust Human Motion Reconstruction via Diffusion,"We propose RoHM, an approach for robust 3D human motion reconstruction from
+monocular RGB(-D) videos in the presence of noise and occlusions. Most previous
+approaches either train neural networks to directly regress motion in 3D or
+learn data-driven motion priors and combine them with optimization at test
+time. The former do not recover globally coherent motion and fail under
+occlusions; the latter are time-consuming, prone to local minima, and require
+manual tuning. To overcome these shortcomings, we exploit the iterative,
+denoising nature of diffusion models. RoHM is a novel diffusion-based motion
+model that, conditioned on noisy and occluded input data, reconstructs
+complete, plausible motions in consistent global coordinates. Given the
+complexity of the problem -- requiring one to address different tasks
+(denoising and infilling) in different solution spaces (local and global
+motion) -- we decompose it into two sub-tasks and learn two models, one for
+global trajectory and one for local motion. To capture the correlations between
+the two, we then introduce a novel conditioning module, combining it with an
+iterative inference scheme. We apply RoHM to a variety of tasks -- from motion
+reconstruction and denoising to spatial and temporal infilling. Extensive
+experiments on three popular datasets show that our method outperforms
+state-of-the-art approaches qualitatively and quantitatively, while being
+faster at test time. The code is available at
+https://sanweiliti.github.io/ROHM/ROHM.html.",cs.CV,['cs.CV']
+Abductive Ego-View Accident Video Understanding for Safe Driving Perception,Jianwu Fang · Lei-lei Li · Junfei Zhou · Junbin Xiao · Hongkai Yu · Chen Lv · Jianru Xue · Tat-seng Chua,www.lotvsmmau.net,https://arxiv.org/abs/2403.00436,,2403.00436.pdf,Abductive Ego-View Accident Video Understanding for Safe Driving Perception,"We present MM-AU, a novel dataset for Multi-Modal Accident video
+Understanding. MM-AU contains 11,727 in-the-wild ego-view accident videos, each
+with temporally aligned text descriptions. We annotate over 2.23 million object
+boxes and 58,650 pairs of video-based accident reasons, covering 58 accident
+categories. MM-AU supports various accident understanding tasks, particularly
+multimodal video diffusion to understand accident cause-effect chains for safe
+driving. With MM-AU, we present an Abductive accident Video understanding
+framework for Safe Driving perception (AdVersa-SD). AdVersa-SD performs video
+diffusion via an Object-Centric Video Diffusion (OAVD) method which is driven
+by an abductive CLIP model. This model involves a contrastive interaction loss
+to learn the pair co-occurrence of normal, near-accident, accident frames with
+the corresponding text descriptions, such as accident reasons, prevention
+advice, and accident categories. OAVD enforces the causal region learning while
+fixing the content of the original frame background in video generation, to
+find the dominant cause-effect chain for certain accidents. Extensive
+experiments verify the abductive ability of AdVersa-SD and the superiority of
+OAVD against the state-of-the-art diffusion models. Additionally, we provide
+careful benchmark evaluations for object detection and accident reason
+answering since AdVersa-SD relies on precise object and accident reason
+information.",cs.CV,"['cs.CV', 'cs.AI']"
+Towards Language-Driven Video Inpainting via Multimodal Large Language Models,Jianzong Wu · Xiangtai Li · Chenyang Si · Shangchen Zhou · Jingkang Yang · Jiangning Zhang · Yining Li · Kai Chen · Yunhai Tong · Ziwei Liu · Chen Change Loy,https://jianzongwu.github.io/projects/rovi/,https://arxiv.org/abs/2401.10226,,2401.10226.pdf,Towards Language-Driven Video Inpainting via Multimodal Large Language Models,"We introduce a new task -- language-driven video inpainting, which uses
+natural language instructions to guide the inpainting process. This approach
+overcomes the limitations of traditional video inpainting methods that depend
+on manually labeled binary masks, a process often tedious and labor-intensive.
+We present the Remove Objects from Videos by Instructions (ROVI) dataset,
+containing 5,650 videos and 9,091 inpainting results, to support training and
+evaluation for this task. We also propose a novel diffusion-based
+language-driven video inpainting framework, the first end-to-end baseline for
+this task, integrating Multimodal Large Language Models to understand and
+execute complex language-based inpainting requests effectively. Our
+comprehensive results showcase the dataset's versatility and the model's
+effectiveness in various language-instructed inpainting scenarios. We will make
+datasets, code, and models publicly available.",cs.CV,['cs.CV']
+Self-Supervised Facial Representation Learning with Facial Region Awareness,Zheng Gao · Ioannis Patras, ,https://arxiv.org/abs/2403.02138,,2403.02138.pdf,Self-Supervised Facial Representation Learning with Facial Region Awareness,"Self-supervised pre-training has been proved to be effective in learning
+transferable representations that benefit various visual tasks. This paper asks
+this question: can self-supervised pre-training learn general facial
+representations for various facial analysis tasks? Recent efforts toward this
+goal are limited to treating each face image as a whole, i.e., learning
+consistent facial representations at the image-level, which overlooks the
+consistency of local facial representations (i.e., facial regions like eyes,
+nose, etc). In this work, we make a first attempt to propose a novel
+self-supervised facial representation learning framework to learn consistent
+global and local facial representations, Facial Region Awareness (FRA).
+Specifically, we explicitly enforce the consistency of facial regions by
+matching the local facial representations across views, which are extracted
+with learned heatmaps highlighting the facial regions. Inspired by the mask
+prediction in supervised semantic segmentation, we obtain the heatmaps via
+cosine similarity between the per-pixel projection of feature maps and facial
+mask embeddings computed from learnable positional embeddings, which leverage
+the attention mechanism to globally look up the facial image for facial
+regions. To learn such heatmaps, we formulate the learning of facial mask
+embeddings as a deep clustering problem by assigning the pixel features from
+the feature maps to them. The transfer learning results on facial
+classification and regression tasks show that our FRA outperforms previous
+pre-trained models and more importantly, using ResNet as the unified backbone
+for various tasks, our FRA achieves comparable or even better performance
+compared with SOTA methods in facial analysis tasks.",cs.CV,['cs.CV']
+Visual Anagrams: Synthesizing Multi-View Optical Illusions with Diffusion Models,Daniel Geng · Inbum Park · Andrew Owens, ,https://arxiv.org/abs/2311.17919,,2311.17919.pdf,Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models,"We address the problem of synthesizing multi-view optical illusions: images
+that change appearance upon a transformation, such as a flip or rotation. We
+propose a simple, zero-shot method for obtaining these illusions from
+off-the-shelf text-to-image diffusion models. During the reverse diffusion
+process, we estimate the noise from different views of a noisy image, and then
+combine these noise estimates together and denoise the image. A theoretical
+analysis suggests that this method works precisely for views that can be
+written as orthogonal transformations, of which permutations are a subset. This
+leads to the idea of a visual anagram--an image that changes appearance under
+some rearrangement of pixels. This includes rotations and flips, but also more
+exotic pixel permutations such as a jigsaw rearrangement. Our approach also
+naturally extends to illusions with more than two views. We provide both
+qualitative and quantitative results demonstrating the effectiveness and
+flexibility of our method. Please see our project webpage for additional
+visualizations and results: https://dangeng.github.io/visual_anagrams/",cs.CV,['cs.CV']
+AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving,Mingfu Liang · Jong-Chyi Su · Samuel Schulter · Sparsh Garg · Shiyu Zhao · Ying Wu · Manmohan Chandraker, ,https://arxiv.org/abs/2403.17373,,2403.17373.pdf,AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving,"Autonomous vehicle (AV) systems rely on robust perception models as a
+cornerstone of safety assurance. However, objects encountered on the road
+exhibit a long-tailed distribution, with rare or unseen categories posing
+challenges to a deployed perception model. This necessitates an expensive
+process of continuously curating and annotating data with significant human
+effort. We propose to leverage recent advances in vision-language and large
+language models to design an Automatic Data Engine (AIDE) that automatically
+identifies issues, efficiently curates data, improves the model through
+auto-labeling, and verifies the model through generation of diverse scenarios.
+This process operates iteratively, allowing for continuous self-improvement of
+the model. We further establish a benchmark for open-world detection on AV
+datasets to comprehensively evaluate various learning paradigms, demonstrating
+our method's superior performance at a reduced cost.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Hierarchical Intra-modal Correlation Learning for Label-free 3D Semantic Segmentation,Xin Kang · Lei Chu · Jiahao Li · Xuejin Chen · Yan Lu, ,https://arxiv.org/abs/2309.10649,,2309.10649.pdf,Cross-modal and Cross-domain Knowledge Transfer for Label-free 3D Segmentation,"Current state-of-the-art point cloud-based perception methods usually rely on
+large-scale labeled data, which requires expensive manual annotations. A
+natural option is to explore the unsupervised methodology for 3D perception
+tasks. However, such methods often face substantial performance-drop
+difficulties. Fortunately, we found that there exist amounts of image-based
+datasets and an alternative can be proposed, i.e., transferring the knowledge
+in the 2D images to 3D point clouds. Specifically, we propose a novel approach
+for the challenging cross-modal and cross-domain adaptation task by fully
+exploring the relationship between images and point clouds and designing
+effective feature alignment strategies. Without any 3D labels, our method
+achieves state-of-the-art performance for 3D point cloud semantic segmentation
+on SemanticKITTI by using the knowledge of KITTI360 and GTA5, compared to
+existing unsupervised and weakly-supervised baselines.",cs.CV,['cs.CV']
+Rethinking Multi-domain Generalization with A General Learning Objective,Zhaorui Tan · Xi Yang · Kaizhu Huang, ,https://arxiv.org/abs/2402.18853,,2402.18853.pdf,Rethinking Multi-domain Generalization with A General Learning Objective,"Multi-domain generalization (mDG) is universally aimed to minimize the
+discrepancy between training and testing distributions to enhance
+marginal-to-label distribution mapping. However, existing mDG literature lacks
+a general learning objective paradigm and often imposes constraints on static
+target marginal distributions. In this paper, we propose to leverage a
+$Y$-mapping to relax the constraint. We rethink the learning objective for mDG
+and design a new \textbf{general learning objective} to interpret and analyze
+most existing mDG wisdom. This general objective is bifurcated into two
+synergistic amis: learning domain-independent conditional features and
+maximizing a posterior. Explorations also extend to two effective
+regularization terms that incorporate prior information and suppress invalid
+causality, alleviating the issues that come with relaxed constraints. We
+theoretically contribute an upper bound for the domain alignment of
+domain-independent conditional features, disclosing that many previous mDG
+endeavors actually \textbf{optimize partially the objective} and thus lead to
+limited performance. As such, our study distills a general learning objective
+into four practical components, providing a general, robust, and flexible
+mechanism to handle complex domain shifts. Extensive empirical results indicate
+that the proposed objective with $Y$-mapping leads to substantially better mDG
+performance in various downstream tasks, including regression, segmentation,
+and classification.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+VideoMAC: Video Masked Autoencoders Meet ConvNets,Gensheng Pei · Tao Chen · Xiruo Jiang · 刘华峰 Liu · Zeren Sun · Yazhou Yao, ,https://arxiv.org/abs/2402.19082,,2402.19082.pdf,VideoMAC: Video Masked Autoencoders Meet ConvNets,"Recently, the advancement of self-supervised learning techniques, like masked
+autoencoders (MAE), has greatly influenced visual representation learning for
+images and videos. Nevertheless, it is worth noting that the predominant
+approaches in existing masked image / video modeling rely excessively on
+resource-intensive vision transformers (ViTs) as the feature encoder. In this
+paper, we propose a new approach termed as \textbf{VideoMAC}, which combines
+video masked autoencoders with resource-friendly ConvNets. Specifically,
+VideoMAC employs symmetric masking on randomly sampled pairs of video frames.
+To prevent the issue of mask pattern dissipation, we utilize ConvNets which are
+implemented with sparse convolutional operators as encoders. Simultaneously, we
+present a simple yet effective masked video modeling (MVM) approach, a dual
+encoder architecture comprising an online encoder and an exponential moving
+average target encoder, aimed to facilitate inter-frame reconstruction
+consistency in videos. Additionally, we demonstrate that VideoMAC, empowering
+classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the
+benefits of MVM, outperforms ViT-based approaches on downstream tasks,
+including video object segmentation (+\textbf{5.2\%} / \textbf{6.4\%}
+$\mathcal{J}\&\mathcal{F}$), body part propagation (+\textbf{6.3\%} /
+\textbf{3.1\%} mIoU), and human pose tracking (+\textbf{10.2\%} /
+\textbf{11.1\%} PCK@0.1).",cs.CV,['cs.CV']
+EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection,Xuanyu Zhang · Runyi Li · Jiwen Yu · Youmin Xu · Weiqi Li · Jian Zhang,https://xuanyuzhang21.github.io/project/editguard/,https://arxiv.org/abs/2312.08883,,2312.08883.pdf,EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection,"In the era where AI-generated content (AIGC) models can produce stunning and
+lifelike images, the lingering shadow of unauthorized reproductions and
+malicious tampering poses imminent threats to copyright integrity and
+information security. Current image watermarking methods, while widely accepted
+for safeguarding visual content, can only protect copyright and ensure
+traceability. They fall short in localizing increasingly realistic image
+tampering, potentially leading to trust crises, privacy violations, and legal
+disputes. To solve this challenge, we propose an innovative proactive forensics
+framework EditGuard, to unify copyright protection and tamper-agnostic
+localization, especially for AIGC-based editing methods. It can offer a
+meticulous embedding of imperceptible watermarks and precise decoding of
+tampered areas and copyright information. Leveraging our observed fragility and
+locality of image-into-image steganography, the realization of EditGuard can be
+converted into a united image-bit steganography issue, thus completely
+decoupling the training process from the tampering types. Extensive experiments
+demonstrate that our EditGuard balances the tamper localization accuracy,
+copyright recovery precision, and generalizability to various AIGC-based
+tampering methods, especially for image forgery that is difficult for the naked
+eye to detect. The project page is available at
+https://xuanyuzhang21.github.io/project/editguard/.",cs.CV,['cs.CV']
+Re-thinking Data Availability Attacks Against Deep Neural Networks,Bin Fang · Bo Li · Shuang Wu · Shouhong Ding · Ran Yi · Lizhuang Ma, ,https://arxiv.org/abs/2401.09740,,2401.09740.pdf,Hijacking Attacks against Neural Networks by Analyzing Training Data,"Backdoors and adversarial examples are the two primary threats currently
+faced by deep neural networks (DNNs). Both attacks attempt to hijack the model
+behaviors with unintended outputs by introducing (small) perturbations to the
+inputs. Backdoor attacks, despite the high success rates, often require a
+strong assumption, which is not always easy to achieve in reality. Adversarial
+example attacks, which put relatively weaker assumptions on attackers, often
+demand high computational resources, yet do not always yield satisfactory
+success rates when attacking mainstream black-box models in the real world.
+These limitations motivate the following research question: can model hijacking
+be achieved more simply, with a higher attack success rate and more reasonable
+assumptions? In this paper, we propose CleanSheet, a new model hijacking attack
+that obtains the high performance of backdoor attacks without requiring the
+adversary to tamper with the model training process. CleanSheet exploits
+vulnerabilities in DNNs stemming from the training data. Specifically, our key
+idea is to treat part of the clean training data of the target model as
+""poisoned data,"" and capture the characteristics of these data that are more
+sensitive to the model (typically called robust features) to construct
+""triggers."" These triggers can be added to any input example to mislead the
+target model, similar to backdoor attacks. We validate the effectiveness of
+CleanSheet through extensive experiments on 5 datasets, 79 normally trained
+models, 68 pruned models, and 39 defensive models. Results show that CleanSheet
+exhibits performance comparable to state-of-the-art backdoor attacks, achieving
+an average attack success rate (ASR) of 97.5% on CIFAR-100 and 92.4% on GTSRB,
+respectively. Furthermore, CleanSheet consistently maintains a high ASR, when
+confronted with various mainstream backdoor defenses.",cs.CR,['cs.CR']
+Multi-View Attentive Contextualization for Multi-View 3D Object Detection,Xianpeng Liu · Ce Zheng · Ming Qian · Nan Xue · Chen Chen · Zhebin Zhang · Chen Li · Tianfu Wu,https://xianpeng919.github.io/mvacon/,https://arxiv.org/abs/2405.12200,,2405.12200.pdf,Multi-View Attentive Contextualization for Multi-View 3D Object Detection,"We present Multi-View Attentive Contextualization (MvACon), a simple yet
+effective method for improving 2D-to-3D feature lifting in query-based
+multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in
+the field of query-based MV3D object detection, prior art often suffers from
+either the lack of exploiting high-resolution 2D features in dense
+attention-based lifting, due to high computational costs, or from
+insufficiently dense grounding of 3D queries to multi-scale 2D features in
+sparse attention-based lifting. Our proposed MvACon hits the two birds with one
+stone using a representationally dense yet computationally sparse attentive
+feature contextualization scheme that is agnostic to specific 2D-to-3D feature
+lifting approaches. In experiments, the proposed MvACon is thoroughly tested on
+the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable
+attention (DFA3D) variant, as well as the PETR, showing consistent detection
+performance improvement, especially in enhancing performance in location,
+orientation, and velocity prediction. It is also tested on the Waymo-mini
+benchmark using BEVFormer with similar improvement. We qualitatively and
+quantitatively show that global cluster-based contexts effectively encode dense
+scene-level contexts for MV3D object detection. The promising results of our
+proposed MvACon reinforces the adage in computer vision -- ``(contextualized)
+feature matters"".",cs.CV,['cs.CV']
+RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection,Ximiao Zhang · Min Xu · Xiuzhuang Zhou,https://github.com/cnulab/RealNet,https://arxiv.org/abs/2403.05897,,2403.05897.pdf,RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection,"Self-supervised feature reconstruction methods have shown promising advances
+in industrial image anomaly detection and localization. Despite this progress,
+these methods still face challenges in synthesizing realistic and diverse
+anomaly samples, as well as addressing the feature redundancy and pre-training
+bias of pre-trained feature. In this work, we introduce RealNet, a feature
+reconstruction network with realistic synthetic anomaly and adaptive feature
+selection. It is incorporated with three key innovations: First, we propose
+Strength-controllable Diffusion Anomaly Synthesis (SDAS), a diffusion
+process-based synthesis strategy capable of generating samples with varying
+anomaly strengths that mimic the distribution of real anomalous samples.
+Second, we develop Anomaly-aware Features Selection (AFS), a method for
+selecting representative and discriminative pre-trained feature subsets to
+improve anomaly detection performance while controlling computational costs.
+Third, we introduce Reconstruction Residuals Selection (RRS), a strategy that
+adaptively selects discriminative residuals for comprehensive identification of
+anomalous regions across multiple levels of granularity. We assess RealNet on
+four benchmark datasets, and our results demonstrate significant improvements
+in both Image AUROC and Pixel AUROC compared to the current state-o-the-art
+methods. The code, data, and models are available at
+https://github.com/cnulab/RealNet.",cs.CV,['cs.CV']
+CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation,Jun Wang · Yuzhe Qin · Kaiming Kuang · Yigit Korkmaz · Akhilan Gurumoorthy · Hao Su · Xiaolong Wang, ,https://arxiv.org/abs/2402.14795,,2402.14795.pdf,CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation,"We introduce CyberDemo, a novel approach to robotic imitation learning that
+leverages simulated human demonstrations for real-world tasks. By incorporating
+extensive data augmentation in a simulated environment, CyberDemo outperforms
+traditional in-domain real-world demonstrations when transferred to the real
+world, handling diverse physical and visual conditions. Regardless of its
+affordability and convenience in data collection, CyberDemo outperforms
+baseline methods in terms of success rates across various tasks and exhibits
+generalizability with previously unseen objects. For example, it can rotate
+novel tetra-valve and penta-valve, despite human demonstrations only involving
+tri-valves. Our research demonstrates the significant potential of simulated
+human demonstrations for real-world dexterous manipulation tasks. More details
+can be found at https://cyber-demo.github.io",cs.RO,"['cs.RO', 'cs.CV']"
+InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion,Jihyun Lee · Shunsuke Saito · Giljoo Nam · Minhyuk Sung · Tae-Kyun Kim,https://jyunlee.github.io/projects/interhandgen/,https://arxiv.org/abs/2403.17422,,2403.17422.pdf,InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion,"We present InterHandGen, a novel framework that learns the generative prior
+of two-hand interaction. Sampling from our model yields plausible and diverse
+two-hand shapes in close interaction with or without an object. Our prior can
+be incorporated into any optimization or learning methods to reduce ambiguity
+in an ill-posed setup. Our key observation is that directly modeling the joint
+distribution of multiple instances imposes high learning complexity due to its
+combinatorial nature. Thus, we propose to decompose the modeling of joint
+distribution into the modeling of factored unconditional and conditional single
+instance distribution. In particular, we introduce a diffusion model that
+learns the single-hand distribution unconditional and conditional to another
+hand via conditioning dropout. For sampling, we combine anti-penetration and
+classifier-free guidance to enable plausible generation. Furthermore, we
+establish the rigorous evaluation protocol of two-hand synthesis, where our
+method significantly outperforms baseline generative models in terms of
+plausibility and diversity. We also demonstrate that our diffusion prior can
+boost the performance of two-hand reconstruction from monocular in-the-wild
+images, achieving new state-of-the-art accuracy.",cs.CV,['cs.CV']
+DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model,Lirui Zhao · Yue Yang · Kaipeng Zhang · Wenqi Shao · Yuxin Zhang · Yu Qiao · Ping Luo · Rongrong Ji, ,https://arxiv.org/abs/2404.01342,,2404.01342.pdf,DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model,"Text-to-image (T2I) generative models have attracted significant attention
+and found extensive applications within and beyond academic research. For
+example, the Civitai community, a platform for T2I innovation, currently hosts
+an impressive array of 74,492 distinct models. However, this diversity presents
+a formidable challenge in selecting the most appropriate model and parameters,
+a process that typically requires numerous trials. Drawing inspiration from the
+tool usage research of large language models (LLMs), we introduce DiffAgent, an
+LLM agent designed to screen the accurate selection in seconds via API calls.
+DiffAgent leverages a novel two-stage training framework, SFTA, enabling it to
+accurately align T2I API responses with user input in accordance with human
+preferences. To train and evaluate DiffAgent's capabilities, we present
+DABench, a comprehensive dataset encompassing an extensive range of T2I APIs
+from the community. Our evaluations reveal that DiffAgent not only excels in
+identifying the appropriate T2I API but also underscores the effectiveness of
+the SFTA training framework. Codes are available at
+https://github.com/OpenGVLab/DiffAgent.",cs.CL,"['cs.CL', 'cs.AI']"
+Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation,Mukul Khanna · Yongsen Mao · Hanxiao Jiang · Sanjay Haresh · Brennan Shacklett · Dhruv Batra · Alexander William Clegg · Eric Undersander · Angel Xuan Chang · Manolis Savva, ,https://arxiv.org/abs/2306.11290,,2306.11290.pdf,Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation,"We contribute the Habitat Synthetic Scene Dataset, a dataset of 211
+high-quality 3D scenes, and use it to test navigation agent generalization to
+realistic 3D environments. Our dataset represents real interiors and contains a
+diverse set of 18,656 models of real-world objects. We investigate the impact
+of synthetic 3D scene dataset scale and realism on the task of training
+embodied agents to find and navigate to objects (ObjectGoal navigation). By
+comparing to synthetic 3D scene datasets from prior work, we find that scale
+helps in generalization, but the benefits quickly saturate, making visual
+fidelity and correlation to real-world scenes more important. Our experiments
+show that agents trained on our smaller-scale dataset can match or outperform
+agents trained on much larger datasets. Surprisingly, we observe that agents
+trained on just 122 scenes from our dataset outperform agents trained on 10,000
+scenes from the ProcTHOR-10K dataset in terms of zero-shot generalization in
+real-world scanned environments.",cs.CV,['cs.CV']
+Objects as volumes: A stochastic geometry view of opaque solids,Bailey Miller · Hanyu Chen · Alice Lai · Ioannis Gkioulekas,https://imaging.cs.cmu.edu/volumetric_opaque_solids/,https://arxiv.org/abs/2312.15406,,2312.15406.pdf,Objects as volumes: A stochastic geometry view of opaque solids,"We develop a theory for the representation of opaque solids as volumes.
+Starting from a stochastic representation of opaque solids as random indicator
+functions, we prove the conditions under which such solids can be modeled using
+exponential volumetric transport. We also derive expressions for the volumetric
+attenuation coefficient as a functional of the probability distributions of the
+underlying indicator functions. We generalize our theory to account for
+isotropic and anisotropic scattering at different parts of the solid, and for
+representations of opaque solids as stochastic implicit surfaces. We derive our
+volumetric representation from first principles, which ensures that it
+satisfies physical constraints such as reciprocity and reversibility. We use
+our theory to explain, compare, and correct previous volumetric
+representations, as well as propose meaningful extensions that lead to improved
+performance in 3D reconstruction tasks.",cs.CV,"['cs.CV', 'cs.GR']"
+MeaCap: Memory-Augmented Zero-shot Image Captioning,Zequn Zeng · Yan Xie · Hao Zhang · Chiyu Chen · Zhengjue Wang · Bo Chen,https://github.com/joeyz0z/MeaCap,https://arxiv.org/abs/2403.03715,,2403.03715.pdf,MeaCap: Memory-Augmented Zero-shot Image Captioning,"Zero-shot image captioning (IC) without well-paired image-text data can be
+divided into two categories, training-free and text-only-training. Generally,
+these two types of methods realize zero-shot IC by integrating pretrained
+vision-language models like CLIP for image-text similarity evaluation and a
+pre-trained language model (LM) for caption generation. The main difference
+between them is whether using a textual corpus to train the LM. Though
+achieving attractive performance w.r.t. some metrics, existing methods often
+exhibit some common drawbacks. Training-free methods tend to produce
+hallucinations, while text-only-training often lose generalization capability.
+To move forward, in this paper, we propose a novel Memory-Augmented zero-shot
+image Captioning framework (MeaCap). Specifically, equipped with a textual
+memory, we introduce a retrieve-then-filter module to get key concepts that are
+highly related to the image. By deploying our proposed memory-augmented
+visual-related fusion score in a keywords-to-sentence LM, MeaCap can generate
+concept-centered captions that keep high consistency with the image with fewer
+hallucinations and more world-knowledge. The framework of MeaCap achieves the
+state-of-the-art performance on a series of zero-shot IC settings. Our code is
+available at https://github.com/joeyz0z/MeaCap.",cs.CV,['cs.CV']
+Weakly Supervised Monocular 3D Detection with a Single-View Image,Xueying Jiang · Sheng Jin · Lewei Lu · Xiaoqin Zhang · Shijian Lu, ,https://arxiv.org/abs/2402.19144,,2402.19144.pdf,Weakly Supervised Monocular 3D Detection with a Single-View Image,"Monocular 3D detection (M3D) aims for precise 3D object localization from a
+single-view image which usually involves labor-intensive annotation of 3D
+detection boxes. Weakly supervised M3D has recently been studied to obviate the
+3D annotation process by leveraging many existing 2D annotations, but it often
+requires extra training data such as LiDAR point clouds or multi-view images
+which greatly degrades its applicability and usability in various applications.
+We propose SKD-WM3D, a weakly supervised monocular 3D detection framework that
+exploits depth information to achieve M3D with a single-view image exclusively
+without any 3D annotations or other training data. One key design in SKD-WM3D
+is a self-knowledge distillation framework, which transforms image features
+into 3D-like representations by fusing depth information and effectively
+mitigates the inherent depth ambiguity in monocular scenarios with little
+computational overhead in inference. In addition, we design an
+uncertainty-aware distillation loss and a gradient-targeted transfer modulation
+strategy which facilitate knowledge acquisition and knowledge transfer,
+respectively. Extensive experiments show that SKD-WM3D surpasses the
+state-of-the-art clearly and is even on par with many fully supervised methods.",cs.CV,['cs.CV']
+SemCity: Semantic Scene Generation with Triplane Diffusion,Jumin Lee · Sebin Lee · Changho Jo · Woobin Im · Ju-hyeong Seon · Sung-Eui Yoon, ,https://arxiv.org/abs/2403.07773,,2403.07773.pdf,SemCity: Semantic Scene Generation with Triplane Diffusion,"We present ""SemCity,"" a 3D diffusion model for semantic scene generation in
+real-world outdoor environments. Most 3D diffusion models focus on generating a
+single object, synthetic indoor scenes, or synthetic outdoor scenes, while the
+generation of real-world outdoor scenes is rarely addressed. In this paper, we
+concentrate on generating a real-outdoor scene through learning a diffusion
+model on a real-world outdoor dataset. In contrast to synthetic data,
+real-outdoor datasets often contain more empty spaces due to sensor
+limitations, causing challenges in learning real-outdoor distributions. To
+address this issue, we exploit a triplane representation as a proxy form of
+scene distributions to be learned by our diffusion model. Furthermore, we
+propose a triplane manipulation that integrates seamlessly with our triplane
+diffusion model. The manipulation improves our diffusion model's applicability
+in a variety of downstream tasks related to outdoor scene generation such as
+scene inpainting, scene outpainting, and semantic scene completion refinements.
+In experimental results, we demonstrate that our triplane diffusion model shows
+meaningful generation results compared with existing work in a real-outdoor
+dataset, SemanticKITTI. We also show our triplane manipulation facilitates
+seamlessly adding, removing, or modifying objects within a scene. Further, it
+also enables the expansion of scenes toward a city-level scale. Finally, we
+evaluate our method on semantic scene completion refinements where our
+diffusion model enhances predictions of semantic scene completion networks by
+learning scene distribution. Our code is available at
+https://github.com/zoomin-lee/SemCity.",cs.CV,['cs.CV']
+SD2Event: Self-supervised Learning of Dynamic Detectors and Contextual Descriptors for Event Cameras,Yuan Gao · Yuqing Zhu · Xinjun Li · Yimin Du · Tianzhu Zhang, ,https://arxiv.org/abs/2401.01042,,2401.01042.pdf,Relating Events and Frames Based on Self-Supervised Learning and Uncorrelated Conditioning for Unsupervised Domain Adaptation,"Event-based cameras provide accurate and high temporal resolution
+measurements for performing computer vision tasks in challenging scenarios,
+such as high-dynamic range environments and fast-motion maneuvers. Despite
+their advantages, utilizing deep learning for event-based vision encounters a
+significant obstacle due to the scarcity of annotated data caused by the
+relatively recent emergence of event-based cameras. To overcome this
+limitation, leveraging the knowledge available from annotated data obtained
+with conventional frame-based cameras presents an effective solution based on
+unsupervised domain adaptation. We propose a new algorithm tailored for
+adapting a deep neural network trained on annotated frame-based data to
+generalize well on event-based unannotated data. Our approach incorporates
+uncorrelated conditioning and self-supervised learning in an adversarial
+learning scheme to close the gap between the two source and target domains. By
+applying self-supervised learning, the algorithm learns to align the
+representations of event-based data with those from frame-based camera data,
+thereby facilitating knowledge transfer.Furthermore, the inclusion of
+uncorrelated conditioning ensures that the adapted model effectively
+distinguishes between event-based and conventional data, enhancing its ability
+to classify event-based images accurately.Through empirical experimentation and
+evaluation, we demonstrate that our algorithm surpasses existing approaches
+designed for the same purpose using two benchmarks. The superior performance of
+our solution is attributed to its ability to effectively utilize annotated data
+from frame-based cameras and transfer the acquired knowledge to the event-based
+vision domain.",cs.CV,['cs.CV']
+Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation,Ming Xu · Stephen Gould, ,https://arxiv.org/abs/2404.01518,,2404.01518.pdf,Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation,"We propose a novel approach to the action segmentation task for long,
+untrimmed videos, based on solving an optimal transport problem. By encoding a
+temporal consistency prior into a Gromov-Wasserstein problem, we are able to
+decode a temporally consistent segmentation from a noisy affinity/matching cost
+matrix between video frames and action classes. Unlike previous approaches, our
+method does not require knowing the action order for a video to attain temporal
+consistency. Furthermore, our resulting (fused) Gromov-Wasserstein problem can
+be efficiently solved on GPUs using a few iterations of projected mirror
+descent. We demonstrate the effectiveness of our method in an unsupervised
+learning setting, where our method is used to generate pseudo-labels for
+self-training. We evaluate our segmentation approach and unsupervised learning
+pipeline on the Breakfast, 50-Salads, YouTube Instructions and Desktop Assembly
+datasets, yielding state-of-the-art results for the unsupervised video action
+segmentation task.",cs.CV,"['cs.CV', 'cs.LG', 'eess.IV']"
+Unsigned Orthogonal Distance Fields: An Accurate Neural Implicit Representation for Diverse 3D Shapes,YuJie Lu · Long Wan · Nayu Ding · Yulong Wang · Shuhan Shen · Shen Cai · Lin Gao,http://www.cscvlab.com/research/UODFs/index.html,https://arxiv.org/abs/2403.01414,,2403.01414.pdf,Unsigned Orthogonal Distance Fields: An Accurate Neural Implicit Representation for Diverse 3D Shapes,"Neural implicit representation of geometric shapes has witnessed considerable
+advancements in recent years. However, common distance field based implicit
+representations, specifically signed distance field (SDF) for watertight shapes
+or unsigned distance field (UDF) for arbitrary shapes, routinely suffer from
+degradation of reconstruction accuracy when converting to explicit surface
+points and meshes. In this paper, we introduce a novel neural implicit
+representation based on unsigned orthogonal distance fields (UODFs). In UODFs,
+the minimal unsigned distance from any spatial point to the shape surface is
+defined solely in one orthogonal direction, contrasting with the
+multi-directional determination made by SDF and UDF. Consequently, every point
+in the 3D UODFs can directly access its closest surface points along three
+orthogonal directions. This distinctive feature leverages the accurate
+reconstruction of surface points without interpolation errors. We verify the
+effectiveness of UODFs through a range of reconstruction examples, extending
+from simple watertight or non-watertight shapes to complex shapes that include
+hollows, internal or assembling structures.",cs.CV,['cs.CV']
+Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transfomers,Sheng Yang · Jiawang Bai · Kuofeng Gao · Yong Yang · Yiming Li · Shu-Tao Xia,https://github.com/20000yshust/SWARM,https://arxiv.org/abs/2405.10612,,2405.10612.pdf,Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers,"Given the power of vision transformers, a new learning paradigm, pre-training
+and then prompting, makes it more efficient and effective to address downstream
+visual recognition tasks. In this paper, we identify a novel security threat
+towards such a paradigm from the perspective of backdoor attacks. Specifically,
+an extra prompt token, called the switch token in this work, can turn the
+backdoor mode on, i.e., converting a benign model into a backdoored one. Once
+under the backdoor mode, a specific trigger can force the model to predict a
+target class. It poses a severe risk to the users of cloud API, since the
+malicious behavior can not be activated and detected under the benign mode,
+thus making the attack very stealthy. To attack a pre-trained model, our
+proposed attack, named SWARM, learns a trigger and prompt tokens including a
+switch token. They are optimized with the clean loss which encourages the model
+always behaves normally even the trigger presents, and the backdoor loss that
+ensures the backdoor can be activated by the trigger when the switch is on.
+Besides, we utilize the cross-mode feature distillation to reduce the effect of
+the switch token on clean samples. The experiments on diverse visual
+recognition tasks confirm the success of our switchable backdoor attack, i.e.,
+achieving 95%+ attack success rate, and also being hard to be detected and
+removed. Our code is available at https://github.com/20000yshust/SWARM.",cs.CV,"['cs.CV', 'cs.CR', 'cs.LG']"
+Unmixing before Fusion: A Generalized Paradigm for Multi-Source-based Hyperspectral Image Synthesis,Yang Yu · Erting Pan · Xinya Wang · Yuheng Wu · Xiaoguang Mei · Jiayi Ma,https://hsi-synthesis.github.io/,,https://ieeexplore.ieee.org/document/10414148,,,,,nan
+Neural Directional Encoding for Efficient and Accurate View-Dependent Appearance Modeling,Liwen Wu · Sai Bi · Zexiang Xu · Fujun Luan · Kai Zhang · Iliyan Georgiev · Kalyan Sunkavalli · Ravi Ramamoorthi,https://lwwu2.github.io/nde/,https://arxiv.org/abs/2405.14847,,2405.14847.pdf,Neural Directional Encoding for Efficient and Accurate View-Dependent Appearance Modeling,"Novel-view synthesis of specular objects like shiny metals or glossy paints
+remains a significant challenge. Not only the glossy appearance but also global
+illumination effects, including reflections of other objects in the
+environment, are critical components to faithfully reproduce a scene. In this
+paper, we present Neural Directional Encoding (NDE), a view-dependent
+appearance encoding of neural radiance fields (NeRF) for rendering specular
+objects. NDE transfers the concept of feature-grid-based spatial encoding to
+the angular domain, significantly improving the ability to model high-frequency
+angular signals. In contrast to previous methods that use encoding functions
+with only angular input, we additionally cone-trace spatial features to obtain
+a spatially varying directional encoding, which addresses the challenging
+interreflection effects. Extensive experiments on both synthetic and real
+datasets show that a NeRF model with NDE (1) outperforms the state of the art
+on view synthesis of specular objects, and (2) works with small networks to
+allow fast (real-time) inference. The project webpage and source code are
+available at: \url{https://lwwu2.github.io/nde/}.",cs.CV,['cs.CV']
+ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models,Jeong-gi Kwak · Erqun Dong · Yuhe Jin · Hanseok Ko · Shweta Mahajan · Kwang Moo Yi,https://ubc-vision.github.io/vivid123/,https://arxiv.org/abs/2312.01305,,2312.01305.pdf,ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models,"Generating novel views of an object from a single image is a challenging
+task. It requires an understanding of the underlying 3D structure of the object
+from an image and rendering high-quality, spatially consistent new views. While
+recent methods for view synthesis based on diffusion have shown great progress,
+achieving consistency among various view estimates and at the same time abiding
+by the desired camera pose remains a critical problem yet to be solved. In this
+work, we demonstrate a strikingly simple method, where we utilize a pre-trained
+video diffusion model to solve this problem. Our key idea is that synthesizing
+a novel view could be reformulated as synthesizing a video of a camera going
+around the object of interest -- a scanning video -- which then allows us to
+leverage the powerful priors that a video diffusion model would have learned.
+Thus, to perform novel-view synthesis, we create a smooth camera trajectory to
+the target view that we wish to render, and denoise using both a
+view-conditioned diffusion model and a video diffusion model. By doing so, we
+obtain a highly consistent novel view synthesis, outperforming the state of the
+art.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR']"
+Self-Adaptive Reality-Guided Diffusion for Artifact-Free Super-Resolution,Qingping Zheng · Ling Zheng · Yuanfan Guo · Ying Li · Songcen Xu · Jiankang Deng · Hang Xu, ,https://arxiv.org/abs/2403.16643v1,,2403.16643v1.pdf,Self-Adaptive Reality-Guided Diffusion for Artifact-Free Super-Resolution,"Artifact-free super-resolution (SR) aims to translate low-resolution images
+into their high-resolution counterparts with a strict integrity of the original
+content, eliminating any distortions or synthetic details. While traditional
+diffusion-based SR techniques have demonstrated remarkable abilities to enhance
+image detail, they are prone to artifact introduction during iterative
+procedures. Such artifacts, ranging from trivial noise to unauthentic textures,
+deviate from the true structure of the source image, thus challenging the
+integrity of the super-resolution process. In this work, we propose
+Self-Adaptive Reality-Guided Diffusion (SARGD), a training-free method that
+delves into the latent space to effectively identify and mitigate the
+propagation of artifacts. Our SARGD begins by using an artifact detector to
+identify implausible pixels, creating a binary mask that highlights artifacts.
+Following this, the Reality Guidance Refinement (RGR) process refines artifacts
+by integrating this mask with realistic latent representations, improving
+alignment with the original image. Nonetheless, initial realistic-latent
+representations from lower-quality images result in over-smoothing in the final
+output. To address this, we introduce a Self-Adaptive Guidance (SAG) mechanism.
+It dynamically computes a reality score, enhancing the sharpness of the
+realistic latent. These alternating mechanisms collectively achieve
+artifact-free super-resolution. Extensive experiments demonstrate the
+superiority of our method, delivering detailed artifact-free high-resolution
+images while reducing sampling steps by 2X. We release our code at
+https://github.com/ProAirVerse/Self-Adaptive-Guidance-Diffusion.git.",eess.IV,"['eess.IV', 'cs.CV']"
+SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes,Yihua Huang · Yangtian Sun · Ziyi Yang · Xiaoyang Lyu · Yan-Pei Cao · Xiaojuan Qi,https://yihua7.github.io/SC-GS-web/,https://arxiv.org/abs/2312.14937,,2312.14937.pdf,SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes,"Novel view synthesis for dynamic scenes is still a challenging problem in
+computer vision and graphics. Recently, Gaussian splatting has emerged as a
+robust technique to represent static scenes and enable high-quality and
+real-time novel view synthesis. Building upon this technique, we propose a new
+representation that explicitly decomposes the motion and appearance of dynamic
+scenes into sparse control points and dense Gaussians, respectively. Our key
+idea is to use sparse control points, significantly fewer in number than the
+Gaussians, to learn compact 6 DoF transformation bases, which can be locally
+interpolated through learned interpolation weights to yield the motion field of
+3D Gaussians. We employ a deformation MLP to predict time-varying 6 DoF
+transformations for each control point, which reduces learning complexities,
+enhances learning abilities, and facilitates obtaining temporal and spatial
+coherent motion patterns. Then, we jointly learn the 3D Gaussians, the
+canonical space locations of control points, and the deformation MLP to
+reconstruct the appearance, geometry, and dynamics of 3D scenes. During
+learning, the location and number of control points are adaptively adjusted to
+accommodate varying motion complexities in different regions, and an ARAP loss
+following the principle of as rigid as possible is developed to enforce spatial
+continuity and local rigidity of learned motions. Finally, thanks to the
+explicit sparse motion representation and its decomposition from appearance,
+our method can enable user-controlled motion editing while retaining
+high-fidelity appearances. Extensive experiments demonstrate that our approach
+outperforms existing approaches on novel view synthesis with a high rendering
+speed and enables novel appearance-preserved motion editing applications.
+Project page: https://yihua7.github.io/SC-GS-web/",cs.CV,"['cs.CV', 'cs.GR']"
+PHYSCENE: Physically Interactable 3D Scene Synthesis for Embodied AI,Yandan Yang · Baoxiong Jia · Peiyuan Zhi · Siyuan Huang, ,https://arxiv.org/abs/2404.09465,,2404.09465.pdf,PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI,"With recent developments in Embodied Artificial Intelligence (EAI) research,
+there has been a growing demand for high-quality, large-scale interactive scene
+generation. While prior methods in scene synthesis have prioritized the
+naturalness and realism of the generated scenes, the physical plausibility and
+interactivity of scenes have been largely left unexplored. To address this
+disparity, we introduce PhyScene, a novel method dedicated to generating
+interactive 3D scenes characterized by realistic layouts, articulated objects,
+and rich physical interactivity tailored for embodied agents. Based on a
+conditional diffusion model for capturing scene layouts, we devise novel
+physics- and interactivity-based guidance mechanisms that integrate constraints
+from object collision, room layout, and object reachability. Through extensive
+experiments, we demonstrate that PhyScene effectively leverages these guidance
+functions for physically interactable scene synthesis, outperforming existing
+state-of-the-art scene synthesis methods by a large margin. Our findings
+suggest that the scenes generated by PhyScene hold considerable potential for
+facilitating diverse skill acquisition among agents within interactive
+environments, thereby catalyzing further advancements in embodied AI research.
+Project website: http://physcene.github.io.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.RO']"
+PlatoNeRF: 3D Reconstruction in Plato’s Cave via Single-View Two-Bounce Lidar,Tzofi Klinghoffer · Xiaoyu Xiang · Siddharth Somasundaram · Yuchen Fan · Christian Richardt · Ramesh Raskar · Rakesh Ranjan,https://platonerf.github.io/,https://arxiv.org/abs/2312.14239,,2312.14239.pdf,PlatoNeRF: 3D Reconstruction in Plato's Cave via Single-View Two-Bounce Lidar,"3D reconstruction from a single-view is challenging because of the ambiguity
+from monocular cues and lack of information about occluded regions. Neural
+radiance fields (NeRF), while popular for view synthesis and 3D reconstruction,
+are typically reliant on multi-view images. Existing methods for single-view 3D
+reconstruction with NeRF rely on either data priors to hallucinate views of
+occluded regions, which may not be physically accurate, or shadows observed by
+RGB cameras, which are difficult to detect in ambient light and low albedo
+backgrounds. We propose using time-of-flight data captured by a single-photon
+avalanche diode to overcome these limitations. Our method models two-bounce
+optical paths with NeRF, using lidar transient data for supervision. By
+leveraging the advantages of both NeRF and two-bounce light measured by lidar,
+we demonstrate that we can reconstruct visible and occluded geometry without
+data priors or reliance on controlled ambient lighting or scene albedo. In
+addition, we demonstrate improved generalization under practical constraints on
+sensor spatial- and temporal-resolution. We believe our method is a promising
+direction as single-photon lidars become ubiquitous on consumer devices, such
+as phones, tablets, and headsets.",cs.CV,"['cs.CV', 'eess.IV']"
+Learning to Rank Patches for Unbiased Image Redundancy Reduction,Yang Luo · Zhineng Chen · Peng Zhou · Zuxuan Wu · Xieping Gao · Yu-Gang Jiang, ,https://arxiv.org/abs/2404.00680,,2404.00680.pdf,Learning to Rank Patches for Unbiased Image Redundancy Reduction,"Images suffer from heavy spatial redundancy because pixels in neighboring
+regions are spatially correlated. Existing approaches strive to overcome this
+limitation by reducing less meaningful image regions. However, current leading
+methods rely on supervisory signals. They may compel models to preserve content
+that aligns with labeled categories and discard content belonging to unlabeled
+categories. This categorical inductive bias makes these methods less effective
+in real-world scenarios. To address this issue, we propose a self-supervised
+framework for image redundancy reduction called Learning to Rank Patches
+(LTRP). We observe that image reconstruction of masked image modeling models is
+sensitive to the removal of visible patches when the masking ratio is high
+(e.g., 90\%). Building upon it, we implement LTRP via two steps: inferring the
+semantic density score of each patch by quantifying variation between
+reconstructions with and without this patch, and learning to rank the patches
+with the pseudo score. The entire process is self-supervised, thus getting out
+of the dilemma of categorical inductive bias. We design extensive experiments
+on different datasets and tasks. The results demonstrate that LTRP outperforms
+both supervised and other self-supervised methods due to the fair assessment of
+image content.",cs.CV,['cs.CV']
+Domain-Rectifying Adapter for Cross-Domain Few-Shot Segmentation,Jiapeng Su · Qi Fan · Wenjie Pei · Guangming Lu · Fanglin Chen, ,https://arxiv.org/abs/2404.10322v1,,2404.10322v1.pdf,Domain-Rectifying Adapter for Cross-Domain Few-Shot Segmentation,"Few-shot semantic segmentation (FSS) has achieved great success on segmenting
+objects of novel classes, supported by only a few annotated samples. However,
+existing FSS methods often underperform in the presence of domain shifts,
+especially when encountering new domain styles that are unseen during training.
+It is suboptimal to directly adapt or generalize the entire model to new
+domains in the few-shot scenario. Instead, our key idea is to adapt a small
+adapter for rectifying diverse target domain styles to the source domain.
+Consequently, the rectified target domain features can fittingly benefit from
+the well-optimized source domain segmentation model, which is intently trained
+on sufficient source domain data. Training domain-rectifying adapter requires
+sufficiently diverse target domains. We thus propose a novel local-global style
+perturbation method to simulate diverse potential target domains by
+perturbating the feature channel statistics of the individual images and
+collective statistics of the entire source domain, respectively. Additionally,
+we propose a cyclic domain alignment module to facilitate the adapter
+effectively rectifying domains using a reverse domain rectification
+supervision. The adapter is trained to rectify the image features from diverse
+synthesized target domains to align with the source domain. During testing on
+target domains, we start by rectifying the image features and then conduct
+few-shot segmentation on the domain-rectified features. Extensive experiments
+demonstrate the effectiveness of our method, achieving promising results on
+cross-domain few-shot semantic segmentation tasks. Our code is available at
+https://github.com/Matt-Su/DR-Adapter.",cs.CV,['cs.CV']
+GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence,Van Nguyen Nguyen · Thibault Groueix · Mathieu Salzmann · Vincent Lepetit, ,https://arxiv.org/abs/2311.14155,,2311.14155.pdf,GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence,"We present GigaPose, a fast, robust, and accurate method for CAD-based novel
+object pose estimation in RGB images. GigaPose first leverages discriminative
+""templates"", rendered images of the CAD models, to recover the out-of-plane
+rotation and then uses patch correspondences to estimate the four remaining
+parameters. Our approach samples templates in only a two-degrees-of-freedom
+space instead of the usual three and matches the input image to the templates
+using fast nearest-neighbor search in feature space, results in a speedup
+factor of 35x compared to the state of the art. Moreover, GigaPose is
+significantly more robust to segmentation errors. Our extensive evaluation on
+the seven core datasets of the BOP challenge demonstrates that it achieves
+state-of-the-art accuracy and can be seamlessly integrated with existing
+refinement methods. Additionally, we show the potential of GigaPose with 3D
+models predicted by recent work on 3D reconstruction from a single image,
+relaxing the need for CAD models and making 6D pose object estimation much more
+convenient. Our source code and trained models are publicly available at
+https://github.com/nv-nguyen/gigaPose",cs.CV,['cs.CV']
+Fine-grained Prototypical Voting with Heterogeneous Mixup for Semi-supervised 2D-3D Cross-modal Retrieval,Fan Zhang · Xian-Sheng Hua · Chong Chen · Xiao Luo, ,,https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4774118,,,,,nan
+SmartRefine: A Scenario-Adaptive Refinement Framework for Efficient Motion Prediction,Yang Zhou · Hao Shao · Letian Wang · Steven L. Waslander · Hongsheng Li · Yu Liu, ,https://arxiv.org/abs/2403.11492,,2403.11492.pdf,SmartRefine: A Scenario-Adaptive Refinement Framework for Efficient Motion Prediction,"Predicting the future motion of surrounding agents is essential for
+autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed
+environments. Context information, such as road maps and surrounding agents'
+states, provides crucial geometric and semantic information for motion behavior
+prediction. To this end, recent works explore two-stage prediction frameworks
+where coarse trajectories are first proposed, and then used to select critical
+context information for trajectory refinement. However, they either incur a
+large amount of computation or bring limited improvement, if not both. In this
+paper, we introduce a novel scenario-adaptive refinement strategy, named
+SmartRefine, to refine prediction with minimal additional computation.
+Specifically, SmartRefine can comprehensively adapt refinement configurations
+based on each scenario's properties, and smartly chooses the number of
+refinement iterations by introducing a quality score to measure the prediction
+quality and remaining refinement potential of each scenario. SmartRefine is
+designed as a generic and flexible approach that can be seamlessly integrated
+into most state-of-the-art motion prediction models. Experiments on Argoverse
+(1 & 2) show that our method consistently improves the prediction accuracy of
+multiple state-of-the-art prediction models. Specifically, by adding
+SmartRefine to QCNet, we outperform all published ensemble-free works on the
+Argoverse 2 leaderboard (single agent track) at submission. Comprehensive
+studies are also conducted to ablate design choices and explore the mechanism
+behind multi-iteration refinement. Codes are available at
+https://github.com/opendilab/SmartRefine/",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']"
+GenTron: Diffusion Transformers for Image and Video Generation,Shoufa Chen · Mengmeng Xu · Jiawei Ren · Yuren Cong · Sen He · Yanping Xie · Animesh Sinha · Ping Luo · Tao Xiang · Juan-Manuel Pérez-Rúa, ,https://arxiv.org/abs/2312.04557,,2312.04557.pdf,GenTron: Diffusion Transformers for Image and Video Generation,"In this study, we explore Transformer-based diffusion models for image and
+video generation. Despite the dominance of Transformer architectures in various
+fields due to their flexibility and scalability, the visual generative domain
+primarily utilizes CNN-based U-Net architectures, particularly in
+diffusion-based models. We introduce GenTron, a family of Generative models
+employing Transformer-based diffusion, to address this gap. Our initial step
+was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a
+process involving thorough empirical exploration of the conditioning mechanism.
+We then scale GenTron from approximately 900M to over 3B parameters, observing
+significant improvements in visual quality. Furthermore, we extend GenTron to
+text-to-video generation, incorporating novel motion-free guidance to enhance
+video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win
+rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text
+alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench,
+underscoring its strengths in compositional generation. We believe this work
+will provide meaningful insights and serve as a valuable reference for future
+research.",cs.CV,['cs.CV']
+Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch,Xidong Wu · Shangqian Gao · Zeyu Zhang · Zhenzhen Li · Runxue Bao · Yanfu Zhang · Xiaoqian Wang · Heng Huang, ,https://arxiv.org/abs/2403.14729,,2403.14729.pdf,Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch,"Current techniques for deep neural network (DNN) pruning often involve
+intricate multi-step processes that require domain-specific expertise, making
+their widespread adoption challenging. To address the limitation, the
+Only-Train-Once (OTO) and OTOv2 are proposed to eliminate the need for
+additional fine-tuning steps by directly training and compressing a general DNN
+from scratch. Nevertheless, the static design of optimizers (in OTO) can lead
+to convergence issues of local optima. In this paper, we proposed the
+Auto-Train-Once (ATO), an innovative network pruning algorithm designed to
+automatically reduce the computational and storage costs of DNNs. During the
+model training phase, our approach not only trains the target model but also
+leverages a controller network as an architecture generator to guide the
+learning of target model weights. Furthermore, we developed a novel stochastic
+gradient algorithm that enhances the coordination between model training and
+controller network training, thereby improving pruning performance. We provide
+a comprehensive convergence analysis as well as extensive experiments, and the
+results show that our approach achieves state-of-the-art performance across
+various model architectures (including ResNet18, ResNet34, ResNet50, ResNet56,
+and MobileNetv2) on standard benchmark datasets (CIFAR-10, CIFAR-100, and
+ImageNet).",cs.CV,"['cs.CV', 'cs.LG']"
+NOPE: Novel Object Pose Estimation from a Single Image,Van Nguyen Nguyen · Thibault Groueix · Georgy Ponimatkin · Yinlin Hu · Renaud Marlet · Mathieu Salzmann · Vincent Lepetit, ,https://arxiv.org/abs/2311.14155,,,GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence,"We present GigaPose, a fast, robust, and accurate method for CAD-based novel
+object pose estimation in RGB images. GigaPose first leverages discriminative
+""templates"", rendered images of the CAD models, to recover the out-of-plane
+rotation and then uses patch correspondences to estimate the four remaining
+parameters. Our approach samples templates in only a two-degrees-of-freedom
+space instead of the usual three and matches the input image to the templates
+using fast nearest-neighbor search in feature space, results in a speedup
+factor of 35x compared to the state of the art. Moreover, GigaPose is
+significantly more robust to segmentation errors. Our extensive evaluation on
+the seven core datasets of the BOP challenge demonstrates that it achieves
+state-of-the-art accuracy and can be seamlessly integrated with existing
+refinement methods. Additionally, we show the potential of GigaPose with 3D
+models predicted by recent work on 3D reconstruction from a single image,
+relaxing the need for CAD models and making 6D pose object estimation much more
+convenient. Our source code and trained models are publicly available at
+https://github.com/nv-nguyen/gigaPose",cs.CV,['cs.CV']
+Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction,Hao Li · Ying Chen · Yifei Chen · Rongshan Yu · Wenxian Yang · Liansheng Wang · Bowen Ding · Yuchen Han, ,https://arxiv.org/abs/2402.19326,,2402.19326.pdf,Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction,"Whole Slide Image (WSI) classification is often formulated as a Multiple
+Instance Learning (MIL) problem. Recently, Vision-Language Models (VLMs) have
+demonstrated remarkable performance in WSI classification. However, existing
+methods leverage coarse-grained pathogenetic descriptions for visual
+representation supervision, which are insufficient to capture the complex
+visual appearance of pathogenetic images, hindering the generalizability of
+models on diverse downstream tasks. Additionally, processing high-resolution
+WSIs can be computationally expensive. In this paper, we propose a novel
+""Fine-grained Visual-Semantic Interaction"" (FiVE) framework for WSI
+classification. It is designed to enhance the model's generalizability by
+leveraging the interaction between localized visual patterns and fine-grained
+pathological semantics. Specifically, with meticulously designed queries, we
+start by utilizing a large language model to extract fine-grained pathological
+descriptions from various non-standardized raw reports. The output descriptions
+are then reconstructed into fine-grained labels used for training. By
+introducing a Task-specific Fine-grained Semantics (TFS) module, we enable
+prompts to capture crucial visual information in WSIs, which enhances
+representation learning and augments generalization capabilities significantly.
+Furthermore, given that pathological visual patterns are redundantly
+distributed across tissue slices, we sample a subset of visual instances during
+training. Our method demonstrates robust generalizability and strong
+transferability, dominantly outperforming the counterparts on the TCGA Lung
+Cancer dataset with at least 9.19% higher accuracy in few-shot experiments. The
+code is available at: https://github.com/ls1rius/WSI_FiVE.",cs.CV,['cs.CV']
+In-distribution Public Data Synthesis with Diffusion Models for Differentially Private Image Classification,Jinseong Park · Yujin Choi · Jaewook Lee,https://jinseongp.github.io/2024/05/28/cvpr2024.html,,https://jinseongp.github.io/2024/05/28/cvpr2024.html,,,,,nan
+CapHuman: Capture Your Moments in Parallel Universes,Chao Liang · Fan Ma · Linchao Zhu · Yingying Deng · Yi Yang,https://caphuman.github.io/,https://arxiv.org/abs/2402.00627,,2402.00627.pdf,CapHuman: Capture Your Moments in Parallel Universes,"We concentrate on a novel human-centric image synthesis task, that is, given
+only one reference facial photograph, it is expected to generate specific
+individual images with diverse head positions, poses, facial expressions, and
+illuminations in different contexts. To accomplish this goal, we argue that our
+generative model should be capable of the following favorable characteristics:
+(1) a strong visual and semantic understanding of our world and human society
+for basic object and human image generation. (2) generalizable identity
+preservation ability. (3) flexible and fine-grained head control. Recently,
+large pre-trained text-to-image diffusion models have shown remarkable results,
+serving as a powerful generative foundation. As a basis, we aim to unleash the
+above two capabilities of the pre-trained model. In this work, we present a new
+framework named CapHuman. We embrace the ""encode then learn to align"" paradigm,
+which enables generalizable identity preservation for new individuals without
+cumbersome tuning at inference. CapHuman encodes identity features and then
+learns to align them into the latent space. Moreover, we introduce the 3D
+facial prior to equip our model with control over the human head in a flexible
+and 3D-consistent manner. Extensive qualitative and quantitative analyses
+demonstrate our CapHuman can produce well-identity-preserved, photo-realistic,
+and high-fidelity portraits with content-rich representations and various head
+renditions, superior to established baselines. Code and checkpoint will be
+released at https://github.com/VamosC/CapHuman.",cs.CV,"['cs.CV', 'cs.AI']"
+CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers,Shahaf Arica · Or Rubin · Sapir Gershov · Shlomi Laufer,https://github.com/shahaf-arica/cuvler,https://arxiv.org/abs/2403.07700,,2403.07700.pdf,CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers,"In this paper, we introduce VoteCut, an innovative method for unsupervised
+object discovery that leverages feature representations from multiple
+self-supervised models. VoteCut employs normalized-cut based graph
+partitioning, clustering and a pixel voting approach. Additionally, We present
+CuVLER (Cut-Vote-and-LEaRn), a zero-shot model, trained using pseudo-labels,
+generated by VoteCut, and a novel soft target loss to refine segmentation
+accuracy. Through rigorous evaluations across multiple datasets and several
+unsupervised setups, our methods demonstrate significant improvements in
+comparison to previous state-of-the-art models. Our ablation studies further
+highlight the contributions of each component, revealing the robustness and
+efficacy of our approach. Collectively, VoteCut and CuVLER pave the way for
+future advancements in image segmentation.",cs.CV,['cs.CV']
+LEDITS++: Limitless Image Editing using Text-to-Image Models,Manuel Brack · Felix Friedrich · Katharina Kornmeier · Linoy Tsaban · Patrick Schramowski · Kristian Kersting · Apolinário Passos, ,https://arxiv.org/abs/2311.16711,,2311.16711.pdf,LEDITS++: Limitless Image Editing using Text-to-Image Models,"Text-to-image diffusion models have recently received increasing interest for
+their astonishing ability to produce high-fidelity images from solely text
+inputs. Subsequent research efforts aim to exploit and apply their capabilities
+to real image editing. However, existing image-to-image methods are often
+inefficient, imprecise, and of limited versatility. They either require
+time-consuming fine-tuning, deviate unnecessarily strongly from the input
+image, and/or lack support for multiple, simultaneous edits. To address these
+issues, we introduce LEDITS++, an efficient yet versatile and precise textual
+image manipulation technique. LEDITS++'s novel inversion approach requires no
+tuning nor optimization and produces high-fidelity results with a few diffusion
+steps. Second, our methodology supports multiple simultaneous edits and is
+architecture-agnostic. Third, we use a novel implicit masking technique that
+limits changes to relevant image regions. We propose the novel TEdBench++
+benchmark as part of our exhaustive evaluation. Our results demonstrate the
+capabilities of LEDITS++ and its improvements over previous methods. The
+project page is available at https://leditsplusplus-project.static.hf.space .",cs.CV,"['cs.CV', 'cs.AI', 'cs.HC', 'cs.LG']"
+Are Conventional SNNs Really Efficient? A Perspective from Network Quantization,Guobin Shen · Dongcheng Zhao · Tenglong Li · Jindong Li · Yi Zeng, ,https://arxiv.org/abs/2311.10802,,2311.10802.pdf,Is Conventional SNN Really Efficient? A Perspective from Network Quantization,"Spiking Neural Networks (SNNs) have been widely praised for their high energy
+efficiency and immense potential. However, comprehensive research that
+critically contrasts and correlates SNNs with quantized Artificial Neural
+Networks (ANNs) remains scant, often leading to skewed comparisons lacking
+fairness towards ANNs. This paper introduces a unified perspective,
+illustrating that the time steps in SNNs and quantized bit-widths of activation
+values present analogous representations. Building on this, we present a more
+pragmatic and rational approach to estimating the energy consumption of SNNs.
+Diverging from the conventional Synaptic Operations (SynOps), we champion the
+""Bit Budget"" concept. This notion permits an intricate discourse on
+strategically allocating computational and storage resources between weights,
+activation values, and temporal steps under stringent hardware constraints.
+Guided by the Bit Budget paradigm, we discern that pivoting efforts towards
+spike patterns and weight quantization, rather than temporal attributes,
+elicits profound implications for model performance. Utilizing the Bit Budget
+for holistic design consideration of SNNs elevates model performance across
+diverse data types, encompassing static imagery and neuromorphic datasets. Our
+revelations bridge the theoretical chasm between SNNs and quantized ANNs and
+illuminate a pragmatic trajectory for future endeavors in energy-efficient
+neural computations.",cs.NE,['cs.NE']
+Task-conditioned adaptation of visual features in multi-task policy learning,Pierre Marza · Laetitia Matignon · Olivier Simonin · Christian Wolf,https://pierremarza.github.io/projects/task_conditioned_adaptation/,https://arxiv.org/abs/2402.07739v1,,2402.07739v1.pdf,Task-conditioned adaptation of visual features in multi-task policy learning,"Successfully addressing a wide variety of tasks is a core ability of
+autonomous agents, which requires flexibly adapting the underlying
+decision-making strategies and, as we argue in this work, also adapting the
+underlying perception modules. An analogical argument would be the human visual
+system, which uses top-down signals to focus attention determined by the
+current task. Similarly, in this work, we adapt pre-trained large vision models
+conditioned on specific downstream tasks in the context of multi-task policy
+learning. We introduce task-conditioned adapters that do not require finetuning
+any pre-trained weights, combined with a single policy trained with behavior
+cloning and capable of addressing multiple tasks. We condition the policy and
+visual adapters on task embeddings, which can be selected at inference if the
+task is known, or alternatively inferred from a set of example demonstrations.
+To this end, we propose a new optimization-based estimator. We evaluate the
+method on a wide variety of tasks of the CortexBench benchmark and show that,
+compared to existing work, it can be addressed with a single policy. In
+particular, we demonstrate that adapting visual features is a key design choice
+and that the method generalizes to unseen tasks given visual demonstrations.",cs.CV,"['cs.CV', 'cs.LG', 'cs.RO']"
+Open-Vocabulary Video Anomaly Detection,Peng Wu · Xuerong Zhou · Guansong Pang · Yujia Sun · Jing Liu · Peng Wang · Yanning Zhang, ,https://arxiv.org/abs/2311.07042,,2311.07042.pdf,Open-Vocabulary Video Anomaly Detection,"Video anomaly detection (VAD) with weak supervision has achieved remarkable
+performance in utilizing video-level labels to discriminate whether a video
+frame is normal or abnormal. However, current approaches are inherently limited
+to a closed-set setting and may struggle in open-world applications where there
+can be anomaly categories in the test data unseen during training. A few recent
+studies attempt to tackle a more realistic setting, open-set VAD, which aims to
+detect unseen anomalies given seen anomalies and normal videos. However, such a
+setting focuses on predicting frame anomaly scores, having no ability to
+recognize the specific categories of anomalies, despite the fact that this
+ability is essential for building more informed video surveillance systems.
+This paper takes a step further and explores open-vocabulary video anomaly
+detection (OVVAD), in which we aim to leverage pre-trained large models to
+detect and categorize seen and unseen anomalies. To this end, we propose a
+model that decouples OVVAD into two mutually complementary tasks --
+class-agnostic detection and class-specific classification -- and jointly
+optimizes both tasks. Particularly, we devise a semantic knowledge injection
+module to introduce semantic knowledge from large language models for the
+detection task, and design a novel anomaly synthesis module to generate pseudo
+unseen anomaly videos with the help of large vision generation models for the
+classification task. These semantic knowledge and synthesis anomalies
+substantially extend our model's capability in detecting and categorizing a
+variety of seen and unseen anomalies. Extensive experiments on three
+widely-used benchmarks demonstrate our model achieves state-of-the-art
+performance on OVVAD task.",cs.CV,['cs.CV']
+Hierarchical Histogram Threshold Segmentation – Auto-terminating High-detail Oversegmentation,Thomas Chang · Simon Seibt · Bartosz von Rymon Lipinski,https://changtvs.github.io/hierarchical-histogram-threshold-segmentation/,,https://www.nature.com/articles/s41598-023-36066-8,,,,,nan
+ManiFPT: Defining and Analyzing Fingerprints of Generative Models,Hae Jin Song · Mahyar Khayatkhoei · Wael AbdAlmageed, ,https://arxiv.org/abs/2402.10401,,2402.10401.pdf,ManiFPT: Defining and Analyzing Fingerprints of Generative Models,"Recent works have shown that generative models leave traces of their
+underlying generative process on the generated samples, broadly referred to as
+fingerprints of a generative model, and have studied their utility in detecting
+synthetic images from real ones. However, the extend to which these
+fingerprints can distinguish between various types of synthetic image and help
+identify the underlying generative process remain under-explored. In
+particular, the very definition of a fingerprint remains unclear, to our
+knowledge. To that end, in this work, we formalize the definition of artifact
+and fingerprint in generative models, propose an algorithm for computing them
+in practice, and finally study its effectiveness in distinguishing a large
+array of different generative models. We find that using our proposed
+definition can significantly improve the performance on the task of identifying
+the underlying generative process from samples (model attribution) compared to
+existing methods. Additionally, we study the structure of the fingerprints, and
+observe that it is very predictive of the effect of different design choices on
+the generative process.",cs.LG,"['cs.LG', 'cs.CV']"
+Beyond Text: Frozen Large Language Models in Visual Signal Comprehension,Lei Zhu · Fangyun Wei · Yanye Lu, ,https://arxiv.org/abs/2403.07874,,2403.07874.pdf,Beyond Text: Frozen Large Language Models in Visual Signal Comprehension,"In this work, we investigate the potential of a large language model (LLM) to
+directly comprehend visual signals without the necessity of fine-tuning on
+multi-modal datasets. The foundational concept of our method views an image as
+a linguistic entity, and translates it to a set of discrete words derived from
+the LLM's vocabulary. To achieve this, we present the Vision-to-Language
+Tokenizer, abbreviated as V2T Tokenizer, which transforms an image into a
+``foreign language'' with the combined aid of an encoder-decoder, the LLM
+vocabulary, and a CLIP model. With this innovative image encoding, the LLM
+gains the ability not only for visual comprehension but also for image
+denoising and restoration in an auto-regressive fashion-crucially, without any
+fine-tuning. We undertake rigorous experiments to validate our method,
+encompassing understanding tasks like image recognition, image captioning, and
+visual question answering, as well as image denoising tasks like inpainting,
+outpainting, deblurring, and shift restoration. Code and models are available
+at https://github.com/zh460045050/V2L-Tokenizer.",cs.CV,['cs.CV']
+Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation,Philipp Schröppel · Christopher Wewer · Jan Lenssen · Eddy Ilg · Thomas Brox,https://neural-point-cloud-diffusion.github.io/,https://arxiv.org/abs/2312.14124,,2312.14124.pdf,Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation,"Controllable generation of 3D assets is important for many practical
+applications like content creation in movies, games and engineering, as well as
+in AR/VR. Recently, diffusion models have shown remarkable results in
+generation quality of 3D objects. However, none of the existing models enable
+disentangled generation to control the shape and appearance separately. For the
+first time, we present a suitable representation for 3D diffusion models to
+enable such disentanglement by introducing a hybrid point cloud and neural
+radiance field approach. We model a diffusion process over point positions
+jointly with a high-dimensional feature space for a local density and radiance
+decoder. While the point positions represent the coarse shape of the object,
+the point features allow modeling the geometry and appearance details. This
+disentanglement enables us to sample both independently and therefore to
+control both separately. Our approach sets a new state of the art in generation
+compared to previous disentanglement-capable methods by reduced FID scores of
+30-90% and is on-par with other non disentanglement-capable state-of-the art
+methods.",cs.CV,['cs.CV']
+SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos,Tao Wu · Runyu He · Gangshan Wu · Limin Wang,https://github.com/MCG-NJU/SportsHHI,https://arxiv.org/abs/2404.04565,,2404.04565.pdf,SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos,"Video-based visual relation detection tasks, such as video scene graph
+generation, play important roles in fine-grained video understanding. However,
+current video visual relation detection datasets have two main limitations that
+hinder the progress of research in this area. First, they do not explore
+complex human-human interactions in multi-person scenarios. Second, the
+relation types of existing datasets have relatively low-level semantics and can
+be often recognized by appearance or simple prior information, without the need
+for detailed spatio-temporal context reasoning. Nevertheless, comprehending
+high-level interactions between humans is crucial for understanding complex
+multi-person videos, such as sports and surveillance videos. To address this
+issue, we propose a new video visual relation detection task: video human-human
+interaction detection, and build a dataset named SportsHHI for it. SportsHHI
+contains 34 high-level interaction classes from basketball and volleyball
+sports. 118,075 human bounding boxes and 50,649 interaction instances are
+annotated on 11,398 keyframes. To benchmark this, we propose a two-stage
+baseline method and conduct extensive experiments to reveal the key factors for
+a successful human-human interaction detector. We hope that SportsHHI can
+stimulate research on human interaction understanding in videos and promote the
+development of spatio-temporal context modeling techniques in video visual
+relation detection.",cs.CV,['cs.CV']
+Time-Efficient Light-Field Acquisition Using Coded Aperture and Events,Shuji Habuchi · Keita Takahashi · Chihiro Tsutake · Toshiaki Fujii · Hajime Nagahara,https://www.fujii.nuee.nagoya-u.ac.jp/Research/EventLF/,https://arxiv.org/abs/2403.07244,,2403.07244.pdf,Time-Efficient Light-Field Acquisition Using Coded Aperture and Events,"We propose a computational imaging method for time-efficient light-field
+acquisition that combines a coded aperture with an event-based camera.
+Different from the conventional coded-aperture imaging method, our method
+applies a sequence of coding patterns during a single exposure for an image
+frame. The parallax information, which is related to the differences in coding
+patterns, is recorded as events. The image frame and events, all of which are
+measured in a single exposure, are jointly used to computationally reconstruct
+a light field. We also designed an algorithm pipeline for our method that is
+end-to-end trainable on the basis of deep optics and compatible with real
+camera hardware. We experimentally showed that our method can achieve more
+accurate reconstruction than several other imaging methods with a single
+exposure. We also developed a hardware prototype with the potential to complete
+the measurement on the camera within 22 msec and demonstrated that light fields
+from real 3-D scenes can be obtained with convincing visual quality. Our
+software and supplementary video are available from our project website.",cs.CV,"['cs.CV', 'eess.IV']"
+Rapid Motor Adaptation for Robotic Manipulator Arms,Yichao Liang · Kevin Ellis · João F. Henriques, ,https://arxiv.org/abs/2312.04670v1,,2312.04670v1.pdf,Rapid Motor Adaptation for Robotic Manipulator Arms,"Developing generalizable manipulation skills is a core challenge in embodied
+AI. This includes generalization across diverse task configurations,
+encompassing variations in object shape, density, friction coefficient, and
+external disturbances such as forces applied to the robot. Rapid Motor
+Adaptation (RMA) offers a promising solution to this challenge. It posits that
+essential hidden variables influencing an agent's task performance, such as
+object mass and shape, can be effectively inferred from the agent's action and
+proprioceptive history. Drawing inspiration from RMA in locomotion and in-hand
+rotation, we use depth perception to develop agents tailored for rapid motor
+adaptation in a variety of manipulation tasks. We evaluated our agents on four
+challenging tasks from the Maniskill2 benchmark, namely pick-and-place
+operations with hundreds of objects from the YCB and EGAD datasets, peg
+insertion with precise position and orientation, and operating a variety of
+faucets and handles, with customized environment variations. Empirical results
+demonstrate that our agents surpass state-of-the-art methods like automatic
+domain randomization and vision-based policies, obtaining better generalization
+performance and sample efficiency.",cs.RO,"['cs.RO', 'cs.AI', 'cs.CV', 'cs.LG']"
+Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation,Jin Wang · Bingfeng Zhang · Jian Pang · Honglong Chen · Weifeng Liu, ,https://arxiv.org/abs/2405.08458,,2405.08458.pdf,Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation,"Few-shot segmentation remains challenging due to the limitations of its
+labeling information for unseen classes. Most previous approaches rely on
+extracting high-level feature maps from the frozen visual encoder to compute
+the pixel-wise similarity as a key prior guidance for the decoder. However,
+such a prior representation suffers from coarse granularity and poor
+generalization to new classes since these high-level feature maps have obvious
+category bias. In this work, we propose to replace the visual prior
+representation with the visual-text alignment capacity to capture more reliable
+guidance and enhance the model generalization. Specifically, we design two
+kinds of training-free prior information generation strategy that attempts to
+utilize the semantic alignment capability of the Contrastive Language-Image
+Pre-training model (CLIP) to locate the target class. Besides, to acquire more
+accurate prior guidance, we build a high-order relationship of attention maps
+and utilize it to refine the initial prior information. Experiments on both the
+PASCAL-5{i} and COCO-20{i} datasets show that our method obtains a clearly
+substantial improvement and reaches the new state-of-the-art performance.",cs.CV,['cs.CV']
+A Unified Framework for Microscopy Defocus Deblur with Multi-Pyramid Transformer and Contrastive Learning,Yuelin Zhang · Pengyu Zheng · Wanquan Yan · Chengyu Fang · Shing Shin Cheng, ,https://arxiv.org/abs/2403.02611,,2403.02611.pdf,A Unified Framework for Microscopy Defocus Deblur with Multi-Pyramid Transformer and Contrastive Learning,"Defocus blur is a persistent problem in microscope imaging that poses harm to
+pathology interpretation and medical intervention in cell microscopy and
+microscope surgery. To address this problem, a unified framework including the
+multi-pyramid transformer (MPT) and extended frequency contrastive
+regularization (EFCR) is proposed to tackle two outstanding challenges in
+microscopy deblur: longer attention span and data deficiency. The MPT employs
+an explicit pyramid structure at each network stage that integrates the
+cross-scale window attention (CSWA), the intra-scale channel attention (ISCA),
+and the feature-enhancing feed-forward network (FEFN) to capture long-range
+cross-scale spatial interaction and global channel context. The EFCR addresses
+the data deficiency problem by exploring latent deblur signals from different
+frequency bands. It also enables deblur knowledge transfer to learn
+cross-domain information from extra data, improving deblur performance for
+labeled and unlabeled data. Extensive experiments and downstream task
+validation show the framework achieves state-of-the-art performance across
+multiple datasets. Project page: https://github.com/PieceZhang/MPT-CataBlur.",cs.CV,"['cs.CV', 'cs.AI']"
+Rotation-Agnostic Image Representation Learning for Digital Pathology,Saghir Alfasly · Abubakr Shafique · Peyman Nejat · Jibran Khan · Areej Alsaafin · Ghazal Alabtah · Hamid Tizhoosh,https://kimialabmayo.github.io/PathDino-Page/,https://arxiv.org/abs/2311.08359,,2311.08359.pdf,Rotation-Agnostic Image Representation Learning for Digital Pathology,"This paper addresses complex challenges in histopathological image analysis
+through three key contributions. Firstly, it introduces a fast patch selection
+method, FPS, for whole-slide image (WSI) analysis, significantly reducing
+computational cost while maintaining accuracy. Secondly, it presents PathDino,
+a lightweight histopathology feature extractor with a minimal configuration of
+five Transformer blocks and only 9 million parameters, markedly fewer than
+alternatives. Thirdly, it introduces a rotation-agnostic representation
+learning paradigm using self-supervised learning, effectively mitigating
+overfitting. We also show that our compact model outperforms existing
+state-of-the-art histopathology-specific vision transformers on 12 diverse
+datasets, including both internal datasets spanning four sites (breast, liver,
+skin, and colorectal) and seven public datasets (PANDA, CAMELYON16, BRACS,
+DigestPath, Kather, PanNuke, and WSSS4LUAD). Notably, even with a training
+dataset of 6 million histopathology patches from The Cancer Genome Atlas
+(TCGA), our approach demonstrates an average 8.5% improvement in patch-level
+majority vote performance. These contributions provide a robust framework for
+enhancing image analysis in digital pathology, rigorously validated through
+extensive evaluation. Project Page:
+https://kimialabmayo.github.io/PathDino-Page/",cs.CV,['cs.CV']
+Weakly Misalignment-free Adaptive Feature Alignment for UAVs-based Multimodal Object Detection,Chen Chen · Jiahao Qi · Xingyue Liu · Kangcheng Bin · Ruigang Fu · Xikun Hu · Ping Zhong, ,https://arxiv.org/abs/2405.16873,,2405.16873.pdf,ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection,"In the field of 3D object detection tasks, fusing heterogeneous features from
+LiDAR and camera sensors into a unified Bird's Eye View (BEV) representation is
+a widely adopted paradigm. However, existing methods are often compromised by
+imprecise sensor calibration, resulting in feature misalignment in LiDAR-camera
+BEV fusion. Moreover, such inaccuracies result in errors in depth estimation
+for the camera branch, ultimately causing misalignment between LiDAR and camera
+BEV features. In this work, we propose a novel ContrastAlign approach that
+utilizes contrastive learning to enhance the alignment of heterogeneous
+modalities, thereby improving the robustness of the fusion process.
+Specifically, our approach includes the L-Instance module, which directly
+outputs LiDAR instance features within LiDAR BEV features. Then, we introduce
+the C-Instance module, which predicts camera instance features through RoI
+(Region of Interest) pooling on the camera BEV features. We propose the
+InstanceFusion module, which utilizes contrastive learning to generate similar
+instance features across heterogeneous modalities. We then use graph matching
+to calculate the similarity between the neighboring camera instance features
+and the similarity instance features to complete the alignment of instance
+features. Our method achieves state-of-the-art performance, with an mAP of
+70.3%, surpassing BEVFusion by 1.8% on the nuScenes validation set.
+Importantly, our method outperforms BEVFusion by 7.3% under conditions with
+misalignment noise.",cs.CV,['cs.CV']
+Learning with Structural Labels for Learning with Noisy Labels,Noo-ri Kim · Jin-Seop Lee · Jee-Hyong Lee, ,https://arxiv.org/abs/2401.04390,,2401.04390.pdf,Learning with Noisy Labels: Interconnection of Two Expectation-Maximizations,"Labor-intensive labeling becomes a bottleneck in developing computer vision
+algorithms based on deep learning. For this reason, dealing with imperfect
+labels has increasingly gained attention and has become an active field of
+study. We address learning with noisy labels (LNL) problem, which is formalized
+as a task of finding a structured manifold in the midst of noisy data. In this
+framework, we provide a proper objective function and an optimization algorithm
+based on two expectation-maximization (EM) cycles. The separate networks
+associated with the two EM cycles collaborate to optimize the objective
+function, where one model is for distinguishing clean labels from corrupted
+ones while the other is for refurbishing the corrupted labels. This approach
+results in a non-collapsing LNL-flywheel model in the end. Experiments show
+that our algorithm achieves state-of-the-art performance in multiple standard
+benchmarks with substantial margins under various types of label noise.",cs.CV,['cs.CV']
+Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training,Runze He · Shaofei Huang · Xuecheng Nie · Tianrui Hui · Luoqi Liu · Jiao Dai · Jizhong Han · Guanbin Li · Si Liu,https://customnerf.github.io/,https://arxiv.org/abs/2312.01663,,2312.01663.pdf,Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training,"In this paper, we target the adaptive source driven 3D scene editing task by
+proposing a CustomNeRF model that unifies a text description or a reference
+image as the editing prompt. However, obtaining desired editing results
+conformed with the editing prompt is nontrivial since there exist two
+significant challenges, including accurate editing of only foreground regions
+and multi-view consistency given a single-view reference image. To tackle the
+first challenge, we propose a Local-Global Iterative Editing (LGIE) training
+scheme that alternates between foreground region editing and full-image
+editing, aimed at foreground-only manipulation while preserving the background.
+For the second challenge, we also design a class-guided regularization that
+exploits class priors within the generation model to alleviate the
+inconsistency problem among different views in image-driven editing. Extensive
+experiments show that our CustomNeRF produces precise editing results under
+various real scenes for both text- and image-driven settings.",cs.CV,"['cs.CV', 'cs.AI']"
+Training-free Pretrained Model Merging,Zhengqi Xu · Ke Yuan · Huiqiong Wang · Yong Wang · Mingli Song · Jie Song,https://github.com/zju-vipa/training_free_model_merging,https://arxiv.org/abs/2403.01753,,2403.01753.pdf,Training-Free Pretrained Model Merging,"Recently, model merging techniques have surfaced as a solution to combine
+multiple single-talent models into a single multi-talent model. However,
+previous endeavors in this field have either necessitated additional training
+or fine-tuning processes, or require that the models possess the same
+pre-trained initialization. In this work, we identify a common drawback in
+prior works w.r.t. the inconsistency of unit similarity in the weight space and
+the activation space. To address this inconsistency, we propose an innovative
+model merging framework, coined as merging under dual-space constraints
+(MuDSC). Specifically, instead of solely maximizing the objective of a single
+space, we advocate for the exploration of permutation matrices situated in a
+region with a unified high similarity in the dual space, achieved through the
+linear combination of activation and weight similarity matrices. In order to
+enhance usability, we have also incorporated adaptations for group structure,
+including Multi-Head Attention and Group Normalization. Comprehensive
+experimental comparisons demonstrate that MuDSC can significantly boost the
+performance of merged models with various task combinations and architectures.
+Furthermore, the visualization of the merged model within the multi-task loss
+landscape reveals that MuDSC enables the merged model to reside in the
+overlapping segment, featuring a unified lower loss for each task. Our code is
+publicly available at https://github.com/zju-vipa/training_free_model_merging.",cs.CV,['cs.CV']
+SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model,Zhengang Li · Yan Kang · Yuchen Liu · Difan Liu · Tobias Hinz · Feng Liu · Yanzhi Wang, ,https://ar5iv.labs.arxiv.org/html/2211.11018,,2211.11018.pdf,MagicVideo: Efficient Video Generation With Latent Diffusion Models,"We present an efficient text-to-video generation framework based on latent
+diffusion models, termed MagicVideo. MagicVideo can generate smooth video clips
+that are concordant with the given text descriptions. Due to a novel and
+efficient 3D U-Net design and modeling video distributions in a low-dimensional
+space, MagicVideo can synthesize video clips with 256x256 spatial resolution on
+a single GPU card, which takes around 64x fewer computations than the Video
+Diffusion Models (VDM) in terms of FLOPs. In specific, unlike existing works
+that directly train video models in the RGB space, we use a pre-trained VAE to
+map video clips into a low-dimensional latent space and learn the distribution
+of videos' latent codes via a diffusion model. Besides, we introduce two new
+designs to adapt the U-Net denoiser trained on image tasks to video data: a
+frame-wise lightweight adaptor for the image-to-video distribution adjustment
+and a directed temporal attention module to capture temporal dependencies
+across frames. Thus, we can exploit the informative weights of convolution
+operators from a text-to-image model for accelerating video training. To
+ameliorate the pixel dithering in the generated videos, we also propose a novel
+VideoVAE auto-encoder for better RGB reconstruction. We conduct extensive
+experiments and demonstrate that MagicVideo can generate high-quality video
+clips with either realistic or imaginary content. Refer to
+\url{https://magicvideo.github.io/#} for more examples.",cs.CV,['cs.CV']
+GenFlow: Generalizable Recurrent Flow for 6D Pose Refinement of Novel Objects,Sungphill Moon · Hyeontae Son · Dongcheol Hur · Sangwook Kim, ,https://arxiv.org/abs/2403.11510,,2403.11510.pdf,GenFlow: Generalizable Recurrent Flow for 6D Pose Refinement of Novel Objects,"Despite the progress of learning-based methods for 6D object pose estimation,
+the trade-off between accuracy and scalability for novel objects still exists.
+Specifically, previous methods for novel objects do not make good use of the
+target object's 3D shape information since they focus on generalization by
+processing the shape indirectly, making them less effective. We present
+GenFlow, an approach that enables both accuracy and generalization to novel
+objects with the guidance of the target object's shape. Our method predicts
+optical flow between the rendered image and the observed image and refines the
+6D pose iteratively. It boosts the performance by a constraint of the 3D shape
+and the generalizable geometric knowledge learned from an end-to-end
+differentiable system. We further improve our model by designing a cascade
+network architecture to exploit the multi-scale correlations and coarse-to-fine
+refinement. GenFlow ranked first on the unseen object pose estimation
+benchmarks in both the RGB and RGB-D cases. It also achieves performance
+competitive with existing state-of-the-art methods for the seen object pose
+estimation without any fine-tuning.",cs.CV,['cs.CV']
+Day-Night Cross-domain Vehicle Re-identification,Hongchao Li · Jingong Chen · AIHUA ZHENG · Yong Wu · YongLong Luo, ,,https://www.mdpi.com/2079-9292/13/10/1823,,,,,nan
+Making Visual Sense of Oracle Bones for You and Me,Runqi Qiao · LAN YANG · Kaiyue Pang · Honggang Zhang, ,https://arxiv.org/abs/2311.15421,,2311.15421.pdf,Wired Perspectives: Multi-View Wire Art Embraces Generative AI,"Creating multi-view wire art (MVWA), a static 3D sculpture with diverse
+interpretations from different viewpoints, is a complex task even for skilled
+artists. In response, we present DreamWire, an AI system enabling everyone to
+craft MVWA easily. Users express their vision through text prompts or
+scribbles, freeing them from intricate 3D wire organisation. Our approach
+synergises 3D B\'ezier curves, Prim's algorithm, and knowledge distillation
+from diffusion models or their variants (e.g., ControlNet). This blend enables
+the system to represent 3D wire art, ensuring spatial continuity and overcoming
+data scarcity. Extensive evaluation and analysis are conducted to shed insight
+on the inner workings of the proposed system, including the trade-off between
+connectivity and visual aesthetics.",cs.CV,"['cs.CV', 'cs.AI']"
+EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI,Tai Wang · Xiaohan Mao · Chenming Zhu · Runsen Xu · Ruiyuan Lyu · Peisen Li · Xiao Chen · Wenwei Zhang · Kai Chen · Tianfan Xue · Xihui Liu · Cewu Lu · Dahua Lin · Jiangmiao Pang, ,https://arxiv.org/abs/2312.16170v1,,2312.16170v1.pdf,EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI,"In the realm of computer vision and robotics, embodied agents are expected to
+explore their environment and carry out human instructions. This necessitates
+the ability to fully understand 3D scenes given their first-person observations
+and contextualize them into language for interaction. However, traditional
+research focuses more on scene-level input and output setups from a global
+view. To address the gap, we introduce EmbodiedScan, a multi-modal, ego-centric
+3D perception dataset and benchmark for holistic 3D scene understanding. It
+encompasses over 5k scans encapsulating 1M ego-centric RGB-D views, 1M language
+prompts, 160k 3D-oriented boxes spanning over 760 categories, some of which
+partially align with LVIS, and dense semantic occupancy with 80 common
+categories. Building upon this database, we introduce a baseline framework
+named Embodied Perceptron. It is capable of processing an arbitrary number of
+multi-modal inputs and demonstrates remarkable 3D perception capabilities, both
+within the two series of benchmarks we set up, i.e., fundamental 3D perception
+tasks and language-grounded tasks, and in the wild. Codes, datasets, and
+benchmarks will be available at https://github.com/OpenRobotLab/EmbodiedScan.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']"
+HouseCat6D - A Large-Scale Multi-Modal Category Level 6D Object Perception Dataset with Household Objects in Realistic Scenarios,HyunJun Jung · Shun-Cheng Wu · Patrick Ruhkamp · Guangyao Zhai · Hannah Schieber · Giulia Rizzoli · Pengyuan Wang · Hongcheng Zhao · Lorenzo Garattoni · Sven Meier · Daniel Roth · Nassir Navab · Benjamin Busam,https://sites.google.com/view/housecat6d,https://ar5iv.labs.arxiv.org/html/2308.10627,,2308.10627.pdf,Polarimetric Information for Multi-Modal 6D Pose Estimation of Photometrically Challenging Objects with Limited Data,"6D pose estimation pipelines that rely on RGB-only or RGB-D data show
+limitations for photometrically challenging objects with e.g. textureless
+surfaces, reflections or transparency. A supervised learning-based method
+utilising complementary polarisation information as input modality is proposed
+to overcome such limitations. This supervised approach is then extended to a
+self-supervised paradigm by leveraging physical characteristics of polarised
+light, thus eliminating the need for annotated real data. The methods achieve
+significant advancements in pose estimation by leveraging geometric information
+from polarised light and incorporating shape priors and invertible physical
+constraints.",cs.CV,['cs.CV']
+SCINeRF: Neural Radiance Fields from a Snapshot Compressive Image,Yunhao Li · Xiaodong Wang · Ping Wang · Xin Yuan · Peidong Liu, ,https://arxiv.org/abs/2403.20018,,2403.20018.pdf,SCINeRF: Neural Radiance Fields from a Snapshot Compressive Image,"In this paper, we explore the potential of Snapshot Compressive Imaging (SCI)
+technique for recovering the underlying 3D scene representation from a single
+temporal compressed image. SCI is a cost-effective method that enables the
+recording of high-dimensional data, such as hyperspectral or temporal
+information, into a single image using low-cost 2D imaging sensors. To achieve
+this, a series of specially designed 2D masks are usually employed, which not
+only reduces storage requirements but also offers potential privacy protection.
+Inspired by this, to take one step further, our approach builds upon the
+powerful 3D scene representation capabilities of neural radiance fields (NeRF).
+Specifically, we formulate the physical imaging process of SCI as part of the
+training of NeRF, allowing us to exploit its impressive performance in
+capturing complex scene structures. To assess the effectiveness of our method,
+we conduct extensive evaluations using both synthetic data and real data
+captured by our SCI system. Extensive experimental results demonstrate that our
+proposed approach surpasses the state-of-the-art methods in terms of image
+reconstruction and novel view image synthesis. Moreover, our method also
+exhibits the ability to restore high frame-rate multi-view consistent images by
+leveraging SCI and the rendering capabilities of NeRF. The code is available at
+https://github.com/WU-CVGL/SCINeRF.",eess.IV,"['eess.IV', 'cs.CV']"
+Source-Free Domain Adaptation with Frozen Multimodal Foundation Model,Song Tang · Wenxin Su · Mao Ye · Xiatian Zhu,https://www.taulab.cc/proj/sfda/cvpr24/difo/index.html,https://arxiv.org/abs/2311.16510,,2311.16510.pdf,Source-Free Domain Adaptation with Frozen Multimodal Foundation Model,"Source-Free Domain Adaptation (SFDA) aims to adapt a source model for a
+target domain, with only access to unlabeled target training data and the
+source model pre-trained on a supervised source domain. Relying on pseudo
+labeling and/or auxiliary supervision, conventional methods are inevitably
+error-prone. To mitigate this limitation, in this work we for the first time
+explore the potentials of off-the-shelf vision-language (ViL) multimodal models
+(e.g.,CLIP) with rich whilst heterogeneous knowledge. We find that directly
+applying the ViL model to the target domain in a zero-shot fashion is
+unsatisfactory, as it is not specialized for this particular task but largely
+generic. To make it task specific, we propose a novel Distilling multimodal
+Foundation model(DIFO)approach. Specifically, DIFO alternates between two steps
+during adaptation: (i) Customizing the ViL model by maximizing the mutual
+information with the target model in a prompt learning manner, (ii) Distilling
+the knowledge of this customized ViL model to the target model. For more
+fine-grained and reliable distillation, we further introduce two effective
+regularization terms, namely most-likely category encouragement and predictive
+consistency. Extensive experiments show that DIFO significantly outperforms the
+state-of-the-art alternatives. Code is here",cs.CV,['cs.CV']
+InstructVideo: Instructing Video Diffusion Models with Human Feedback,Hangjie Yuan · Shiwei Zhang · Xiang Wang · Yujie Wei · Tao Feng · Yining Pan · Yingya Zhang · Ziwei Liu · Samuel Albanie · Dong Ni, ,https://arxiv.org/abs/2312.12490,,2312.12490.pdf,InstructVideo: Instructing Video Diffusion Models with Human Feedback,"Diffusion models have emerged as the de facto paradigm for video generation.
+However, their reliance on web-scale data of varied quality often yields
+results that are visually unappealing and misaligned with the textual prompts.
+To tackle this problem, we propose InstructVideo to instruct text-to-video
+diffusion models with human feedback by reward fine-tuning. InstructVideo has
+two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by
+generating through the full DDIM sampling chain, we recast reward fine-tuning
+as editing. By leveraging the diffusion process to corrupt a sampled video,
+InstructVideo requires only partial inference of the DDIM sampling chain,
+reducing fine-tuning cost while improving fine-tuning efficiency. 2) To
+mitigate the absence of a dedicated video reward model for human preferences,
+we repurpose established image reward models, e.g., HPSv2. To this end, we
+propose Segmental Video Reward, a mechanism to provide reward signals based on
+segmental sparse sampling, and Temporally Attenuated Reward, a method that
+mitigates temporal modeling degradation during fine-tuning. Extensive
+experiments, both qualitative and quantitative, validate the practicality and
+efficacy of using image reward models in InstructVideo, significantly enhancing
+the visual quality of generated videos without compromising generalization
+capabilities. Code and models will be made publicly available.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.MM']"
+Stable Neighbor Denoising for Source-free Domain Adaptive Segmentation.,Dong Zhao · Shuang Wang · Qi Zang · Licheng Jiao · Nicu Sebe · Zhun Zhong, ,,,,,,,nan
+FREE: Faster and Better Data-Free Meta-Learning,Yongxian Wei · Zixuan Hu · Zhenyi Wang · Li Shen · Chun Yuan · Dacheng Tao, ,https://arxiv.org/abs/2405.00984,,2405.00984.pdf,FREE: Faster and Better Data-Free Meta-Learning,"Data-Free Meta-Learning (DFML) aims to extract knowledge from a collection of
+pre-trained models without requiring the original data, presenting practical
+benefits in contexts constrained by data privacy concerns. Current DFML methods
+primarily focus on the data recovery from these pre-trained models. However,
+they suffer from slow recovery speed and overlook gaps inherent in
+heterogeneous pre-trained models. In response to these challenges, we introduce
+the Faster and Better Data-Free Meta-Learning (FREE) framework, which contains:
+(i) a meta-generator for rapidly recovering training tasks from pre-trained
+models; and (ii) a meta-learner for generalizing to new unseen tasks.
+Specifically, within the module Faster Inversion via Meta-Generator, each
+pre-trained model is perceived as a distinct task. The meta-generator can
+rapidly adapt to a specific task in just five steps, significantly accelerating
+the data recovery. Furthermore, we propose Better Generalization via
+Meta-Learner and introduce an implicit gradient alignment algorithm to optimize
+the meta-learner. This is achieved as aligned gradient directions alleviate
+potential conflicts among tasks from heterogeneous pre-trained models.
+Empirical experiments on multiple benchmarks affirm the superiority of our
+approach, marking a notable speed-up (20$\times$) and performance enhancement
+(1.42\% $\sim$ 4.78\%) in comparison to the state-of-the-art.",cs.LG,"['cs.LG', 'cs.CV']"
+HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation,Ce Zhang · Simon Stepputtis · Joseph Campbell · Katia Sycara · Yaqi Xie,https://zhangce01.github.io/HiKER-SGG/,https://arxiv.org/abs/2403.12033,,2403.12033.pdf,HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation,"Being able to understand visual scenes is a precursor for many downstream
+tasks, including autonomous driving, robotics, and other vision-based
+approaches. A common approach enabling the ability to reason over visual data
+is Scene Graph Generation (SGG); however, many existing approaches assume
+undisturbed vision, i.e., the absence of real-world corruptions such as fog,
+snow, smoke, as well as non-uniform perturbations like sun glare or water
+drops. In this work, we propose a novel SGG benchmark containing procedurally
+generated weather corruptions and other transformations over the Visual Genome
+dataset. Further, we introduce a corresponding approach, Hierarchical Knowledge
+Enhanced Robust Scene Graph Generation (HiKER-SGG), providing a strong baseline
+for scene graph generation under such challenging setting. At its core,
+HiKER-SGG utilizes a hierarchical knowledge graph in order to refine its
+predictions from coarse initial estimates to detailed predictions. In our
+extensive experiments, we show that HiKER-SGG does not only demonstrate
+superior performance on corrupted images in a zero-shot manner, but also
+outperforms current state-of-the-art methods on uncorrupted SGG tasks. Code is
+available at https://github.com/zhangce01/HiKER-SGG.",cs.CV,['cs.CV']
+Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models,Xin Li · Yunfei Wu · Xinghua Jiang · ZhiHao Guo · Mingming Gong · Haoyu Cao · Yinsong Liu · Deqiang Jiang · Xing Sun, ,https://arxiv.org/abs/2402.19014,,2402.19014.pdf,Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models,"Recently, the advent of Large Visual-Language Models (LVLMs) has received
+increasing attention across various domains, particularly in the field of
+visual document understanding (VDU). Different from conventional
+vision-language tasks, VDU is specifically concerned with text-rich scenarios
+containing abundant document elements. Nevertheless, the importance of
+fine-grained features remains largely unexplored within the community of LVLMs,
+leading to suboptimal performance in text-rich scenarios. In this paper, we
+abbreviate it as the fine-grained feature collapse issue. With the aim of
+filling this gap, we propose a contrastive learning framework, termed Document
+Object COntrastive learning (DoCo), specifically tailored for the downstream
+tasks of VDU. DoCo leverages an auxiliary multimodal encoder to obtain the
+features of document objects and align them to the visual features generated by
+the vision encoder of LVLM, which enhances visual representation in text-rich
+scenarios. It can represent that the contrastive learning between the visual
+holistic representations and the multimodal fine-grained features of document
+objects can assist the vision encoder in acquiring more effective visual cues,
+thereby enhancing the comprehension of text-rich documents in LVLMs. We also
+demonstrate that the proposed DoCo serves as a plug-and-play pre-training
+method, which can be employed in the pre-training of various LVLMs without
+inducing any increase in computational complexity during the inference process.
+Extensive experimental results on multiple benchmarks of VDU reveal that LVLMs
+equipped with our proposed DoCo can achieve superior performance and mitigate
+the gap between VDU and generic vision-language tasks.",cs.CV,['cs.CV']
+PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF,Yutao Feng · Yintong Shang · Xuan Li · Tianjia Shao · Chenfanfu Jiang · Yin Yang,https://fytalon.github.io/pienerf/,https://arxiv.org/abs/2311.13099,,2311.13099.pdf,PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF,"We show that physics-based simulations can be seamlessly integrated with NeRF
+to generate high-quality elastodynamics of real-world objects. Unlike existing
+methods, we discretize nonlinear hyperelasticity in a meshless way, obviating
+the necessity for intermediate auxiliary shape proxies like a tetrahedral mesh
+or voxel grid. A quadratic generalized moving least square (Q-GMLS) is employed
+to capture nonlinear dynamics and large deformation on the implicit model. Such
+meshless integration enables versatile simulations of complex and codimensional
+shapes. We adaptively place the least-square kernels according to the NeRF
+density field to significantly reduce the complexity of the nonlinear
+simulation. As a result, physically realistic animations can be conveniently
+synthesized using our method for a wide range of hyperelastic materials at an
+interactive rate. For more information, please visit our project page at
+https://fytalon.github.io/pienerf/.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG']"
+DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling,Linqi Zhou · Andy Shih · Chenlin Meng · Stefano Ermon, ,https://arxiv.org/abs/2311.17082,,2311.17082.pdf,DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling,"Recent methods such as Score Distillation Sampling (SDS) and Variational
+Score Distillation (VSD) using 2D diffusion models for text-to-3D generation
+have demonstrated impressive generation quality. However, the long generation
+time of such algorithms significantly degrades the user experience. To tackle
+this problem, we propose DreamPropeller, a drop-in acceleration algorithm that
+can be wrapped around any existing text-to-3D generation pipeline based on
+score distillation. Our framework generalizes Picard iterations, a classical
+algorithm for parallel sampling an ODE path, and can account for non-ODE paths
+such as momentum-based gradient updates and changes in dimensions during the
+optimization process as in many cases of 3D generation. We show that our
+algorithm trades parallel compute for wallclock time and empirically achieves
+up to 4.7x speedup with a negligible drop in generation quality for all tested
+frameworks.",cs.CV,"['cs.CV', 'stat.ML']"
+RepViT: Revisiting Mobile CNN From ViT Perspective,Ao Wang · Hui Chen · Zijia Lin · Jungong Han · Guiguang Ding,https://github.com/THU-MIG/RepViT,https://arxiv.org/abs/2307.09283,,2307.09283.pdf,RepViT: Revisiting Mobile CNN From ViT Perspective,"Recently, lightweight Vision Transformers (ViTs) demonstrate superior
+performance and lower latency, compared with lightweight Convolutional Neural
+Networks (CNNs), on resource-constrained mobile devices. Researchers have
+discovered many structural connections between lightweight ViTs and lightweight
+CNNs. However, the notable architectural disparities in the block structure,
+macro, and micro designs between them have not been adequately examined. In
+this study, we revisit the efficient design of lightweight CNNs from ViT
+perspective and emphasize their promising prospect for mobile devices.
+Specifically, we incrementally enhance the mobile-friendliness of a standard
+lightweight CNN, \ie, MobileNetV3, by integrating the efficient architectural
+designs of lightweight ViTs. This ends up with a new family of pure lightweight
+CNNs, namely RepViT. Extensive experiments show that RepViT outperforms
+existing state-of-the-art lightweight ViTs and exhibits favorable latency in
+various vision tasks. Notably, on ImageNet, RepViT achieves over 80\% top-1
+accuracy with 1.0 ms latency on an iPhone 12, which is the first time for a
+lightweight model, to the best of our knowledge. Besides, when RepViT meets
+SAM, our RepViT-SAM can achieve nearly 10$\times$ faster inference than the
+advanced MobileSAM. Codes and models are available at
+\url{https://github.com/THU-MIG/RepViT}.",cs.CV,['cs.CV']
+Neural Video Compression with Feature Modulation,Jiahao Li · Bin Li · Yan Lu, ,https://arxiv.org/abs/2402.17414v1,,2402.17414v1.pdf,Neural Video Compression with Feature Modulation,"The emerging conditional coding-based neural video codec (NVC) shows
+superiority over commonly-used residual coding-based codec and the latest NVC
+already claims to outperform the best traditional codec. However, there still
+exist critical problems blocking the practicality of NVC. In this paper, we
+propose a powerful conditional coding-based NVC that solves two critical
+problems via feature modulation. The first is how to support a wide quality
+range in a single model. Previous NVC with this capability only supports about
+3.8 dB PSNR range on average. To tackle this limitation, we modulate the latent
+feature of the current frame via the learnable quantization scaler. During the
+training, we specially design the uniform quantization parameter sampling
+mechanism to improve the harmonization of encoding and quantization. This
+results in a better learning of the quantization scaler and helps our NVC
+support about 11.4 dB PSNR range. The second is how to make NVC still work
+under a long prediction chain. We expose that the previous SOTA NVC has an
+obvious quality degradation problem when using a large intra-period setting. To
+this end, we propose modulating the temporal feature with a periodically
+refreshing mechanism to boost the quality. %Besides solving the above two
+problems, we also design a single model that can support both RGB and YUV
+colorspaces. Notably, under single intra-frame setting, our codec can achieve
+29.7\% bitrate saving over previous SOTA NVC with 16\% MACs reduction. Our
+codec serves as a notable landmark in the journey of NVC evolution. The codes
+are at https://github.com/microsoft/DCVC.",cs.CV,"['cs.CV', 'eess.IV']"
+Meta-Point Learning and Refining for Category-Agnostic Pose Estimation,Junjie Chen · Jiebin Yan · Yuming Fang · Li Niu, ,https://arxiv.org/abs/2403.13647,,2403.13647.pdf,Meta-Point Learning and Refining for Category-Agnostic Pose Estimation,"Category-agnostic pose estimation (CAPE) aims to predict keypoints for
+arbitrary classes given a few support images annotated with keypoints. Existing
+methods only rely on the features extracted at support keypoints to predict or
+refine the keypoints on query image, but a few support feature vectors are
+local and inadequate for CAPE. Considering that human can quickly perceive
+potential keypoints of arbitrary objects, we propose a novel framework for CAPE
+based on such potential keypoints (named as meta-points). Specifically, we
+maintain learnable embeddings to capture inherent information of various
+keypoints, which interact with image feature maps to produce meta-points
+without any support. The produced meta-points could serve as meaningful
+potential keypoints for CAPE. Due to the inevitable gap between inherency and
+annotation, we finally utilize the identities and details offered by support
+keypoints to assign and refine meta-points to desired keypoints in query image.
+In addition, we propose a progressive deformable point decoder and a slacked
+regression loss for better prediction and supervision. Our novel framework not
+only reveals the inherency of keypoints but also outperforms existing methods
+of CAPE. Comprehensive experiments and in-depth studies on large-scale MP-100
+dataset demonstrate the effectiveness of our framework.",cs.CV,['cs.CV']
+Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization,Insoo Kim · Jae Seok Choi · Geonseok Seo · Kinam Kwon · Jinwoo Shin · Hyong-Euk Lee, ,https://arxiv.org/abs/2404.12168,,2404.12168.pdf,Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization,"As recent advances in mobile camera technology have enabled the capability to
+capture high-resolution images, such as 4K images, the demand for an efficient
+deblurring model handling large motion has increased. In this paper, we
+discover that the image residual errors, i.e., blur-sharp pixel differences,
+can be grouped into some categories according to their motion blur type and how
+complex their neighboring pixels are. Inspired by this, we decompose the
+deblurring (regression) task into blur pixel discretization (pixel-level blur
+classification) and discrete-to-continuous conversion (regression with blur
+class map) tasks. Specifically, we generate the discretized image residual
+errors by identifying the blur pixels and then transform them to a continuous
+form, which is computationally more efficient than naively solving the original
+regression problem with continuous values. Here, we found that the
+discretization result, i.e., blur segmentation map, remarkably exhibits visual
+similarity with the image residual errors. As a result, our efficient model
+shows comparable performance to state-of-the-art methods in realistic
+benchmarks, while our method is up to 10 times computationally more efficient.",cs.CV,"['cs.CV', 'cs.AI']"
+Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention,Xingyu Zhou · Leheng Zhang · Xiaorui Zhao · Keze Wang · Leida Li · Shuhang Gu, ,https://arxiv.org/abs/2401.06312,,2401.06312.pdf,Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention,"Recently, Vision Transformer has achieved great success in recovering missing
+details in low-resolution sequences, i.e., the video super-resolution (VSR)
+task. Despite its superiority in VSR accuracy, the heavy computational burden
+as well as the large memory footprint hinder the deployment of
+Transformer-based VSR models on constrained devices. In this paper, we address
+the above issue by proposing a novel feature-level masked processing framework:
+VSR with Masked Intra and inter frame Attention (MIA-VSR). The core of MIA-VSR
+is leveraging feature-level temporal continuity between adjacent frames to
+reduce redundant computations and make more rational use of previously enhanced
+SR features. Concretely, we propose an intra-frame and inter-frame attention
+block which takes the respective roles of past features and input features into
+consideration and only exploits previously enhanced features to provide
+supplementary information. In addition, an adaptive block-wise mask prediction
+module is developed to skip unimportant computations according to feature
+similarity between adjacent frames. We conduct detailed ablation studies to
+validate our contributions and compare the proposed method with recent
+state-of-the-art VSR approaches. The experimental results demonstrate that
+MIA-VSR improves the memory and computation efficiency over state-of-the-art
+methods, without trading off PSNR accuracy. The code is available at
+https://github.com/LabShuHangGU/MIA-VSR.",cs.CV,['cs.CV']
+Any-Shift Prompting for Generalization over Distributions,Zehao Xiao · Jiayi Shen · Mohammad Mahdi Derakhshani · Shengcai Liao · Cees G. M. Snoek, ,https://arxiv.org/abs/2402.10099,,2402.10099.pdf,Any-Shift Prompting for Generalization over Distributions,"Image-language models with prompt learning have shown remarkable advances in
+numerous downstream vision tasks. Nevertheless, conventional prompt learning
+methods overfit their training distribution and lose the generalization ability
+on test distributions. To improve generalization across various distribution
+shifts, we propose any-shift prompting: a general probabilistic inference
+framework that considers the relationship between training and test
+distributions during prompt learning. We explicitly connect training and test
+distributions in the latent space by constructing training and test prompts in
+a hierarchical architecture. Within this framework, the test prompt exploits
+the distribution relationships to guide the generalization of the CLIP
+image-language model from training to any test distribution. To effectively
+encode the distribution information and their relationships, we further
+introduce a transformer inference network with a pseudo-shift training
+mechanism. The network generates the tailored test prompt with both training
+and test information in a feedforward pass, avoiding extra training costs at
+test time. Extensive experiments on twenty-three datasets demonstrate the
+effectiveness of any-shift prompting on the generalization over various
+distribution shifts.",cs.CV,['cs.CV']
+Mosaic-SDF for 3D Generative Models,Lior Yariv · Omri Puny · Oran Gafni · Yaron Lipman,https://lioryariv.github.io/msdf/,https://arxiv.org/abs/2312.09222,,2312.09222.pdf,Mosaic-SDF for 3D Generative Models,"Current diffusion or flow-based generative models for 3D shapes divide to
+two: distilling pre-trained 2D image diffusion models, and training directly on
+3D shapes. When training a diffusion or flow models on 3D shapes a crucial
+design choice is the shape representation. An effective shape representation
+needs to adhere three design principles: it should allow an efficient
+conversion of large 3D datasets to the representation form; it should provide a
+good tradeoff of approximation power versus number of parameters; and it should
+have a simple tensorial form that is compatible with existing powerful neural
+architectures. While standard 3D shape representations such as volumetric grids
+and point clouds do not adhere to all these principles simultaneously, we
+advocate in this paper a new representation that does. We introduce Mosaic-SDF
+(M-SDF): a simple 3D shape representation that approximates the Signed Distance
+Function (SDF) of a given shape by using a set of local grids spread near the
+shape's boundary. The M-SDF representation is fast to compute for each shape
+individually making it readily parallelizable; it is parameter efficient as it
+only covers the space around the shape's boundary; and it has a simple matrix
+form, compatible with Transformer-based architectures. We demonstrate the
+efficacy of the M-SDF representation by using it to train a 3D generative flow
+model including class-conditioned generation with the 3D Warehouse dataset, and
+text-to-3D generation using a dataset of about 600k caption-shape pairs.",cs.CV,"['cs.CV', 'cs.GR']"
+Fourier-basis functions to bridge augmentation gap: Rethinking frequency augmentation in image classification,Mei Vaish · Shunxin Wang · Nicola Strisciuglio,https://github.com/nis-research/afa-augment,https://arxiv.org/abs/2403.01944,,2403.01944.pdf,Fourier-basis Functions to Bridge Augmentation Gap: Rethinking Frequency Augmentation in Image Classification,"Computer vision models normally witness degraded performance when deployed in
+real-world scenarios, due to unexpected changes in inputs that were not
+accounted for during training. Data augmentation is commonly used to address
+this issue, as it aims to increase data variety and reduce the distribution gap
+between training and test data. However, common visual augmentations might not
+guarantee extensive robustness of computer vision models. In this paper, we
+propose Auxiliary Fourier-basis Augmentation (AFA), a complementary technique
+targeting augmentation in the frequency domain and filling the augmentation gap
+left by visual augmentations. We demonstrate the utility of augmentation via
+Fourier-basis additive noise in a straightforward and efficient adversarial
+setting. Our results show that AFA benefits the robustness of models against
+common corruptions, OOD generalization, and consistency of performance of
+models against increasing perturbations, with negligible deficit to the
+standard performance of models. It can be seamlessly integrated with other
+augmentation techniques to further boost performance. Code and models can be
+found at: https://github.com/nis-research/afa-augment",cs.CV,"['cs.CV', 'cs.LG']"
+CDMAD: Class-Distribution-Mismatch-Aware Debiasing for Class-Imbalanced Semi-Supervised Learning,Hyuck Lee · Heeyoung Kim, ,https://arxiv.org/abs/2403.10391,,2403.10391.pdf,CDMAD: Class-Distribution-Mismatch-Aware Debiasing for Class-Imbalanced Semi-Supervised Learning,"Pseudo-label-based semi-supervised learning (SSL) algorithms trained on a
+class-imbalanced set face two cascading challenges: 1) Classifiers tend to be
+biased towards majority classes, and 2) Biased pseudo-labels are used for
+training. It is difficult to appropriately re-balance the classifiers in SSL
+because the class distribution of an unlabeled set is often unknown and could
+be mismatched with that of a labeled set. We propose a novel class-imbalanced
+SSL algorithm called class-distribution-mismatch-aware debiasing (CDMAD). For
+each iteration of training, CDMAD first assesses the classifier's biased degree
+towards each class by calculating the logits on an image without any patterns
+(e.g., solid color image), which can be considered irrelevant to the training
+set. CDMAD then refines biased pseudo-labels of the base SSL algorithm by
+ensuring the classifier's neutrality. CDMAD uses these refined pseudo-labels
+during the training of the base SSL algorithm to improve the quality of the
+representations. In the test phase, CDMAD similarly refines biased class
+predictions on test samples. CDMAD can be seen as an extension of post-hoc
+logit adjustment to address a challenge of incorporating the unknown class
+distribution of the unlabeled set for re-balancing the biased classifier under
+class distribution mismatch. CDMAD ensures Fisher consistency for the balanced
+error. Extensive experiments verify the effectiveness of CDMAD.",cs.CV,['cs.CV']
+LoS: Local Structure Guided Stereo Matching,Kunhong Li · Longguang Wang · Ye Zhang · Kaiwen Xue · Shunbo Zhou · Yulan Guo, ,https://ar5iv.labs.arxiv.org/html/2309.16992,,2309.16992.pdf,Segment Anything Model is a Good Teacher for Local Feature Learning,"Local feature detection and description play an important role in many
+computer vision tasks, which are designed to detect and describe keypoints in
+""any scene"" and ""any downstream task"". Data-driven local feature learning
+methods need to rely on pixel-level correspondence for training, which is
+challenging to acquire at scale, thus hindering further improvements in
+performance. In this paper, we propose SAMFeat to introduce SAM (segment
+anything model), a fundamental model trained on 11 million images, as a teacher
+to guide local feature learning and thus inspire higher performance on limited
+datasets. To do so, first, we construct an auxiliary task of Pixel Semantic
+Relational Distillation (PSRD), which distillates feature relations with
+category-agnostic semantic information learned by the SAM encoder into a local
+feature learning network, to improve local feature description using semantic
+discrimination. Second, we develop a technique called Weakly Supervised
+Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic
+groupings derived from SAM as weakly supervised signals, to optimize the metric
+space of local descriptors. Third, we design an Edge Attention Guidance (EAG)
+to further improve the accuracy of local feature detection and description by
+prompting the network to pay more attention to the edge region guided by SAM.
+SAMFeat's performance on various tasks such as image matching on HPatches, and
+long-term visual localization on Aachen Day-Night showcases its superiority
+over previous local features. The release code is available at
+https://github.com/vignywang/SAMFeat.",cs.CV,"['cs.CV', 'cs.LG']"
+Rethinking the Representation in Federated Unsupervised Learning with Non-IID Data,Xinting Liao · Weiming Liu · Chaochao Chen · Pengyang Zhou · Fengyuan Yu · Huabin Zhu · Binhui Yao · Tao Wang · Xiaolin Zheng · Yanchao Tan, ,https://arxiv.org/abs/2403.16398,,2403.16398.pdf,Rethinking the Representation in Federated Unsupervised Learning with Non-IID Data,"Federated learning achieves effective performance in modeling decentralized
+data. In practice, client data are not well-labeled, which makes it potential
+for federated unsupervised learning (FUSL) with non-IID data. However, the
+performance of existing FUSL methods suffers from insufficient representations,
+i.e., (1) representation collapse entanglement among local and global models,
+and (2) inconsistent representation spaces among local models. The former
+indicates that representation collapse in local model will subsequently impact
+the global model and other local models. The latter means that clients model
+data representation with inconsistent parameters due to the deficiency of
+supervision signals. In this work, we propose FedU2 which enhances generating
+uniform and unified representation in FUSL with non-IID data. Specifically,
+FedU2 consists of flexible uniform regularizer (FUR) and efficient unified
+aggregator (EUA). FUR in each client avoids representation collapse via
+dispersing samples uniformly, and EUA in server promotes unified representation
+by constraining consistent client model updating. To extensively validate the
+performance of FedU2, we conduct both cross-device and cross-silo evaluation
+experiments on two benchmark datasets, i.e., CIFAR10 and CIFAR100.",cs.LG,"['cs.LG', 'cs.AI']"
+Towards Detailed and Robust 3D Clothed Human Reconstruction with High-Frequency and Low-Frequency Information of Parametric Body Models,Yifan Yang · Dong Liu · Shuhai Zhang · Zeshuai Deng · Zixiong Huang · Mingkui Tan, ,https://arxiv.org/abs/2404.04876,,2404.04876.pdf,HiLo: Detailed and Robust 3D Clothed Human Reconstruction with High-and Low-Frequency Information of Parametric Models,"Reconstructing 3D clothed human involves creating a detailed geometry of
+individuals in clothing, with applications ranging from virtual try-on, movies,
+to games. To enable practical and widespread applications, recent advances
+propose to generate a clothed human from an RGB image. However, they struggle
+to reconstruct detailed and robust avatars simultaneously. We empirically find
+that the high-frequency (HF) and low-frequency (LF) information from a
+parametric model has the potential to enhance geometry details and improve
+robustness to noise, respectively. Based on this, we propose HiLo, namely
+clothed human reconstruction with high- and low-frequency information, which
+contains two components. 1) To recover detailed geometry using HF information,
+we propose a progressive HF Signed Distance Function to enhance the detailed 3D
+geometry of a clothed human. We analyze that our progressive learning manner
+alleviates large gradients that hinder model convergence. 2) To achieve robust
+reconstruction against inaccurate estimation of the parametric model by using
+LF information, we propose a spatial interaction implicit function. This
+function effectively exploits the complementary spatial information from a
+low-resolution voxel grid of the parametric model. Experimental results
+demonstrate that HiLo outperforms the state-of-the-art methods by 10.43% and
+9.54% in terms of Chamfer distance on the Thuman2.0 and CAPE datasets,
+respectively. Additionally, HiLo demonstrates robustness to noise from the
+parametric model, challenging poses, and various clothing styles.",cs.CV,['cs.CV']
+MS-DETR: Efficient DETR Training  with Mixed Supervision,Chuyang Zhao · Yifan Sun · Wenhao Wang · Qiang Chen · Errui Ding · Yi Yang · Jingdong Wang,https://github.com/Atten4Vis/MS-DETR,https://arxiv.org/abs/2401.03989,,2401.03989.pdf,MS-DETR: Efficient DETR Training with Mixed Supervision,"DETR accomplishes end-to-end object detection through iteratively generating
+multiple object candidates based on image features and promoting one candidate
+for each ground-truth object. The traditional training procedure using
+one-to-one supervision in the original DETR lacks direct supervision for the
+object detection candidates.
+  We aim at improving the DETR training efficiency by explicitly supervising
+the candidate generation procedure through mixing one-to-one supervision and
+one-to-many supervision. Our approach, namely MS-DETR, is simple, and places
+one-to-many supervision to the object queries of the primary decoder that is
+used for inference. In comparison to existing DETR variants with one-to-many
+supervision, such as Group DETR and Hybrid DETR, our approach does not need
+additional decoder branches or object queries. The object queries of the
+primary decoder in our approach directly benefit from one-to-many supervision
+and thus are superior in object candidate prediction. Experimental results show
+that our approach outperforms related DETR variants, such as DN-DETR, Hybrid
+DETR, and Group DETR, and the combination with related DETR variants further
+improves the performance.",cs.CV,['cs.CV']
+Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models,Yabin Zhang · Wenjie Zhu · Hui Tang · Zhiyuan Ma · Kaiyang Zhou · Lei Zhang,https://github.com/YBZh/DMN,https://arxiv.org/abs/2403.17589,,2403.17589.pdf,Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models,"With the emergence of pre-trained vision-language models like CLIP, how to
+adapt them to various downstream classification tasks has garnered significant
+attention in recent research. The adaptation strategies can be typically
+categorized into three paradigms: zero-shot adaptation, few-shot adaptation,
+and the recently-proposed training-free few-shot adaptation. Most existing
+approaches are tailored for a specific setting and can only cater to one or two
+of these paradigms. In this paper, we introduce a versatile adaptation approach
+that can effectively work under all three settings. Specifically, we propose
+the dual memory networks that comprise dynamic and static memory components.
+The static memory caches training data knowledge, enabling training-free
+few-shot adaptation, while the dynamic memory preserves historical test
+features online during the testing process, allowing for the exploration of
+additional data insights beyond the training set. This novel capability
+enhances model performance in the few-shot setting and enables model usability
+in the absence of training data. The two memory networks employ the same
+flexible memory interactive strategy, which can operate in a training-free mode
+and can be further enhanced by incorporating learnable projection layers. Our
+approach is tested across 11 datasets under the three task settings.
+Remarkably, in the zero-shot scenario, it outperforms existing methods by over
+3\% and even shows superior results against methods utilizing external training
+data. Additionally, our method exhibits robust performance against natural
+distribution shifts. Codes are available at \url{https://github.com/YBZh/DMN}.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.MM']"
+Fully Convolutional Slice-to-Volume Reconstruction for Single-Stack MRI,Sean I. Young · Yaël Balbastre · Bruce Fischl · Polina Golland · Juan Iglesias, ,https://arxiv.org/abs/2312.03102,,2312.03102.pdf,Fully Convolutional Slice-to-Volume Reconstruction for Single-Stack MRI,"In magnetic resonance imaging (MRI), slice-to-volume reconstruction (SVR)
+refers to computational reconstruction of an unknown 3D magnetic resonance
+volume from stacks of 2D slices corrupted by motion. While promising, current
+SVR methods require multiple slice stacks for accurate 3D reconstruction,
+leading to long scans and limiting their use in time-sensitive applications
+such as fetal fMRI. Here, we propose a SVR method that overcomes the
+shortcomings of previous work and produces state-of-the-art reconstructions in
+the presence of extreme inter-slice motion. Inspired by the recent success of
+single-view depth estimation methods, we formulate SVR as a single-stack motion
+estimation task and train a fully convolutional network to predict a motion
+stack for a given slice stack, producing a 3D reconstruction as a byproduct of
+the predicted motion. Extensive experiments on the SVR of adult and fetal
+brains demonstrate that our fully convolutional method is twice as accurate as
+previous SVR methods. Our code is available at github.com/seannz/svr.",eess.IV,"['eess.IV', 'cs.CV', 'cs.LG']"
+Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers,Subhadeep Koley · Ayan Kumar Bhunia · Aneeshan Sain · Pinaki Nath Chowdhury · Tao Xiang · Yi-Zhe Song,https://subhadeepkoley.github.io/DiffusionZSSBIR/,https://arxiv.org/abs/2403.07214,,2403.07214.pdf,Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers,"This paper, for the first time, explores text-to-image diffusion models for
+Zero-Shot Sketch-based Image Retrieval (ZS-SBIR). We highlight a pivotal
+discovery: the capacity of text-to-image diffusion models to seamlessly bridge
+the gap between sketches and photos. This proficiency is underpinned by their
+robust cross-modal capabilities and shape bias, findings that are substantiated
+through our pilot studies. In order to harness pre-trained diffusion models
+effectively, we introduce a straightforward yet powerful strategy focused on
+two key aspects: selecting optimal feature layers and utilising visual and
+textual prompts. For the former, we identify which layers are most enriched
+with information and are best suited for the specific retrieval requirements
+(category-level or fine-grained). Then we employ visual and textual prompts to
+guide the model's feature extraction process, enabling it to generate more
+discriminative and contextually relevant cross-modal representations. Extensive
+experiments on several benchmark datasets validate significant performance
+improvements.",cs.CV,['cs.CV']
+Enhance Image Classification Via Inter-Class Image Mixup With Diffusion Model,Zhicai Wang · Longhui Wei · Tan Wang · Heyu Chen · Yanbin Hao · Xiang Wang · Xiangnan He · Qi Tian, ,https://arxiv.org/abs/2403.19600,,2403.19600.pdf,Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model,"Text-to-image (T2I) generative models have recently emerged as a powerful
+tool, enabling the creation of photo-realistic images and giving rise to a
+multitude of applications. However, the effective integration of T2I models
+into fundamental image classification tasks remains an open question. A
+prevalent strategy to bolster image classification performance is through
+augmenting the training set with synthetic images generated by T2I models. In
+this study, we scrutinize the shortcomings of both current generative and
+conventional data augmentation techniques. Our analysis reveals that these
+methods struggle to produce images that are both faithful (in terms of
+foreground objects) and diverse (in terms of background contexts) for
+domain-specific concepts. To tackle this challenge, we introduce an innovative
+inter-class data augmentation method known as Diff-Mix
+(https://github.com/Zhicaiwww/Diff-Mix), which enriches the dataset by
+performing image translations between classes. Our empirical results
+demonstrate that Diff-Mix achieves a better balance between faithfulness and
+diversity, leading to a marked improvement in performance across diverse image
+classification scenarios, including few-shot, conventional, and long-tail
+classifications for domain-specific datasets.",cs.CV,['cs.CV']
+Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata,Dongsu Zhang · Francis Williams · Žan Gojčič · Karsten Kreis · Sanja Fidler · Young Min Kim · Amlan Kar, ,,https://www.tandfonline.com/doi/full/10.1080/15481603.2023.2290352,,,,,nan
+3D Neural Edge Reconstruction,Lei Li · Songyou Peng · Zehao Yu · Shaohui Liu · Rémi Pautrat · Xiaochuan Yin · Marc Pollefeys,https://neural-edge-map.github.io/,https://arxiv.org/abs/2405.19295,,2405.19295.pdf,3D Neural Edge Reconstruction,"Real-world objects and environments are predominantly composed of edge
+features, including straight lines and curves. Such edges are crucial elements
+for various applications, such as CAD modeling, surface meshing, lane mapping,
+etc. However, existing traditional methods only prioritize lines over curves
+for simplicity in geometric modeling. To this end, we introduce EMAP, a new
+method for learning 3D edge representations with a focus on both lines and
+curves. Our method implicitly encodes 3D edge distance and direction in
+Unsigned Distance Functions (UDF) from multi-view edge maps. On top of this
+neural representation, we propose an edge extraction algorithm that robustly
+abstracts parametric 3D edges from the inferred edge points and their
+directions. Comprehensive evaluations demonstrate that our method achieves
+better 3D edge reconstruction on multiple challenging datasets. We further show
+that our learned UDF field enhances neural surface reconstruction by capturing
+more details.",cs.CV,['cs.CV']
+ProMark: Proactive Diffusion Watermarking for Causal Attribution,Vishal Asnani · John Collomosse · Tu Bui · Xiaoming Liu · Shruti Agarwal, ,https://arxiv.org/abs/2403.09914,,2403.09914.pdf,ProMark: Proactive Diffusion Watermarking for Causal Attribution,"Generative AI (GenAI) is transforming creative workflows through the
+capability to synthesize and manipulate images via high-level prompts. Yet
+creatives are not well supported to receive recognition or reward for the use
+of their content in GenAI training. To this end, we propose ProMark, a causal
+attribution technique to attribute a synthetically generated image to its
+training data concepts like objects, motifs, templates, artists, or styles. The
+concept information is proactively embedded into the input training images
+using imperceptible watermarks, and the diffusion models (unconditional or
+conditional) are trained to retain the corresponding watermarks in generated
+images. We show that we can embed as many as $2^{16}$ unique watermarks into
+the training data, and each training image can contain more than one watermark.
+ProMark can maintain image quality whilst outperforming correlation-based
+attribution. Finally, several qualitative examples are presented, providing the
+confidence that the presence of the watermark conveys a causative relationship
+between training data and synthetic images.",cs.CV,['cs.CV']
+Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior,Chen Cheng · Xiaofeng Yang · Fan Yang · Chengzeng Feng · ZHOUJIE FU · Chuan-Sheng Foo · Guosheng Lin · Fayao Liu, ,https://arxiv.org/abs/2403.09140,,2403.09140.pdf,Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior,"Recent works on text-to-3d generation show that using only 2D diffusion
+supervision for 3D generation tends to produce results with inconsistent
+appearances (e.g., faces on the back view) and inaccurate shapes (e.g., animals
+with extra legs). Existing methods mainly address this issue by retraining
+diffusion models with images rendered from 3D data to ensure multi-view
+consistency while struggling to balance 2D generation quality with 3D
+consistency. In this paper, we present a new framework Sculpt3D that equips the
+current pipeline with explicit injection of 3D priors from retrieved reference
+objects without re-training the 2D diffusion model. Specifically, we
+demonstrate that high-quality and diverse 3D geometry can be guaranteed by
+keypoints supervision through a sparse ray sampling approach. Moreover, to
+ensure accurate appearances of different views, we further modulate the output
+of the 2D diffusion model to the correct patterns of the template views without
+altering the generated object's style. These two decoupled designs effectively
+harness 3D information from reference objects to generate 3D objects while
+preserving the generation quality of the 2D diffusion model. Extensive
+experiments show our method can largely improve the multi-view consistency
+while retaining fidelity and diversity. Our project page is available at:
+https://stellarcheng.github.io/Sculpt3D/.",cs.CV,['cs.CV']
+Empowering Resampling Operation for Ultra-High-Definition Image Enhancement with Model-Aware Guidance,Yu · Jie Huang · Li · Kaiwen Zheng · Qi Zhu · Man Zhou · Feng Zhao, ,,https://github.com/YPatrickW/LMAR,,,,,nan
+You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval,Subhadeep Koley · Ayan Kumar Bhunia · Aneeshan Sain · Pinaki Nath Chowdhury · Tao Xiang · Yi-Zhe Song,https://subhadeepkoley.github.io/Sketch2Word/,https://arxiv.org/abs/2403.07222v2,,2403.07222v2.pdf,You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval,"Two primary input modalities prevail in image retrieval: sketch and text.
+While text is widely used for inter-category retrieval tasks, sketches have
+been established as the sole preferred modality for fine-grained image
+retrieval due to their ability to capture intricate visual details. In this
+paper, we question the reliance on sketches alone for fine-grained image
+retrieval by simultaneously exploring the fine-grained representation
+capabilities of both sketch and text, orchestrating a duet between the two. The
+end result enables precise retrievals previously unattainable, allowing users
+to pose ever-finer queries and incorporate attributes like colour and
+contextual cues from text. For this purpose, we introduce a novel
+compositionality framework, effectively combining sketches and text using
+pre-trained CLIP models, while eliminating the need for extensive fine-grained
+textual descriptions. Last but not least, our system extends to novel
+applications in composed image retrieval, domain attribute transfer, and
+fine-grained generation, providing solutions for various real-world scenarios.",cs.CV,['cs.CV']
+Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos,Kumaranage Ravindu Nagasinghe · Honglu Zhou · Malitha Gunawardhana · Martin Renqiang Min · Daniel Harari · Muhammad Haris Khan,https://ravindu-yasas-nagasinghe.github.io/KEPP-Project_Page/,https://arxiv.org/abs/2403.02782,,2403.02782.pdf,Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos,"In this paper, we explore the capability of an agent to construct a logical
+sequence of action steps, thereby assembling a strategic procedural plan. This
+plan is crucial for navigating from an initial visual observation to a target
+visual outcome, as depicted in real-life instructional videos. Existing works
+have attained partial success by extensively leveraging various sources of
+information available in the datasets, such as heavy intermediate visual
+observations, procedural names, or natural language step-by-step instructions,
+for features or supervision signals. However, the task remains formidable due
+to the implicit causal constraints in the sequencing of steps and the
+variability inherent in multiple feasible plans. To tackle these intricacies
+that previous efforts have overlooked, we propose to enhance the capabilities
+of the agent by infusing it with procedural knowledge. This knowledge, sourced
+from training procedure plans and structured as a directed weighted graph,
+equips the agent to better navigate the complexities of step sequencing and its
+potential variations. We coin our approach KEPP, a novel Knowledge-Enhanced
+Procedure Planning system, which harnesses a probabilistic procedural knowledge
+graph extracted from training data, effectively acting as a comprehensive
+textbook for the training domain. Experimental evaluations across three
+widely-used datasets under settings of varying complexity reveal that KEPP
+attains superior, state-of-the-art results while requiring only minimal
+supervision.",cs.CV,['cs.CV']
+Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes,Ziqian Bai · Feitong Tan · Sean Fanello · Rohit Pandey · Mingsong Dou · Shichen Liu · Ping Tan · Yinda Zhang,https://augmentedperception.github.io/monoavatar-plus/,https://arxiv.org/abs/2404.01543,,2404.01543.pdf,Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes,"3D head avatars built with neural implicit volumetric representations have
+achieved unprecedented levels of photorealism. However, the computational cost
+of these methods remains a significant barrier to their widespread adoption,
+particularly in real-time applications such as virtual reality and
+teleconferencing. While attempts have been made to develop fast neural
+rendering approaches for static scenes, these methods cannot be simply employed
+to support realistic facial expressions, such as in the case of a dynamic
+facial performance. To address these challenges, we propose a novel fast 3D
+neural implicit head avatar model that achieves real-time rendering while
+maintaining fine-grained controllability and high rendering quality. Our key
+idea lies in the introduction of local hash table blendshapes, which are
+learned and attached to the vertices of an underlying face parametric model.
+These per-vertex hash-tables are linearly merged with weights predicted via a
+CNN, resulting in expression dependent embeddings. Our novel representation
+enables efficient density and color predictions using a lightweight MLP, which
+is further accelerated by a hierarchical nearest neighbor search method.
+Extensive experiments show that our approach runs in real-time while achieving
+comparable rendering quality to state-of-the-arts and decent results on
+challenging expressions.",cs.CV,"['cs.CV', 'cs.GR']"
+"Towards Co-Evaluation of Cameras, HDR, and Algorithms for Industrial-Grade 6DoF Pose Estimation",Agastya Kalra · Guy Stoppi · Dmitrii Marin · Vage Taamazyan · Aarrushi Shandilya · Rishav Agarwal · Anton Boykov · Aaron Chong · Michael Stark,https://github.com/intrinsic-ai/ipd,https://arxiv.org/abs/2403.03221,,2403.03221.pdf,"FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation","Estimating relative camera poses between images has been a central problem in
+computer vision. Methods that find correspondences and solve for the
+fundamental matrix offer high precision in most cases. Conversely, methods
+predicting pose directly using neural networks are more robust to limited
+overlap and can infer absolute translation scale, but at the expense of reduced
+precision. We show how to combine the best of both methods; our approach yields
+results that are both precise and robust, while also accurately inferring
+translation scales. At the heart of our model lies a Transformer that (1)
+learns to balance between solved and learned pose estimations, and (2) provides
+a prior to guide a solver. A comprehensive analysis supports our design choices
+and demonstrates that our method adapts flexibly to various feature extractors
+and correspondence estimators, showing state-of-the-art performance in 6DoF
+pose estimation on Matterport3D, InteriorNet, StreetLearn, and Map-free
+Relocalization.",cs.CV,['cs.CV']
+A Generative Approach for Wikipedia-Scale Visual Entity Recognition,Mathilde Caron · Ahmet Iscen · Alireza Fathi · Cordelia Schmid,https://github.com/google-research/scenic/tree/main/scenic/projects/gerald,https://arxiv.org/abs/2403.02041,,2403.02041.pdf,A Generative Approach for Wikipedia-Scale Visual Entity Recognition,"In this paper, we address web-scale visual entity recognition, specifically
+the task of mapping a given query image to one of the 6 million existing
+entities in Wikipedia. One way of approaching a problem of such scale is using
+dual-encoder models (eg CLIP), where all the entity names and query images are
+embedded into a unified space, paving the way for an approximate k-NN search.
+Alternatively, it is also possible to re-purpose a captioning model to directly
+generate the entity names for a given image. In contrast, we introduce a novel
+Generative Entity Recognition (GER) framework, which given an input image
+learns to auto-regressively decode a semantic and discriminative ``code''
+identifying the target entity. Our experiments demonstrate the efficacy of this
+GER paradigm, showcasing state-of-the-art performance on the challenging OVEN
+benchmark. GER surpasses strong captioning, dual-encoder, visual matching and
+hierarchical classification baselines, affirming its advantage in tackling the
+complexities of web-scale recognition.",cs.CV,['cs.CV']
+How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval?,Subhadeep Koley · Ayan Kumar Bhunia · Aneeshan Sain · Pinaki Nath Chowdhury · Tao Xiang · Yi-Zhe Song,https://subhadeepkoley.github.io/AbstractAway/,https://arxiv.org/abs/2403.07203,,2403.07203.pdf,How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval?,"In this paper, we propose a novel abstraction-aware sketch-based image
+retrieval framework capable of handling sketch abstraction at varied levels.
+Prior works had mainly focused on tackling sub-factors such as drawing style
+and order, we instead attempt to model abstraction as a whole, and propose
+feature-level and retrieval granularity-level designs so that the system builds
+into its DNA the necessary means to interpret abstraction. On learning
+abstraction-aware features, we for the first-time harness the rich semantic
+embedding of pre-trained StyleGAN model, together with a novel
+abstraction-level mapper that deciphers the level of abstraction and
+dynamically selects appropriate dimensions in the feature matrix
+correspondingly, to construct a feature matrix embedding that can be freely
+traversed to accommodate different levels of abstraction. For granularity-level
+abstraction understanding, we dictate that the retrieval model should not treat
+all abstraction-levels equally and introduce a differentiable surrogate Acc.@q
+loss to inject that understanding into the system. Different to the
+gold-standard triplet loss, our Acc.@q loss uniquely allows a sketch to
+narrow/broaden its focus in terms of how stringent the evaluation should be -
+the more abstract a sketch, the less stringent (higher q). Extensive
+experiments depict our method to outperform existing state-of-the-arts in
+standard SBIR tasks along with challenging scenarios like early retrieval,
+forensic sketch-photo matching, and style-invariant retrieval.",cs.CV,['cs.CV']
+A Recipe for Scaling up Text-to-Video Generation with Text-free Videos,Xiang Wang · Shiwei Zhang · Hangjie Yuan · Zhiwu Qing · Biao Gong · Yingya Zhang · Yujun Shen · Changxin Gao · Nong Sang,https://tf-t2v.github.io/,https://arxiv.org/abs/2312.15770,,2312.15770.pdf,A Recipe for Scaling up Text-to-Video Generation with Text-free Videos,"Diffusion-based text-to-video generation has witnessed impressive progress in
+the past year yet still falls behind text-to-image generation. One of the key
+reasons is the limited scale of publicly available data (e.g., 10M video-text
+pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost
+of video captioning. Instead, it could be far easier to collect unlabeled clips
+from video platforms like YouTube. Motivated by this, we come up with a novel
+text-to-video generation framework, termed TF-T2V, which can directly learn
+with text-free videos. The rationale behind is to separate the process of text
+decoding from that of temporal modeling. To this end, we employ a content
+branch and a motion branch, which are jointly optimized with weights shared.
+Following such a pipeline, we study the effect of doubling the scale of
+training set (i.e., video-only WebVid10M) with some randomly collected
+text-free videos and are encouraged to observe the performance improvement (FID
+from 9.67 to 8.19 and FVD from 484 to 441), demonstrating the scalability of
+our approach. We also find that our model could enjoy sustainable performance
+gain (FID from 8.19 to 7.64 and FVD from 441 to 366) after reintroducing some
+text labels for training. Finally, we validate the effectiveness and
+generalizability of our ideology on both native text-to-video generation and
+compositional video synthesis paradigms. Code and models will be publicly
+available at https://tf-t2v.github.io/.",cs.CV,"['cs.CV', 'cs.AI']"
+HOIAnimator: Text-Prompt Human-Object Animations Generation with Perceptive Diffusion Models,Wenfeng Song · Xinyu Zhang · Shuai Li · Yang Gao · Aimin Hao · Xia HOU · Chenglizhao Chen · Ning Li · Hong Qin, ,https://arxiv.org/abs/2312.06553,,2312.06553.pdf,HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models,"We address the problem of generating realistic 3D human-object interactions
+(HOIs) driven by textual prompts. To this end, we take a modular design and
+decompose the complex task into simpler sub-tasks. We first develop a
+dual-branch diffusion model (HOI-DM) to generate both human and object motions
+conditioned on the input text, and encourage coherent motions by a
+cross-attention communication module between the human and object motion
+generation branches. We also develop an affordance prediction diffusion model
+(APDM) to predict the contacting area between the human and object during the
+interactions driven by the textual prompt. The APDM is independent of the
+results by the HOI-DM and thus can correct potential errors by the latter.
+Moreover, it stochastically generates the contacting points to diversify the
+generated motions. Finally, we incorporate the estimated contacting points into
+the classifier-guidance to achieve accurate and close contact between humans
+and objects. To train and evaluate our approach, we annotate BEHAVE dataset
+with text descriptions. Experimental results on BEHAVE and OMOMO demonstrate
+that our approach produces realistic HOIs with various interactions and
+different types of objects.",cs.CV,['cs.CV']
+Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement,Ziyu Wang · Yue Xu · Cewu Lu · Yonglu Li, ,https://arxiv.org/abs/2312.00362,,2312.00362.pdf,Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement,"Recently, dataset distillation has paved the way towards efficient machine
+learning, especially for image datasets. However, the distillation for videos,
+characterized by an exclusive temporal dimension, remains an underexplored
+domain. In this work, we provide the first systematic study of video
+distillation and introduce a taxonomy to categorize temporal compression. Our
+investigation reveals that the temporal information is usually not well learned
+during distillation, and the temporal dimension of synthetic data contributes
+little. The observations motivate our unified framework of disentangling the
+dynamic and static information in the videos. It first distills the videos into
+still images as static memory and then compensates the dynamic and motion
+information with a learnable dynamic memory block. Our method achieves
+state-of-the-art on video datasets at different scales, with a notably smaller
+memory storage budget. Our code is available at
+https://github.com/yuz1wan/video_distillation.",cs.CV,"['cs.CV', 'cs.LG']"
+Readout Guidance: Learning Control from Diffusion Features,Grace Luo · Trevor Darrell · Oliver Wang · Dan B Goldman · Aleksander Holynski,https://readout-guidance.github.io,https://arxiv.org/abs/2312.02150,,2312.02150.pdf,Readout Guidance: Learning Control from Diffusion Features,"We present Readout Guidance, a method for controlling text-to-image diffusion
+models with learned signals. Readout Guidance uses readout heads, lightweight
+networks trained to extract signals from the features of a pre-trained, frozen
+diffusion model at every timestep. These readouts can encode single-image
+properties, such as pose, depth, and edges; or higher-order properties that
+relate multiple images, such as correspondence and appearance similarity.
+Furthermore, by comparing the readout estimates to a user-defined target, and
+back-propagating the gradient through the readout head, these estimates can be
+used to guide the sampling process. Compared to prior methods for conditional
+generation, Readout Guidance requires significantly fewer added parameters and
+training samples, and offers a convenient and simple recipe for reproducing
+different forms of conditional control under a single framework, with a single
+architecture and sampling procedure. We showcase these benefits in the
+applications of drag-based manipulation, identity-consistent generation, and
+spatially aligned control. Project page: https://readout-guidance.github.io.",cs.CV,['cs.CV']
+BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection,Zhenxin Li · Shiyi Lan · Jose M. Alvarez · Zuxuan Wu, ,https://arxiv.org/abs/2312.01696v1,,2312.01696v1.pdf,BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection,"Recently, the rise of query-based Transformer decoders is reshaping
+camera-based 3D object detection. These query-based decoders are surpassing the
+traditional dense BEV (Bird's Eye View)-based methods. However, we argue that
+dense BEV frameworks remain important due to their outstanding abilities in
+depth estimation and object localization, depicting 3D scenes accurately and
+comprehensively. This paper aims to address the drawbacks of the existing dense
+BEV-based 3D object detectors by introducing our proposed enhanced components,
+including a CRF-modulated depth estimation module enforcing object-level
+consistencies, a long-term temporal aggregation module with extended receptive
+fields, and a two-stage object decoder combining perspective techniques with
+CRF-modulated depth embedding. These enhancements lead to a ""modernized"" dense
+BEV framework dubbed BEVNeXt. On the nuScenes benchmark, BEVNeXt outperforms
+both BEV-based and query-based frameworks under various settings, achieving a
+state-of-the-art result of 64.2 NDS on the nuScenes test set.",cs.CV,['cs.CV']
+It's All About Your Sketch: Democratising Sketch Control in Diffusion Models,Subhadeep Koley · Ayan Kumar Bhunia · Deeptanshu Sekhri · Aneeshan Sain · Pinaki Nath Chowdhury · Tao Xiang · Yi-Zhe Song,https://subhadeepkoley.github.io/StableSketching/,https://arxiv.org/abs/2403.07234,,2403.07234.pdf,It's All About Your Sketch: Democratising Sketch Control in Diffusion Models,"This paper unravels the potential of sketches for diffusion models,
+addressing the deceptive promise of direct sketch control in generative AI. We
+importantly democratise the process, enabling amateur sketches to generate
+precise images, living up to the commitment of ""what you sketch is what you
+get"". A pilot study underscores the necessity, revealing that deformities in
+existing models stem from spatial-conditioning. To rectify this, we propose an
+abstraction-aware framework, utilising a sketch adapter, adaptive time-step
+sampling, and discriminative guidance from a pre-trained fine-grained
+sketch-based image retrieval model, working synergistically to reinforce
+fine-grained sketch-photo association. Our approach operates seamlessly during
+inference without the need for textual prompts; a simple, rough sketch akin to
+what you and I can create suffices! We welcome everyone to examine results
+presented in the paper and its supplementary. Contributions include
+democratising sketch control, introducing an abstraction-aware framework, and
+leveraging discriminative guidance, validated through extensive experiments.",cs.CV,['cs.CV']
+COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction,Qihang Ma · Xin Tan · Yanyun Qu · Lizhuang Ma · Zhizhong Zhang · Yuan Xie,https://github.com/NotACracker/COTR,https://arxiv.org/abs/2312.01919,,2312.01919.pdf,COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction,"The autonomous driving community has shown significant interest in 3D
+occupancy prediction, driven by its exceptional geometric perception and
+general object recognition capabilities. To achieve this, current works try to
+construct a Tri-Perspective View (TPV) or Occupancy (OCC) representation
+extending from the Bird-Eye-View perception. However, compressed views like TPV
+representation lose 3D geometry information while raw and sparse OCC
+representation requires heavy but redundant computational costs. To address the
+above limitations, we propose Compact Occupancy TRansformer (COTR), with a
+geometry-aware occupancy encoder and a semantic-aware group decoder to
+reconstruct a compact 3D OCC representation. The occupancy encoder first
+generates a compact geometrical OCC feature through efficient explicit-implicit
+view transformation. Then, the occupancy decoder further enhances the semantic
+discriminability of the compact OCC representation by a coarse-to-fine semantic
+grouping strategy. Empirical experiments show that there are evident
+performance gains across multiple baselines, e.g., COTR outperforms baselines
+with a relative improvement of 8%-15%, demonstrating the superiority of our
+method.",cs.CV,['cs.CV']
+Global and Local Prompts Cooperation via Optimal Transport for Federated Learning,Hongxia Li · Wei Huang · Jingya Wang · Ye Shi,https://github.com/HongxiaLee/FedOTP,https://arxiv.org/abs/2403.00041,,2403.00041.pdf,Global and Local Prompts Cooperation via Optimal Transport for Federated Learning,"Prompt learning in pretrained visual-language models has shown remarkable
+flexibility across various downstream tasks. Leveraging its inherent
+lightweight nature, recent research attempted to integrate the powerful
+pretrained models into federated learning frameworks to simultaneously reduce
+communication costs and promote local training on insufficient data. Despite
+these efforts, current federated prompt learning methods lack specialized
+designs to systematically address severe data heterogeneities, e.g., data
+distribution with both label and feature shifts involved. To address this
+challenge, we present Federated Prompts Cooperation via Optimal Transport
+(FedOTP), which introduces efficient collaborative prompt learning strategies
+to capture diverse category traits on a per-client basis. Specifically, for
+each client, we learn a global prompt to extract consensus knowledge among
+clients, and a local prompt to capture client-specific category
+characteristics. Unbalanced Optimal Transport is then employed to align local
+visual features with these prompts, striking a balance between global consensus
+and local personalization. By relaxing one of the equality constraints, FedOTP
+enables prompts to focus solely on the core regions of image patches. Extensive
+experiments on datasets with various types of heterogeneities have demonstrated
+that our FedOTP outperforms the state-of-the-art methods.",cs.LG,"['cs.LG', 'cs.AI', 'cs.DC']"
+Rethinking the Evaluation Protocol of Domain Generalization,Han Yu · Xingxuan Zhang · Renzhe Xu · Jiashuo Liu · Yue He · Peng Cui, ,https://arxiv.org/abs/2307.11108,,2307.11108.pdf,Flatness-Aware Minimization for Domain Generalization,"Domain generalization (DG) seeks to learn robust models that generalize well
+under unknown distribution shifts. As a critical aspect of DG, optimizer
+selection has not been explored in depth. Currently, most DG methods follow the
+widely used benchmark, DomainBed, and utilize Adam as the default optimizer for
+all datasets. However, we reveal that Adam is not necessarily the optimal
+choice for the majority of current DG methods and datasets. Based on the
+perspective of loss landscape flatness, we propose a novel approach,
+Flatness-Aware Minimization for Domain Generalization (FAD), which can
+efficiently optimize both zeroth-order and first-order flatness simultaneously
+for DG. We provide theoretical analyses of the FAD's out-of-distribution (OOD)
+generalization error and convergence. Our experimental results demonstrate the
+superiority of FAD on various DG datasets. Additionally, we confirm that FAD is
+capable of discovering flatter optima in comparison to other zeroth-order and
+first-order flatness-aware optimization methods.",cs.CV,"['cs.CV', 'cs.LG']"
+"The More You See in 2D, the More You Perceive in 3D",Xinyang Han · Zelin Gao · Angjoo Kanazawa · Shubham Goel · Yossi Gandelsman, ,https://arxiv.org/abs/2404.03652,,2404.03652.pdf,"The More You See in 2D, the More You Perceive in 3D","Humans can infer 3D structure from 2D images of an object based on past
+experience and improve their 3D understanding as they see more images. Inspired
+by this behavior, we introduce SAP3D, a system for 3D reconstruction and novel
+view synthesis from an arbitrary number of unposed images. Given a few unposed
+images of an object, we adapt a pre-trained view-conditioned diffusion model
+together with the camera poses of the images via test-time fine-tuning. The
+adapted diffusion model and the obtained camera poses are then utilized as
+instance-specific priors for 3D reconstruction and novel view synthesis. We
+show that as the number of input images increases, the performance of our
+approach improves, bridging the gap between optimization-based prior-less 3D
+reconstruction methods and single-image-to-3D diffusion-based methods. We
+demonstrate our system on real images as well as standard synthetic benchmarks.
+Our ablation studies confirm that this adaption behavior is key for more
+accurate 3D understanding.",cs.CV,['cs.CV']
+"Selective, Interpretable and Motion Consistent Privacy Attribute Obfuscation for Action Recognition",Filip Ilic · He Zhao · Thomas Pock · Richard P. Wildes,https://f-ilic.github.io/SelectivePrivacyPreservation,https://arxiv.org/abs/2403.12710,,2403.12710.pdf,"Selective, Interpretable, and Motion Consistent Privacy Attribute Obfuscation for Action Recognition","Concerns for the privacy of individuals captured in public imagery have led
+to privacy-preserving action recognition. Existing approaches often suffer from
+issues arising through obfuscation being applied globally and a lack of
+interpretability. Global obfuscation hides privacy sensitive regions, but also
+contextual regions important for action recognition. Lack of interpretability
+erodes trust in these new technologies. We highlight the limitations of current
+paradigms and propose a solution: Human selected privacy templates that yield
+interpretability by design, an obfuscation scheme that selectively hides
+attributes and also induces temporal consistency, which is important in action
+recognition. Our approach is architecture agnostic and directly modifies input
+imagery, while existing approaches generally require architecture training. Our
+approach offers more flexibility, as no retraining is required, and outperforms
+alternatives on three widely used datasets.",cs.CV,"['cs.CV', 'cs.LG']"
+OpticalDR: A Deep Optical Imaging Model for Privacy-Protective Depression Recognition,Yuchen Pan · Junjun Jiang · Kui Jiang · Zhihao Wu · Keyuan Yu · Xianming Liu, ,https://arxiv.org/abs/2402.18786,,2402.18786.pdf,OpticalDR: A Deep Optical Imaging Model for Privacy-Protective Depression Recognition,"Depression Recognition (DR) poses a considerable challenge, especially in the
+context of the growing concerns surrounding privacy. Traditional automatic
+diagnosis of DR technology necessitates the use of facial images, undoubtedly
+expose the patient identity features and poses privacy risks. In order to
+mitigate the potential risks associated with the inappropriate disclosure of
+patient facial images, we design a new imaging system to erase the identity
+information of captured facial images while retain disease-relevant features.
+It is irreversible for identity information recovery while preserving essential
+disease-related characteristics necessary for accurate DR. More specifically,
+we try to record a de-identified facial image (erasing the identifiable
+features as much as possible) by a learnable lens, which is optimized in
+conjunction with the following DR task as well as a range of face analysis
+related auxiliary tasks in an end-to-end manner. These aforementioned
+strategies form our final Optical deep Depression Recognition network
+(OpticalDR). Experiments on CelebA, AVEC 2013, and AVEC 2014 datasets
+demonstrate that our OpticalDR has achieved state-of-the-art privacy protection
+performance with an average AUC of 0.51 on popular facial recognition models,
+and competitive results for DR with MAE/RMSE of 7.53/8.48 on AVEC 2013 and
+7.89/8.82 on AVEC 2014, respectively.",cs.CV,['cs.CV']
+NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis,Nilesh Kulkarni · Davis Rempe · Kyle Genova · Abhijit Kundu · Justin Johnson · David Fouhey · Leonidas Guibas, ,https://arxiv.org/abs/2307.07511,,2307.07511.pdf,NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis,"We address the problem of generating realistic 3D motions of humans
+interacting with objects in a scene. Our key idea is to create a neural
+interaction field attached to a specific object, which outputs the distance to
+the valid interaction manifold given a human pose as input. This interaction
+field guides the sampling of an object-conditioned human motion diffusion
+model, so as to encourage plausible contacts and affordance semantics. To
+support interactions with scarcely available data, we propose an automated
+synthetic data pipeline. For this, we seed a pre-trained motion model, which
+has priors for the basics of human movement, with interaction-specific anchor
+poses extracted from limited motion capture data. Using our guided diffusion
+model trained on generated synthetic data, we synthesize realistic motions for
+sitting and lifting with several objects, outperforming alternative approaches
+in terms of motion quality and successful action completion. We call our
+framework NIFTY: Neural Interaction Fields for Trajectory sYnthesis.",cs.CV,['cs.CV']
+On The Vulnerability of Efficient Vision Transformers to Adversarial Computation Attacks,Navaneet K L · Soroush Abbasi Koohpayegani · Essam Sleiman · Hamed Pirsiavash, ,https://arxiv.org/html/2208.09602v2,,2208.09602v2.pdf,Exploring Adversarial Robustness of Vision Transformers in the Spectral Perspective,"The Vision Transformer has emerged as a powerful tool for image
+classification tasks, surpassing the performance of convolutional neural
+networks (CNNs). Recently, many researchers have attempted to understand the
+robustness of Transformers against adversarial attacks. However, previous
+researches have focused solely on perturbations in the spatial domain. This
+paper proposes an additional perspective that explores the adversarial
+robustness of Transformers against frequency-selective perturbations in the
+spectral domain. To facilitate comparison between these two domains, an attack
+framework is formulated as a flexible tool for implementing attacks on images
+in the spatial and spectral domains. The experiments reveal that Transformers
+rely more on phase and low frequency information, which can render them more
+vulnerable to frequency-selective attacks than CNNs. This work offers new
+insights into the properties and adversarial robustness of Transformers.",cs.CV,['cs.CV']
+Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation,Daichi Horita · Naoto Inoue · Kotaro Kikuchi · Kota Yamaguchi · Kiyoharu Aizawa,https://udonda.github.io/RALF/,https://arxiv.org/abs/2311.13602,,2311.13602.pdf,Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation,"Content-aware graphic layout generation aims to automatically arrange visual
+elements along with a given content, such as an e-commerce product image. In
+this paper, we argue that the current layout generation approaches suffer from
+the limited training data for the high-dimensional layout structure. We show
+that a simple retrieval augmentation can significantly improve the generation
+quality. Our model, which is named Retrieval-Augmented Layout Transformer
+(RALF), retrieves nearest neighbor layout examples based on an input image and
+feeds these results into an autoregressive generator. Our model can apply
+retrieval augmentation to various controllable generation tasks and yield
+high-quality layouts within a unified architecture. Our extensive experiments
+show that RALF successfully generates content-aware layouts in both constrained
+and unconstrained settings and significantly outperforms the baselines.",cs.CV,['cs.CV']
+Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks,Shin&#x27;ya Yamaguchi · Sekitoshi Kanai · Kazuki Adachi · Daiki Chijiwa,https://github.com/yshinya6/adarand,https://arxiv.org/abs/2403.10097,,2403.10097.pdf,Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks,"While fine-tuning is a de facto standard method for training deep neural
+networks, it still suffers from overfitting when using small target datasets.
+Previous methods improve fine-tuning performance by maintaining knowledge of
+the source datasets or introducing regularization terms such as contrastive
+loss. However, these methods require auxiliary source information (e.g., source
+labels or datasets) or heavy additional computations. In this paper, we propose
+a simple method called adaptive random feature regularization (AdaRand).
+AdaRand helps the feature extractors of training models to adaptively change
+the distribution of feature vectors for downstream classification tasks without
+auxiliary source information and with reasonable computation costs. To this
+end, AdaRand minimizes the gap between feature vectors and random reference
+vectors that are sampled from class conditional Gaussian distributions.
+Furthermore, AdaRand dynamically updates the conditional distribution to follow
+the currently updated feature extractors and balance the distance between
+classes in feature spaces. Our experiments show that AdaRand outperforms the
+other fine-tuning regularization, which requires auxiliary source information
+and heavy computation costs.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+UFC-Net: Unrolling Fixed-point Continuous Network for Deep Compressive Sensing,Xiaoyang Wang · Hongping Gan, ,,https://link.springer.com/article/10.1007/s11263-023-01814-w,,,,,nan
+Error Detection in Egocentric Procedural Task Videos,Shih-Po Lee · Zijia Lu · Zekun Zhang · Minh Hoai · Ehsan Elhamifar, ,https://arxiv.org/abs/2404.01933,,2404.01933.pdf,PREGO: online mistake detection in PRocedural EGOcentric videos,"Promptly identifying procedural errors from egocentric videos in an online
+setting is highly challenging and valuable for detecting mistakes as soon as
+they happen. This capability has a wide range of applications across various
+fields, such as manufacturing and healthcare. The nature of procedural mistakes
+is open-set since novel types of failures might occur, which calls for
+one-class classifiers trained on correctly executed procedures. However, no
+technique can currently detect open-set procedural mistakes online. We propose
+PREGO, the first online one-class classification model for mistake detection in
+PRocedural EGOcentric videos. PREGO is based on an online action recognition
+component to model the current action, and a symbolic reasoning module to
+predict the next actions. Mistake detection is performed by comparing the
+recognized current action with the expected future one. We evaluate PREGO on
+two procedural egocentric video datasets, Assembly101 and Epic-tent, which we
+adapt for online benchmarking of procedural mistake detection to establish
+suitable benchmarks, thus defining the Assembly101-O and Epic-tent-O datasets,
+respectively.",cs.CV,['cs.CV']
+Low-Rank Knowledge Decomposition for Medical Foundation Models,Yuhang Zhou · Haolin li · Siyuan Du · Jiangchao Yao · Ya Zhang · Yanfeng Wang, ,https://arxiv.org/abs/2404.17184,,2404.17184.pdf,Low-Rank Knowledge Decomposition for Medical Foundation Models,"The popularity of large-scale pre-training has promoted the development of
+medical foundation models. However, some studies have shown that although
+foundation models exhibit strong general feature extraction capabilities, their
+performance on specific tasks is still inferior to task-specific methods. In
+this paper, we explore a new perspective called ``Knowledge Decomposition'' to
+improve the performance on specific medical tasks, which deconstruct the
+foundation model into multiple lightweight expert models, each dedicated to a
+particular task, with the goal of improving specialization while concurrently
+mitigating resource expenditure. To accomplish the above objective, we design a
+novel framework named Low-Rank Knowledge Decomposition (LoRKD), which
+explicitly separates graidents by incorporating low-rank expert modules and the
+efficient knowledge separation convolution. Extensive experimental results
+demonstrate that the decomposed models perform well in terms of performance and
+transferability, even surpassing the original foundation models.",cs.CV,['cs.CV']
+GS-IR: 3D Gaussian Splatting for Inverse Rendering,Zhihao Liang · Qi Zhang · Ying Feng · Ying Shan · Kui Jia, ,https://arxiv.org/abs/2311.16473,,2311.16473.pdf,GS-IR: 3D Gaussian Splatting for Inverse Rendering,"We propose GS-IR, a novel inverse rendering approach based on 3D Gaussian
+Splatting (GS) that leverages forward mapping volume rendering to achieve
+photorealistic novel view synthesis and relighting results. Unlike previous
+works that use implicit neural representations and volume rendering (e.g.
+NeRF), which suffer from low expressive power and high computational
+complexity, we extend GS, a top-performance representation for novel view
+synthesis, to estimate scene geometry, surface material, and environment
+illumination from multi-view images captured under unknown lighting conditions.
+There are two main problems when introducing GS to inverse rendering: 1) GS
+does not support producing plausible normal natively; 2) forward mapping (e.g.
+rasterization and splatting) cannot trace the occlusion like backward mapping
+(e.g. ray tracing). To address these challenges, our GS-IR proposes an
+efficient optimization scheme that incorporates a depth-derivation-based
+regularization for normal estimation and a baking-based occlusion to model
+indirect lighting. The flexible and expressive GS representation allows us to
+achieve fast and compact geometry reconstruction, photorealistic novel view
+synthesis, and effective physically-based rendering. We demonstrate the
+superiority of our method over baseline methods through qualitative and
+quantitative evaluations on various challenging scenes.",cs.CV,['cs.CV']
+Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation,Yuanhong Chen · Yuyuan Liu · Hu Wang · Fengbei Liu · Chong Wang · Helen Frazer · Gustavo Carneiro, ,https://arxiv.org/abs/2310.18709,,2310.18709.pdf,Audio-Visual Instance Segmentation,"In this paper, we propose a new multi-modal task, namely audio-visual
+instance segmentation (AVIS), in which the goal is to identify, segment, and
+track individual sounding object instances in audible videos, simultaneously.
+To our knowledge, it is the first time that instance segmentation has been
+extended into the audio-visual domain. To better facilitate this research, we
+construct the first audio-visual instance segmentation benchmark (AVISeg).
+Specifically, AVISeg consists of 1,258 videos with an average duration of 62.6
+seconds from YouTube and public audio-visual datasets, where 117 videos have
+been annotated by using an interactive semi-automatic labeling tool based on
+the Segment Anything Model (SAM). In addition, we present a simple baseline
+model for the AVIS task. Our new model introduces an audio branch and a
+cross-modal fusion module to Mask2Former to locate all sounding objects.
+Finally, we evaluate the proposed method using two backbones on AVISeg. We
+believe that AVIS will inspire the community towards a more comprehensive
+multi-modal understanding.",cs.CV,"['cs.CV', 'cs.LG', 'cs.MM', 'cs.SD', 'eess.AS']"
+Towards Generalizable Multi-Object Tracking,Zheng Qin · Le Wang · Sanping Zhou · Panpan Fu · Gang Hua · Wei Tang, ,http://export.arxiv.org/abs/2311.10382,,2311.10382.pdf,Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking,"Multi-Object Tracking (MOT) remains a vital component of intelligent video
+analysis, which aims to locate targets and maintain a consistent identity for
+each target throughout a video sequence. Existing works usually learn a
+discriminative feature representation, such as motion and appearance, to
+associate the detections across frames, which are easily affected by mutual
+occlusion and background clutter in practice. In this paper, we propose a
+simple yet effective two-stage feature learning paradigm to jointly learn
+single-shot and multi-shot features for different targets, so as to achieve
+robust data association in the tracking process. For the detections without
+being associated, we design a novel single-shot feature learning module to
+extract discriminative features of each detection, which can efficiently
+associate targets between adjacent frames. For the tracklets being lost several
+frames, we design a novel multi-shot feature learning module to extract
+discriminative features of each tracklet, which can accurately refind these
+lost targets after a long period. Once equipped with a simple data association
+logic, the resulting VisualTracker can perform robust MOT based on the
+single-shot and multi-shot feature representations. Extensive experimental
+results demonstrate that our method has achieved significant improvements on
+MOT17 and MOT20 datasets while reaching state-of-the-art performance on
+DanceTrack dataset.",cs.CV,['cs.CV']
+Authentic Hand Avatar from a Phone Scan via Universal Hand Model,Gyeongsik Moon · Weipeng Xu · Rohan Joshi · Chenglei Wu · Takaaki Shiratori, ,https://arxiv.org/abs/2405.07933,,2405.07933.pdf,Authentic Hand Avatar from a Phone Scan via Universal Hand Model,"The authentic 3D hand avatar with every identifiable information, such as
+hand shapes and textures, is necessary for immersive experiences in AR/VR. In
+this paper, we present a universal hand model (UHM), which 1) can universally
+represent high-fidelity 3D hand meshes of arbitrary identities (IDs) and 2) can
+be adapted to each person with a short phone scan for the authentic hand
+avatar. For effective universal hand modeling, we perform tracking and modeling
+at the same time, while previous 3D hand models perform them separately. The
+conventional separate pipeline suffers from the accumulated errors from the
+tracking stage, which cannot be recovered in the modeling stage. On the other
+hand, ours does not suffer from the accumulated errors while having a much more
+concise overall pipeline. We additionally introduce a novel image matching loss
+function to address a skin sliding during the tracking and modeling, while
+existing works have not focused on it much. Finally, using learned priors from
+our UHM, we effectively adapt our UHM to each person's short phone scan for the
+authentic hand avatar.",cs.CV,['cs.CV']
+WANDR: Intention-guided Human Motion Generation,Markos Diomataris · Nikos Athanasiou · Omid Taheri · Xi Wang · Otmar Hilliges · Michael J. Black,https://wandr.is.tue.mpg.de/,https://arxiv.org/abs/2404.15383,,2404.15383.pdf,WANDR: Intention-guided Human Motion Generation,"Synthesizing natural human motions that enable a 3D human avatar to walk and
+reach for arbitrary goals in 3D space remains an unsolved problem with many
+applications. Existing methods (data-driven or using reinforcement learning)
+are limited in terms of generalization and motion naturalness. A primary
+obstacle is the scarcity of training data that combines locomotion with goal
+reaching. To address this, we introduce WANDR, a data-driven model that takes
+an avatar's initial pose and a goal's 3D position and generates natural human
+motions that place the end effector (wrist) on the goal location. To solve
+this, we introduce novel intention features that drive rich goal-oriented
+movement. Intention guides the agent to the goal, and interactively adapts the
+generation to novel situations without needing to define sub-goals or the
+entire motion path. Crucially, intention allows training on datasets that have
+goal-oriented motions as well as those that do not. WANDR is a conditional
+Variational Auto-Encoder (c-VAE), which we train using the AMASS and CIRCLE
+datasets. We evaluate our method extensively and demonstrate its ability to
+generate natural and long-term motions that reach 3D goals and generalize to
+unseen goal locations. Our models and code are available for research purposes
+at wandr.is.tue.mpg.de.",cs.CV,"['cs.CV', 'cs.AI']"
+SynSP: Synergy of Smoothness and Precision in Pose Sequences Refinement,Tao Wang · Lei Jin · Zheng Wang · Jianshu Li · Liang Li · Fang Zhao · Yu Cheng · Li Yuan · Li ZHOU · Junliang Xing · Jian Zhao, ,https://arxiv.org/abs/2311.09543,,2311.09543.pdf,Temporal-Aware Refinement for Video-based Human Pose and Shape Recovery,"Though significant progress in human pose and shape recovery from monocular
+RGB images has been made in recent years, obtaining 3D human motion with high
+accuracy and temporal consistency from videos remains challenging. Existing
+video-based methods tend to reconstruct human motion from global image
+features, which lack detailed representation capability and limit the
+reconstruction accuracy. In this paper, we propose a Temporal-Aware Refining
+Network (TAR), to synchronously explore temporal-aware global and local image
+features for accurate pose and shape recovery. First, a global transformer
+encoder is introduced to obtain temporal global features from static feature
+sequences. Second, a bidirectional ConvGRU network takes the sequence of
+high-resolution feature maps as input, and outputs temporal local feature maps
+that maintain high resolution and capture the local motion of the human body.
+Finally, a recurrent refinement module iteratively updates estimated SMPL
+parameters by leveraging both global and local temporal information to achieve
+accurate and smooth results. Extensive experiments demonstrate that our TAR
+obtains more accurate results than previous state-of-the-art methods on popular
+benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.",cs.CV,['cs.CV']
+vid-TLDR: Training Free Token merging for Light-weight Video Transformer,Joonmyung Choi · Sanghyeok Lee · Jaewon Chu · Minhyuk Choi · Hyunwoo J. Kim,https://github.com/mlvlab/vid-TLDR,https://arxiv.org/abs/2403.13347,,2403.13347.pdf,vid-TLDR: Training Free Token merging for Light-weight Video Transformer,"Video Transformers have become the prevalent solution for various video
+downstream tasks with superior expressive power and flexibility. However, these
+video transformers suffer from heavy computational costs induced by the massive
+number of tokens across the entire video frames, which has been the major
+barrier to training the model. Further, the patches irrelevant to the main
+contents, e.g., backgrounds, degrade the generalization performance of models.
+To tackle these issues, we propose training free token merging for lightweight
+video Transformer (vid-TLDR) that aims to enhance the efficiency of video
+Transformers by merging the background tokens without additional training. For
+vid-TLDR, we introduce a novel approach to capture the salient regions in
+videos only with the attention map. Further, we introduce the saliency-aware
+token merging strategy by dropping the background tokens and sharpening the
+object scores. Our experiments show that vid-TLDR significantly mitigates the
+computational complexity of video Transformers while achieving competitive
+performance compared to the base model without vid-TLDR. Code is available at
+https://github.com/mlvlab/vid-TLDR.",cs.CV,['cs.CV']
+Boosting Image Restoration via Priors from Pre-trained Models,Xiaogang Xu · Shu Kong · Tao Hu · Zhe Liu · Hujun Bao, ,https://arxiv.org/abs/2403.06793,,2403.06793.pdf,Boosting Image Restoration via Priors from Pre-trained Models,"Pre-trained models with large-scale training data, such as CLIP and Stable
+Diffusion, have demonstrated remarkable performance in various high-level
+computer vision tasks such as image understanding and generation from language
+descriptions. Yet, their potential for low-level tasks such as image
+restoration remains relatively unexplored. In this paper, we explore such
+models to enhance image restoration. As off-the-shelf features (OSF) from
+pre-trained models do not directly serve image restoration, we propose to learn
+an additional lightweight module called Pre-Train-Guided Refinement Module
+(PTG-RM) to refine restoration results of a target restoration network with
+OSF. PTG-RM consists of two components, Pre-Train-Guided Spatial-Varying
+Enhancement (PTG-SVE), and Pre-Train-Guided Channel-Spatial Attention
+(PTG-CSA). PTG-SVE enables optimal short- and long-range neural operations,
+while PTG-CSA enhances spatial-channel attention for restoration-related
+learning. Extensive experiments demonstrate that PTG-RM, with its compact size
+($<$1M parameters), effectively enhances restoration performance of various
+models across different tasks, including low-light enhancement, deraining,
+deblurring, and denoising.",cs.CV,['cs.CV']
+HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting,Yuheng Jiang · Zhehao Shen · Penghao Wang · Zhuo Su · Yu Hong · Yingliang Zhang · Jingyi Yu · Lan Xu,https://nowheretrix.github.io/HiFi4G/,https://arxiv.org/abs/2312.03461,,2312.03461.pdf,HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting,"We have recently seen tremendous progress in photo-real human modeling and
+rendering. Yet, efficiently rendering realistic human performance and
+integrating it into the rasterization pipeline remains challenging. In this
+paper, we present HiFi4G, an explicit and compact Gaussian-based approach for
+high-fidelity human performance rendering from dense footage. Our core
+intuition is to marry the 3D Gaussian representation with non-rigid tracking,
+achieving a compact and compression-friendly representation. We first propose a
+dual-graph mechanism to obtain motion priors, with a coarse deformation graph
+for effective initialization and a fine-grained Gaussian graph to enforce
+subsequent constraints. Then, we utilize a 4D Gaussian optimization scheme with
+adaptive spatial-temporal regularizers to effectively balance the non-rigid
+prior and Gaussian updating. We also present a companion compression scheme
+with residual compensation for immersive experiences on various platforms. It
+achieves a substantial compression rate of approximately 25 times, with less
+than 2MB of storage per frame. Extensive experiments demonstrate the
+effectiveness of our approach, which significantly outperforms existing
+approaches in terms of optimization speed, rendering quality, and storage
+overhead.",cs.CV,['cs.CV']
+Preserving Fairness Generalization in Deepfake Detection,Li Lin · Li Lin · Xinan He · Yan Ju · Xin Wang · Feng Ding · Shu Hu, ,https://arxiv.org/abs/2402.17229v1,,2402.17229v1.pdf,Preserving Fairness Generalization in Deepfake Detection,"Although effective deepfake detection models have been developed in recent
+years, recent studies have revealed that these models can result in unfair
+performance disparities among demographic groups, such as race and gender. This
+can lead to particular groups facing unfair targeting or exclusion from
+detection, potentially allowing misclassified deepfakes to manipulate public
+opinion and undermine trust in the model. The existing method for addressing
+this problem is providing a fair loss function. It shows good fairness
+performance for intra-domain evaluation but does not maintain fairness for
+cross-domain testing. This highlights the significance of fairness
+generalization in the fight against deepfakes. In this work, we propose the
+first method to address the fairness generalization problem in deepfake
+detection by simultaneously considering features, loss, and optimization
+aspects. Our method employs disentanglement learning to extract demographic and
+domain-agnostic forgery features, fusing them to encourage fair learning across
+a flattened loss landscape. Extensive experiments on prominent deepfake
+datasets demonstrate our method's effectiveness, surpassing state-of-the-art
+approaches in preserving fairness during cross-domain deepfake detection. The
+code is available at https://github.com/Purdue-M2/Fairness-Generalization",cs.CV,"['cs.CV', 'cs.CY', 'cs.LG']"
+CoSeR: Bridging Image and Language for Cognitive Super-Resolution,Haoze Sun · Wenbo Li · Jianzhuang Liu · Haoyu Chen · Renjing Pei · Xueyi Zou · Youliang Yan · Yujiu Yang, ,https://arxiv.org/abs/2311.16512,,2311.16512.pdf,CoSeR: Bridging Image and Language for Cognitive Super-Resolution,"Existing super-resolution (SR) models primarily focus on restoring local
+texture details, often neglecting the global semantic information within the
+scene. This oversight can lead to the omission of crucial semantic details or
+the introduction of inaccurate textures during the recovery process. In our
+work, we introduce the Cognitive Super-Resolution (CoSeR) framework, empowering
+SR models with the capacity to comprehend low-resolution images. We achieve
+this by marrying image appearance and language understanding to generate a
+cognitive embedding, which not only activates prior information from large
+text-to-image diffusion models but also facilitates the generation of
+high-quality reference images to optimize the SR process. To further improve
+image fidelity, we propose a novel condition injection scheme called
+""All-in-Attention"", consolidating all conditional information into a single
+module. Consequently, our method successfully restores semantically correct and
+photorealistic details, demonstrating state-of-the-art performance across
+multiple benchmarks. Code: https://github.com/VINHYU/CoSeR",cs.CV,"['cs.CV', 'cs.AI']"
+Task-Customized Mixture of Adapters for General Image Fusion,Pengfei Zhu · Yang Sun · Bing Cao · Qinghua Hu, ,https://arxiv.org/abs/2403.12494,,2403.12494.pdf,Task-Customized Mixture of Adapters for General Image Fusion,"General image fusion aims at integrating important information from
+multi-source images. However, due to the significant cross-task gap, the
+respective fusion mechanism varies considerably in practice, resulting in
+limited performance across subtasks. To handle this problem, we propose a novel
+task-customized mixture of adapters (TC-MoA) for general image fusion,
+adaptively prompting various fusion tasks in a unified model. We borrow the
+insight from the mixture of experts (MoE), taking the experts as efficient
+tuning adapters to prompt a pre-trained foundation model. These adapters are
+shared across different tasks and constrained by mutual information
+regularization, ensuring compatibility with different tasks while
+complementarity for multi-source images. The task-specific routing networks
+customize these adapters to extract task-specific information from different
+sources with dynamic dominant intensity, performing adaptive visual feature
+prompt fusion. Notably, our TC-MoA controls the dominant intensity bias for
+different fusion tasks, successfully unifying multiple fusion tasks in a single
+model. Extensive experiments show that TC-MoA outperforms the competing
+approaches in learning commonalities while retaining compatibility for general
+image fusion (multi-modal, multi-exposure, and multi-focus), and also
+demonstrating striking controllability on more generalization experiments. The
+code is available at https://github.com/YangSun22/TC-MoA .",cs.CV,['cs.CV']
+Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation,Shuting He · Henghui Ding, ,https://arxiv.org/abs/2404.03645,,2404.03645.pdf,Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation,"Referring video segmentation relies on natural language expressions to
+identify and segment objects, often emphasizing motion clues. Previous works
+treat a sentence as a whole and directly perform identification at the
+video-level, mixing up static image-level cues with temporal motion cues.
+However, image-level features cannot well comprehend motion cues in sentences,
+and static cues are not crucial for temporal perception. In fact, static cues
+can sometimes interfere with temporal perception by overshadowing motion cues.
+In this work, we propose to decouple video-level referring expression
+understanding into static and motion perception, with a specific emphasis on
+enhancing temporal comprehension. Firstly, we introduce an
+expression-decoupling module to make static cues and motion cues perform their
+distinct role, alleviating the issue of sentence embeddings overlooking motion
+cues. Secondly, we propose a hierarchical motion perception module to capture
+temporal information effectively across varying timescales. Furthermore, we
+employ contrastive learning to distinguish the motions of visually similar
+objects. These contributions yield state-of-the-art performance across five
+datasets, including a remarkable $\textbf{9.2%}$ $\mathcal{J\&F}$ improvement
+on the challenging $\textbf{MeViS}$ dataset. Code is available at
+https://github.com/heshuting555/DsHmp.",cs.CV,['cs.CV']
+MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark,Sanghyun Woo · Kwanyong Park · Inkyu Shin · Myungchul Kim · In So Kweon,https://sites.google.com/view/mtmmc,https://arxiv.org/abs/2403.20225,,2403.20225.pdf,MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark,"Multi-target multi-camera tracking is a crucial task that involves
+identifying and tracking individuals over time using video streams from
+multiple cameras. This task has practical applications in various fields, such
+as visual surveillance, crowd behavior analysis, and anomaly detection.
+However, due to the difficulty and cost of collecting and labeling data,
+existing datasets for this task are either synthetically generated or
+artificially constructed within a controlled camera network setting, which
+limits their ability to model real-world dynamics and generalize to diverse
+camera configurations. To address this issue, we present MTMMC, a real-world,
+large-scale dataset that includes long video sequences captured by 16
+multi-modal cameras in two different environments - campus and factory - across
+various time, weather, and season conditions. This dataset provides a
+challenging test-bed for studying multi-camera tracking under diverse
+real-world complexities and includes an additional input modality of spatially
+aligned and temporally synchronized RGB and thermal cameras, which enhances the
+accuracy of multi-camera tracking. MTMMC is a super-set of existing datasets,
+benefiting independent fields such as person detection, re-identification, and
+multiple object tracking. We provide baselines and new learning setups on this
+dataset and set the reference scores for future studies. The datasets, models,
+and test server will be made publicly available.",cs.CV,['cs.CV']
+Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection,Zhanwei Zhang · Minghao Chen · Shuai Xiao · Liang Peng · Hengjia Li · Binbin Lin · Ping Li · Wenxiao Wang · Boxi Wu · Deng Cai, ,https://arxiv.org/abs/2404.19384,,2404.19384.pdf,Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection,"Recent self-training techniques have shown notable improvements in
+unsupervised domain adaptation for 3D object detection (3D UDA). These
+techniques typically select pseudo labels, i.e., 3D boxes, to supervise models
+for the target domain. However, this selection process inevitably introduces
+unreliable 3D boxes, in which 3D points cannot be definitively assigned as
+foreground or background. Previous techniques mitigate this by reweighting
+these boxes as pseudo labels, but these boxes can still poison the training
+process. To resolve this problem, in this paper, we propose a novel pseudo
+label refinery framework. Specifically, in the selection process, to improve
+the reliability of pseudo boxes, we propose a complementary augmentation
+strategy. This strategy involves either removing all points within an
+unreliable box or replacing it with a high-confidence box. Moreover, the point
+numbers of instances in high-beam datasets are considerably higher than those
+in low-beam datasets, also degrading the quality of pseudo labels during the
+training process. We alleviate this issue by generating additional proposals
+and aligning RoI features across different domains. Experimental results
+demonstrate that our method effectively enhances the quality of pseudo labels
+and consistently surpasses the state-of-the-art methods on six autonomous
+driving benchmarks. Code will be available at
+https://github.com/Zhanwei-Z/PERE.",cs.CV,"['cs.CV', 'cs.AI']"
+Unbiased Faster R-CNN for Single-source Domain Generalized Object Detection,Yajing Liu · Shijun Zhou · Xiyao Liu · chunhui Hao · Baojie Fan · Jiandong Tian, ,https://arxiv.org/abs/2405.15225,,2405.15225.pdf,Unbiased Faster R-CNN for Single-source Domain Generalized Object Detection,"Single-source domain generalization (SDG) for object detection is a
+challenging yet essential task as the distribution bias of the unseen domain
+degrades the algorithm performance significantly. However, existing methods
+attempt to extract domain-invariant features, neglecting that the biased data
+leads the network to learn biased features that are non-causal and poorly
+generalizable. To this end, we propose an Unbiased Faster R-CNN (UFR) for
+generalizable feature learning. Specifically, we formulate SDG in object
+detection from a causal perspective and construct a Structural Causal Model
+(SCM) to analyze the data bias and feature bias in the task, which are caused
+by scene confounders and object attribute confounders. Based on the SCM, we
+design a Global-Local Transformation module for data augmentation, which
+effectively simulates domain diversity and mitigates the data bias.
+Additionally, we introduce a Causal Attention Learning module that incorporates
+a designed attention invariance loss to learn image-level features that are
+robust to scene confounders. Moreover, we develop a Causal Prototype Learning
+module with an explicit instance constraint and an implicit prototype
+constraint, which further alleviates the negative impact of object attribute
+confounders. Experimental results on five scenes demonstrate the prominent
+generalization ability of our method, with an improvement of 3.9% mAP on the
+Night-Clear scene.",cs.CV,['cs.CV']
+DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning,Sikai Bai · Jie ZHANG · Song Guo · Shuaicheng Li · Jingcai Guo · Jun Hou · Tao Han · Xiaocheng Lu, ,https://arxiv.org/abs/2403.08506,,2403.08506.pdf,DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning,"Federated learning (FL) has emerged as a powerful paradigm for learning from
+decentralized data, and federated domain generalization further considers the
+test dataset (target domain) is absent from the decentralized training data
+(source domains). However, most existing FL methods assume that domain labels
+are provided during training, and their evaluation imposes explicit constraints
+on the number of domains, which must strictly match the number of clients.
+Because of the underutilization of numerous edge devices and additional
+cross-client domain annotations in the real world, such restrictions may be
+impractical and involve potential privacy leaks. In this paper, we propose an
+efficient and novel approach, called Disentangled Prompt Tuning (DiPrompT), a
+method that tackles the above restrictions by learning adaptive prompts for
+domain generalization in a distributed manner. Specifically, we first design
+two types of prompts, i.e., global prompt to capture general knowledge across
+all clients and domain prompts to capture domain-specific knowledge. They
+eliminate the restriction on the one-to-one mapping between source domains and
+local clients. Furthermore, a dynamic query metric is introduced to
+automatically search the suitable domain label for each sample, which includes
+two-substep text-image alignments based on prompt tuning without
+labor-intensive annotation. Extensive experiments on multiple datasets
+demonstrate that our DiPrompT achieves superior domain generalization
+performance over state-of-the-art FL methods when domain labels are not
+provided, and even outperforms many centralized learning methods using domain
+labels.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+HIVE: Harnessing Human Feedback for Instructional Visual Editing,Shu Zhang · Xinyi Yang · Yihao Feng · Can Qin · Chia-Chih Chen · Ning Yu · Zeyuan Chen · Huan Wang · Silvio Savarese · Stefano Ermon · Caiming Xiong · Ran Xu, ,,https://www.semanticscholar.org/paper/HQ-Edit:-A-High-Quality-Dataset-for-Image-Editing-Hui-Yang/09609bd28855fd9b27f043b4dbf509615229bd08,,,,,nan
+LightIt: Illumination Modeling and Control for Diffusion Models,Peter Kocsis · Kalyan Sunkavalli · Julien Philip · Matthias Nießner · Yannick Hold-Geoffroy,https://peter-kocsis.github.io/LightIt/,https://arxiv.org/abs/2403.10615,,2403.10615.pdf,LightIt: Illumination Modeling and Control for Diffusion Models,"We introduce LightIt, a method for explicit illumination control for image
+generation. Recent generative methods lack lighting control, which is crucial
+to numerous artistic aspects of image generation such as setting the overall
+mood or cinematic appearance. To overcome these limitations, we propose to
+condition the generation on shading and normal maps. We model the lighting with
+single bounce shading, which includes cast shadows. We first train a shading
+estimation module to generate a dataset of real-world images and shading pairs.
+Then, we train a control network using the estimated shading and normals as
+input. Our method demonstrates high-quality image generation and lighting
+control in numerous scenes. Additionally, we use our generated dataset to train
+an identity-preserving relighting model, conditioned on an image and a target
+shading. Our method is the first that enables the generation of images with
+controllable, consistent lighting and performs on par with specialized
+relighting state-of-the-art methods.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG', 'I.4.8; I.2.10']"
+Byzantine-robust Decentralized Federated Learning via Dual-domain Clustering and Trust Bootstrapping,Peng Sun · Xinyang Liu · Zhibo Wang · Bo Liu, ,,https://dl.acm.org/doi/abs/10.1145/3637494.3638729,,,,,nan
+Generative 3D Part Assembly via Part-Whole-Hierarchy Message Passing,Bi'an Du · Xiang Gao · Wei Hu · Renjie Liao, ,https://arxiv.org/abs/2402.17464,,2402.17464.pdf,Generative 3D Part Assembly via Part-Whole-Hierarchy Message Passing,"Generative 3D part assembly involves understanding part relationships and
+predicting their 6-DoF poses for assembling a realistic 3D shape. Prior work
+often focus on the geometry of individual parts, neglecting part-whole
+hierarchies of objects. Leveraging two key observations: 1) super-part poses
+provide strong hints about part poses, and 2) predicting super-part poses is
+easier due to fewer superparts, we propose a part-whole-hierarchy message
+passing network for efficient 3D part assembly. We first introduce super-parts
+by grouping geometrically similar parts without any semantic labels. Then we
+employ a part-whole hierarchical encoder, wherein a super-part encoder predicts
+latent super-part poses based on input parts. Subsequently, we transform the
+point cloud using the latent poses, feeding it to the part encoder for
+aggregating super-part information and reasoning about part relationships to
+predict all part poses. In training, only ground-truth part poses are required.
+During inference, the predicted latent poses of super-parts enhance
+interpretability. Experimental results on the PartNet dataset show that our
+method achieves state-of-the-art performance in part and connectivity accuracy
+and enables an interpretable hierarchical part assembly. Code is available at
+https://github.com/pkudba/3DHPA.",cs.CV,['cs.CV']
+FreeMan: Towards benchmarking 3D human pose estimation under Real-World Conditions,Jiong WANG · Fengyu Yang · Bingliang Li · Wenbo Gou · Danqi Yan · Ailing Zeng · Ailing Zeng · Yijun Gao · Junle Wang · Yanqing Jing · Ruimao Zhang,https://wangjiongw.github.io/freeman/,https://arxiv.org/abs/2309.05073,,2309.05073.pdf,FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions,"Estimating the 3D structure of the human body from natural scenes is a
+fundamental aspect of visual perception. 3D human pose estimation is a vital
+step in advancing fields like AIGC and human-robot interaction, serving as a
+crucial technique for understanding and interacting with human actions in
+real-world settings. However, the current datasets, often collected under
+single laboratory conditions using complex motion capture equipment and
+unvarying backgrounds, are insufficient. The absence of datasets on variable
+conditions is stalling the progress of this crucial task. To facilitate the
+development of 3D pose estimation, we present FreeMan, the first large-scale,
+multi-view dataset collected under the real-world conditions. FreeMan was
+captured by synchronizing 8 smartphones across diverse scenarios. It comprises
+11M frames from 8000 sequences, viewed from different perspectives. These
+sequences cover 40 subjects across 10 different scenarios, each with varying
+lighting conditions. We have also established an semi-automated pipeline
+containing error detection to reduce the workload of manual check and ensure
+precise annotation. We provide comprehensive evaluation baselines for a range
+of tasks, underlining the significant challenges posed by FreeMan. Further
+evaluations of standard indoor/outdoor human sensing datasets reveal that
+FreeMan offers robust representation transferability in real and complex
+scenes. Code and data are available at https://wangjiongw.github.io/freeman.",cs.CV,['cs.CV']
+Generative Multimodal Models are In-Context Learners,Quan Sun · Yufeng Cui · Yufeng Cui · Xiaosong Zhang · Fan Zhang · Qiying Yu · Yueze Wang · Yongming Rao · Jingjing Liu · Tiejun Huang · Xinlong Wang, ,https://arxiv.org/abs/2312.13286,,2312.13286.pdf,Generative Multimodal Models are In-Context Learners,"The human ability to easily solve multimodal tasks in context (i.e., with
+only a few demonstrations or simple instructions), is what current multimodal
+systems have largely struggled to imitate. In this work, we demonstrate that
+the task-agnostic in-context learning capabilities of large multimodal models
+can be significantly enhanced by effective scaling-up. We introduce Emu2, a
+generative multimodal model with 37 billion parameters, trained on large-scale
+multimodal sequences with a unified autoregressive objective. Emu2 exhibits
+strong multimodal in-context learning abilities, even emerging to solve tasks
+that require on-the-fly reasoning, such as visual prompting and object-grounded
+generation. The model sets a new record on multiple multimodal understanding
+tasks in few-shot settings. When instruction-tuned to follow specific
+instructions, Emu2 further achieves new state-of-the-art on challenging tasks
+such as question answering benchmarks for large multimodal models and
+open-ended subject-driven generation. These achievements demonstrate that Emu2
+can serve as a base model and general-purpose interface for a wide range of
+multimodal tasks. Code and models are publicly available to facilitate future
+research.",cs.CV,['cs.CV']
+SVDTree: Semantic Voxel Diffusion for Single Image Tree Reconstruction,Yuan Li · Zhihao Liu · Bedrich Benes · Xiaopeng Zhang · Jianwei Guo,https://github.com/RyuZhihao123/SVDTree,https://arxiv.org/abs/2402.12712,,2402.12712.pdf,MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction,"This paper presents a neural architecture MVDiffusion++ for 3D object
+reconstruction that synthesizes dense and high-resolution views of an object
+given one or a few images without camera poses. MVDiffusion++ achieves superior
+flexibility and scalability with two surprisingly simple ideas: 1) A
+``pose-free architecture'' where standard self-attention among 2D latent
+features learns 3D consistency across an arbitrary number of conditional and
+generation views without explicitly using camera pose information; and 2) A
+``view dropout strategy'' that discards a substantial number of output views
+during training, which reduces the training-time memory footprint and enables
+dense and high-resolution view synthesis at test time. We use the Objaverse for
+training and the Google Scanned Objects for evaluation with standard novel view
+synthesis and 3D reconstruction metrics, where MVDiffusion++ significantly
+outperforms the current state of the arts. We also demonstrate a text-to-3D
+application example by combining MVDiffusion++ with a text-to-image generative
+model. The project page is at https://mvdiffusion-plusplus.github.io.",cs.CV,['cs.CV']
+HomoFormer: Homogenized Transformer for Image Shadow Removal,Jie Xiao · Xueyang Fu · Yurui Zhu · Dong Li · Jie Huang · Kai Zhu · Zheng-Jun Zha, ,https://arxiv.org/abs/2404.18433,,2404.18433.pdf,ShadowMaskFormer: Mask Augmented Patch Embeddings for Shadow Removal,"Transformer recently emerged as the de facto model for computer vision tasks
+and has also been successfully applied to shadow removal. However, these
+existing methods heavily rely on intricate modifications to the attention
+mechanisms within the transformer blocks while using a generic patch embedding.
+As a result, it often leads to complex architectural designs requiring
+additional computation resources. In this work, we aim to explore the efficacy
+of incorporating shadow information within the early processing stage.
+Accordingly, we propose a transformer-based framework with a novel patch
+embedding that is tailored for shadow removal, dubbed ShadowMaskFormer.
+Specifically, we present a simple and effective mask-augmented patch embedding
+to integrate shadow information and promote the model's emphasis on acquiring
+knowledge for shadow regions. Extensive experiments conducted on the ISTD,
+ISTD+, and SRD benchmark datasets demonstrate the efficacy of our method
+against state-of-the-art approaches while using fewer model parameters.",cs.CV,['cs.CV']
+Novel Class Discovery for Ultra-Fine-Grained Visual Categorization,Qi Jia · Yaqi Cai · Qi Jia · Binglin Qiu · Weimin Wang · Nan Pu,https://github.com/SSDUT-Caiyq/UFG-NCD,https://arxiv.org/abs/2405.06283,,2405.06283.pdf,Novel Class Discovery for Ultra-Fine-Grained Visual Categorization,"Ultra-fine-grained visual categorization (Ultra-FGVC) aims at distinguishing
+highly similar sub-categories within fine-grained objects, such as different
+soybean cultivars. Compared to traditional fine-grained visual categorization,
+Ultra-FGVC encounters more hurdles due to the small inter-class and large
+intra-class variation. Given these challenges, relying on human annotation for
+Ultra-FGVC is impractical. To this end, our work introduces a novel task termed
+Ultra-Fine-Grained Novel Class Discovery (UFG-NCD), which leverages partially
+annotated data to identify new categories of unlabeled images for Ultra-FGVC.
+To tackle this problem, we devise a Region-Aligned Proxy Learning (RAPL)
+framework, which comprises a Channel-wise Region Alignment (CRA) module and a
+Semi-Supervised Proxy Learning (SemiPL) strategy. The CRA module is designed to
+extract and utilize discriminative features from local regions, facilitating
+knowledge transfer from labeled to unlabeled classes. Furthermore, SemiPL
+strengthens representation learning and knowledge transfer with proxy-guided
+supervised learning and proxy-guided contrastive learning. Such techniques
+leverage class distribution information in the embedding space, improving the
+mining of subtle differences between labeled and unlabeled ultra-fine-grained
+classes. Extensive experiments demonstrate that RAPL significantly outperforms
+baselines across various datasets, indicating its effectiveness in handling the
+challenges of UFG-NCD. Code is available at
+https://github.com/SSDUT-Caiyq/UFG-NCD.",cs.CV,['cs.CV']
+RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method,Ming Yan · Yan Zhang · Shuqiang Cai · Shuqi Fan · Xincheng Lin · Yudi Dai · Siqi Shen · Chenglu Wen · Lan Xu · Yuexin Ma · Cheng Wang,http://www.lidarhumanmotion.net/reli11d/,https://arxiv.org/abs/2403.19501,,2403.19501.pdf,RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method,"Comprehensive capturing of human motions requires both accurate captures of
+complex poses and precise localization of the human within scenes. Most of the
+HPE datasets and methods primarily rely on RGB, LiDAR, or IMU data. However,
+solely using these modalities or a combination of them may not be adequate for
+HPE, particularly for complex and fast movements. For holistic human motion
+understanding, we present RELI11D, a high-quality multimodal human motion
+dataset involves LiDAR, IMU system, RGB camera, and Event camera. It records
+the motions of 10 actors performing 5 sports in 7 scenes, including 3.32 hours
+of synchronized LiDAR point clouds, IMU measurement data, RGB videos and Event
+steams. Through extensive experiments, we demonstrate that the RELI11D presents
+considerable challenges and opportunities as it contains many rapid and complex
+motions that require precise location. To address the challenge of integrating
+different modalities, we propose LEIR, a multimodal baseline that effectively
+utilizes LiDAR Point Cloud, Event stream, and RGB through our cross-attention
+fusion strategy. We show that LEIR exhibits promising results for rapid motions
+and daily motions and that utilizing the characteristics of multiple modalities
+can indeed improve HPE performance. Both the dataset and source code will be
+released publicly to the research community, fostering collaboration and
+enabling further exploration in this field.",cs.CV,['cs.CV']
+Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation guided by the Characteristic Dance Primitives,Ronghui Li · Yuxiang Zhang · Yachao Zhang · Hongwen Zhang · Jie Guo · Yan Zhang · Yebin Liu · Xiu Li,https://li-ronghui.github.io/lodge,https://arxiv.org/abs/2403.10518,,2403.10518.pdf,Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives,"We propose Lodge, a network capable of generating extremely long dance
+sequences conditioned on given music. We design Lodge as a two-stage coarse to
+fine diffusion architecture, and propose the characteristic dance primitives
+that possess significant expressiveness as intermediate representations between
+two diffusion models. The first stage is global diffusion, which focuses on
+comprehending the coarse-level music-dance correlation and production
+characteristic dance primitives. In contrast, the second-stage is the local
+diffusion, which parallelly generates detailed motion sequences under the
+guidance of the dance primitives and choreographic rules. In addition, we
+propose a Foot Refine Block to optimize the contact between the feet and the
+ground, enhancing the physical realism of the motion. Our approach can
+parallelly generate dance sequences of extremely long length, striking a
+balance between global choreographic patterns and local motion quality and
+expressiveness. Extensive experiments validate the efficacy of our method.",cs.CV,"['cs.CV', 'cs.GR', 'cs.SD', 'eess.AS']"
+ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models,Meng-Li Shih · Wei-Chiu Ma · Lorenzo Boyice · Aleksander Holynski · Forrester Cole · Brian Curless · Janne Kontkanen, ,https://arxiv.org/abs/2401.00979,,2401.00979.pdf,3D Visibility-aware Generalizable Neural Radiance Fields for Interacting Hands,"Neural radiance fields (NeRFs) are promising 3D representations for scenes,
+objects, and humans. However, most existing methods require multi-view inputs
+and per-scene training, which limits their real-life applications. Moreover,
+current methods focus on single-subject cases, leaving scenes of interacting
+hands that involve severe inter-hand occlusions and challenging view variations
+remain unsolved. To tackle these issues, this paper proposes a generalizable
+visibility-aware NeRF (VA-NeRF) framework for interacting hands. Specifically,
+given an image of interacting hands as input, our VA-NeRF first obtains a
+mesh-based representation of hands and extracts their corresponding geometric
+and textural features. Subsequently, a feature fusion module that exploits the
+visibility of query points and mesh vertices is introduced to adaptively merge
+features of both hands, enabling the recovery of features in unseen areas.
+Additionally, our VA-NeRF is optimized together with a novel discriminator
+within an adversarial learning paradigm. In contrast to conventional
+discriminators that predict a single real/fake label for the synthesized image,
+the proposed discriminator generates a pixel-wise visibility map, providing
+fine-grained supervision for unseen areas and encouraging the VA-NeRF to
+improve the visual quality of synthesized images. Experiments on the
+Interhand2.6M dataset demonstrate that our proposed VA-NeRF outperforms
+conventional NeRFs significantly. Project Page:
+\url{https://github.com/XuanHuang0/VANeRF}.",cs.CV,['cs.CV']
+"Representing Part-Whole Hierarchies in Foundation Models by Learning Localizability, Composability, and Decomposability from Anatomy via Self-Supervision",Mohammad Reza Hosseinzadeh Taher · Michael Gotway · Jianming Liang, ,https://arxiv.org/abs/2404.15672,,2404.15672.pdf,"Representing Part-Whole Hierarchies in Foundation Models by Learning Localizability, Composability, and Decomposability from Anatomy via Self-Supervision","Humans effortlessly interpret images by parsing them into part-whole
+hierarchies; deep learning excels in learning multi-level feature spaces, but
+they often lack explicit coding of part-whole relations, a prominent property
+of medical imaging. To overcome this limitation, we introduce Adam-v2, a new
+self-supervised learning framework extending Adam [79] by explicitly
+incorporating part-whole hierarchies into its learning objectives through three
+key branches: (1) Localizability, acquiring discriminative representations to
+distinguish different anatomical patterns; (2) Composability, learning each
+anatomical structure in a parts-to-whole manner; and (3) Decomposability,
+comprehending each anatomical structure in a whole-to-parts manner.
+Experimental results across 10 tasks, compared to 11 baselines in zero-shot,
+few-shot transfer, and full fine-tuning settings, showcase Adam-v2's superior
+performance over large-scale medical models and existing SSL methods across
+diverse downstream tasks. The higher generality and robustness of Adam-v2's
+representations originate from its explicit construction of hierarchies for
+distinct anatomical structures from unlabeled medical images. Adam-v2 preserves
+a semantic balance of anatomical diversity and harmony in its embedding,
+yielding representations that are both generic and semantically meaningful, yet
+overlooked in existing SSL methods. All code and pretrained models are
+available at https://github.com/JLiangLab/Eden.",cs.CV,['cs.CV']
+Robust Synthetic-to-Real Transfer for Stereo Matching,Jiawei Zhang · Jiahe Li · Lei Huang · Xiaohan Yu · Lin Gu · Jin Zheng · Xiao Bai, ,https://arxiv.org/abs/2403.07705,,2403.07705.pdf,Robust Synthetic-to-Real Transfer for Stereo Matching,"With advancements in domain generalized stereo matching networks, models
+pre-trained on synthetic data demonstrate strong robustness to unseen domains.
+However, few studies have investigated the robustness after fine-tuning them in
+real-world scenarios, during which the domain generalization ability can be
+seriously degraded. In this paper, we explore fine-tuning stereo matching
+networks without compromising their robustness to unseen domains. Our
+motivation stems from comparing Ground Truth (GT) versus Pseudo Label (PL) for
+fine-tuning: GT degrades, but PL preserves the domain generalization ability.
+Empirically, we find the difference between GT and PL implies valuable
+information that can regularize networks during fine-tuning. We also propose a
+framework to utilize this difference for fine-tuning, consisting of a frozen
+Teacher, an exponential moving average (EMA) Teacher, and a Student network.
+The core idea is to utilize the EMA Teacher to measure what the Student has
+learned and dynamically improve GT and PL for fine-tuning. We integrate our
+framework with state-of-the-art networks and evaluate its effectiveness on
+several real-world datasets. Extensive experiments show that our method
+effectively preserves the domain generalization ability during fine-tuning.",cs.CV,['cs.CV']
+Towards Robust Learning to Optimize with Theoretical Guarantees,Qingyu Song · Wei Lin · Juncheng Wang · Hong Xu,https://github.com/NetX-lab/GoMathL2O-Official,,https://henryhxu.github.io/papers.html,,,,,nan
+UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs,Yanwu Xu · Yang Zhao · Zhisheng Xiao · Tingbo Hou, ,https://arxiv.org/abs/2311.09257,,2311.09257.pdf,UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs,"Text-to-image diffusion models have demonstrated remarkable capabilities in
+transforming textual prompts into coherent images, yet the computational cost
+of their inference remains a persistent challenge. To address this issue, we
+present UFOGen, a novel generative model designed for ultra-fast, one-step
+text-to-image synthesis. In contrast to conventional approaches that focus on
+improving samplers or employing distillation techniques for diffusion models,
+UFOGen adopts a hybrid methodology, integrating diffusion models with a GAN
+objective. Leveraging a newly introduced diffusion-GAN objective and
+initialization with pre-trained diffusion models, UFOGen excels in efficiently
+generating high-quality images conditioned on textual descriptions in a single
+step. Beyond traditional text-to-image generation, UFOGen showcases versatility
+in applications. Notably, UFOGen stands among the pioneering models enabling
+one-step text-to-image generation and diverse downstream tasks, presenting a
+significant advancement in the landscape of efficient generative models.",cs.CV,['cs.CV']
+Instance-aware Contrastive Learning for Occluded Human Mesh Reconstruction,Mi-Gyeong Gwon · Gi-Mun Um · Won-Sik Cheong · Wonjun Kim,https://github.com/DCVL-3D/InstanceHMR_release,https://arxiv.org/abs/2307.16377,,2307.16377.pdf,JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery,"In this study, we focus on the problem of 3D human mesh recovery from a
+single image under obscured conditions. Most state-of-the-art methods aim to
+improve 2D alignment technologies, such as spatial averaging and 2D joint
+sampling. However, they tend to neglect the crucial aspect of 3D alignment by
+improving 3D representations. Furthermore, recent methods struggle to separate
+the target human from occlusion or background in crowded scenes as they
+optimize the 3D space of target human with 3D joint coordinates as local
+supervision. To address these issues, a desirable method would involve a
+framework for fusing 2D and 3D features and a strategy for optimizing the 3D
+space globally. Therefore, this paper presents 3D JOint contrastive learning
+with TRansformers (JOTR) framework for handling occluded 3D human mesh
+recovery. Our method includes an encoder-decoder transformer architecture to
+fuse 2D and 3D representations for achieving 2D$\&$3D aligned results in a
+coarse-to-fine manner and a novel 3D joint contrastive learning approach for
+adding explicitly global supervision for the 3D feature space. The contrastive
+learning approach includes two contrastive losses: joint-to-joint contrast for
+enhancing the similarity of semantically similar voxels (i.e., human joints),
+and joint-to-non-joint contrast for ensuring discrimination from others (e.g.,
+occlusions and background). Qualitative and quantitative analyses demonstrate
+that our method outperforms state-of-the-art competitors on both
+occlusion-specific and standard benchmarks, significantly improving the
+reconstruction of occluded humans.",cs.CV,['cs.CV']
+CCEdit: Creative and Controllable Video Editing via Diffusion Models,Ruoyu Feng · Wenming Weng · Yanhui Wang · Yuhui Yuan · Jianmin Bao · Chong Luo · Zhibo Chen · Baining Guo, ,https://arxiv.org/abs/2309.16496,,2309.16496.pdf,CCEdit: Creative and Controllable Video Editing via Diffusion Models,"In this paper, we present CCEdit, a versatile generative video editing
+framework based on diffusion models. Our approach employs a novel trident
+network structure that separates structure and appearance control, ensuring
+precise and creative editing capabilities. Utilizing the foundational
+ControlNet architecture, we maintain the structural integrity of the video
+during editing. The incorporation of an additional appearance branch enables
+users to exert fine-grained control over the edited key frame. These two side
+branches seamlessly integrate into the main branch, which is constructed upon
+existing text-to-image (T2I) generation models, through learnable temporal
+layers. The versatility of our framework is demonstrated through a diverse
+range of choices in both structure representations and personalized T2I models,
+as well as the option to provide the edited key frame. To facilitate
+comprehensive evaluation, we introduce the BalanceCC benchmark dataset,
+comprising 100 videos and 4 target prompts for each video. Our extensive user
+studies compare CCEdit with eight state-of-the-art video editing methods. The
+outcomes demonstrate CCEdit's substantial superiority over all other methods.",cs.CV,['cs.CV']
+CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoor Object Detection from Multi-view Images,Guanlin Shen · Jingwei Huang · Zhihua Hu · Bin Wang,https://github.com/SerCharles/CN-RMA,https://arxiv.org/abs/2403.04198,,2403.04198.pdf,CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoors Object Detection from Multi-view Images,"This paper introduces CN-RMA, a novel approach for 3D indoor object detection
+from multi-view images. We observe the key challenge as the ambiguity of image
+and 3D correspondence without explicit geometry to provide occlusion
+information. To address this issue, CN-RMA leverages the synergy of 3D
+reconstruction networks and 3D object detection networks, where the
+reconstruction network provides a rough Truncated Signed Distance Function
+(TSDF) and guides image features to vote to 3D space correctly in an end-to-end
+manner. Specifically, we associate weights to sampled points of each ray
+through ray marching, representing the contribution of a pixel in an image to
+corresponding 3D locations. Such weights are determined by the predicted signed
+distances so that image features vote only to regions near the reconstructed
+surface. Our method achieves state-of-the-art performance in 3D object
+detection from multi-view images, as measured by mAP@0.25 and mAP@0.5 on the
+ScanNet and ARKitScenes datasets. The code and models are released at
+https://github.com/SerCharles/CN-RMA.",cs.CV,['cs.CV']
+One-Class Face Anti-spoofing via Spoof Cue Map-Guided Feature Learning,Pei-Kai Huang · Cheng-Hsuan Chiang · Tzu-Hsien Chen · Jun-Xiong Chong · Tyng-Luh Liu · Chiou-Ting Hsu, ,,https://link.springer.com/article/10.1007/s11042-023-17739-y,,,,,nan
+PanoRecon: Real-Time Panoptic 3D Reconstruction from Monocular Video,Dong Wu · Zike Yan · Hongbin Zha, ,,,,,,,nan
+Grounding Everything: Emerging Localization Properties in Vision-Language Transformers,Walid Bousselham · Felix Petersen · Vittorio Ferrari · Hilde Kuehne, ,https://arxiv.org/abs/2312.00878,,2312.00878.pdf,Grounding Everything: Emerging Localization Properties in Vision-Language Transformers,"Vision-language foundation models have shown remarkable performance in
+various zero-shot settings such as image retrieval, classification, or
+captioning. But so far, those models seem to fall behind when it comes to
+zero-shot localization of referential expressions and objects in images. As a
+result, they need to be fine-tuned for this task. In this paper, we show that
+pretrained vision-language (VL) models allow for zero-shot open-vocabulary
+object localization without any fine-tuning. To leverage those capabilities, we
+propose a Grounding Everything Module (GEM) that generalizes the idea of
+value-value attention introduced by CLIPSurgery to a self-self attention path.
+We show that the concept of self-self attention corresponds to clustering, thus
+enforcing groups of tokens arising from the same object to be similar while
+preserving the alignment with the language space. To further guide the group
+formation, we propose a set of regularizations that allows the model to finally
+generalize across datasets and backbones. We evaluate the proposed GEM
+framework on various benchmark tasks and datasets for semantic segmentation. It
+shows that GEM not only outperforms other training-free open-vocabulary
+localization methods, but also achieves state-of-the-art results on the
+recently proposed OpenImagesV7 large-scale segmentation benchmark.",cs.CV,"['cs.CV', 'cs.AI']"
+Brain Decodes Deep Nets,Huzheng Yang · James Gee · Jianbo Shi,https://huzeyann.github.io/brain-decodes-deep-nets,https://arxiv.org/abs/2312.01280,,2312.01280.pdf,Brain Decodes Deep Nets,"We developed a tool for visualizing and analyzing large pre-trained vision
+models by mapping them onto the brain, thus exposing their hidden inside. Our
+innovation arises from a surprising usage of brain encoding: predicting brain
+fMRI measurements in response to images. We report two findings. First,
+explicit mapping between the brain and deep-network features across dimensions
+of space, layers, scales, and channels is crucial. This mapping method,
+FactorTopy, is plug-and-play for any deep-network; with it, one can paint a
+picture of the network onto the brain (literally!). Second, our visualization
+shows how different training methods matter: they lead to remarkable
+differences in hierarchical organization and scaling behavior, growing with
+more data or network capacity. It also provides insight into fine-tuning: how
+pre-trained models change when adapting to small datasets. We found brain-like
+hierarchically organized network suffer less from catastrophic forgetting after
+fine-tuned.",cs.CV,['cs.CV']
+Revisiting Spatial-Frequency Information Integration from a Hierarchical Perspective for Panchromatic and Multi-Spectral Image Fusion,Jiangtong Tan · Jie Huang · Naishan Zheng · Man Zhou · Keyu Yan · Danfeng Hong · Feng Zhao, ,,https://ieeexplore.ieee.org/document/10443302,,,,,nan
+Prompting Vision Foundation Models for Pathology Image Analysis,CHONG YIN · Siqi Liu · Kaiyang Zhou · Vincent Wong · Pong C. Yuen, ,https://arxiv.org/abs/2403.16497,,2403.16497.pdf,PathoTune: Adapting Visual Foundation Model to Pathological Specialists,"As natural image understanding moves towards the pretrain-finetune era,
+research in pathology imaging is concurrently evolving. Despite the predominant
+focus on pretraining pathological foundation models, how to adapt foundation
+models to downstream tasks is little explored. For downstream adaptation, we
+propose the existence of two domain gaps, i.e., the Foundation-Task Gap and the
+Task-Instance Gap. To mitigate these gaps, we introduce PathoTune, a framework
+designed to efficiently adapt pathological or even visual foundation models to
+pathology-specific tasks via multi-modal prompt tuning. The proposed framework
+leverages Task-specific Visual Prompts and Task-specific Textual Prompts to
+identify task-relevant features, along with Instance-specific Visual Prompts
+for encoding single pathological image features. Results across multiple
+datasets at both patch-level and WSI-level demonstrate its superior performance
+over single-modality prompt tuning approaches. Significantly, PathoTune
+facilitates the direct adaptation of natural visual foundation models to
+pathological tasks, drastically outperforming pathological foundation models
+with simple linear probing. The code will be available upon acceptance.",cs.CV,"['cs.CV', 'cs.LG']"
+DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization,Jiahe Li · Jiawei Zhang · Xiao Bai · Jin Zheng · Xin Ning · Jun Zhou · Lin Gu,https://fictionarry.github.io/DNGaussian/,https://arxiv.org/abs/2403.06912,,2403.06912.pdf,DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization,"Radiance fields have demonstrated impressive performance in synthesizing
+novel views from sparse input views, yet prevailing methods suffer from high
+training costs and slow inference speed. This paper introduces DNGaussian, a
+depth-regularized framework based on 3D Gaussian radiance fields, offering
+real-time and high-quality few-shot novel view synthesis at low costs. Our
+motivation stems from the highly efficient representation and surprising
+quality of the recent 3D Gaussian Splatting, despite it will encounter a
+geometry degradation when input views decrease. In the Gaussian radiance
+fields, we find this degradation in scene geometry primarily lined to the
+positioning of Gaussian primitives and can be mitigated by depth constraint.
+Consequently, we propose a Hard and Soft Depth Regularization to restore
+accurate scene geometry under coarse monocular depth supervision while
+maintaining a fine-grained color appearance. To further refine detailed
+geometry reshaping, we introduce Global-Local Depth Normalization, enhancing
+the focus on small local depth changes. Extensive experiments on LLFF, DTU, and
+Blender datasets demonstrate that DNGaussian outperforms state-of-the-art
+methods, achieving comparable or better results with significantly reduced
+memory cost, a $25 \times$ reduction in training time, and over $3000 \times$
+faster rendering speed.",cs.CV,['cs.CV']
+Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation,Xiao Ma · Sumit Patidar · Iain Haughton · Stephen James,https://yusufma03.github.io/projects/hdp/,https://arxiv.org/abs/2403.03890v1,,2403.03890v1.pdf,Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation,"This paper introduces Hierarchical Diffusion Policy (HDP), a hierarchical
+agent for multi-task robotic manipulation. HDP factorises a manipulation policy
+into a hierarchical structure: a high-level task-planning agent which predicts
+a distant next-best end-effector pose (NBP), and a low-level goal-conditioned
+diffusion policy which generates optimal motion trajectories. The factorised
+policy representation allows HDP to tackle both long-horizon task planning
+while generating fine-grained low-level actions. To generate context-aware
+motion trajectories while satisfying robot kinematics constraints, we present a
+novel kinematics-aware goal-conditioned control agent, Robot Kinematics
+Diffuser (RK-Diffuser). Specifically, RK-Diffuser learns to generate both the
+end-effector pose and joint position trajectories, and distill the accurate but
+kinematics-unaware end-effector pose diffuser to the kinematics-aware but less
+accurate joint position diffuser via differentiable kinematics. Empirically, we
+show that HDP achieves a significantly higher success rate than the
+state-of-the-art methods in both simulation and real-world.",cs.RO,"['cs.RO', 'cs.AI', 'cs.CV', 'cs.LG']"
+Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition,Yifei Chen · Dapeng Chen · Ruijin Liu · Sai Zhou · Wenyuan Xue · Wei Peng, ,https://arxiv.org/abs/2311.15619,,2311.15619.pdf,Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition,"Large-scale visual-language pre-trained models have achieved significant
+success in various video tasks. However, most existing methods follow an ""adapt
+then align"" paradigm, which adapts pre-trained image encoders to model
+video-level representations and utilizes one-hot or text embedding of the
+action labels for supervision. This paradigm overlooks the challenge of mapping
+from static images to complicated activity concepts. In this paper, we propose
+a novel ""Align before Adapt"" (ALT) paradigm. Prior to adapting to video
+representation learning, we exploit the entity-to-region alignments for each
+frame. The alignments are fulfilled by matching the region-aware image
+embeddings to an offline-constructed text corpus. With the aligned entities, we
+feed their text embeddings to a transformer-based video adapter as the queries,
+which can help extract the semantics of the most important entities from a
+video to a vector. This paradigm reuses the visual-language alignment of VLP
+during adaptation and tries to explain an action by the underlying entities.
+This helps understand actions by bridging the gap with complex activity
+semantics, particularly when facing unfamiliar or unseen categories. ALT
+demonstrates competitive performance while maintaining remarkably low
+computational costs. In fully supervised experiments, it achieves 88.1% top-1
+accuracy on Kinetics-400 with only 4947 GFLOPs. Moreover, ALT outperforms the
+previous state-of-the-art methods in both zero-shot and few-shot experiments,
+emphasizing its superior generalizability across various learning scenarios.",cs.CV,"['cs.CV', 'cs.AI']"
+Discriminative Probing and Tuning for Text-to-Image Generation,Leigang Qu · Wenjie Wang · Yongqi Li · Hanwang Zhang · Liqiang Nie · Tat-seng Chua,https://dpt-t2i.github.io/,https://arxiv.org/abs/2403.04321,,2403.04321.pdf,Discriminative Probing and Tuning for Text-to-Image Generation,"Despite advancements in text-to-image generation (T2I), prior methods often
+face text-image misalignment problems such as relation confusion in generated
+images. Existing solutions involve cross-attention manipulation for better
+compositional understanding or integrating large language models for improved
+layout planning. However, the inherent alignment capabilities of T2I models are
+still inadequate. By reviewing the link between generative and discriminative
+modeling, we posit that T2I models' discriminative abilities may reflect their
+text-image alignment proficiency during generation. In this light, we advocate
+bolstering the discriminative abilities of T2I models to achieve more precise
+text-to-image alignment for generation. We present a discriminative adapter
+built on T2I models to probe their discriminative abilities on two
+representative tasks and leverage discriminative fine-tuning to improve their
+text-image alignment. As a bonus of the discriminative adapter, a
+self-correction mechanism can leverage discriminative gradients to better align
+generated images to text prompts during inference. Comprehensive evaluations
+across three benchmark datasets, including both in-distribution and
+out-of-distribution scenarios, demonstrate our method's superior generation
+performance. Meanwhile, it achieves state-of-the-art discriminative performance
+on the two discriminative tasks compared to other generative models.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.MM']"
+CLIP-KD: An Empirical Study of CLIP Model Distillation,Chuanguang Yang · Zhulin An · Libo Huang · Junyu Bi · XinQiang Yu · Han Yang · boyu diao · Yongjun Xu, ,https://arxiv.org/abs/2307.12732,,2307.12732.pdf,CLIP-KD: An Empirical Study of CLIP Model Distillation,"Contrastive Language-Image Pre-training (CLIP) has become a promising
+language-supervised visual pre-training framework. This paper aims to distill
+small CLIP models supervised by a large teacher CLIP model. We propose several
+distillation strategies, including relation, feature, gradient and contrastive
+paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We
+show that a simple feature mimicry with Mean Squared Error loss works
+surprisingly well. Moreover, interactive contrastive learning across teacher
+and student encoders is also effective in performance improvement. We explain
+that the success of CLIP-KD can be attributed to maximizing the feature
+similarity between teacher and student. The unified method is applied to
+distill several student models trained on CC3M+12M. CLIP-KD improves student
+CLIP models consistently over zero-shot ImageNet classification and cross-modal
+retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the
+teacher, CLIP-KD achieves 57.5\% and 55.4\% zero-shot top-1 ImageNet accuracy
+over ViT-B/16 and ResNet-50, surpassing the original CLIP without KD by 20.5\%
+and 20.1\% margins, respectively. Our code is released on
+https://github.com/winycg/CLIP-KD.",cs.CV,['cs.CV']
+FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation,Shuai Yang · Yifan Zhou · Ziwei Liu · Chen Change Loy,https://www.mmlab-ntu.com/project/fresco/,https://arxiv.org/abs/2403.12962,,2403.12962.pdf,FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation,"The remarkable efficacy of text-to-image diffusion models has motivated
+extensive exploration of their potential application in video domains.
+Zero-shot methods seek to extend image diffusion models to videos without
+necessitating model training. Recent methods mainly focus on incorporating
+inter-frame correspondence into attention mechanisms. However, the soft
+constraint imposed on determining where to attend to valid features can
+sometimes be insufficient, resulting in temporal inconsistency. In this paper,
+we introduce FRESCO, intra-frame correspondence alongside inter-frame
+correspondence to establish a more robust spatial-temporal constraint. This
+enhancement ensures a more consistent transformation of semantically similar
+content across frames. Beyond mere attention guidance, our approach involves an
+explicit update of features to achieve high spatial-temporal consistency with
+the input video, significantly improving the visual coherence of the resulting
+translated videos. Extensive experiments demonstrate the effectiveness of our
+proposed framework in producing high-quality, coherent videos, marking a
+notable improvement over existing zero-shot methods.",cs.CV,['cs.CV']
+XFibrosis: Explicit Vessel-Fiber Modeling for Fibrosis Staging from Liver Pathology Images,CHONG YIN · Siqi Liu · Fei Lyu · Jiahao Lu · Sune Darkner · Vincent Wong · Pong C. Yuen, ,,https://www.youtube.com/watch?v=_Yiu5g71ZHo,,,,,nan
+LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model,Dongkai Wang · shiyu xuan · Shiliang Zhang, ,https://arxiv.org/abs/2310.00582,,2310.00582.pdf,Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs,"Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities
+in various multi-modal tasks. Nevertheless, their performance in fine-grained
+image understanding tasks is still limited. To address this issue, this paper
+proposes a new framework to enhance the fine-grained image understanding
+abilities of MLLMs. Specifically, we present a new method for constructing the
+instruction tuning dataset at a low cost by leveraging annotations in existing
+datasets. A self-consistent bootstrapping method is also introduced to extend
+existing dense object annotations into high-quality
+referring-expression-bounding-box pairs. These methods enable the generation of
+high-quality instruction data which includes a wide range of fundamental
+abilities essential for fine-grained image perception. Moreover, we argue that
+the visual encoder should be tuned during instruction tuning to mitigate the
+gap between full image perception and fine-grained image perception.
+Experimental results demonstrate the superior performance of our method. For
+instance, our model exhibits a 5.2% accuracy improvement over Qwen-VL on GQA
+and surpasses the accuracy of Kosmos-2 by 24.7% on RefCOCO_val. We have also
+attained the top rank on the leaderboard of MMBench. This promising performance
+is achieved by training on only publicly available data, making it easily
+reproducible. The models, datasets, and codes are publicly available at
+https://github.com/SY-Xuan/Pink.",cs.CV,"['cs.CV', 'cs.AI']"
+Producing and Leveraging Online Map Uncertainty in Trajectory Prediction,Xunjiang Gu · Guanyu Song · Igor Gilitschenski · Marco Pavone · Boris Ivanovic,https://github.com/alfredgu001324/MapUncertaintyPrediction,https://arxiv.org/abs/2403.16439v1,,2403.16439v1.pdf,Producing and Leveraging Online Map Uncertainty in Trajectory Prediction,"High-definition (HD) maps have played an integral role in the development of
+modern autonomous vehicle (AV) stacks, albeit with high associated labeling and
+maintenance costs. As a result, many recent works have proposed methods for
+estimating HD maps online from sensor data, enabling AVs to operate outside of
+previously-mapped regions. However, current online map estimation approaches
+are developed in isolation of their downstream tasks, complicating their
+integration in AV stacks. In particular, they do not produce uncertainty or
+confidence estimates. In this work, we extend multiple state-of-the-art online
+map estimation methods to additionally estimate uncertainty and show how this
+enables more tightly integrating online mapping with trajectory forecasting. In
+doing so, we find that incorporating uncertainty yields up to 50% faster
+training convergence and up to 15% better prediction performance on the
+real-world nuScenes driving dataset.",cs.RO,"['cs.RO', 'cs.CV', 'cs.LG']"
+GenZI: Zero-Shot 3D Human-Scene Interaction Generation,Lei Li · Angela Dai,https://craigleili.github.io/projects/genzi/,https://arxiv.org/abs/2311.17737,,2311.17737.pdf,GenZI: Zero-Shot 3D Human-Scene Interaction Generation,"Can we synthesize 3D humans interacting with scenes without learning from any
+3D human-scene interaction data? We propose GenZI, the first zero-shot approach
+to generating 3D human-scene interactions. Key to GenZI is our distillation of
+interaction priors from large vision-language models (VLMs), which have learned
+a rich semantic space of 2D human-scene compositions. Given a natural language
+description and a coarse point location of the desired interaction in a 3D
+scene, we first leverage VLMs to imagine plausible 2D human interactions
+inpainted into multiple rendered views of the scene. We then formulate a robust
+iterative optimization to synthesize the pose and shape of a 3D human model in
+the scene, guided by consistency with the 2D interaction hypotheses. In
+contrast to existing learning-based approaches, GenZI circumvents the
+conventional need for captured 3D interaction data, and allows for flexible
+control of the 3D interaction synthesis with easy-to-use text prompts.
+Extensive experiments show that our zero-shot approach has high flexibility and
+generality, making it applicable to diverse scene types, including both indoor
+and outdoor environments.",cs.CV,"['cs.CV', 'cs.GR']"
+LPSNet: End-to-End Human Pose and Shape Estimation with Lensless Imaging,Haoyang Ge · Qiao Feng · Hailong Jia · Xiongzheng Li · Xiangjun Yin · You Zhou · Jingyu Yang · Kun Li,https://cic.tju.edu.cn/faculty/likun/projects/LPSNet/index.html,https://arxiv.org/abs/2404.01941,,2404.01941.pdf,LPSNet: End-to-End Human Pose and Shape Estimation with Lensless Imaging,"Human pose and shape (HPS) estimation with lensless imaging is not only
+beneficial to privacy protection but also can be used in covert surveillance
+scenarios due to the small size and simple structure of this device. However,
+this task presents significant challenges due to the inherent ambiguity of the
+captured measurements and lacks effective methods for directly estimating human
+pose and shape from lensless data. In this paper, we propose the first
+end-to-end framework to recover 3D human poses and shapes from lensless
+measurements to our knowledge. We specifically design a multi-scale lensless
+feature decoder to decode the lensless measurements through the optically
+encoded mask for efficient feature extraction. We also propose a double-head
+auxiliary supervision mechanism to improve the estimation accuracy of human
+limb ends. Besides, we establish a lensless imaging system and verify the
+effectiveness of our method on various datasets acquired by our lensless
+imaging system.",cs.CV,['cs.CV']
+Can I Trust Your Answer? Visually Grounded Video Question Answering,Junbin Xiao · Angela Yao · Yicong Li · Tat-seng Chua, ,https://arxiv.org/abs/2309.01327,,2309.01327.pdf,Can I Trust Your Answer? Visually Grounded Video Question Answering,"We study visually grounded VideoQA in response to the emerging trends of
+utilizing pretraining techniques for video-language understanding.
+Specifically, by forcing vision-language models (VLMs) to answer questions and
+simultaneously provide visual evidence, we seek to ascertain the extent to
+which the predictions of such techniques are genuinely anchored in relevant
+video content, versus spurious correlations from language or irrelevant visual
+context. Towards this, we construct NExT-GQA -- an extension of NExT-QA with
+10.5$K$ temporal grounding (or location) labels tied to the original QA pairs.
+With NExT-GQA, we scrutinize a series of state-of-the-art VLMs. Through
+post-hoc attention analysis, we find that these models are extremely weak in
+substantiating the answers despite their strong QA performance. This exposes
+the limitation of current VLMs in making reliable predictions. As a remedy, we
+further explore and propose a grounded-QA method via Gaussian mask optimization
+and cross-modal learning. Experiments with different backbones demonstrate that
+this grounding mechanism improves both grounding and QA. With these efforts, we
+aim to push towards trustworthy VLMs in VQA systems. Our dataset and code are
+available at https://github.com/doc-doc/NExT-GQA.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM']"
+RNb-NeuS: Reflectance and Normal-based Multi-View 3D Reconstruction,Baptiste Brument · Robin Bruneau · Yvain Queau · Jean Mélou · Francois Lauze · Jean-Denis Durou · Lilian Calvet,https://robinbruneau.github.io/publications/rnb_neus.html,https://arxiv.org/abs/2312.01215,,2312.01215.pdf,RNb-NeuS: Reflectance and Normal-based Multi-View 3D Reconstruction,"This paper introduces a versatile paradigm for integrating multi-view
+reflectance (optional) and normal maps acquired through photometric stereo. Our
+approach employs a pixel-wise joint re-parameterization of reflectance and
+normal, considering them as a vector of radiances rendered under simulated,
+varying illumination. This re-parameterization enables the seamless integration
+of reflectance and normal maps as input data in neural volume rendering-based
+3D reconstruction while preserving a single optimization objective. In
+contrast, recent multi-view photometric stereo (MVPS) methods depend on
+multiple, potentially conflicting objectives. Despite its apparent simplicity,
+our proposed approach outperforms state-of-the-art approaches in MVPS
+benchmarks across F-score, Chamfer distance, and mean angular error metrics.
+Notably, it significantly improves the detailed 3D reconstruction of areas with
+high curvature or low visibility.",cs.CV,['cs.CV']
+Multimodal Sense-Informed Prediction of 3D Human Motions,Zhenyu Lou · Qiongjie Cui · Haofan Wang · Xu Tang · Hong Zhou, ,https://arxiv.org/abs/2405.02911,,2405.02911.pdf,Multimodal Sense-Informed Prediction of 3D Human Motions,"Predicting future human pose is a fundamental application for machine
+intelligence, which drives robots to plan their behavior and paths ahead of
+time to seamlessly accomplish human-robot collaboration in real-world 3D
+scenarios. Despite encouraging results, existing approaches rarely consider the
+effects of the external scene on the motion sequence, leading to pronounced
+artifacts and physical implausibilities in the predictions. To address this
+limitation, this work introduces a novel multi-modal sense-informed motion
+prediction approach, which conditions high-fidelity generation on two modal
+information: external 3D scene, and internal human gaze, and is able to
+recognize their salience for future human activity. Furthermore, the gaze
+information is regarded as the human intention, and combined with both motion
+and scene features, we construct a ternary intention-aware attention to
+supervise the generation to match where the human wants to reach. Meanwhile, we
+introduce semantic coherence-aware attention to explicitly distinguish the
+salient point clouds and the underlying ones, to ensure a reasonable
+interaction of the generated sequence with the 3D scene. On two real-world
+benchmarks, the proposed method achieves state-of-the-art performance both in
+3D human pose and trajectory prediction.",cs.CV,['cs.CV']
+SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge,Andong Wang · Bo Wu · Sunli Chen · Zhenfang Chen · Haotian Guan · Wei-Ning Lee · Li Erran Li · Chuang Gan, ,https://arxiv.org/abs/2405.09713,,2405.09713.pdf,SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge,"Learning commonsense reasoning from visual contexts and scenes in real-world
+is a crucial step toward advanced artificial intelligence. However, existing
+video reasoning benchmarks are still inadequate since they were mainly designed
+for factual or situated reasoning and rarely involve broader knowledge in the
+real world. Our work aims to delve deeper into reasoning evaluations,
+specifically within dynamic, open-world, and structured context knowledge. We
+propose a new benchmark (SOK-Bench), consisting of 44K questions and 10K
+situations with instance-level annotations depicted in the videos. The
+reasoning process is required to understand and apply situated knowledge and
+general knowledge for problem-solving. To create such a dataset, we propose an
+automatic and scalable generation method to generate question-answer pairs,
+knowledge graphs, and rationales by instructing the combinations of LLMs and
+MLLMs. Concretely, we first extract observable situated entities, relations,
+and processes from videos for situated knowledge and then extend to open-world
+knowledge beyond the visible content. The task generation is facilitated
+through multiple dialogues as iterations and subsequently corrected and refined
+by our designed self-promptings and demonstrations. With a corpus of both
+explicit situated facts and implicit commonsense, we generate associated
+question-answer pairs and reasoning processes, finally followed by manual
+reviews for quality assurance. We evaluated recent mainstream large
+vision-language models on the benchmark and found several insightful
+conclusions. For more information, please refer to our benchmark at
+www.bobbywu.com/SOKBench.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']"
+SURE: SUrvey REcipes for building reliable and robust deep networks,Yuting Li · Yingyi Chen · Xuanlong Yu · Dexiong Chen · Xi Shen,https://yutingli0606.github.io/SURE/,https://arxiv.org/abs/2403.00543,,2403.00543.pdf,SURE: SUrvey REcipes for building reliable and robust deep networks,"In this paper, we revisit techniques for uncertainty estimation within deep
+neural networks and consolidate a suite of techniques to enhance their
+reliability. Our investigation reveals that an integrated application of
+diverse techniques--spanning model regularization, classifier and
+optimization--substantially improves the accuracy of uncertainty predictions in
+image classification tasks. The synergistic effect of these techniques
+culminates in our novel SURE approach. We rigorously evaluate SURE against the
+benchmark of failure prediction, a critical testbed for uncertainty estimation
+efficacy. Our results showcase a consistently better performance than models
+that individually deploy each technique, across various datasets and model
+architectures. When applied to real-world challenges, such as data corruption,
+label noise, and long-tailed class distribution, SURE exhibits remarkable
+robustness, delivering results that are superior or on par with current
+state-of-the-art specialized methods. Particularly on Animal-10N and Food-101N
+for learning with noisy labels, SURE achieves state-of-the-art performance
+without any task-specific adjustments. This work not only sets a new benchmark
+for robust uncertainty estimation but also paves the way for its application in
+diverse, real-world scenarios where reliability is paramount. Our code is
+available at \url{https://yutingli0606.github.io/SURE/}.",cs.CV,['cs.CV']
+"ShapeMatcher: Self-Supervised Joint Shape Canonicalization, Segmentation, Retrieval and Deformation",Yan Di · Chenyangguang Zhang · Chaowei Wang · Ruida Zhang · Guangyao Zhai · Yanyan Li · Bowen Fu · Xiangyang Ji · Shan Gao, ,https://arxiv.org/abs/2311.11106,,2311.11106.pdf,"ShapeMatcher: Self-Supervised Joint Shape Canonicalization, Segmentation, Retrieval and Deformation","In this paper, we present ShapeMatcher, a unified self-supervised learning
+framework for joint shape canonicalization, segmentation, retrieval and
+deformation. Given a partially-observed object in an arbitrary pose, we first
+canonicalize the object by extracting point-wise affine-invariant features,
+disentangling inherent structure of the object with its pose and size. These
+learned features are then leveraged to predict semantically consistent part
+segmentation and corresponding part centers. Next, our lightweight retrieval
+module aggregates the features within each part as its retrieval token and
+compare all the tokens with source shapes from a pre-established database to
+identify the most geometrically similar shape. Finally, we deform the retrieved
+shape in the deformation module to tightly fit the input object by harnessing
+part center guided neural cage deformation. The key insight of ShapeMaker is
+the simultaneous training of the four highly-associated processes:
+canonicalization, segmentation, retrieval, and deformation, leveraging
+cross-task consistency losses for mutual supervision. Extensive experiments on
+synthetic datasets PartNet, ComplementMe, and real-world dataset Scan2CAD
+demonstrate that ShapeMaker surpasses competitors by a large margin.",cs.CV,['cs.CV']
+DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars,Tobias Kirschstein · Simon Giebenhain · Matthias Nießner,https://tobias-kirschstein.github.io/diffusion-avatars/,https://arxiv.org/abs/2311.18635,,2311.18635.pdf,DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars,"DiffusionAvatars synthesizes a high-fidelity 3D head avatar of a person,
+offering intuitive control over both pose and expression. We propose a
+diffusion-based neural renderer that leverages generic 2D priors to produce
+compelling images of faces. For coarse guidance of the expression and head
+pose, we render a neural parametric head model (NPHM) from the target
+viewpoint, which acts as a proxy geometry of the person. Additionally, to
+enhance the modeling of intricate facial expressions, we condition
+DiffusionAvatars directly on the expression codes obtained from NPHM via
+cross-attention. Finally, to synthesize consistent surface details across
+different viewpoints and expressions, we rig learnable spatial features to the
+head's surface via TriPlane lookup in NPHM's canonical space. We train
+DiffusionAvatars on RGB videos and corresponding fitted NPHM meshes of a person
+and test the obtained avatars in both self-reenactment and animation scenarios.
+Our experiments demonstrate that DiffusionAvatars generates temporally
+consistent and visually appealing videos for novel poses and expressions of a
+person, outperforming existing approaches.",cs.CV,['cs.CV']
+PREGO: online mistake detection in PRocedural EGOcentric videos,Alessandro Flaborea · Guido M. D&amp;#x27;Amely di Melendugno · Leonardo Plini · Luca Scofano · Edoardo De Matteis · Antonino Furnari · Giovanni Maria Farinella · Fabio Galasso,https://github.com/aleflabo/PREGO,https://arxiv.org/abs/2404.01933,,,PREGO: online mistake detection in PRocedural EGOcentric videos,"Promptly identifying procedural errors from egocentric videos in an online
+setting is highly challenging and valuable for detecting mistakes as soon as
+they happen. This capability has a wide range of applications across various
+fields, such as manufacturing and healthcare. The nature of procedural mistakes
+is open-set since novel types of failures might occur, which calls for
+one-class classifiers trained on correctly executed procedures. However, no
+technique can currently detect open-set procedural mistakes online. We propose
+PREGO, the first online one-class classification model for mistake detection in
+PRocedural EGOcentric videos. PREGO is based on an online action recognition
+component to model the current action, and a symbolic reasoning module to
+predict the next actions. Mistake detection is performed by comparing the
+recognized current action with the expected future one. We evaluate PREGO on
+two procedural egocentric video datasets, Assembly101 and Epic-tent, which we
+adapt for online benchmarking of procedural mistake detection to establish
+suitable benchmarks, thus defining the Assembly101-O and Epic-tent-O datasets,
+respectively.",cs.CV,['cs.CV']
+TEA: Test-time Energy Adaptation,Yige Yuan · Bingbing Xu · Liang Hou · Fei Sun · Huawei Shen · Xueqi Cheng, ,https://arxiv.org/abs/2311.14402,,2311.14402.pdf,TEA: Test-time Energy Adaptation,"Test-time adaptation (TTA) aims to improve model generalizability when test
+data diverges from training distribution, offering the distinct advantage of
+not requiring access to training data and processes, especially valuable in the
+context of large pre-trained models. However, current TTA methods fail to
+address the fundamental issue: covariate shift, i.e., the decreased
+generalizability can be attributed to the model's reliance on the marginal
+distribution of the training data, which may impair model calibration and
+introduce confirmation bias. To address this, we propose a novel energy-based
+perspective, enhancing the model's perception of target data distributions
+without requiring access to training data or processes. Building on this
+perspective, we introduce $\textbf{T}$est-time $\textbf{E}$nergy
+$\textbf{A}$daptation ($\textbf{TEA}$), which transforms the trained classifier
+into an energy-based model and aligns the model's distribution with the test
+data's, enhancing its ability to perceive test distributions and thus improving
+overall generalizability. Extensive experiments across multiple tasks,
+benchmarks and architectures demonstrate TEA's superior generalization
+performance against state-of-the-art methods. Further in-depth analyses reveal
+that TEA can equip the model with a comprehensive perception of test
+distribution, ultimately paving the way toward improved generalization and
+calibration.",cs.LG,['cs.LG']
+A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural Network,Ruichen Ma · Guanchao Qiao · Yian Liu · Liwei Meng · Ning Ning · Yang Liu · Shaogang Hu,https://github.com/Ruichen0424/AB-BNN,https://arxiv.org/abs/2403.03739,,2403.03739.pdf,A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural Network,"Binary neural networks utilize 1-bit quantized weights and activations to
+reduce both the model's storage demands and computational burden. However,
+advanced binary architectures still incorporate millions of inefficient and
+nonhardware-friendly full-precision multiplication operations. A&B BNN is
+proposed to directly remove part of the multiplication operations in a
+traditional BNN and replace the rest with an equal number of bit operations,
+introducing the mask layer and the quantized RPReLU structure based on the
+normalizer-free network architecture. The mask layer can be removed during
+inference by leveraging the intrinsic characteristics of BNN with
+straightforward mathematical transformations to avoid the associated
+multiplication operations. The quantized RPReLU structure enables more
+efficient bit operations by constraining its slope to be integer powers of 2.
+Experimental results achieved 92.30%, 69.35%, and 66.89% on the CIFAR-10,
+CIFAR-100, and ImageNet datasets, respectively, which are competitive with the
+state-of-the-art. Ablation studies have verified the efficacy of the quantized
+RPReLU structure, leading to a 1.14% enhancement on the ImageNet compared to
+using a fixed slope RLeakyReLU. The proposed add&bit-operation-only BNN offers
+an innovative approach for hardware-friendly network architecture.",cs.LG,"['cs.LG', 'cs.AI']"
+"Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges",Tongtong Yuan · Xuange Zhang · Kun Liu · Bo Liu · Chen Chen · Jian Jin · Zhenzhen Jiao, ,https://arxiv.org/abs/2309.13925,,2309.13925.pdf,"Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges","Surveillance videos are an essential component of daily life with various
+critical applications, particularly in public security. However, current
+surveillance video tasks mainly focus on classifying and localizing anomalous
+events. Existing methods are limited to detecting and classifying the
+predefined events with unsatisfactory semantic understanding, although they
+have obtained considerable performance. To address this issue, we propose a new
+research direction of surveillance video-and-language understanding, and
+construct the first multimodal surveillance video dataset. We manually annotate
+the real-world surveillance dataset UCF-Crime with fine-grained event content
+and timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), contains
+23,542 sentences, with an average length of 20 words, and its annotated videos
+are as long as 110.7 hours. Furthermore, we benchmark SOTA models for four
+multimodal tasks on this newly created dataset, which serve as new baselines
+for surveillance video-and-language understanding. Through our experiments, we
+find that mainstream models used in previously publicly available datasets
+perform poorly on surveillance video, which demonstrates the new challenges in
+surveillance video-and-language understanding. To validate the effectiveness of
+our UCA, we conducted experiments on multimodal anomaly detection. The results
+demonstrate that our multimodal surveillance learning can improve the
+performance of conventional anomaly detection tasks. All the experiments
+highlight the necessity of constructing this dataset to advance surveillance
+AI. The link to our dataset is provided at:
+https://xuange923.github.io/Surveillance-Video-Understanding.",cs.CV,"['cs.CV', 'cs.AI']"
+Validating Privacy-Preserving Face Recognition under a Minimum Assumption,Hui Zhang · Xingbo Dong · YenLungLai · Ying Zhou · Xiaoyan ZHANG · Xingguo Lv · Zhe Jin · Xuejun Li, ,https://arxiv.org/abs/2403.12457,,2403.12457.pdf,Privacy-Preserving Face Recognition Using Trainable Feature Subtraction,"The widespread adoption of face recognition has led to increasing privacy
+concerns, as unauthorized access to face images can expose sensitive personal
+information. This paper explores face image protection against viewing and
+recovery attacks. Inspired by image compression, we propose creating a visually
+uninformative face image through feature subtraction between an original face
+and its model-produced regeneration. Recognizable identity features within the
+image are encouraged by co-training a recognition model on its high-dimensional
+feature representation. To enhance privacy, the high-dimensional representation
+is crafted through random channel shuffling, resulting in randomized
+recognizable images devoid of attacker-leverageable texture details. We distill
+our methodologies into a novel privacy-preserving face recognition method,
+MinusFace. Experiments demonstrate its high recognition accuracy and effective
+privacy protection. Its code is available at https://github.com/Tencent/TFace.",cs.CV,['cs.CV']
+One-Shot Open Affordance Learning with Foundation Models,Gen Li · Deqing Sun · Laura Sevilla-Lara · Varun Jampani, ,https://arxiv.org/abs/2311.17776v1,,2311.17776v1.pdf,One-Shot Open Affordance Learning with Foundation Models,"We introduce One-shot Open Affordance Learning (OOAL), where a model is
+trained with just one example per base object category, but is expected to
+identify novel objects and affordances. While vision-language models excel at
+recognizing novel objects and scenes, they often struggle to understand finer
+levels of granularity such as affordances. To handle this issue, we conduct a
+comprehensive analysis of existing foundation models, to explore their inherent
+understanding of affordances and assess the potential for data-limited
+affordance learning. We then propose a vision-language framework with simple
+and effective designs that boost the alignment between visual features and
+affordance text embeddings. Experiments on two affordance segmentation
+benchmarks show that the proposed method outperforms state-of-the-art models
+with less than 1% of the full training data, and exhibits reasonable
+generalization capability on unseen objects and affordances.",cs.CV,['cs.CV']
+Automatic Controllable Colorization via Imagination,Xiaoyan Cong · Yue Wu · Qifeng Chen · Chenyang Lei, ,https://arxiv.org/abs/2404.05661,,2404.05661.pdf,Automatic Controllable Colorization via Imagination,"We propose a framework for automatic colorization that allows for iterative
+editing and modifications. The core of our framework lies in an imagination
+module: by understanding the content within a grayscale image, we utilize a
+pre-trained image generation model to generate multiple images that contain the
+same content. These images serve as references for coloring, mimicking the
+process of human experts. As the synthesized images can be imperfect or
+different from the original grayscale image, we propose a Reference Refinement
+Module to select the optimal reference composition. Unlike most previous
+end-to-end automatic colorization algorithms, our framework allows for
+iterative and localized modifications of the colorization results because we
+explicitly model the coloring samples. Extensive experiments demonstrate the
+superiority of our framework over existing automatic colorization algorithms in
+editability and flexibility. Project page:
+https://xy-cong.github.io/imagine-colorization.",cs.CV,['cs.CV']
+GAFusion: Adaptive Fusing LiDAR and Camera with Multiple Guidance for 3D Object Detection,Xiaotian Li · Baojie Fan · Jiandong Tian · Huijie Fan, ,https://arxiv.org/abs/2309.11804,,2309.11804.pdf,FGFusion: Fine-Grained Lidar-Camera Fusion for 3D Object Detection,"Lidars and cameras are critical sensors that provide complementary
+information for 3D detection in autonomous driving. While most prevalent
+methods progressively downscale the 3D point clouds and camera images and then
+fuse the high-level features, the downscaled features inevitably lose low-level
+detailed information. In this paper, we propose Fine-Grained Lidar-Camera
+Fusion (FGFusion) that make full use of multi-scale features of image and point
+cloud and fuse them in a fine-grained way. First, we design a dual pathway
+hierarchy structure to extract both high-level semantic and low-level detailed
+features of the image. Second, an auxiliary network is introduced to guide
+point cloud features to better learn the fine-grained spatial information.
+Finally, we propose multi-scale fusion (MSF) to fuse the last N feature maps of
+image and point cloud. Extensive experiments on two popular autonomous driving
+benchmarks, i.e. KITTI and Waymo, demonstrate the effectiveness of our method.",cs.CV,['cs.CV']
+Open Vocabulary Semantic Scene Sketch Understanding,Ahmed Bourouis · Judith Fan · Yulia Gryaditskaya,https://ahmedbourouis.github.io/Scene_Sketch_Segmentation/,https://arxiv.org/abs/2312.12463,,2312.12463.pdf,Open Vocabulary Semantic Scene Sketch Understanding,"We study the underexplored but fundamental vision problem of machine
+understanding of abstract freehand scene sketches. We introduce a sketch
+encoder that results in semantically-aware feature space, which we evaluate by
+testing its performance on a semantic sketch segmentation task. To train our
+model we rely only on the availability of bitmap sketches with their brief
+captions and do not require any pixel-level annotations. To obtain
+generalization to a large set of sketches and categories, we build on a vision
+transformer encoder pretrained with the CLIP model. We freeze the text encoder
+and perform visual-prompt tuning of the visual encoder branch while introducing
+a set of critical modifications. Firstly, we augment the classical key-query
+(k-q) self-attention blocks with value-value (v-v) self-attention blocks.
+Central to our model is a two-level hierarchical network design that enables
+efficient semantic disentanglement: The first level ensures holistic scene
+sketch encoding, and the second level focuses on individual categories. We,
+then, in the second level of the hierarchy, introduce a cross-attention between
+textual and visual branches. Our method outperforms zero-shot CLIP pixel
+accuracy of segmentation results by 37 points, reaching an accuracy of $85.5\%$
+on the FS-COCO sketch dataset. Finally, we conduct a user study that allows us
+to identify further improvements needed over our method to reconcile machine
+and human understanding of scene sketches.",cs.CV,['cs.CV']
+View From Above: Orthogonal viewpoint aware Cross-view Localization,Shan Wang · Chuong Nguyen · Jiawei Liu · Yanhao Zhang · Sundaram Muthu · Fahira Afzal Maken · Kaihao Zhang · Hongdong Li, ,https://arxiv.org/abs/2308.08110,,2308.08110.pdf,View Consistent Purification for Accurate Cross-View Localization,"This paper proposes a fine-grained self-localization method for outdoor
+robotics that utilizes a flexible number of onboard cameras and readily
+accessible satellite images. The proposed method addresses limitations in
+existing cross-view localization methods that struggle to handle noise sources
+such as moving objects and seasonal variations. It is the first sparse
+visual-only method that enhances perception in dynamic environments by
+detecting view-consistent key points and their corresponding deep features from
+ground and satellite views, while removing off-the-ground objects and
+establishing homography transformation between the two views. Moreover, the
+proposed method incorporates a spatial embedding approach that leverages camera
+intrinsic and extrinsic information to reduce the ambiguity of purely visual
+matching, leading to improved feature matching and overall pose estimation
+accuracy. The method exhibits strong generalization and is robust to
+environmental changes, requiring only geo-poses as ground truth. Extensive
+experiments on the KITTI and Ford Multi-AV Seasonal datasets demonstrate that
+our proposed method outperforms existing state-of-the-art methods, achieving
+median spatial accuracy errors below $0.5$ meters along the lateral and
+longitudinal directions, and a median orientation accuracy error below 2
+degrees.",cs.CV,['cs.CV']
+OCAI: Improving Optical Flow Estimation by Occlusion and Consistency Aware Interpolation,Jisoo Jeong · Hong Cai · Risheek Garrepalli · Jamie Lin · Munawar Hayat · Fatih Porikli, ,https://arxiv.org/abs/2403.18092,,2403.18092.pdf,OCAI: Improving Optical Flow Estimation by Occlusion and Consistency Aware Interpolation,"The scarcity of ground-truth labels poses one major challenge in developing
+optical flow estimation models that are both generalizable and robust. While
+current methods rely on data augmentation, they have yet to fully exploit the
+rich information available in labeled video sequences. We propose OCAI, a
+method that supports robust frame interpolation by generating intermediate
+video frames alongside optical flows in between. Utilizing a forward warping
+approach, OCAI employs occlusion awareness to resolve ambiguities in pixel
+values and fills in missing values by leveraging the forward-backward
+consistency of optical flows. Additionally, we introduce a teacher-student
+style semi-supervised learning method on top of the interpolated frames. Using
+a pair of unlabeled frames and the teacher model's predicted optical flow, we
+generate interpolated frames and flows to train a student model. The teacher's
+weights are maintained using Exponential Moving Averaging of the student. Our
+evaluations demonstrate perceptually superior interpolation quality and
+enhanced optical flow accuracy on established benchmarks such as Sintel and
+KITTI.",cs.CV,['cs.CV']
+GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs,Gege Gao · Weiyang Liu · Anpei Chen · Andreas Geiger · Bernhard Schölkopf, ,https://arxiv.org/abs/2312.00093,,2312.00093.pdf,GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs,"As pretrained text-to-image diffusion models become increasingly powerful,
+recent efforts have been made to distill knowledge from these text-to-image
+pretrained models for optimizing a text-guided 3D model. Most of the existing
+methods generate a holistic 3D model from a plain text input. This can be
+problematic when the text describes a complex scene with multiple objects,
+because the vectorized text embeddings are inherently unable to capture a
+complex description with multiple entities and relationships. Holistic 3D
+modeling of the entire scene further prevents accurate grounding of text
+entities and concepts. To address this limitation, we propose GraphDreamer, a
+novel framework to generate compositional 3D scenes from scene graphs, where
+objects are represented as nodes and their interactions as edges. By exploiting
+node and edge information in scene graphs, our method makes better use of the
+pretrained text-to-image diffusion model and is able to fully disentangle
+different objects without image-level supervision. To facilitate modeling of
+object-wise relationships, we use signed distance fields as representation and
+impose a constraint to avoid inter-penetration of objects. To avoid manual
+scene graph creation, we design a text prompt for ChatGPT to generate scene
+graphs based on text inputs. We conduct both qualitative and quantitative
+experiments to validate the effectiveness of GraphDreamer in generating
+high-fidelity compositional 3D scenes with disentangled object entities.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning,Wenjin Hou · Shiming Chen · Shuhuang Chen · Ziming Hong · Yan Wang · Xuetao Feng · Salman Khan · Fahad Shahbaz Khan · Xinge You, ,https://arxiv.org/abs/2404.14808v1,,2404.14808v1.pdf,Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning,"Generative Zero-shot learning (ZSL) learns a generator to synthesize visual
+samples for unseen classes, which is an effective way to advance ZSL. However,
+existing generative methods rely on the conditions of Gaussian noise and the
+predefined semantic prototype, which limit the generator only optimized on
+specific seen classes rather than characterizing each visual instance,
+resulting in poor generalizations (\textit{e.g.}, overfitting to seen classes).
+To address this issue, we propose a novel Visual-Augmented Dynamic Semantic
+prototype method (termed VADS) to boost the generator to learn accurate
+semantic-visual mapping by fully exploiting the visual-augmented knowledge into
+semantic conditions. In detail, VADS consists of two modules: (1) Visual-aware
+Domain Knowledge Learning module (VDKL) learns the local bias and global prior
+of the visual features (referred to as domain visual knowledge), which replace
+pure Gaussian noise to provide richer prior noise information; (2)
+Vision-Oriented Semantic Updation module (VOSU) updates the semantic prototype
+according to the visual representations of the samples. Ultimately, we
+concatenate their output as a dynamic semantic prototype, which serves as the
+condition of the generator. Extensive experiments demonstrate that our VADS
+achieves superior CZSL and GZSL performances on three prominent datasets and
+outperforms other state-of-the-art methods with averaging increases by 6.4\%,
+5.9\% and 4.2\% on SUN, CUB and AWA2, respectively.",cs.CV,['cs.CV']
+EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Priors,Zhipeng Hu · Minda Zhao · Chaoyi Zhao · Xinyue Liang · Lincheng Li · Zeng Zhao · Changjie Fan · Xiaowei Zhou · Xin Yu, ,https://arxiv.org/abs/2308.13223,,2308.13223.pdf,EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior,"While image diffusion models have made significant progress in text-driven 3D
+content creation, they often fail to accurately capture the intended meaning of
+text prompts, especially for view information. This limitation leads to the
+Janus problem, where multi-faced 3D models are generated under the guidance of
+such diffusion models. In this paper, we propose a robust high-quality 3D
+content generation pipeline by exploiting orthogonal-view image guidance.
+First, we introduce a novel 2D diffusion model that generates an image
+consisting of four orthogonal-view sub-images based on the given text prompt.
+Then, the 3D content is created using this diffusion model. Notably, the
+generated orthogonal-view image provides strong geometric structure priors and
+thus improves 3D consistency. As a result, it effectively resolves the Janus
+problem and significantly enhances the quality of 3D content creation.
+Additionally, we present a 3D synthesis fusion network that can further improve
+the details of the generated 3D contents. Both quantitative and qualitative
+evaluations demonstrate that our method surpasses previous text-to-3D
+techniques. Project page: https://efficientdreamer.github.io.",cs.CV,['cs.CV']
+Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On,Xu Yang · Changxing Ding · Zhibin Hong · Junhao Huang · Jin Tao · Xiangmin Xu, ,https://arxiv.org/abs/2404.01089,,2404.01089.pdf,Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On,"Image-based virtual try-on is an increasingly important task for online
+shopping. It aims to synthesize images of a specific person wearing a specified
+garment. Diffusion model-based approaches have recently become popular, as they
+are excellent at image synthesis tasks. However, these approaches usually
+employ additional image encoders and rely on the cross-attention mechanism for
+texture transfer from the garment to the person image, which affects the
+try-on's efficiency and fidelity. To address these issues, we propose an
+Texture-Preserving Diffusion (TPD) model for virtual try-on, which enhances the
+fidelity of the results and introduces no additional image encoders.
+Accordingly, we make contributions from two aspects. First, we propose to
+concatenate the masked person and reference garment images along the spatial
+dimension and utilize the resulting image as the input for the diffusion
+model's denoising UNet. This enables the original self-attention layers
+contained in the diffusion model to achieve efficient and accurate texture
+transfer. Second, we propose a novel diffusion-based method that predicts a
+precise inpainting mask based on the person and reference garment images,
+further enhancing the reliability of the try-on results. In addition, we
+integrate mask prediction and image synthesis into a single compact model. The
+experimental results show that our approach can be applied to various try-on
+tasks, e.g., garment-to-person and person-to-person try-ons, and significantly
+outperforms state-of-the-art methods on popular VITON, VITON-HD databases.",cs.CV,"['cs.CV', 'cs.AI']"
+ZePT: Zero-Shot Pan-Tumor Segmentation via Query-Disentangling and Self-Prompting,Yankai Jiang · Zhongzhen Huang · Rongzhao Zhang · Xiaofan Zhang · Shaoting Zhang,https://github.com/Yankai96/ZePT,https://arxiv.org/abs/2312.04964,,2312.04964.pdf,ZePT: Zero-Shot Pan-Tumor Segmentation via Query-Disentangling and Self-Prompting,"The long-tailed distribution problem in medical image analysis reflects a
+high prevalence of common conditions and a low prevalence of rare ones, which
+poses a significant challenge in developing a unified model capable of
+identifying rare or novel tumor categories not encountered during training. In
+this paper, we propose a new zero-shot pan-tumor segmentation framework (ZePT)
+based on query-disentangling and self-prompting to segment unseen tumor
+categories beyond the training set. ZePT disentangles the object queries into
+two subsets and trains them in two stages. Initially, it learns a set of
+fundamental queries for organ segmentation through an object-aware feature
+grouping strategy, which gathers organ-level visual features. Subsequently, it
+refines the other set of advanced queries that focus on the auto-generated
+visual prompts for unseen tumor segmentation. Moreover, we introduce
+query-knowledge alignment at the feature level to enhance each query's
+discriminative representation and generalizability. Extensive experiments on
+various tumor segmentation tasks demonstrate the performance superiority of
+ZePT, which surpasses the previous counterparts and evidence the promising
+ability for zero-shot tumor segmentation in real-world settings.",cs.CV,['cs.CV']
+Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld,Yijun Yang · Tianyi Zhou · kanxue Li · Dapeng Tao · Lusong Li · Li Shen · Xiaodong He · Jing Jiang · Yuhui Shi,https://github.com/stevenyangyj/Emma-Alfworld,https://arxiv.org/abs/2311.16714v1,,2311.16714v1.pdf,Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld,"While large language models (LLMs) excel in a simulated world of texts, they
+struggle to interact with the more realistic world without perceptions of other
+modalities such as visual or audio signals. Although vision-language models
+(VLMs) integrate LLM modules (1) aligned with static image features, and (2)
+may possess prior knowledge of world dynamics (as demonstrated in the text
+world), they have not been trained in an embodied visual world and thus cannot
+align with its dynamics. On the other hand, training an embodied agent in a
+noisy visual world without expert guidance is often challenging and
+inefficient. In this paper, we train a VLM agent living in a visual world using
+an LLM agent excelling in a parallel text world (but inapplicable to the visual
+world). Specifically, we distill LLM's reflection outcomes (improved actions by
+analyzing mistakes) in a text world's tasks to finetune the VLM on the same
+tasks of the visual world, resulting in an Embodied Multi-Modal Agent (EMMA)
+quickly adapting to the visual world dynamics. Such cross-modality imitation
+learning between the two parallel worlds enables EMMA to generalize to a broad
+scope of new tasks without any further guidance from the LLM expert. Extensive
+evaluations on the ALFWorld benchmark highlight EMMA's superior performance to
+SOTA VLM-based agents across diverse tasks, e.g., 20%-70% improvement in the
+success rate.",cs.CV,['cs.CV']
+Mip-Splatting: Alias-free 3D Gaussian Splatting,Zehao Yu · Anpei Chen · Binbin Huang · Torsten Sattler · Andreas Geiger, ,https://arxiv.org/abs/2311.16493,,2311.16493.pdf,Mip-Splatting: Alias-free 3D Gaussian Splatting,"Recently, 3D Gaussian Splatting has demonstrated impressive novel view
+synthesis results, reaching high fidelity and efficiency. However, strong
+artifacts can be observed when changing the sampling rate, \eg, by changing
+focal length or camera distance. We find that the source for this phenomenon
+can be attributed to the lack of 3D frequency constraints and the usage of a 2D
+dilation filter. To address this problem, we introduce a 3D smoothing filter
+which constrains the size of the 3D Gaussian primitives based on the maximal
+sampling frequency induced by the input views, eliminating high-frequency
+artifacts when zooming in. Moreover, replacing 2D dilation with a 2D Mip
+filter, which simulates a 2D box filter, effectively mitigates aliasing and
+dilation issues. Our evaluation, including scenarios such a training on
+single-scale images and testing on multiple scales, validates the effectiveness
+of our approach.",cs.CV,['cs.CV']
+Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling,Baoquan Zhang · Huaibin Wang · Luo Chuyao · Xutao Li · Guotao liang · Yunming Ye · joeq · Yao He,https://youtu.be/N6M0jcMP9lo,https://arxiv.org/abs/2403.10071,,2403.10071.pdf,Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling,"Vector-Quantized Image Modeling (VQIM) is a fundamental research problem in
+image synthesis, which aims to represent an image with a discrete token
+sequence. Existing studies effectively address this problem by learning a
+discrete codebook from scratch and in a code-independent manner to quantize
+continuous representations into discrete tokens. However, learning a codebook
+from scratch and in a code-independent manner is highly challenging, which may
+be a key reason causing codebook collapse, i.e., some code vectors can rarely
+be optimized without regard to the relationship between codes and good codebook
+priors such that die off finally. In this paper, inspired by pretrained
+language models, we find that these language models have actually pretrained a
+superior codebook via a large number of text corpus, but such information is
+rarely exploited in VQIM. To this end, we propose a novel codebook transfer
+framework with part-of-speech, called VQCT, which aims to transfer a
+well-trained codebook from pretrained language models to VQIM for robust
+codebook learning. Specifically, we first introduce a pretrained codebook from
+language models and part-of-speech knowledge as priors. Then, we construct a
+vision-related codebook with these priors for achieving codebook transfer.
+Finally, a novel codebook transfer network is designed to exploit abundant
+semantic relationships between codes contained in pretrained codebooks for
+robust VQIM codebook learning. Experimental results on four datasets show that
+our VQCT method achieves superior VQIM performance over previous
+state-of-the-art methods.",cs.CV,['cs.CV']
+Multi-Level Neural Scene Graphs for Dynamic Urban Environments,Tobias Fischer · Lorenzo Porzi · Samuel Rota Bulò · Marc Pollefeys · Peter Kontschieder, ,https://arxiv.org/abs/2404.00168,,2404.00168.pdf,Multi-Level Neural Scene Graphs for Dynamic Urban Environments,"We estimate the radiance field of large-scale dynamic areas from multiple
+vehicle captures under varying environmental conditions. Previous works in this
+domain are either restricted to static environments, do not scale to more than
+a single short video, or struggle to separately represent dynamic object
+instances. To this end, we present a novel, decomposable radiance field
+approach for dynamic urban environments. We propose a multi-level neural scene
+graph representation that scales to thousands of images from dozens of
+sequences with hundreds of fast-moving objects. To enable efficient training
+and rendering of our representation, we develop a fast composite ray sampling
+and rendering scheme. To test our approach in urban driving scenarios, we
+introduce a new, novel view synthesis benchmark. We show that our approach
+outperforms prior art by a significant margin on both established and our
+proposed benchmark while being faster in training and rendering.",cs.CV,['cs.CV']
+Rethinking Multi-view Representation Learning via Distilled Disentangling,Guanzhou Ke · Bo Wang · Xiao-Li Wang · Shengfeng He, ,https://arxiv.org/abs/2403.10897,,2403.10897.pdf,Rethinking Multi-view Representation Learning via Distilled Disentangling,"Multi-view representation learning aims to derive robust representations that
+are both view-consistent and view-specific from diverse data sources. This
+paper presents an in-depth analysis of existing approaches in this domain,
+highlighting a commonly overlooked aspect: the redundancy between
+view-consistent and view-specific representations. To this end, we propose an
+innovative framework for multi-view representation learning, which incorporates
+a technique we term 'distilled disentangling'. Our method introduces the
+concept of masked cross-view prediction, enabling the extraction of compact,
+high-quality view-consistent representations from various sources without
+incurring extra computational overhead. Additionally, we develop a distilled
+disentangling module that efficiently filters out consistency-related
+information from multi-view representations, resulting in purer view-specific
+representations. This approach significantly reduces redundancy between
+view-consistent and view-specific representations, enhancing the overall
+efficiency of the learning process. Our empirical evaluations reveal that
+higher mask ratios substantially improve the quality of view-consistent
+representations. Moreover, we find that reducing the dimensionality of
+view-consistent representations relative to that of view-specific
+representations further refines the quality of the combined representations.
+Our code is accessible at: https://github.com/Guanzhou-Ke/MRDD.",cs.CV,"['cs.CV', 'cs.MM']"
+Neural Refinement for Absolute Pose Regression with Feature Synthesis,Shuai Chen · Yash Bhalgat · Xinghui Li · Jia-Wang Bian · Kejie Li · Zirui Wang · Victor Adrian Prisacariu, ,https://arxiv.org/html/2402.14371v2,,2402.14371v2.pdf,HR-APR: APR-agnostic Framework with Uncertainty Estimation and Hierarchical Refinement for Camera Relocalisation,"Absolute Pose Regressors (APRs) directly estimate camera poses from monocular
+images, but their accuracy is unstable for different queries. Uncertainty-aware
+APRs provide uncertainty information on the estimated pose, alleviating the
+impact of these unreliable predictions. However, existing uncertainty modelling
+techniques are often coupled with a specific APR architecture, resulting in
+suboptimal performance compared to state-of-the-art (SOTA) APR methods. This
+work introduces a novel APR-agnostic framework, HR-APR, that formulates
+uncertainty estimation as cosine similarity estimation between the query and
+database features. It does not rely on or affect APR network architecture,
+which is flexible and computationally efficient. In addition, we take advantage
+of the uncertainty for pose refinement to enhance the performance of APR. The
+extensive experiments demonstrate the effectiveness of our framework, reducing
+27.4\% and 15.2\% of computational overhead on the 7Scenes and Cambridge
+Landmarks datasets while maintaining the SOTA accuracy in single-image APRs.",cs.CV,"['cs.CV', 'cs.RO']"
+Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer,Zhen Zhao · Jingqun Tang · Chunhui Lin · Binghong Wu · Can Huang · Hao Liu · Xin Tan · Zhizhong Zhang · Yuan Xie,https://github.com/bytedance/E2STR,https://arxiv.org/abs/2311.13120,,2311.13120.pdf,Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer,"Scene text recognition (STR) in the wild frequently encounters challenges
+when coping with domain variations, font diversity, shape deformations, etc. A
+straightforward solution is performing model fine-tuning tailored to a specific
+scenario, but it is computationally intensive and requires multiple model
+copies for various scenarios. Recent studies indicate that large language
+models (LLMs) can learn from a few demonstration examples in a training-free
+manner, termed ""In-Context Learning"" (ICL). Nevertheless, applying LLMs as a
+text recognizer is unacceptably resource-consuming. Moreover, our pilot
+experiments on LLMs show that ICL fails in STR, mainly attributed to the
+insufficient incorporation of contextual information from diverse samples in
+the training stage. To this end, we introduce E$^2$STR, a STR model trained
+with context-rich scene text sequences, where the sequences are generated via
+our proposed in-context training strategy. E$^2$STR demonstrates that a
+regular-sized model is sufficient to achieve effective ICL capabilities in STR.
+Extensive experiments show that E$^2$STR exhibits remarkable training-free
+adaptation in various scenarios and outperforms even the fine-tuned
+state-of-the-art approaches on public benchmarks. The code is released at
+https://github.com/bytedance/E2STR .",cs.CV,['cs.CV']
+Cross Initialization for Face Personalization of Text-to-Image Models,Lianyu Pang · Jian Yin · Haoran Xie · Qiping Wang · Qing Li · Xudong Mao, ,https://arxiv.org/abs/2312.15905,,2312.15905.pdf,Cross Initialization for Personalized Text-to-Image Generation,"Recently, there has been a surge in face personalization techniques,
+benefiting from the advanced capabilities of pretrained text-to-image diffusion
+models. Among these, a notable method is Textual Inversion, which generates
+personalized images by inverting given images into textual embeddings. However,
+methods based on Textual Inversion still struggle with balancing the trade-off
+between reconstruction quality and editability. In this study, we examine this
+issue through the lens of initialization. Upon closely examining traditional
+initialization methods, we identified a significant disparity between the
+initial and learned embeddings in terms of both scale and orientation. The
+scale of the learned embedding can be up to 100 times greater than that of the
+initial embedding. Such a significant change in the embedding could increase
+the risk of overfitting, thereby compromising the editability. Driven by this
+observation, we introduce a novel initialization method, termed Cross
+Initialization, that significantly narrows the gap between the initial and
+learned embeddings. This method not only improves both reconstruction and
+editability but also reduces the optimization steps from 5000 to 320.
+Furthermore, we apply a regularization term to keep the learned embedding close
+to the initial embedding. We show that when combined with Cross Initialization,
+this regularization term can effectively improve editability. We provide
+comprehensive empirical evidence to demonstrate the superior performance of our
+method compared to the baseline methods. Notably, in our experiments, Cross
+Initialization is the only method that successfully edits an individual's
+facial expression. Additionally, a fast version of our method allows for
+capturing an input image in roughly 26 seconds, while surpassing the baseline
+methods in terms of both reconstruction and editability. Code will be made
+publicly available.",cs.CV,['cs.CV']
+Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis,Marianna Ohanyan · Hayk Manukyan · Zhangyang Wang · Shant Navasardyan · Humphrey Shi, ,https://arxiv.org/abs/2311.12342,,2311.12342.pdf,LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis,"Recent text-to-image diffusion models have reached an unprecedented level in
+generating high-quality images. However, their exclusive reliance on textual
+prompts often falls short in precise control of image compositions. In this
+paper, we propose LoCo, a training-free approach for layout-to-image Synthesis
+that excels in producing high-quality images aligned with both textual prompts
+and layout instructions. Specifically, we introduce a Localized Attention
+Constraint (LAC), leveraging semantic affinity between pixels in self-attention
+maps to create precise representations of desired objects and effectively
+ensure the accurate placement of objects in designated regions. We further
+propose a Padding Token Constraint (PTC) to leverage the semantic information
+embedded in previously neglected padding tokens, improving the consistency
+between object appearance and layout instructions. LoCo seamlessly integrates
+into existing text-to-image and layout-to-image models, enhancing their
+performance in spatial control and addressing semantic failures observed in
+prior methods. Extensive experiments showcase the superiority of our approach,
+surpassing existing state-of-the-art training-free layout-to-image methods both
+qualitatively and quantitatively across multiple benchmarks.",cs.CV,['cs.CV']
+NB-GTR: Narrow-Band Guided Turbulence Removal,Yifei Xia · Chu Zhou · Chengxuan Zhu · Minggui Teng · Chao Xu · Boxin Shi, ,,https://freebutuselesssoul.github.io/publications/cvpr2024b,,,,,nan
+SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models,Yuzhou Huang · Liangbin Xie · Xintao Wang · Ziyang Yuan · Xiaodong Cun · Yixiao Ge · Jiantao Zhou · Chao Dong · Rui Huang · Ruimao Zhang · Ying Shan, ,https://arxiv.org/abs/2312.06739,,2312.06739.pdf,SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models,"Current instruction-based editing methods, such as InstructPix2Pix, often
+fail to produce satisfactory results in complex scenarios due to their
+dependence on the simple CLIP text encoder in diffusion models. To rectify
+this, this paper introduces SmartEdit, a novel approach to instruction-based
+image editing that leverages Multimodal Large Language Models (MLLMs) to
+enhance their understanding and reasoning capabilities. However, direct
+integration of these elements still faces challenges in situations requiring
+complex reasoning. To mitigate this, we propose a Bidirectional Interaction
+Module that enables comprehensive bidirectional information interactions
+between the input image and the MLLM output. During training, we initially
+incorporate perception data to boost the perception and understanding
+capabilities of diffusion models. Subsequently, we demonstrate that a small
+amount of complex instruction editing data can effectively stimulate
+SmartEdit's editing capabilities for more complex instructions. We further
+construct a new evaluation dataset, Reason-Edit, specifically tailored for
+complex instruction-based image editing. Both quantitative and qualitative
+results on this evaluation dataset indicate that our SmartEdit surpasses
+previous methods, paving the way for the practical application of complex
+instruction-based image editing.",cs.CV,['cs.CV']
+FADES: Fair Disentanglement with Sensitive Relevance,Taeuk Jang · Xiaoqian Wang, ,https://arxiv.org/abs/2405.07011,,2405.07011.pdf,Fair Graph Representation Learning via Sensitive Attribute Disentanglement,"Group fairness for Graph Neural Networks (GNNs), which emphasizes algorithmic
+decisions neither favoring nor harming certain groups defined by sensitive
+attributes (e.g., race and gender), has gained considerable attention. In
+particular, the objective of group fairness is to ensure that the decisions
+made by GNNs are independent of the sensitive attribute. To achieve this
+objective, most existing approaches involve eliminating sensitive attribute
+information in node representations or algorithmic decisions. However, such
+ways may also eliminate task-related information due to its inherent
+correlation with the sensitive attribute, leading to a sacrifice in utility. In
+this work, we focus on improving the fairness of GNNs while preserving
+task-related information and propose a fair GNN framework named FairSAD.
+Instead of eliminating sensitive attribute information, FairSAD enhances the
+fairness of GNNs via Sensitive Attribute Disentanglement (SAD), which separates
+the sensitive attribute-related information into an independent component to
+mitigate its impact. Additionally, FairSAD utilizes a channel masking mechanism
+to adaptively identify the sensitive attribute-related component and
+subsequently decorrelates it. Overall, FairSAD minimizes the impact of the
+sensitive attribute on GNN outcomes rather than eliminating sensitive
+attributes, thereby preserving task-related information associated with the
+sensitive attribute. Furthermore, experiments conducted on several real-world
+datasets demonstrate that FairSAD outperforms other state-of-the-art methods by
+a significant margin in terms of both fairness and utility performance. Our
+source code is available at https://github.com/ZzoomD/FairSAD.",cs.LG,"['cs.LG', 'cs.CY']"
+VRP-SAM: SAM with Visual Reference Prompt,Yanpeng Sun · Jiahui Chen · Shan Zhang · Xinyu Zhang · Qiang Chen · gang zhang · Errui Ding · Jingdong Wang · Zechao Li, ,https://arxiv.org/abs/2402.17726,,2402.17726.pdf,VRP-SAM: SAM with Visual Reference Prompt,"In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that
+empowers the Segment Anything Model (SAM) to utilize annotated reference images
+as prompts for segmentation, creating the VRP-SAM model. In essence, VRP-SAM
+can utilize annotated reference images to comprehend specific objects and
+perform segmentation of specific objects in target image. It is note that the
+VRP encoder can support a variety of annotation formats for reference images,
+including \textbf{point}, \textbf{box}, \textbf{scribble}, and \textbf{mask}.
+VRP-SAM achieves a breakthrough within the SAM framework by extending its
+versatility and applicability while preserving SAM's inherent strengths, thus
+enhancing user-friendliness. To enhance the generalization ability of VRP-SAM,
+the VRP encoder adopts a meta-learning strategy. To validate the effectiveness
+of VRP-SAM, we conducted extensive empirical studies on the Pascal and COCO
+datasets. Remarkably, VRP-SAM achieved state-of-the-art performance in visual
+reference segmentation with minimal learnable parameters. Furthermore, VRP-SAM
+demonstrates strong generalization capabilities, allowing it to perform
+segmentation of unseen objects and enabling cross-domain segmentation. The
+source code and models will be available at
+\url{https://github.com/syp2ysy/VRP-SAM}",cs.CV,['cs.CV']
+Localization Is All You Evaluate: Data Leakage in Online Mapping Datasets and How to Fix It,Adam Lilja · Junsheng Fu · Erik Stenborg · Lars Hammarstrand,https://github.com/LiljaAdam/geographical-splits,https://arxiv.org/abs/2312.06420,,2312.06420.pdf,Localization Is All You Evaluate: Data Leakage in Online Mapping Datasets and How to Fix It,"The task of online mapping is to predict a local map using current sensor
+observations, e.g. from lidar and camera, without relying on a pre-built map.
+State-of-the-art methods are based on supervised learning and are trained
+predominantly using two datasets: nuScenes and Argoverse 2. However, these
+datasets revisit the same geographic locations across training, validation, and
+test sets. Specifically, over $80$% of nuScenes and $40$% of Argoverse 2
+validation and test samples are less than $5$ m from a training sample. At test
+time, the methods are thus evaluated more on how well they localize within a
+memorized implicit map built from the training data than on extrapolating to
+unseen locations. Naturally, this data leakage causes inflated performance
+numbers and we propose geographically disjoint data splits to reveal the true
+performance in unseen environments. Experimental results show that methods
+perform considerably worse, some dropping more than $45$ mAP, when trained and
+evaluated on proper data splits. Additionally, a reassessment of prior design
+choices reveals diverging conclusions from those based on the original split.
+Notably, the impact of lifting methods and the support from auxiliary tasks
+(e.g., depth supervision) on performance appears less substantial or follows a
+different trajectory than previously perceived. Splits can be found at
+https://github.com/LiljaAdam/geographical-splits",cs.CV,['cs.CV']
+Gated Fields: Learning Scene Reconstruction from Gated Videos,Andrea Ramazzina · Stefanie Walz · Pragyan Dahal · Mario Bijelic · Felix Heide, ,https://arxiv.org/abs/2405.19819,,2405.19819.pdf,Gated Fields: Learning Scene Reconstruction from Gated Videos,"Reconstructing outdoor 3D scenes from temporal observations is a challenge
+that recent work on neural fields has offered a new avenue for. However,
+existing methods that recover scene properties, such as geometry, appearance,
+or radiance, solely from RGB captures often fail when handling poorly-lit or
+texture-deficient regions. Similarly, recovering scenes with scanning LiDAR
+sensors is also difficult due to their low angular sampling rate which makes
+recovering expansive real-world scenes difficult. Tackling these gaps, we
+introduce Gated Fields - a neural scene reconstruction method that utilizes
+active gated video sequences. To this end, we propose a neural rendering
+approach that seamlessly incorporates time-gated capture and illumination. Our
+method exploits the intrinsic depth cues in the gated videos, achieving precise
+and dense geometry reconstruction irrespective of ambient illumination
+conditions. We validate the method across day and night scenarios and find that
+Gated Fields compares favorably to RGB and LiDAR reconstruction methods. Our
+code and datasets are available at https://light.princeton.edu/gatedfields/.",cs.CV,['cs.CV']
+VINECS: Video-based Neural Character Skinning,Zhouyingcheng Liao · Vladislav Golyanik · Marc Habermann · Christian Theobalt, ,https://arxiv.org/abs/2307.00842,,2307.00842.pdf,VINECS: Video-based Neural Character Skinning,"Rigging and skinning clothed human avatars is a challenging task and
+traditionally requires a lot of manual work and expertise. Recent methods
+addressing it either generalize across different characters or focus on
+capturing the dynamics of a single character observed under different pose
+configurations. However, the former methods typically predict solely static
+skinning weights, which perform poorly for highly articulated poses, and the
+latter ones either require dense 3D character scans in different poses or
+cannot generate an explicit mesh with vertex correspondence over time. To
+address these challenges, we propose a fully automated approach for creating a
+fully rigged character with pose-dependent skinning weights, which can be
+solely learned from multi-view video. Therefore, we first acquire a rigged
+template, which is then statically skinned. Next, a coordinate-based MLP learns
+a skinning weights field parameterized over the position in a canonical pose
+space and the respective pose. Moreover, we introduce our pose- and
+view-dependent appearance field allowing us to differentiably render and
+supervise the posed mesh using multi-view imagery. We show that our approach
+outperforms state-of-the-art while not relying on dense 4D scans.",cs.CV,['cs.CV']
+LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding,Chuwei Luo · Yufan Shen · Zhaoqing Zhu · Qi Zheng · Zhi Yu · Cong Yao,https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/LayoutLLM,https://arxiv.org/abs/2404.05225,,2404.05225.pdf,LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding,"Recently, leveraging large language models (LLMs) or multimodal large
+language models (MLLMs) for document understanding has been proven very
+promising. However, previous works that employ LLMs/MLLMs for document
+understanding have not fully explored and utilized the document layout
+information, which is vital for precise document understanding. In this paper,
+we propose LayoutLLM, an LLM/MLLM based method for document understanding. The
+core of LayoutLLM is a layout instruction tuning strategy, which is specially
+designed to enhance the comprehension and utilization of document layouts. The
+proposed layout instruction tuning strategy consists of two components:
+Layout-aware Pre-training and Layout-aware Supervised Fine-tuning. To capture
+the characteristics of document layout in Layout-aware Pre-training, three
+groups of pre-training tasks, corresponding to document-level, region-level and
+segment-level information, are introduced. Furthermore, a novel module called
+layout chain-of-thought (LayoutCoT) is devised to enable LayoutLLM to focus on
+regions relevant to the question and generate accurate answers. LayoutCoT is
+effective for boosting the performance of document understanding. Meanwhile, it
+brings a certain degree of interpretability, which could facilitate manual
+inspection and correction. Experiments on standard benchmarks show that the
+proposed LayoutLLM significantly outperforms existing methods that adopt
+open-source 7B LLMs/MLLMs for document understanding. The training data of the
+LayoutLLM is publicly available at
+https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/DocumentUnderstanding/LayoutLLM",cs.CV,"['cs.CV', 'cs.CL']"
+DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance,Zixuan Wang · Jia Jia · Shikun Sun · Haozhe Wu · Rong Han · Zhenyu Li · Di Tang · Jiaqing Zhou · Jiebo Luo, ,https://arxiv.org/abs/2403.13667,,2403.13667.pdf,DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance,"Choreographers determine what the dances look like, while cameramen determine
+the final presentation of dances. Recently, various methods and datasets have
+showcased the feasibility of dance synthesis. However, camera movement
+synthesis with music and dance remains an unsolved challenging problem due to
+the scarcity of paired data. Thus, we present DCM, a new multi-modal 3D
+dataset, which for the first time combines camera movement with dance motion
+and music audio. This dataset encompasses 108 dance sequences (3.2 hours) of
+paired dance-camera-music data from the anime community, covering 4 music
+genres. With this dataset, we uncover that dance camera movement is
+multifaceted and human-centric, and possesses multiple influencing factors,
+making dance camera synthesis a more challenging task compared to camera or
+dance synthesis alone. To overcome these difficulties, we propose
+DanceCamera3D, a transformer-based diffusion model that incorporates a novel
+body attention loss and a condition separation strategy. For evaluation, we
+devise new metrics measuring camera movement quality, diversity, and dancer
+fidelity. Utilizing these metrics, we conduct extensive experiments on our DCM
+dataset, providing both quantitative and qualitative evidence showcasing the
+effectiveness of our DanceCamera3D model. Code and video demos are available at
+https://github.com/Carmenw1203/DanceCamera3D-Official.",cs.CV,"['cs.CV', 'cs.MM']"
+Active Prompt Learning in Vision Language Models,Jihwan Bang · Sumyeong Ahn · Jae-Gil Lee, ,https://arxiv.org/abs/2311.11178,,2311.11178.pdf,Active Prompt Learning in Vision Language Models,"Pre-trained Vision Language Models (VLMs) have demonstrated notable progress
+in various zero-shot tasks, such as classification and retrieval. Despite their
+performance, because improving performance on new tasks requires task-specific
+knowledge, their adaptation is essential. While labels are needed for the
+adaptation, acquiring them is typically expensive. To overcome this challenge,
+active learning, a method of achieving a high performance by obtaining labels
+for a small number of samples from experts, has been studied. Active learning
+primarily focuses on selecting unlabeled samples for labeling and leveraging
+them to train models. In this study, we pose the question, ""how can the
+pre-trained VLMs be adapted under the active learning framework?"" In response
+to this inquiry, we observe that (1) simply applying a conventional active
+learning framework to pre-trained VLMs even may degrade performance compared to
+random selection because of the class imbalance in labeling candidates, and (2)
+the knowledge of VLMs can provide hints for achieving the balance before
+labeling. Based on these observations, we devise a novel active learning
+framework for VLMs, denoted as PCB. To assess the effectiveness of our
+approach, we conduct experiments on seven different real-world datasets, and
+the results demonstrate that PCB surpasses conventional active learning and
+random sampling methods. Code will be available in
+https://github.com/kaist-dmlab/pcb .",cs.CV,['cs.CV']
+One-Prompt to Segment All Medical Images,Wu · Min Xu, ,https://arxiv.org/html/2305.10300v3,,2305.10300v3.pdf,One-Prompt to Segment All Medical Images,"Large foundation models, known for their strong zero-shot generalization,
+have excelled in visual and language applications. However, applying them to
+medical image segmentation, a domain with diverse imaging types and target
+labels, remains an open challenge. Current approaches, such as adapting
+interactive segmentation models like Segment Anything Model (SAM), require user
+prompts for each sample during inference. Alternatively, transfer learning
+methods like few/one-shot models demand labeled samples, leading to high costs.
+This paper introduces a new paradigm toward the universal medical image
+segmentation, termed 'One-Prompt Segmentation.' One-Prompt Segmentation
+combines the strengths of one-shot and interactive methods. In the inference
+stage, with just \textbf{one prompted sample}, it can adeptly handle the unseen
+task in a single forward pass. We train One-Prompt Model on 64 open-source
+medical datasets, accompanied by the collection of over 3,000 clinician-labeled
+prompts. Tested on 14 previously unseen tasks, the One-Prompt Model showcases
+superior zero-shot segmentation capabilities, outperforming a wide range of
+related methods. The code and annotated data will be publicly released.",eess.IV,"['eess.IV', 'cs.CV']"
+Reconstructing Hands in 3D with Transformers,Georgios Pavlakos · Dandan Shan · Ilija Radosavovic · Angjoo Kanazawa · David Fouhey · Jitendra Malik, ,https://arxiv.org/abs/2312.05251,,2312.05251.pdf,Reconstructing Hands in 3D with Transformers,"We present an approach that can reconstruct hands in 3D from monocular input.
+Our approach for Hand Mesh Recovery, HaMeR, follows a fully transformer-based
+architecture and can analyze hands with significantly increased accuracy and
+robustness compared to previous work. The key to HaMeR's success lies in
+scaling up both the data used for training and the capacity of the deep network
+for hand reconstruction. For training data, we combine multiple datasets that
+contain 2D or 3D hand annotations. For the deep model, we use a large scale
+Vision Transformer architecture. Our final model consistently outperforms the
+previous baselines on popular 3D hand pose benchmarks. To further evaluate the
+effect of our design in non-controlled settings, we annotate existing
+in-the-wild datasets with 2D hand keypoint annotations. On this newly collected
+dataset of annotations, HInt, we demonstrate significant improvements over
+existing baselines. We make our code, data and models available on the project
+website: https://geopavlakos.github.io/hamer/.",cs.CV,['cs.CV']
+Can Biases in ImageNet Models Explain Generalization?,Paul Gavrikov · Janis Keuper,https://github.com/paulgavrikov/biases_vs_generalization,https://arxiv.org/abs/2404.01509,,2404.01509.pdf,Can Biases in ImageNet Models Explain Generalization?,"The robust generalization of models to rare, in-distribution (ID) samples
+drawn from the long tail of the training distribution and to
+out-of-training-distribution (OOD) samples is one of the major challenges of
+current deep learning methods. For image classification, this manifests in the
+existence of adversarial attacks, the performance drops on distorted images,
+and a lack of generalization to concepts such as sketches. The current
+understanding of generalization in neural networks is very limited, but some
+biases that differentiate models from human vision have been identified and
+might be causing these limitations. Consequently, several attempts with varying
+success have been made to reduce these biases during training to improve
+generalization. We take a step back and sanity-check these attempts. Fixing the
+architecture to the well-established ResNet-50, we perform a large-scale study
+on 48 ImageNet models obtained via different training methods to understand how
+and if these biases - including shape bias, spectral biases, and critical bands
+- interact with generalization. Our extensive study results reveal that
+contrary to previous findings, these biases are insufficient to accurately
+predict the generalization of a model holistically. We provide access to all
+checkpoints and evaluation code at
+https://github.com/paulgavrikov/biases_vs_generalization",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'stat.ML']"
+"Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA",Zhuowan Li · Bhavan Jasani · Peng Tang · Shabnam Ghadar, ,https://arxiv.org/abs/2403.16385,,2403.16385.pdf,"Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA","Understanding data visualizations like charts and plots requires reasoning
+about both visual elements and numerics. Although strong in extractive
+questions, current chart visual question answering (chart VQA) models suffer on
+complex reasoning questions. In this work, we address the lack of reasoning
+ability by data augmentation. We leverage Large Language Models (LLMs), which
+have shown to have strong reasoning ability, as an automatic data annotator
+that generates question-answer annotations for chart images. The key innovation
+in our method lies in the Synthesize Step-by-Step strategy: our LLM-based data
+generator learns to decompose the complex question into step-by-step
+sub-questions (rationales), which are then used to derive the final answer
+using external tools, i.e. Python. This step-wise generation procedure is
+trained on synthetic data generated using a template-based QA generation
+pipeline. Experimental results highlight the significance of the proposed
+step-by-step generation. By training with the LLM-augmented data (LAMENDA), we
+significantly enhance the chart VQA models, achieving the state-of-the-art
+accuracy on the ChartQA and PlotQA datasets. In particular, our approach
+improves the accuracy of the previous state-of-the-art approach from 38% to 54%
+on the human-written questions in the ChartQA dataset, which needs strong
+reasoning. We hope our work underscores the potential of synthetic data and
+encourages further exploration of data augmentation using LLMs for
+reasoning-heavy tasks.",cs.CV,"['cs.CV', 'cs.CL']"
+TeTriRF: Temporal Tri-Plane Radiance Fields for Efficient Free-Viewpoint Video,Minye Wu · Zehao Wang · Georgios Kouros · Tinne Tuytelaars, ,https://arxiv.org/abs/2312.06713,,2312.06713.pdf,TeTriRF: Temporal Tri-Plane Radiance Fields for Efficient Free-Viewpoint Video,"Neural Radiance Fields (NeRF) revolutionize the realm of visual media by
+providing photorealistic Free-Viewpoint Video (FVV) experiences, offering
+viewers unparalleled immersion and interactivity. However, the technology's
+significant storage requirements and the computational complexity involved in
+generation and rendering currently limit its broader application. To close this
+gap, this paper presents Temporal Tri-Plane Radiance Fields (TeTriRF), a novel
+technology that significantly reduces the storage size for Free-Viewpoint Video
+(FVV) while maintaining low-cost generation and rendering. TeTriRF introduces a
+hybrid representation with tri-planes and voxel grids to support scaling up to
+long-duration sequences and scenes with complex motions or rapid changes. We
+propose a group training scheme tailored to achieving high training efficiency
+and yielding temporally consistent, low-entropy scene representations.
+Leveraging these properties of the representations, we introduce a compression
+pipeline with off-the-shelf video codecs, achieving an order of magnitude less
+storage size compared to the state-of-the-art. Our experiments demonstrate that
+TeTriRF can achieve competitive quality with a higher compression rate.",cs.CV,['cs.CV']
+Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation,Alexander Raistrick · Lingjie Mei · Karhan Kayan · David Yan · Yiming Zuo · Beining Han · Hongyu Wen · Meenal Parakh · Stamatis Alexandropoulos · Lahav Lipson · Zeyu Ma · Jia Deng, ,https://arxiv.org/abs/2306.09310,,2306.09310.pdf,Infinite Photorealistic Worlds using Procedural Generation,"We introduce Infinigen, a procedural generator of photorealistic 3D scenes of
+the natural world. Infinigen is entirely procedural: every asset, from shape to
+texture, is generated from scratch via randomized mathematical rules, using no
+external source and allowing infinite variation and composition. Infinigen
+offers broad coverage of objects and scenes in the natural world including
+plants, animals, terrains, and natural phenomena such as fire, cloud, rain, and
+snow. Infinigen can be used to generate unlimited, diverse training data for a
+wide range of computer vision tasks including object detection, semantic
+segmentation, optical flow, and 3D reconstruction. We expect Infinigen to be a
+useful resource for computer vision research and beyond. Please visit
+https://infinigen.org for videos, code and pre-generated data.",cs.CV,['cs.CV']
+TetraSphere: A Neural Descriptor for O(3)-Invariant Point Cloud Analysis,Pavlo Melnyk · Andreas Robinson · Michael Felsberg · Mårten Wadenbäck,https://github.com/pavlo-melnyk/tetrasphere,,https://www.youtube.com/watch?v=MRJr0V7eMj8,,,,,nan
+Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft,Hao Li · Xue Yang · Zhaokai Wang · Xizhou Zhu · Jie Zhou · Yu Qiao · Xiaogang Wang · Hongsheng Li · Lewei Lu · Jifeng Dai,https://yangxue0827.github.io/auto_mc-reward.html,https://arxiv.org/abs/2312.09238,,2312.09238.pdf,Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft,"Many reinforcement learning environments (e.g., Minecraft) provide only
+sparse rewards that indicate task completion or failure with binary values. The
+challenge in exploration efficiency in such environments makes it difficult for
+reinforcement-learning-based agents to learn complex tasks. To address this,
+this paper introduces an advanced learning system, named Auto MC-Reward, that
+leverages Large Language Models (LLMs) to automatically design dense reward
+functions, thereby enhancing the learning efficiency. Auto MC-Reward consists
+of three important components: Reward Designer, Reward Critic, and Trajectory
+Analyzer. Given the environment information and task descriptions, the Reward
+Designer first design the reward function by coding an executable Python
+function with predefined observation inputs. Then, our Reward Critic will be
+responsible for verifying the code, checking whether the code is
+self-consistent and free of syntax and semantic errors. Further, the Trajectory
+Analyzer summarizes possible failure causes and provides refinement suggestions
+according to collected trajectories. In the next round, Reward Designer will
+further refine and iterate the dense reward function based on feedback.
+Experiments demonstrate a significant improvement in the success rate and
+learning efficiency of our agents in complex tasks in Minecraft, such as
+obtaining diamond with the efficient ability to avoid lava, and efficiently
+explore trees and animals that are sparse in the plains biome.",cs.AI,"['cs.AI', 'cs.CL', 'cs.CV', 'cs.LG']"
+Practical Measurements of Translucent Materials with Inter-Pixel Translucency Prior,Zhenyu Chen · Jie Guo · Shuichang Lai · Ruoyu Fu · mengxun kong · Chen Wang · Hongyu Sun · Zhebin Zhang · Chen Li · Yanwen Guo, ,,https://github.com/ZhenyuChen1999/IPTNet,,,,,nan
+Dual-View Visual Contextualization for Web Navigation,Jihyung Kil · Chan Hee Song · Boyuan Zheng · Xiang Deng · Yu Su · Wei-Lun Chao, ,https://arxiv.org/abs/2402.04476,,2402.04476.pdf,Dual-View Visual Contextualization for Web Navigation,"Automatic web navigation aims to build a web agent that can follow language
+instructions to execute complex and diverse tasks on real-world websites.
+Existing work primarily takes HTML documents as input, which define the
+contents and action spaces (i.e., actionable elements and operations) of
+webpages. Nevertheless, HTML documents may not provide a clear task-related
+context for each element, making it hard to select the right (sequence of)
+actions. In this paper, we propose to contextualize HTML elements through their
+""dual views"" in webpage screenshots: each HTML element has its corresponding
+bounding box and visual content in the screenshot. We build upon the insight --
+web developers tend to arrange task-related elements nearby on webpages to
+enhance user experiences -- and propose to contextualize each element with its
+neighbor elements, using both textual and visual features. The resulting
+representations of HTML elements are more informative for the agent to take
+action. We validate our method on the recently released Mind2Web dataset, which
+features diverse navigation domains and tasks on real-world websites. Our
+method consistently outperforms the baseline in all the scenarios, including
+cross-task, cross-website, and cross-domain ones.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']"
+CAT: Exploiting Inter-Class Dynamics for Domain Adaptive Object Detection,Mikhail Kennerley · Jian-Gang Wang · Bharadwaj Veeravalli · Robby T. Tan,https://www.mikhailkennerley.com/cat,https://arxiv.org/abs/2403.19278v1,,2403.19278v1.pdf,CAT: Exploiting Inter-Class Dynamics for Domain Adaptive Object Detection,"Domain adaptive object detection aims to adapt detection models to domains
+where annotated data is unavailable. Existing methods have been proposed to
+address the domain gap using the semi-supervised student-teacher framework.
+However, a fundamental issue arises from the class imbalance in the labelled
+training set, which can result in inaccurate pseudo-labels. The relationship
+between classes, especially where one class is a majority and the other
+minority, has a large impact on class bias. We propose Class-Aware Teacher
+(CAT) to address the class bias issue in the domain adaptation setting. In our
+work, we approximate the class relationships with our Inter-Class Relation
+module (ICRm) and exploit it to reduce the bias within the model. In this way,
+we are able to apply augmentations to highly related classes, both inter- and
+intra-domain, to boost the performance of minority classes while having minimal
+impact on majority classes. We further reduce the bias by implementing a
+class-relation weight to our classification loss. Experiments conducted on
+various datasets and ablation studies show that our method is able to address
+the class bias in the domain adaptation setting. On the Cityscapes to Foggy
+Cityscapes dataset, we attained a 52.5 mAP, a substantial improvement over the
+51.2 mAP achieved by the state-of-the-art method.",cs.CV,['cs.CV']
+HOISDF: Constraining 3D Hand Object Pose Estimation with Global Signed Distance Fields,Haozhe Qi · Chen Zhao · Mathieu Salzmann · Alexander Mathis, ,https://arxiv.org/abs/2402.17062,,2402.17062.pdf,HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields,"Human hands are highly articulated and versatile at handling objects. Jointly
+estimating the 3D poses of a hand and the object it manipulates from a
+monocular camera is challenging due to frequent occlusions. Thus, existing
+methods often rely on intermediate 3D shape representations to increase
+performance. These representations are typically explicit, such as 3D point
+clouds or meshes, and thus provide information in the direct surroundings of
+the intermediate hand pose estimate. To address this, we introduce HOISDF, a
+Signed Distance Field (SDF) guided hand-object pose estimation network, which
+jointly exploits hand and object SDFs to provide a global, implicit
+representation over the complete reconstruction volume. Specifically, the role
+of the SDFs is threefold: equip the visual encoder with implicit shape
+information, help to encode hand-object interactions, and guide the hand and
+object pose regression via SDF-based sampling and by augmenting the feature
+representations. We show that HOISDF achieves state-of-the-art results on
+hand-object pose estimation benchmarks (DexYCB and HO3Dv2). Code is available
+at https://github.com/amathislab/HOISDF",cs.CV,['cs.CV']
+Learning Object State Changes in Videos: An Open-World Perspective,Zihui Xue · Kumar Ashutosh · Kristen Grauman,https://vision.cs.utexas.edu/projects/VidOSC/,https://arxiv.org/abs/2312.11782,,2312.11782.pdf,Learning Object State Changes in Videos: An Open-World Perspective,"Object State Changes (OSCs) are pivotal for video understanding. While humans
+can effortlessly generalize OSC understanding from familiar to unknown objects,
+current approaches are confined to a closed vocabulary. Addressing this gap, we
+introduce a novel open-world formulation for the video OSC problem. The goal is
+to temporally localize the three stages of an OSC -- the object's initial
+state, its transitioning state, and its end state -- whether or not the object
+has been observed during training. Towards this end, we develop VidOSC, a
+holistic learning approach that: (1) leverages text and vision-language models
+for supervisory signals to obviate manually labeling OSC training data, and (2)
+abstracts fine-grained shared state representations from objects to enhance
+generalization. Furthermore, we present HowToChange, the first open-world
+benchmark for video OSC localization, which offers an order of magnitude
+increase in the label space and annotation volume compared to the best existing
+benchmark. Experimental results demonstrate the efficacy of our approach, in
+both traditional closed-world and open-world scenarios.",cs.CV,['cs.CV']
+Depth Prompting for Sensor-Agnostic Depth Estimation,Jin-Hwi Park · Chanhwi Jeong · Junoh Lee · Hae-Gon Jeon, ,https://arxiv.org/abs/2405.11867,,2405.11867.pdf,Depth Prompting for Sensor-Agnostic Depth Estimation,"Dense depth maps have been used as a key element of visual perception tasks.
+There have been tremendous efforts to enhance the depth quality, ranging from
+optimization-based to learning-based methods. Despite the remarkable progress
+for a long time, their applicability in the real world is limited due to
+systematic measurement biases such as density, sensing pattern, and scan range.
+It is well-known that the biases make it difficult for these methods to achieve
+their generalization. We observe that learning a joint representation for input
+modalities (e.g., images and depth), which most recent methods adopt, is
+sensitive to the biases. In this work, we disentangle those modalities to
+mitigate the biases with prompt engineering. For this, we design a novel depth
+prompt module to allow the desirable feature representation according to new
+depth distributions from either sensor types or scene configurations. Our depth
+prompt can be embedded into foundation models for monocular depth estimation.
+Through this embedding process, our method helps the pretrained model to be
+free from restraint of depth scan range and to provide absolute scale depth
+maps. We demonstrate the effectiveness of our method through extensive
+evaluations. Source code is publicly available at
+https://github.com/JinhwiPark/DepthPrompting .",cs.CV,"['cs.CV', 'cs.LG', 'cs.RO']"
+PromptCoT: Align Prompt Distribution via Adapted Chain-of-Thought,Junyi Yao · Yijiang Liu · Zhen Dong · Mingfei Guo · Helan Hu · Kurt Keutzer · Li Du · Daquan Zhou · Shanghang Zhang, ,https://arxiv.org/abs/2307.13339,,2307.13339.pdf,Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions,"Chain-of-thought (CoT) prompting has been shown to empirically improve the
+accuracy of large language models (LLMs) on various question answering tasks.
+While understanding why CoT prompting is effective is crucial to ensuring that
+this phenomenon is a consequence of desired model behavior, little work has
+addressed this; nonetheless, such an understanding is a critical prerequisite
+for responsible model deployment. We address this question by leveraging
+gradient-based feature attribution methods which produce saliency scores that
+capture the influence of input tokens on model output. Specifically, we probe
+several open-source LLMs to investigate whether CoT prompting affects the
+relative importances they assign to particular input tokens. Our results
+indicate that while CoT prompting does not increase the magnitude of saliency
+scores attributed to semantically relevant tokens in the prompt compared to
+standard few-shot prompting, it increases the robustness of saliency scores to
+question perturbations and variations in model output.",cs.CL,"['cs.CL', 'cs.AI']"
+Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation,Wenxiao Deng · Wenbin Li · Tianyu Ding · Lei Wang · Hongguang Zhang · Kuihua Huang · Jing Huo · Yang Gao, ,https://arxiv.org/abs/2404.00563,,2404.00563.pdf,Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation,"Dataset distillation has emerged as a promising approach in deep learning,
+enabling efficient training with small synthetic datasets derived from larger
+real ones. Particularly, distribution matching-based distillation methods
+attract attention thanks to its effectiveness and low computational cost.
+However, these methods face two primary limitations: the dispersed feature
+distribution within the same class in synthetic datasets, reducing class
+discrimination, and an exclusive focus on mean feature consistency, lacking
+precision and comprehensiveness. To address these challenges, we introduce two
+novel constraints: a class centralization constraint and a covariance matching
+constraint. The class centralization constraint aims to enhance class
+discrimination by more closely clustering samples within classes. The
+covariance matching constraint seeks to achieve more accurate feature
+distribution matching between real and synthetic datasets through local feature
+covariance matrices, particularly beneficial when sample sizes are much smaller
+than the number of features. Experiments demonstrate notable improvements with
+these constraints, yielding performance boosts of up to 6.6% on CIFAR10, 2.9%
+on SVHN, 2.5% on CIFAR100, and 2.5% on TinyImageNet, compared to the
+state-of-the-art relevant methods. In addition, our method maintains robust
+performance in cross-architecture settings, with a maximum performance drop of
+1.7% on four architectures. Code is available at
+https://github.com/VincenDen/IID.",cs.CV,['cs.CV']
+MeshPose: Unifying DensePose and 3D Body Mesh reconstruction,Eric-Tuan Le · Antonios Kakolyris · Petros Koutras · Himmy Tam · Efstratios Skordos · George Papandreou · Riza Alp Guler · Iasonas Kokkinos, ,https://arxiv.org/abs/2308.10305,,2308.10305.pdf,Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video,"Despite significant progress in single image-based 3D human mesh recovery,
+accurately and smoothly recovering 3D human motion from a video remains
+challenging. Existing video-based methods generally recover human mesh by
+estimating the complex pose and shape parameters from coupled image features,
+whose high complexity and low representation ability often result in
+inconsistent pose motion and limited shape patterns. To alleviate this issue,
+we introduce 3D pose as the intermediary and propose a Pose and Mesh
+Co-Evolution network (PMCE) that decouples this task into two parts: 1)
+video-based 3D human pose estimation and 2) mesh vertices regression from the
+estimated 3D pose and temporal image feature. Specifically, we propose a
+two-stream encoder that estimates mid-frame 3D pose and extracts a temporal
+image feature from the input image sequence. In addition, we design a
+co-evolution decoder that performs pose and mesh interactions with the
+image-guided Adaptive Layer Normalization (AdaLN) to make pose and mesh fit the
+human body shape. Extensive experiments demonstrate that the proposed PMCE
+outperforms previous state-of-the-art methods in terms of both per-frame
+accuracy and temporal consistency on three benchmark datasets: 3DPW, Human3.6M,
+and MPI-INF-3DHP. Our code is available at https://github.com/kasvii/PMCE.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces,Haithem Turki · Vasu Agrawal · Samuel Rota Bulò · Lorenzo Porzi · Peter Kontschieder · Deva Ramanan · Michael Zollhoefer · Christian Richardt,https://haithemturki.com/hybrid-nerf/,https://arxiv.org/abs/2312.03160,,2312.03160.pdf,HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces,"Neural radiance fields provide state-of-the-art view synthesis quality but
+tend to be slow to render. One reason is that they make use of volume
+rendering, thus requiring many samples (and model queries) per ray at render
+time. Although this representation is flexible and easy to optimize, most
+real-world objects can be modeled more efficiently with surfaces instead of
+volumes, requiring far fewer samples per ray. This observation has spurred
+considerable progress in surface representations such as signed distance
+functions, but these may struggle to model semi-opaque and thin structures. We
+propose a method, HybridNeRF, that leverages the strengths of both
+representations by rendering most objects as surfaces while modeling the
+(typically) small fraction of challenging regions volumetrically. We evaluate
+HybridNeRF against the challenging Eyeful Tower dataset along with other
+commonly used view synthesis datasets. When comparing to state-of-the-art
+baselines, including recent rasterization-based approaches, we improve error
+rates by 15-30% while achieving real-time framerates (at least 36 FPS) for
+virtual-reality resolutions (2Kx2K).",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+LTA-PCS: Learnable Task-Agnostic Point Cloud Sampling,Jiaheng Liu · Jianhao Li · Kaisiyuan Wang · Hongcheng Guo · Jian Yang · Junran Peng · Ke Xu · Xianglong Liu · Jinyang Guo, ,https://arxiv.org/abs/2404.00857,,2404.00857.pdf,Meta Episodic learning with Dynamic Task Sampling for CLIP-based Point Cloud Classification,"Point cloud classification refers to the process of assigning semantic labels
+or categories to individual points within a point cloud data structure. Recent
+works have explored the extension of pre-trained CLIP to 3D recognition. In
+this direction, CLIP-based point cloud models like PointCLIP, CLIP2Point have
+become state-of-the-art methods in the few-shot setup. Although these methods
+show promising performance for some classes like airplanes, desks, guitars,
+etc, the performance for some classes like the cup, flower pot, sink,
+nightstand, etc is still far from satisfactory. This is due to the fact that
+the adapter of CLIP-based models is trained using randomly sampled N-way K-shot
+data in the standard supervised learning setup. In this paper, we propose a
+novel meta-episodic learning framework for CLIP-based point cloud
+classification, addressing the challenges of limited training examples and
+sampling unknown classes. Additionally, we introduce dynamic task sampling
+within the episode based on performance memory. This sampling strategy
+effectively addresses the challenge of sampling unknown classes, ensuring that
+the model learns from a diverse range of classes and promotes the exploration
+of underrepresented categories. By dynamically updating the performance memory,
+we adaptively prioritize the sampling of classes based on their performance,
+enhancing the model's ability to handle challenging and real-world scenarios.
+Experiments show an average performance gain of 3-6\% on ModelNet40 and
+ScanobjectNN datasets in a few-shot setup.",cs.CV,['cs.CV']
+OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers,Han Liang · Jiacheng Bao · Ruichi Zhang · Sihan Ren · Yuecheng Xu · Sibei Yang · Xin Chen · Jingyi Yu · Lan Xu, ,https://arxiv.org/abs/2312.08985v3,,2312.08985v3.pdf,OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers,"We have recently seen tremendous progress in realistic text-to-motion
+generation. Yet, the existing methods often fail or produce implausible motions
+with unseen text inputs, which limits the applications. In this paper, we
+present OMG, a novel framework, which enables compelling motion generation from
+zero-shot open-vocabulary text prompts. Our key idea is to carefully tailor the
+pretrain-then-finetune paradigm into the text-to-motion generation. At the
+pre-training stage, our model improves the generation ability by learning the
+rich out-of-domain inherent motion traits. To this end, we scale up a large
+unconditional diffusion model up to 1B parameters, so as to utilize the massive
+unlabeled motion data up to over 20M motion instances. At the subsequent
+fine-tuning stage, we introduce motion ControlNet, which incorporates text
+prompts as conditioning information, through a trainable copy of the
+pre-trained model and the proposed novel Mixture-of-Controllers (MoC) block.
+MoC block adaptively recognizes various ranges of the sub-motions with a
+cross-attention mechanism and processes them separately with the
+text-token-specific experts. Such a design effectively aligns the CLIP token
+embeddings of text prompts to various ranges of compact and expressive motion
+features. Extensive experiments demonstrate that our OMG achieves significant
+improvements over the state-of-the-art methods on zero-shot text-to-motion
+generation. Project page: https://tr3e.github.io/omg-page.",cs.CV,['cs.CV']
+FedSOL: Stabilized Orthogonal Learning with Proximal Restrictions in Federated Learning,Gihun Lee · Minchan Jeong · SangMook Kim · Jaehoon Oh · Se-Young Yun, ,https://arxiv.org/abs/2308.12532v6,,2308.12532v6.pdf,FedSOL: Stabilized Orthogonal Learning with Proximal Restrictions in Federated Learning,"Federated Learning (FL) aggregates locally trained models from individual
+clients to construct a global model. While FL enables learning a model with
+data privacy, it often suffers from significant performance degradation when
+clients have heterogeneous data distributions. This data heterogeneity causes
+the model to forget the global knowledge acquired from previously sampled
+clients after being trained on local datasets. Although the introduction of
+proximal objectives in local updates helps to preserve global knowledge, it can
+also hinder local learning by interfering with local objectives. To address
+this problem, we propose a novel method, Federated Stabilized Orthogonal
+Learning (FedSOL), which adopts an orthogonal learning strategy to balance the
+two conflicting objectives. FedSOL is designed to identify gradients of local
+objectives that are inherently orthogonal to directions affecting the proximal
+objective. Specifically, FedSOL targets parameter regions where learning on the
+local objective is minimally influenced by proximal weight perturbations. Our
+experiments demonstrate that FedSOL consistently achieves state-of-the-art
+performance across various scenarios.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+NC-SDF: Enhancing Indoor Scene Reconstruction Using Neural SDFs with View-Dependent Normal Compensation,Ziyi Chen · Xiaolong Wu · Yu Zhang, ,https://arxiv.org/abs/2405.00340,,2405.00340.pdf,NC-SDF: Enhancing Indoor Scene Reconstruction Using Neural SDFs with View-Dependent Normal Compensation,"State-of-the-art neural implicit surface representations have achieved
+impressive results in indoor scene reconstruction by incorporating monocular
+geometric priors as additional supervision. However, we have observed that
+multi-view inconsistency between such priors poses a challenge for high-quality
+reconstructions. In response, we present NC-SDF, a neural signed distance field
+(SDF) 3D reconstruction framework with view-dependent normal compensation (NC).
+Specifically, we integrate view-dependent biases in monocular normal priors
+into the neural implicit representation of the scene. By adaptively learning
+and correcting the biases, our NC-SDF effectively mitigates the adverse impact
+of inconsistent supervision, enhancing both the global consistency and local
+details in the reconstructions. To further refine the details, we introduce an
+informative pixel sampling strategy to pay more attention to intricate geometry
+with higher information content. Additionally, we design a hybrid geometry
+modeling approach to improve the neural implicit representation. Experiments on
+synthetic and real-world datasets demonstrate that NC-SDF outperforms existing
+approaches in terms of reconstruction quality.",cs.CV,['cs.CV']
+GLID: Pre-training a Generalist Encoder-Decoder Vision Model,Jihao Liu · Jinliang Zheng · Yu Liu · Hongsheng Li,https://arxiv.org/abs/2404.07603,https://arxiv.org/abs/2404.07603,,2404.07603.pdf,GLID: Pre-training a Generalist Encoder-Decoder Vision Model,"This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method
+for better handling various downstream computer vision tasks. While
+self-supervised pre-training approaches, e.g., Masked Autoencoder, have shown
+success in transfer learning, task-specific sub-architectures are still
+required to be appended for different downstream tasks, which cannot enjoy the
+benefits of large-scale pre-training. GLID overcomes this challenge by allowing
+the pre-trained generalist encoder-decoder to be fine-tuned on various vision
+tasks with minimal task-specific architecture modifications. In the GLID
+training scheme, pre-training pretext task and other downstream tasks are
+modeled as ""query-to-answer"" problems, including the pre-training pretext task
+and other downstream tasks. We pre-train a task-agnostic encoder-decoder with
+query-mask pairs. During fine-tuning, GLID maintains the pre-trained
+encoder-decoder and queries, only replacing the topmost linear transformation
+layer with task-specific linear heads. This minimizes the pretrain-finetune
+architecture inconsistency and enables the pre-trained model to better adapt to
+downstream tasks. GLID achieves competitive performance on various vision
+tasks, including object detection, image segmentation, pose estimation, and
+depth estimation, outperforming or matching specialist models such as
+Mask2Former, DETR, ViTPose, and BinsFormer.",cs.CV,['cs.CV']
+Your Transferability Barrier is Fragile: Free-Lunch for Transferring the Non-Transferable Learning,Ziming Hong · Li Shen · Tongliang Liu, ,,https://openreview.net/forum?id=FYKVPOHCpE,,,,,nan
+Steganographic Passport: An Owner and User Verifiable Credential for Deep Model IP Protection Without Retraining,Qi Cui · Ruohan Meng · Chaohui Xu · Chip Hong Chang,https://github.com/TracyCuiq/Steganographic-Passport,https://arxiv.org/abs/2404.02889,,2404.02889.pdf,Steganographic Passport: An Owner and User Verifiable Credential for Deep Model IP Protection Without Retraining,"Ensuring the legal usage of deep models is crucial to promoting trustable,
+accountable, and responsible artificial intelligence innovation. Current
+passport-based methods that obfuscate model functionality for license-to-use
+and ownership verifications suffer from capacity and quality constraints, as
+they require retraining the owner model for new users. They are also vulnerable
+to advanced Expanded Residual Block ambiguity attacks. We propose
+Steganographic Passport, which uses an invertible steganographic network to
+decouple license-to-use from ownership verification by hiding the user's
+identity images into the owner-side passport and recovering them from their
+respective user-side passports. An irreversible and collision-resistant hash
+function is used to avoid exposing the owner-side passport from the derived
+user-side passports and increase the uniqueness of the model signature. To
+safeguard both the passport and model's weights against advanced ambiguity
+attacks, an activation-level obfuscation is proposed for the verification
+branch of the owner's model. By jointly training the verification and
+deployment branches, their weights become tightly coupled. The proposed method
+supports agile licensing of deep models by providing a strong ownership proof
+and license accountability without requiring a separate model retraining for
+the admission of every new user. Experiment results show that our
+Steganographic Passport outperforms other passport-based deep model protection
+methods in robustness against various known attacks.",cs.CR,"['cs.CR', 'cs.CV']"
+NIVeL: Neural Implicit Vector Layers for Text-to-Vector Generation,Vikas Thamizharasan · Difan Liu · Matthew Fisher · Nanxuan Zhao · Evangelos Kalogerakis · Michal Lukáč, ,https://arxiv.org/abs/2405.15217,,2405.15217.pdf,NIVeL: Neural Implicit Vector Layers for Text-to-Vector Generation,"The success of denoising diffusion models in representing rich data
+distributions over 2D raster images has prompted research on extending them to
+other data representations, such as vector graphics. Unfortunately due to their
+variable structure and scarcity of vector training data, directly applying
+diffusion models on this domain remains a challenging problem. Using
+workarounds like optimization via Score Distillation Sampling (SDS) is also
+fraught with difficulty, as vector representations are non trivial to directly
+optimize and tend to result in implausible geometries such as redundant or
+self-intersecting shapes. NIVeL addresses these challenges by reinterpreting
+the problem on an alternative, intermediate domain which preserves the
+desirable properties of vector graphics -- mainly sparsity of representation
+and resolution-independence. This alternative domain is based on neural
+implicit fields expressed in a set of decomposable, editable layers. Based on
+our experiments, NIVeL produces text-to-vector graphics results of
+significantly better quality than the state-of-the-art.",cs.CV,"['cs.CV', 'cs.GR']"
+GlitchBench: Can large multimodal models detect video game glitches?,Mohammad Reza Taesiri · Tianjun Feng · Cor-Paul Bezemer · Anh Nguyen, ,https://arxiv.org/abs/2312.05291,,2312.05291.pdf,GlitchBench: Can large multimodal models detect video game glitches?,"Large multimodal models (LMMs) have evolved from large language models (LLMs)
+to integrate multiple input modalities, such as visual inputs. This integration
+augments the capacity of LLMs for tasks requiring visual comprehension and
+reasoning. However, the extent and limitations of their enhanced abilities are
+not fully understood, especially when it comes to real-world tasks. To address
+this gap, we introduce GlitchBench, a novel benchmark derived from video game
+quality assurance tasks, to test and evaluate the reasoning capabilities of
+LMMs. Our benchmark is curated from a variety of unusual and glitched scenarios
+from video games and aims to challenge both the visual and linguistic reasoning
+powers of LMMs in detecting and interpreting out-of-the-ordinary events. We
+evaluate multiple state-of-the-art LMMs, and we show that GlitchBench presents
+a new challenge for these models. Code and data are available at:
+https://glitchbench.github.io/",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']"
+ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions,Chunlong Xia · Xinliang Wang · Feng Lv · Xin Hao · Yifeng Shi, ,https://arxiv.org/abs/2403.07392,,2403.07392.pdf,ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions,"Although Vision Transformer (ViT) has achieved significant success in
+computer vision, it does not perform well in dense prediction tasks due to the
+lack of inner-patch information interaction and the limited diversity of
+feature scale. Most existing studies are devoted to designing vision-specific
+transformers to solve the above problems, which introduce additional
+pre-training costs. Therefore, we present a plain, pre-training-free, and
+feature-enhanced ViT backbone with Convolutional Multi-scale feature
+interaction, named ViT-CoMer, which facilitates bidirectional interaction
+between CNN and transformer. Compared to the state-of-the-art, ViT-CoMer has
+the following advantages: (1) We inject spatial pyramid multi-receptive field
+convolutional features into the ViT architecture, which effectively alleviates
+the problems of limited local information interaction and single-feature
+representation in ViT. (2) We propose a simple and efficient CNN-Transformer
+bidirectional fusion interaction module that performs multi-scale fusion across
+hierarchical features, which is beneficial for handling dense prediction tasks.
+(3) We evaluate the performance of ViT-CoMer across various dense prediction
+tasks, different frameworks, and multiple advanced pre-training. Notably, our
+ViT-CoMer-L achieves 64.3% AP on COCO val2017 without extra training data, and
+62.1% mIoU on ADE20K val, both of which are comparable to state-of-the-art
+methods. We hope ViT-CoMer can serve as a new backbone for dense prediction
+tasks to facilitate future research. The code will be released at
+https://github.com/Traffic-X/ViT-CoMer.",cs.CV,['cs.CV']
+LiDAR4D: Dynamic Neural Fields for Novel Space-time View LiDAR Synthesis,Zehan Zheng · Fan Lu · Weiyi Xue · Guang Chen · Changjun Jiang,https://dyfcalid.github.io/LiDAR4D,https://arxiv.org/abs/2404.02742,,2404.02742.pdf,LiDAR4D: Dynamic Neural Fields for Novel Space-time View LiDAR Synthesis,"Although neural radiance fields (NeRFs) have achieved triumphs in image novel
+view synthesis (NVS), LiDAR NVS remains largely unexplored. Previous LiDAR NVS
+methods employ a simple shift from image NVS methods while ignoring the dynamic
+nature and the large-scale reconstruction problem of LiDAR point clouds. In
+light of this, we propose LiDAR4D, a differentiable LiDAR-only framework for
+novel space-time LiDAR view synthesis. In consideration of the sparsity and
+large-scale characteristics, we design a 4D hybrid representation combined with
+multi-planar and grid features to achieve effective reconstruction in a
+coarse-to-fine manner. Furthermore, we introduce geometric constraints derived
+from point clouds to improve temporal consistency. For the realistic synthesis
+of LiDAR point clouds, we incorporate the global optimization of ray-drop
+probability to preserve cross-region patterns. Extensive experiments on
+KITTI-360 and NuScenes datasets demonstrate the superiority of our method in
+accomplishing geometry-aware and time-consistent dynamic reconstruction. Codes
+are available at https://github.com/ispc-lab/LiDAR4D.",cs.CV,['cs.CV']
+AETTA: Label-Free Accuracy Estimation for Test-Time Adaptation,Taeckyung Lee · Sorn Chottananurak · Taesik Gong · Sung-Ju Lee,https://nmsl.kaist.ac.kr/projects/aetta/,https://arxiv.org/abs/2404.01351,,2404.01351.pdf,AETTA: Label-Free Accuracy Estimation for Test-Time Adaptation,"Test-time adaptation (TTA) has emerged as a viable solution to adapt
+pre-trained models to domain shifts using unlabeled test data. However, TTA
+faces challenges of adaptation failures due to its reliance on blind adaptation
+to unknown test samples in dynamic scenarios. Traditional methods for
+out-of-distribution performance estimation are limited by unrealistic
+assumptions in the TTA context, such as requiring labeled data or re-training
+models. To address this issue, we propose AETTA, a label-free accuracy
+estimation algorithm for TTA. We propose the prediction disagreement as the
+accuracy estimate, calculated by comparing the target model prediction with
+dropout inferences. We then improve the prediction disagreement to extend the
+applicability of AETTA under adaptation failures. Our extensive evaluation with
+four baselines and six TTA methods demonstrates that AETTA shows an average of
+19.8%p more accurate estimation compared with the baselines. We further
+demonstrate the effectiveness of accuracy estimation with a model recovery case
+study, showcasing the practicality of our model recovery based on accuracy
+estimation. The source code is available at https://github.com/taeckyung/AETTA.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+Adversarial Distillation Based on Slack Matching and Attribution Region Alignment,Shenglin Yin · Zhen Xiao · Mingxuan Song · Jieyi Long, ,https://arxiv.org/abs/2312.08912,,2312.08912.pdf,Dataset Distillation via Adversarial Prediction Matching,"Dataset distillation is the technique of synthesizing smaller condensed
+datasets from large original datasets while retaining necessary information to
+persist the effect. In this paper, we approach the dataset distillation problem
+from a novel perspective: we regard minimizing the prediction discrepancy on
+the real data distribution between models, which are respectively trained on
+the large original dataset and on the small distilled dataset, as a conduit for
+condensing information from the raw data into the distilled version. An
+adversarial framework is proposed to solve the problem efficiently. In contrast
+to existing distillation methods involving nested optimization or long-range
+gradient unrolling, our approach hinges on single-level optimization. This
+ensures the memory efficiency of our method and provides a flexible tradeoff
+between time and memory budgets, allowing us to distil ImageNet-1K using a
+minimum of only 6.5GB of GPU memory. Under the optimal tradeoff strategy, it
+requires only 2.5$\times$ less memory and 5$\times$ less runtime compared to
+the state-of-the-art. Empirically, our method can produce synthetic datasets
+just 10% the size of the original, yet achieve, on average, 94% of the test
+accuracy of models trained on the full original datasets including ImageNet-1K,
+significantly surpassing state-of-the-art. Additionally, extensive tests reveal
+that our distilled datasets excel in cross-architecture generalization
+capabilities.",cs.CV,['cs.CV']
+ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering,Haokai Pang · Heming Zhu · Adam Kortylewski · Christian Theobalt · Marc Habermann,https://vcai.mpi-inf.mpg.de/projects/ash/,https://arxiv.org/abs/2312.05941,,2312.05941.pdf,ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering,"Real-time rendering of photorealistic and controllable human avatars stands
+as a cornerstone in Computer Vision and Graphics. While recent advances in
+neural implicit rendering have unlocked unprecedented photorealism for digital
+avatars, real-time performance has mostly been demonstrated for static scenes
+only. To address this, we propose ASH, an animatable Gaussian splatting
+approach for photorealistic rendering of dynamic humans in real-time. We
+parameterize the clothed human as animatable 3D Gaussians, which can be
+efficiently splatted into image space to generate the final rendering. However,
+naively learning the Gaussian parameters in 3D space poses a severe challenge
+in terms of compute. Instead, we attach the Gaussians onto a deformable
+character model, and learn their parameters in 2D texture space, which allows
+leveraging efficient 2D convolutional architectures that easily scale with the
+required number of Gaussians. We benchmark ASH with competing methods on
+pose-controllable avatars, demonstrating that our method outperforms existing
+real-time methods by a large margin and shows comparable or even better results
+than offline methods.",cs.CV,['cs.CV']
+Design2Cloth: 3D Cloth Generation from 2D Masks,Jiali Zheng · Rolandos Alexandros Potamias · Stefanos Zafeiriou, ,https://arxiv.org/abs/2404.02686,,2404.02686.pdf,Design2Cloth: 3D Cloth Generation from 2D Masks,"In recent years, there has been a significant shift in the field of digital
+avatar research, towards modeling, animating and reconstructing clothed human
+representations, as a key step towards creating realistic avatars. However,
+current 3D cloth generation methods are garment specific or trained completely
+on synthetic data, hence lacking fine details and realism. In this work, we
+make a step towards automatic realistic garment design and propose
+Design2Cloth, a high fidelity 3D generative model trained on a real world
+dataset from more than 2000 subject scans. To provide vital contribution to the
+fashion industry, we developed a user-friendly adversarial model capable of
+generating diverse and detailed clothes simply by drawing a 2D cloth mask.
+Under a series of both qualitative and quantitative experiments, we showcase
+that Design2Cloth outperforms current state-of-the-art cloth generative models
+by a large margin. In addition to the generative properties of our network, we
+showcase that the proposed method can be used to achieve high quality
+reconstructions from single in-the-wild images and 3D scans. Dataset, code and
+pre-trained model will become publicly available.",cs.CV,['cs.CV']
+Revisiting the Domain Shift and Sample Uncertainty in Multi-source Active Domain Transfer,Wenqiao Zhang · Zheqi Lv, ,https://arxiv.org/abs/2311.12905,,2311.12905.pdf,Revisiting the Domain Shift and Sample Uncertainty in Multi-source Active Domain Transfer,"Active Domain Adaptation (ADA) aims to maximally boost model adaptation in a
+new target domain by actively selecting a limited number of target data to
+annotate.This setting neglects the more practical scenario where training data
+are collected from multiple sources. This motivates us to target a new and
+challenging setting of knowledge transfer that extends ADA from a single source
+domain to multiple source domains, termed Multi-source Active Domain Adaptation
+(MADA). Not surprisingly, we find that most traditional ADA methods cannot work
+directly in such a setting, mainly due to the excessive domain gap introduced
+by all the source domains and thus their uncertainty-aware sample selection can
+easily become miscalibrated under the multi-domain shifts. Considering this, we
+propose a Dynamic integrated uncertainty valuation framework(Detective) that
+comprehensively consider the domain shift between multi-source domains and
+target domain to detect the informative target samples. Specifically, the
+leverages a dynamic Domain Adaptation(DA) model that learns how to adapt the
+model's parameters to fit the union of multi-source domains. This enables an
+approximate single-source domain modeling by the dynamic model. We then
+comprehensively measure both domain uncertainty and predictive uncertainty in
+the target domain to detect informative target samples using evidential deep
+learning, thereby mitigating uncertainty miscalibration. Furthermore, we
+introduce a contextual diversity-aware calculator to enhance the diversity of
+the selected samples. Experiments demonstrate that our solution outperforms
+existing methods by a considerable margin on three domain adaptation
+benchmarks.",cs.AI,"['cs.AI', 'cs.LG']"
+Attentive Illumination Decomposition Model for Multi-Illuminant White Balancing,Dongyoung Kim · Jinwoo Kim · Junsang Yu · Seon Joo Kim,https://www.dykim.me/projects/aid,https://arxiv.org/abs/2402.18277,,2402.18277.pdf,Attentive Illumination Decomposition Model for Multi-Illuminant White Balancing,"White balance (WB) algorithms in many commercial cameras assume single and
+uniform illumination, leading to undesirable results when multiple lighting
+sources with different chromaticities exist in the scene. Prior research on
+multi-illuminant WB typically predicts illumination at the pixel level without
+fully grasping the scene's actual lighting conditions, including the number and
+color of light sources. This often results in unnatural outcomes lacking in
+overall consistency. To handle this problem, we present a deep white balancing
+model that leverages the slot attention, where each slot is in charge of
+representing individual illuminants. This design enables the model to generate
+chromaticities and weight maps for individual illuminants, which are then fused
+to compose the final illumination map. Furthermore, we propose the
+centroid-matching loss, which regulates the activation of each slot based on
+the color range, thereby enhancing the model to separate illumination more
+effectively. Our method achieves the state-of-the-art performance on both
+single- and multi-illuminant WB benchmarks, and also offers additional
+information such as the number of illuminants in the scene and their
+chromaticity. This capability allows for illumination editing, an application
+not feasible with prior methods.",cs.CV,['cs.CV']
+Vista-LLaMA: Reliable Video Teller via Equal Distance to Visual Tokens,Fan Ma · Xiaojie Jin · Heng Wang · Yuchen Xian · Jiashi Feng · Yi Yang, ,https://arxiv.org/abs/2312.08870,,2312.08870.pdf,Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens,"Recent advances in large video-language models have displayed promising
+outcomes in video comprehension. Current approaches straightforwardly convert
+video into language tokens and employ large language models for multi-modal
+tasks. However, this method often leads to the generation of irrelevant
+content, commonly known as ""hallucination"", as the length of the text increases
+and the impact of the video diminishes. To address this problem, we propose
+Vista-LLaMA, a novel framework that maintains the consistent distance between
+all visual tokens and any language tokens, irrespective of the generated text
+length. Vista-LLaMA omits relative position encoding when determining attention
+weights between visual and text tokens, retaining the position encoding for
+text and text tokens. This amplifies the effect of visual tokens on text
+generation, especially when the relative distance is longer between visual and
+text tokens. The proposed attention mechanism significantly reduces the chance
+of producing irrelevant text related to the video content. Furthermore, we
+present a sequential visual projector that projects the current video frame
+into tokens of language space with the assistance of the previous frame. This
+approach not only captures the temporal relationship within the video, but also
+allows less visual tokens to encompass the entire video. Our approach
+significantly outperforms various previous methods (e.g., Video-ChatGPT,
+MovieChat) on four challenging open-ended video question answering benchmarks.
+We reach an accuracy of 60.7 on the zero-shot NExT-QA and 60.5 on the zero-shot
+MSRVTT-QA, setting a new state-of-the-art performance. This project is
+available at https://jinxxian.github.io/Vista-LLaMA.",cs.CV,['cs.CV']
+Data-Efficient Unsupervised Interpolation Without Any Intermediate Frame for 4D Medical Images,JungEun Kim · Hangyul Yoon · Geondo Park · Kyungsu Kim · Eunho Yang, ,https://arxiv.org/abs/2404.01464,,2404.01464.pdf,Data-Efficient Unsupervised Interpolation Without Any Intermediate Frame for 4D Medical Images,"4D medical images, which represent 3D images with temporal information, are
+crucial in clinical practice for capturing dynamic changes and monitoring
+long-term disease progression. However, acquiring 4D medical images poses
+challenges due to factors such as radiation exposure and imaging duration,
+necessitating a balance between achieving high temporal resolution and
+minimizing adverse effects. Given these circumstances, not only is data
+acquisition challenging, but increasing the frame rate for each dataset also
+proves difficult. To address this challenge, this paper proposes a simple yet
+effective Unsupervised Volumetric Interpolation framework, UVI-Net. This
+framework facilitates temporal interpolation without the need for any
+intermediate frames, distinguishing it from the majority of other existing
+unsupervised methods. Experiments on benchmark datasets demonstrate significant
+improvements across diverse evaluation metrics compared to unsupervised and
+supervised baselines. Remarkably, our approach achieves this superior
+performance even when trained with a dataset as small as one, highlighting its
+exceptional robustness and efficiency in scenarios with sparse supervision.
+This positions UVI-Net as a compelling alternative for 4D medical imaging,
+particularly in settings where data availability is limited. The source code is
+available at https://github.com/jungeun122333/UVI-Net.",eess.IV,"['eess.IV', 'cs.AI', 'cs.CV', 'cs.LG']"
+ERMVP: Communication-Efficient and Collaboration-Robust Multi-Vehicle Perception in Challenging Environments,Jingyu Zhang · Kun Yang · Yilei Wang · Hanqi Wang · Peng Sun · Liang Song, ,https://arxiv.org/abs/2307.13929v3,,2307.13929v3.pdf,Spatio-Temporal Domain Awareness for Multi-Agent Collaborative Perception,"Multi-agent collaborative perception as a potential application for
+vehicle-to-everything communication could significantly improve the perception
+performance of autonomous vehicles over single-agent perception. However,
+several challenges remain in achieving pragmatic information sharing in this
+emerging research. In this paper, we propose SCOPE, a novel collaborative
+perception framework that aggregates the spatio-temporal awareness
+characteristics across on-road agents in an end-to-end manner. Specifically,
+SCOPE has three distinct strengths: i) it considers effective semantic cues of
+the temporal context to enhance current representations of the target agent;
+ii) it aggregates perceptually critical spatial information from heterogeneous
+agents and overcomes localization errors via multi-scale feature interactions;
+iii) it integrates multi-source representations of the target agent based on
+their complementary contributions by an adaptive fusion paradigm. To thoroughly
+evaluate SCOPE, we consider both real-world and simulated scenarios of
+collaborative 3D object detection tasks on three datasets. Extensive
+experiments demonstrate the superiority of our approach and the necessity of
+the proposed components.",cs.CV,['cs.CV']
+HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding,Trong-Thuan Nguyen · Pha Nguyen · Khoa Luu,https://uark-cviu.github.io/ASPIRe/,https://arxiv.org/abs/2312.03050,,2312.03050.pdf,HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding,"Visual interactivity understanding within visual scenes presents a
+significant challenge in computer vision. Existing methods focus on complex
+interactivities while leveraging a simple relationship model. These methods,
+however, struggle with a diversity of appearance, situation, position,
+interaction, and relation in videos. This limitation hinders the ability to
+fully comprehend the interplay within the complex visual dynamics of subjects.
+In this paper, we delve into interactivities understanding within visual
+content by deriving scene graph representations from dense interactivities
+among humans and objects. To achieve this goal, we first present a new dataset
+containing Appearance-Situation-Position-Interaction-Relation predicates, named
+ASPIRe, offering an extensive collection of videos marked by a wide range of
+interactivities. Then, we propose a new approach named Hierarchical
+Interlacement Graph (HIG), which leverages a unified layer and graph within a
+hierarchical structure to provide deep insights into scene changes across five
+distinct tasks. Our approach demonstrates superior performance to other methods
+through extensive experiments conducted in various scenarios.",cs.CV,['cs.CV']
+A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions,Jack Urbanek · Florian Bordes · Pietro Astolfi · Mary Williamson · Vasu Sharma · Adriana Romero-Soriano, ,https://arxiv.org/abs/2312.08578,,2312.08578.pdf,A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions,"Curation methods for massive vision-language datasets trade off between
+dataset size and quality. However, even the highest quality of available
+curated captions are far too short to capture the rich visual detail in an
+image. To show the value of dense and highly-aligned image-text pairs, we
+collect the Densely Captioned Images (DCI) dataset, containing 8012 natural
+images human-annotated with mask-aligned descriptions averaging above 1000
+words each. With precise and reliable captions associated with specific parts
+of an image, we can evaluate vision-language models' (VLMs) understanding of
+image content with a novel task that matches each caption with its
+corresponding subcrop. As current models are often limited to 77 text tokens,
+we also introduce a summarized version (sDCI) in which each caption length is
+limited. We show that modern techniques that make progress on standard
+benchmarks do not correspond with significant improvement on our sDCI based
+benchmark. Lastly, we finetune CLIP using sDCI and show significant
+improvements over the baseline despite a small training set. By releasing the
+first human annotated dense image captioning dataset, we hope to enable the
+development of new benchmarks or fine-tuning recipes for the next generation of
+VLMs to come.",cs.CV,['cs.CV']
+Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion,Lucas Nunes · Rodrigo Marcuzzi · Benedikt Mersch · Jens Behley · Cyrill Stachniss,https://github.com/PRBonn/LiDiff,https://arxiv.org/html/2403.13470v1,,2403.13470v1.pdf,Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion,"Computer vision techniques play a central role in the perception stack of
+autonomous vehicles. Such methods are employed to perceive the vehicle
+surroundings given sensor data. 3D LiDAR sensors are commonly used to collect
+sparse 3D point clouds from the scene. However, compared to human perception,
+such systems struggle to deduce the unseen parts of the scene given those
+sparse point clouds. In this matter, the scene completion task aims at
+predicting the gaps in the LiDAR measurements to achieve a more complete scene
+representation. Given the promising results of recent diffusion models as
+generative models for images, we propose extending them to achieve scene
+completion from a single 3D LiDAR scan. Previous works used diffusion models
+over range images extracted from LiDAR data, directly applying image-based
+diffusion methods. Distinctly, we propose to directly operate on the points,
+reformulating the noising and denoising diffusion process such that it can
+efficiently work at scene scale. Together with our approach, we propose a
+regularization loss to stabilize the noise predicted during the denoising
+process. Our experimental evaluation shows that our method can complete the
+scene given a single LiDAR scan as input, producing a scene with more details
+compared to state-of-the-art scene completion methods. We believe that our
+proposed diffusion process formulation can support further research in
+diffusion models applied to scene-scale point cloud data.",cs.CV,['cs.CV']
+Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis,Willi Menapace · Aliaksandr Siarohin · Ivan Skorokhodov · Ekaterina Deyneka · Tsai-Shien Chen · Anil Kag · Yuwei Fang · Aleksei Stoliar · Elisa Ricci · Jian Ren · Sergey Tulyakov, ,https://arxiv.org/abs/2402.14797,,2402.14797.pdf,Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis,"Contemporary models for generating images show remarkable quality and
+versatility. Swayed by these advantages, the research community repurposes them
+to generate videos. Since video content is highly redundant, we argue that
+naively bringing advances of image models to the video generation domain
+reduces motion fidelity, visual quality and impairs scalability. In this work,
+we build Snap Video, a video-first model that systematically addresses these
+challenges. To do that, we first extend the EDM framework to take into account
+spatially and temporally redundant pixels and naturally support video
+generation. Second, we show that a U-Net - a workhorse behind image generation
+- scales poorly when generating videos, requiring significant computational
+overhead. Hence, we propose a new transformer-based architecture that trains
+3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us
+to efficiently train a text-to-video model with billions of parameters for the
+first time, reach state-of-the-art results on a number of benchmarks, and
+generate videos with substantially higher quality, temporal consistency, and
+motion complexity. The user studies showed that our model was favored by a
+large margin over the most recent methods. See our website at
+https://snap-research.github.io/snapvideo/.",cs.CV,"['cs.CV', 'cs.AI']"
+ADFactory: An Effective Framework for Generalizing Optical Flow with NeRF,Han Ling · Quansen Sun · Yinghui Sun · Xian Xu · Xingfeng Li, ,https://arxiv.org/abs/2311.04246,,2311.04246.pdf,ADFactory: An Effective Framework for Generalizing Optical Flow with Nerf,"A significant challenge facing current optical flow methods is the difficulty
+in generalizing them well to the real world. This is mainly due to the high
+cost of hand-crafted datasets, and existing self-supervised methods are limited
+by indirect loss and occlusions, resulting in fuzzy outcomes. To address this
+challenge, we introduce a novel optical flow training framework: automatic data
+factory (ADF). ADF only requires RGB images as input to effectively train the
+optical flow network on the target data domain. Specifically, we use advanced
+Nerf technology to reconstruct scenes from photo groups collected by a
+monocular camera, and then calculate optical flow labels between camera pose
+pairs based on the rendering results. To eliminate erroneous labels caused by
+defects in the scene reconstructed by Nerf, we screened the generated labels
+from multiple aspects, such as optical flow matching accuracy, radiation field
+confidence, and depth consistency. The filtered labels can be directly used for
+network supervision. Experimentally, the generalization ability of ADF on KITTI
+surpasses existing self-supervised optical flow and monocular scene flow
+algorithms. In addition, ADF achieves impressive results in real-world
+zero-point generalization evaluations and surpasses most supervised methods.",cs.CV,['cs.CV']
+Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection,Chengjie Wang · wenbing zhu · Bin-Bin Gao · Zhenye Gan · Jiangning Zhang · Zhihao Gu · Bruce Qian · Mingang Chen · Lizhuang Ma, ,https://arxiv.org/abs/2403.12580,,2403.12580.pdf,Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection,"Industrial anomaly detection (IAD) has garnered significant attention and
+experienced rapid development. However, the recent development of IAD approach
+has encountered certain difficulties due to dataset limitations. On the one
+hand, most of the state-of-the-art methods have achieved saturation (over 99%
+in AUROC) on mainstream datasets such as MVTec, and the differences of methods
+cannot be well distinguished, leading to a significant gap between public
+datasets and actual application scenarios. On the other hand, the research on
+various new practical anomaly detection settings is limited by the scale of the
+dataset, posing a risk of overfitting in evaluation results. Therefore, we
+propose a large-scale, Real-world, and multi-view Industrial Anomaly Detection
+dataset, named Real-IAD, which contains 150K high-resolution images of 30
+different objects, an order of magnitude larger than existing datasets. It has
+a larger range of defect area and ratio proportions, making it more challenging
+than previous datasets. To make the dataset closer to real application
+scenarios, we adopted a multi-view shooting method and proposed sample-level
+evaluation metrics. In addition, beyond the general unsupervised anomaly
+detection setting, we propose a new setting for Fully Unsupervised Industrial
+Anomaly Detection (FUIAD) based on the observation that the yield rate in
+industrial production is usually greater than 60%, which has more practical
+application value. Finally, we report the results of popular IAD methods on the
+Real-IAD dataset, providing a highly challenging benchmark to promote the
+development of the IAD field.",cs.CV,['cs.CV']
+Multiview Aerial Visual RECognition (MAVREC) Dataset: Can Multi-view Improve Aerial Visual Perception?,Aritra Dutta · Srijan Das · Jacob Nielsen · RAJATSUBHRA CHAKRABORTY · Mubarak Shah, ,https://arxiv.org/abs/2312.04548,,2312.04548.pdf,Multiview Aerial Visual Recognition (MAVREC): Can Multi-view Improve Aerial Visual Perception?,"Despite the commercial abundance of UAVs, aerial data acquisition remains
+challenging, and the existing Asia and North America-centric open-source UAV
+datasets are small-scale or low-resolution and lack diversity in scene
+contextuality. Additionally, the color content of the scenes, solar-zenith
+angle, and population density of different geographies influence the data
+diversity. These two factors conjointly render suboptimal aerial-visual
+perception of the deep neural network (DNN) models trained primarily on the
+ground-view data, including the open-world foundational models.
+  To pave the way for a transformative era of aerial detection, we present
+Multiview Aerial Visual RECognition or MAVREC, a video dataset where we record
+synchronized scenes from different perspectives -- ground camera and
+drone-mounted camera. MAVREC consists of around 2.5 hours of industry-standard
+2.7K resolution video sequences, more than 0.5 million frames, and 1.1 million
+annotated bounding boxes. This makes MAVREC the largest ground and aerial-view
+dataset, and the fourth largest among all drone-based datasets across all
+modalities and tasks. Through our extensive benchmarking on MAVREC, we
+recognize that augmenting object detectors with ground-view images from the
+corresponding geographical location is a superior pre-training strategy for
+aerial detection. Building on this strategy, we benchmark MAVREC with a
+curriculum-based semi-supervised object detection approach that leverages
+labeled (ground and aerial) and unlabeled (only aerial) images to enhance the
+aerial detection. We publicly release the MAVREC dataset:
+https://mavrec.github.io.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'I.4.0; I.4.8; I.5.1; I.5.4; I.2.10']"
+SpatialTracker: Tracking Any 2D Pixels in 3D Space,Yuxi Xiao · Qianqian Wang · Shangzhan Zhang · Nan Xue · Sida Peng · Yujun Shen · Xiaowei Zhou, ,https://arxiv.org/abs/2404.04319,,2404.04319.pdf,SpatialTracker: Tracking Any 2D Pixels in 3D Space,"Recovering dense and long-range pixel motion in videos is a challenging
+problem. Part of the difficulty arises from the 3D-to-2D projection process,
+leading to occlusions and discontinuities in the 2D motion domain. While 2D
+motion can be intricate, we posit that the underlying 3D motion can often be
+simple and low-dimensional. In this work, we propose to estimate point
+trajectories in 3D space to mitigate the issues caused by image projection. Our
+method, named SpatialTracker, lifts 2D pixels to 3D using monocular depth
+estimators, represents the 3D content of each frame efficiently using a
+triplane representation, and performs iterative updates using a transformer to
+estimate 3D trajectories. Tracking in 3D allows us to leverage
+as-rigid-as-possible (ARAP) constraints while simultaneously learning a
+rigidity embedding that clusters pixels into different rigid parts. Extensive
+evaluation shows that our approach achieves state-of-the-art tracking
+performance both qualitatively and quantitatively, particularly in challenging
+scenarios such as out-of-plane rotation.",cs.CV,['cs.CV']
+SVDinsTN: A Tensor Network Paradigm for Efficient Structure Search from Regularized Modeling Perspective,Yu-Bang Zheng · Xile Zhao · Junhua Zeng · Chao Li · Qibin Zhao · Heng-Chao Li · Ting-Zhu Huang,https://yubangzheng.github.io,,https://zhaoxile.github.io/index.html,,,,,nan
+LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge,Gongwei Chen · Leyang Shen · Rui Shao · Xiang Deng · Liqiang Nie, ,https://arxiv.org/abs/2311.11860,,2311.11860.pdf,LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge,"Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability
+to perceive and understand multi-modal signals. However, most of the existing
+MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text
+pairs, leading to insufficient extraction and reasoning of visual knowledge. To
+address this issue, we devise a dual-Level vIsual knOwledge eNhanced Multimodal
+Large Language Model (LION), which empowers the MLLM by injecting visual
+knowledge in two levels. 1) Progressive incorporation of fine-grained
+spatial-aware visual knowledge. We design a vision aggregator cooperated with
+region-level vision-language (VL) tasks to incorporate fine-grained
+spatial-aware visual knowledge into the MLLM. To alleviate the conflict between
+image-level and region-level VL tasks during incorporation, we devise a
+dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This
+progressive incorporation scheme contributes to the mutual promotion between
+these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual
+evidence. We facilitate the MLLM with high-level semantic visual evidence by
+leveraging diverse image tags. To mitigate the potential influence caused by
+imperfect predicted tags, we propose a soft prompting method by embedding a
+learnable token into the tailored text instruction. Comprehensive experiments
+on several multi-modal benchmarks demonstrate the superiority of our model
+(e.g., improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over
+InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).",cs.CV,['cs.CV']
+LEAD: Learning Decomposition for Source-free Universal Domain Adaptation,Sanqing Qu · Tianpei Zou · Lianghua He · Florian Röhrbein · Alois Knoll · Guang Chen · Changjun Jiang,https://github.com/ispc-lab/LEAD,https://arxiv.org/abs/2403.03421,,2403.03421.pdf,LEAD: Learning Decomposition for Source-free Universal Domain Adaptation,"Universal Domain Adaptation (UniDA) targets knowledge transfer in the
+presence of both covariate and label shifts. Recently, Source-free Universal
+Domain Adaptation (SF-UniDA) has emerged to achieve UniDA without access to
+source data, which tends to be more practical due to data protection policies.
+The main challenge lies in determining whether covariate-shifted samples belong
+to target-private unknown categories. Existing methods tackle this either
+through hand-crafted thresholding or by developing time-consuming iterative
+clustering strategies. In this paper, we propose a new idea of LEArning
+Decomposition (LEAD), which decouples features into source-known and -unknown
+components to identify target-private data. Technically, LEAD initially
+leverages the orthogonal decomposition analysis for feature decomposition.
+Then, LEAD builds instance-level decision boundaries to adaptively identify
+target-private data. Extensive experiments across various UniDA scenarios have
+demonstrated the effectiveness and superiority of LEAD. Notably, in the OPDA
+scenario on VisDA dataset, LEAD outperforms GLC by 3.5% overall H-score and
+reduces 75% time to derive pseudo-labeling decision boundaries. Besides, LEAD
+is also appealing in that it is complementary to most existing methods. The
+code is available at https://github.com/ispc-lab/LEAD.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations,Evonne Ng · Javier Romero · Timur Bagautdinov · Shaojie Bai · Trevor Darrell · Angjoo Kanazawa · Alexander Richard, ,https://arxiv.org/abs/2401.01885,,2401.01885.pdf,From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations,"We present a framework for generating full-bodied photorealistic avatars that
+gesture according to the conversational dynamics of a dyadic interaction. Given
+speech audio, we output multiple possibilities of gestural motion for an
+individual, including face, body, and hands. The key behind our method is in
+combining the benefits of sample diversity from vector quantization with the
+high-frequency details obtained through diffusion to generate more dynamic,
+expressive motion. We visualize the generated motion using highly
+photorealistic avatars that can express crucial nuances in gestures (e.g.
+sneers and smirks). To facilitate this line of research, we introduce a
+first-of-its-kind multi-view conversational dataset that allows for
+photorealistic reconstruction. Experiments show our model generates appropriate
+and diverse gestures, outperforming both diffusion- and VQ-only methods.
+Furthermore, our perceptual evaluation highlights the importance of
+photorealism (vs. meshes) in accurately assessing subtle motion details in
+conversational gestures. Code and dataset available online.",cs.CV,['cs.CV']
+Frequency Decoupling for Motion Magnification via Multi-Level Isomorphic Architecture,Fei Wang · Dan Guo · Kun Li · Zhun Zhong · Meng Wang, ,https://arxiv.org/abs/2403.07347,,2403.07347.pdf,Frequency Decoupling for Motion Magnification via Multi-Level Isomorphic Architecture,"Video Motion Magnification (VMM) aims to reveal subtle and imperceptible
+motion information of objects in the macroscopic world. Prior methods directly
+model the motion field from the Eulerian perspective by Representation Learning
+that separates shape and texture or Multi-domain Learning from phase
+fluctuations. Inspired by the frequency spectrum, we observe that the
+low-frequency components with stable energy always possess spatial structure
+and less noise, making them suitable for modeling the subtle motion field. To
+this end, we present FD4MM, a new paradigm of Frequency Decoupling for Motion
+Magnification with a Multi-level Isomorphic Architecture to capture multi-level
+high-frequency details and a stable low-frequency structure (motion field) in
+video space. Since high-frequency details and subtle motions are susceptible to
+information degradation due to their inherent subtlety and unavoidable external
+interference from noise, we carefully design Sparse High/Low-pass Filters to
+enhance the integrity of details and motion structures, and a Sparse Frequency
+Mixer to promote seamless recoupling. Besides, we innovatively design a
+contrastive regularization for this task to strengthen the model's ability to
+discriminate irrelevant features, reducing undesired motion magnification.
+Extensive experiments on both Real-world and Synthetic Datasets show that our
+FD4MM outperforms SOTA methods. Meanwhile, FD4MM reduces FLOPs by 1.63$\times$
+and boosts inference speed by 1.68$\times$ than the latest method. Our code is
+available at https://github.com/Jiafei127/FD4MM.",cs.CV,['cs.CV']
+LLM-AR: When Large Language Model Meets Skeleton-Based Action Recognition,Haoxuan Qu · Yujun Cai · Jun Liu, ,https://arxiv.org/abs/2404.00532,,2404.00532.pdf,LLMs are Good Action Recognizers,"Skeleton-based action recognition has attracted lots of research attention.
+Recently, to build an accurate skeleton-based action recognizer, a variety of
+works have been proposed. Among them, some works use large model architectures
+as backbones of their recognizers to boost the skeleton data representation
+capability, while some other works pre-train their recognizers on external data
+to enrich the knowledge. In this work, we observe that large language models
+which have been extensively used in various natural language processing tasks
+generally hold both large model architectures and rich implicit knowledge.
+Motivated by this, we propose a novel LLM-AR framework, in which we investigate
+treating the Large Language Model as an Action Recognizer. In our framework, we
+propose a linguistic projection process to project each input action signal
+(i.e., each skeleton sequence) into its ``sentence format'' (i.e., an ``action
+sentence''). Moreover, we also incorporate our framework with several designs
+to further facilitate this linguistic projection process. Extensive experiments
+demonstrate the efficacy of our proposed framework.",cs.CV,['cs.CV']
+Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis,Xin Zhou · Dingkang Liang · Wei Xu · Xingkui Zhu · Yihan Xu · Zhikang Zou · Xiang Bai, ,https://arxiv.org/abs/2403.01439,,2403.01439.pdf,Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis,"Point cloud analysis has achieved outstanding performance by transferring
+point cloud pre-trained models. However, existing methods for model adaptation
+usually update all model parameters, i.e., full fine-tuning paradigm, which is
+inefficient as it relies on high computational costs (e.g., training GPU
+memory) and massive storage space. In this paper, we aim to study
+parameter-efficient transfer learning for point cloud analysis with an ideal
+trade-off between task performance and parameter efficiency. To achieve this
+goal, we freeze the parameters of the default pre-trained models and then
+propose the Dynamic Adapter, which generates a dynamic scale for each token,
+considering the token significance to the downstream task. We further
+seamlessly integrate Dynamic Adapter with Prompt Tuning (DAPT) by constructing
+Internal Prompts, capturing the instance-specific features for interaction.
+Extensive experiments conducted on five challenging datasets demonstrate that
+the proposed DAPT achieves superior performance compared to the full
+fine-tuning counterparts while significantly reducing the trainable parameters
+and training GPU memory by 95% and 35%, respectively. Code is available at
+https://github.com/LMD0311/DAPT.",cs.CV,['cs.CV']
+Link-Context Learning for Multimodal LLMs,Yan Tai · Weichen Fan · Zhao Zhang · Ziwei Liu, ,https://arxiv.org/abs/2308.07891,,2308.07891.pdf,Link-Context Learning for Multimodal LLMs,"The ability to learn from context with novel concepts, and deliver
+appropriate responses are essential in human conversations. Despite current
+Multimodal Large Language Models (MLLMs) and Large Language Models (LLMs) being
+trained on mega-scale datasets, recognizing unseen images or understanding
+novel concepts in a training-free manner remains a challenge. In-Context
+Learning (ICL) explores training-free few-shot learning, where models are
+encouraged to ``learn to learn"" from limited tasks and generalize to unseen
+tasks. In this work, we propose link-context learning (LCL), which emphasizes
+""reasoning from cause and effect"" to augment the learning capabilities of
+MLLMs. LCL goes beyond traditional ICL by explicitly strengthening the causal
+relationship between the support set and the query set. By providing
+demonstrations with causal links, LCL guides the model to discern not only the
+analogy but also the underlying causal associations between data points, which
+empowers MLLMs to recognize unseen images and understand novel concepts more
+effectively. To facilitate the evaluation of this novel approach, we introduce
+the ISEKAI dataset, comprising exclusively of unseen generated image-label
+pairs designed for link-context learning. Extensive experiments show that our
+LCL-MLLM exhibits strong link-context learning capabilities to novel concepts
+over vanilla MLLMs. Code and data will be released at
+https://github.com/isekai-portal/Link-Context-Learning.",cs.CV,"['cs.CV', 'cs.CL']"
+Perturbing Attention Gives You More Bang for the Buck: Subtle Imaging Perturbations That Efficiently Fool Customized Diffusion Models,Jingyao Xu · Yuetong Lu · Yandong Li · Siyang Lu · Dongdong Wang · Xiang Wei, ,https://arxiv.org/abs/2404.15081,,2404.15081.pdf,Perturbing Attention Gives You More Bang for the Buck: Subtle Imaging Perturbations That Efficiently Fool Customized Diffusion Models,"Diffusion models (DMs) embark a new era of generative modeling and offer more
+opportunities for efficient generating high-quality and realistic data samples.
+However, their widespread use has also brought forth new challenges in model
+security, which motivates the creation of more effective adversarial attackers
+on DMs to understand its vulnerability. We propose CAAT, a simple but generic
+and efficient approach that does not require costly training to effectively
+fool latent diffusion models (LDMs). The approach is based on the observation
+that cross-attention layers exhibits higher sensitivity to gradient change,
+allowing for leveraging subtle perturbations on published images to
+significantly corrupt the generated images. We show that a subtle perturbation
+on an image can significantly impact the cross-attention layers, thus changing
+the mapping between text and image during the fine-tuning of customized
+diffusion models. Extensive experiments demonstrate that CAAT is compatible
+with diverse diffusion models and outperforms baseline attack methods in a more
+effective (more noise) and efficient (twice as fast as Anti-DreamBooth and
+Mist) manner.",cs.CV,"['cs.CV', 'cs.CR', 'cs.LG']"
+Robust Depth Enhancement via Polarization Prompt Fusion Tuning,Kei IKEMURA · Yiming Huang · Felix Heide · Zhaoxiang Zhang · Qifeng Chen · Chenyang Lei,https://lastbasket.github.io/PPFT/,https://arxiv.org/abs/2404.04318,,2404.04318.pdf,Robust Depth Enhancement via Polarization Prompt Fusion Tuning,"Existing depth sensors are imperfect and may provide inaccurate depth values
+in challenging scenarios, such as in the presence of transparent or reflective
+objects. In this work, we present a general framework that leverages
+polarization imaging to improve inaccurate depth measurements from various
+depth sensors. Previous polarization-based depth enhancement methods focus on
+utilizing pure physics-based formulas for a single sensor. In contrast, our
+method first adopts a learning-based strategy where a neural network is trained
+to estimate a dense and complete depth map from polarization data and a sensor
+depth map from different sensors. To further improve the performance, we
+propose a Polarization Prompt Fusion Tuning (PPFT) strategy to effectively
+utilize RGB-based models pre-trained on large-scale datasets, as the size of
+the polarization dataset is limited to train a strong model from scratch. We
+conducted extensive experiments on a public dataset, and the results
+demonstrate that the proposed method performs favorably compared to existing
+depth enhancement baselines. Code and demos are available at
+https://lastbasket.github.io/PPFT/.",cs.CV,"['cs.CV', 'cs.AI']"
+Shadows Don’t Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now,Ayush Sarkar · Hanlin Mai · Amitabh Mahapatra · David Forsyth · Svetlana Lazebnik · Anand Bhattad,https://projective-geometry.github.io,https://arxiv.org/abs/2311.17138,,2311.17138.pdf,Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now,"Generative models can produce impressively realistic images. This paper
+demonstrates that generated images have geometric features different from those
+of real images. We build a set of collections of generated images, prequalified
+to fool simple, signal-based classifiers into believing they are real. We then
+show that prequalified generated images can be identified reliably by
+classifiers that only look at geometric properties. We use three such
+classifiers. All three classifiers are denied access to image pixels, and look
+only at derived geometric features. The first classifier looks at the
+perspective field of the image, the second looks at lines detected in the
+image, and the third looks at relations between detected objects and shadows.
+Our procedure detects generated images more reliably than SOTA local signal
+based detectors, for images from a number of distinct generators. Saliency maps
+suggest that the classifiers can identify geometric problems reliably. We
+conclude that current generators cannot reliably reproduce geometric properties
+of real images.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG']"
+GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image,Chong Bao · Yinda Zhang · Yuan Li · Xiyu Zhang · Bangbang Yang · Hujun Bao · Marc Pollefeys · Guofeng Zhang · Zhaopeng Cui,https://zju3dv.github.io/geneavatar/,https://arxiv.org/abs/2404.02152,,2404.02152.pdf,GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image,"Recently, we have witnessed the explosive growth of various volumetric
+representations in modeling animatable head avatars. However, due to the
+diversity of frameworks, there is no practical method to support high-level
+applications like 3D head avatar editing across different representations. In
+this paper, we propose a generic avatar editing approach that can be
+universally applied to various 3DMM driving volumetric head avatars. To achieve
+this goal, we design a novel expression-aware modification generative model,
+which enables lift 2D editing from a single image to a consistent 3D
+modification field. To ensure the effectiveness of the generative modification
+process, we develop several techniques, including an expression-dependent
+modification distillation scheme to draw knowledge from the large-scale head
+avatar model and 2D facial texture editing tools, implicit latent space
+guidance to enhance model convergence, and a segmentation-based loss reweight
+strategy for fine-grained texture inversion. Extensive experiments demonstrate
+that our method delivers high-quality and consistent results across multiple
+expression and viewpoints. Project page: https://zju3dv.github.io/geneavatar/",cs.CV,['cs.CV']
+MarkovGen: Structured Prediction for Efficient Text-to-Image Generation,Sadeep Jayasumana · Daniel Glasner · Srikumar Ramalingam · Andreas Veit · Ayan Chakrabarti · Sanjiv Kumar, ,https://arxiv.org/abs/2308.10997,,2308.10997.pdf,MarkovGen: Structured Prediction for Efficient Text-to-Image Generation,"Modern text-to-image generation models produce high-quality images that are
+both photorealistic and faithful to the text prompts. However, this quality
+comes at significant computational cost: nearly all of these models are
+iterative and require running sampling multiple times with large models. This
+iterative process is needed to ensure that different regions of the image are
+not only aligned with the text prompt, but also compatible with each other. In
+this work, we propose a light-weight approach to achieving this compatibility
+between different regions of an image, using a Markov Random Field (MRF) model.
+We demonstrate the effectiveness of this method on top of the latent
+token-based Muse text-to-image model. The MRF richly encodes the compatibility
+among image tokens at different spatial locations to improve quality and
+significantly reduce the required number of Muse sampling steps. Inference with
+the MRF is significantly cheaper, and its parameters can be quickly learned
+through back-propagation by modeling MRF inference as a differentiable
+neural-network layer. Our full model, MarkovGen, uses this proposed MRF model
+to both speed up Muse by 1.5X and produce higher quality images by decreasing
+undesirable image artifacts.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+LAENeRF: Local Appearance Editing for Neural Radiance Fields,Lukas Radl · Michael Steiner · Andreas Kurz · Markus Steinberger,https://r4dl.github.io/LAENeRF/,https://arxiv.org/abs/2312.09913,,2312.09913.pdf,LAENeRF: Local Appearance Editing for Neural Radiance Fields,"Due to the omnipresence of Neural Radiance Fields (NeRFs), the interest
+towards editable implicit 3D representations has surged over the last years.
+However, editing implicit or hybrid representations as used for NeRFs is
+difficult due to the entanglement of appearance and geometry encoded in the
+model parameters. Despite these challenges, recent research has shown first
+promising steps towards photorealistic and non-photorealistic appearance edits.
+The main open issues of related work include limited interactivity, a lack of
+support for local edits and large memory requirements, rendering them less
+useful in practice. We address these limitations with LAENeRF, a unified
+framework for photorealistic and non-photorealistic appearance editing of
+NeRFs. To tackle local editing, we leverage a voxel grid as starting point for
+region selection. We learn a mapping from expected ray terminations to final
+output color, which can optionally be supervised by a style loss, resulting in
+a framework which can perform photorealistic and non-photorealistic appearance
+editing of selected regions. Relying on a single point per ray for our mapping,
+we limit memory requirements and enable fast optimization. To guarantee
+interactivity, we compose the output color using a set of learned, modifiable
+base colors, composed with additive layer mixing. Compared to concurrent work,
+LAENeRF enables recoloring and stylization while keeping processing time low.
+Furthermore, we demonstrate that our approach surpasses baseline methods both
+quantitatively and qualitatively.",cs.CV,['cs.CV']
+EgoGen: An Egocentric Synthetic Data Generator,Gen Li · Kaifeng Zhao · Siwei Zhang · Xiaozhong Lyu · Mihai Dusmanu · Yan Zhang · Marc Pollefeys · Siyu Tang,https://ego-gen.github.io,https://arxiv.org/abs/2401.08739,,2401.08739.pdf,EgoGen: An Egocentric Synthetic Data Generator,"Understanding the world in first-person view is fundamental in Augmented
+Reality (AR). This immersive perspective brings dramatic visual changes and
+unique challenges compared to third-person views. Synthetic data has empowered
+third-person-view vision models, but its application to embodied egocentric
+perception tasks remains largely unexplored. A critical challenge lies in
+simulating natural human movements and behaviors that effectively steer the
+embodied cameras to capture a faithful egocentric representation of the 3D
+world. To address this challenge, we introduce EgoGen, a new synthetic data
+generator that can produce accurate and rich ground-truth training data for
+egocentric perception tasks. At the heart of EgoGen is a novel human motion
+synthesis model that directly leverages egocentric visual inputs of a virtual
+human to sense the 3D environment. Combined with collision-avoiding motion
+primitives and a two-stage reinforcement learning approach, our motion
+synthesis model offers a closed-loop solution where the embodied perception and
+movement of the virtual human are seamlessly coupled. Compared to previous
+works, our model eliminates the need for a pre-defined global path, and is
+directly applicable to dynamic environments. Combined with our easy-to-use and
+scalable data generation pipeline, we demonstrate EgoGen's efficacy in three
+tasks: mapping and localization for head-mounted cameras, egocentric camera
+tracking, and human mesh recovery from egocentric views. EgoGen will be fully
+open-sourced, offering a practical solution for creating realistic egocentric
+training data and aiming to serve as a useful tool for egocentric computer
+vision research. Refer to our project page: https://ego-gen.github.io/.",cs.CV,"['cs.CV', 'cs.AI']"
+D3T: Distinctive Dual-Domain Teacher Zigzagging Across RGB-Thermal Gap for Domain-Adaptive Object Detection,Dinh Phat Do · Taehoon Kim · JAEMIN NA · Jiwon Kim · Keonho LEE · Kyunghwan Cho · Wonjun Hwang,https://github.com/EdwardDo69/D3T,https://arxiv.org/abs/2403.09359,,2403.09359.pdf,D3T: Distinctive Dual-Domain Teacher Zigzagging Across RGB-Thermal Gap for Domain-Adaptive Object Detection,"Domain adaptation for object detection typically entails transferring
+knowledge from one visible domain to another visible domain. However, there are
+limited studies on adapting from the visible to the thermal domain, because the
+domain gap between the visible and thermal domains is much larger than
+expected, and traditional domain adaptation can not successfully facilitate
+learning in this situation. To overcome this challenge, we propose a
+Distinctive Dual-Domain Teacher (D3T) framework that employs distinct training
+paradigms for each domain. Specifically, we segregate the source and target
+training sets for building dual-teachers and successively deploy exponential
+moving average to the student model to individual teachers of each domain. The
+framework further incorporates a zigzag learning method between dual teachers,
+facilitating a gradual transition from the visible to thermal domains during
+training. We validate the superiority of our method through newly designed
+experimental protocols with well-known thermal datasets, i.e., FLIR and KAIST.
+Source code is available at https://github.com/EdwardDo69/D3T .",cs.CV,"['cs.CV', 'cs.AI']"
+Bayesian Diffusion Models for 3D Shape Reconstruction,Haiyang Xu · Yu lei · Zeyuan Chen · Xiang Zhang · Yue Zhao · Yilin Wang · Zhuowen Tu, ,https://arxiv.org/abs/2403.06973,,2403.06973.pdf,Bayesian Diffusion Models for 3D Shape Reconstruction,"We present Bayesian Diffusion Models (BDM), a prediction algorithm that
+performs effective Bayesian inference by tightly coupling the top-down (prior)
+information with the bottom-up (data-driven) procedure via joint diffusion
+processes. We show the effectiveness of BDM on the 3D shape reconstruction
+task. Compared to prototypical deep learning data-driven approaches trained on
+paired (supervised) data-labels (e.g. image-point clouds) datasets, our BDM
+brings in rich prior information from standalone labels (e.g. point clouds) to
+improve the bottom-up 3D reconstruction. As opposed to the standard Bayesian
+frameworks where explicit prior and likelihood are required for the inference,
+BDM performs seamless information fusion via coupled diffusion processes with
+learned gradient computation networks. The specialty of our BDM lies in its
+capability to engage the active and effective information exchange and fusion
+of the top-down and bottom-up processes where each itself is a diffusion
+process. We demonstrate state-of-the-art results on both synthetic and
+real-world benchmarks for 3D shape reconstruction.",cs.CV,"['cs.CV', 'cs.LG']"
+Domain Separation Graph Neural Networks for Saliency Object Ranking,Zijian Wu · Jun Lu · Jing Han · Lianfa Bai · Yi Zhang · Zhuang Zhao · Siyang Song, ,,https://www.nature.com/articles/s41598-024-61105-3,,,,,nan
+DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision,Lu Ling · Yichen Sheng · Zhi Tu · Wentian Zhao · Cheng Xin · Kun Wan · Lantao Yu · Qianyu Guo · Zixun Yu · Yawen Lu · Xuanmao Li · Xingpeng Sun · Rohan Ashok · Aniruddha Mukherjee · Hao Kang · Xiangrui Kong · Gang Hua · Tianyi Zhang · Bedrich Benes · Aniket Bera, ,https://arxiv.org/abs/2312.16256,,2312.16256.pdf,DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision,"We have witnessed significant progress in deep learning-based 3D vision,
+ranging from neural radiance field (NeRF) based 3D representation learning to
+applications in novel view synthesis (NVS). However, existing scene-level
+datasets for deep learning-based 3D vision, limited to either synthetic
+environments or a narrow selection of real-world scenes, are quite
+insufficient. This insufficiency not only hinders a comprehensive benchmark of
+existing methods but also caps what could be explored in deep learning-based 3D
+analysis. To address this critical gap, we present DL3DV-10K, a large-scale
+scene dataset, featuring 51.2 million frames from 10,510 videos captured from
+65 types of point-of-interest (POI) locations, covering both bounded and
+unbounded scenes, with different levels of reflection, transparency, and
+lighting. We conducted a comprehensive benchmark of recent NVS methods on
+DL3DV-10K, which revealed valuable insights for future research in NVS. In
+addition, we have obtained encouraging results in a pilot study to learn
+generalizable NeRF from DL3DV-10K, which manifests the necessity of a
+large-scale scene-level dataset to forge a path toward a foundation model for
+learning 3D representation. Our DL3DV-10K dataset, benchmark results, and
+models will be publicly accessible at https://dl3dv-10k.github.io/DL3DV-10K/.",cs.CV,"['cs.CV', 'cs.AI']"
+Fitting Flats to Flats,Gabriel Dogadov · Ugo Finnendahl · Marc Alexa, ,,https://github.com/gdogadov,,,,,nan
+MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild,Zeren Jiang · Chen Guo · Manuel Kaufmann · Tianjian Jiang · Julien Valentin · Otmar Hilliges · Jie Song, ,,https://dl.acm.org/doi/10.1145/3581783.3611978,,,,,nan
+Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving,Yuqi Wang · Jiawei He · Lue Fan · Hongxin Li · Yuntao Chen · Zhaoxiang Zhang,https://drive-wm.github.io,https://arxiv.org/abs/2311.17918,,2311.17918.pdf,Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving,"In autonomous driving, predicting future events in advance and evaluating the
+foreseeable risks empowers autonomous vehicles to better plan their actions,
+enhancing safety and efficiency on the road. To this end, we propose Drive-WM,
+the first driving world model compatible with existing end-to-end planning
+models. Through a joint spatial-temporal modeling facilitated by view
+factorization, our model generates high-fidelity multiview videos in driving
+scenes. Building on its powerful generation ability, we showcase the potential
+of applying the world model for safe driving planning for the first time.
+Particularly, our Drive-WM enables driving into multiple futures based on
+distinct driving maneuvers, and determines the optimal trajectory according to
+the image-based rewards. Evaluation on real-world driving datasets verifies
+that our method could generate high-quality, consistent, and controllable
+multiview videos, opening up possibilities for real-world simulations and safe
+planning.",cs.CV,['cs.CV']
+DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving,Chen Min · Dawei Zhao · Liang Xiao · Jian Zhao · Xinli Xu · Zheng Zhu · Lei Jin · Jianshu Li · Yulan Guo · Junliang Xing · Liping Jing · Yiming Nie · Bin Dai, ,,https://paperswithcode.com/paper/driveworld-4d-pre-trained-scene-understanding,,,,,nan
+ZONE: Zero-Shot Instruction-Guided Local Editing,Shanglin Li · Bohan Zeng · Yutang Feng · Sicheng Gao · Xuhui Liu · Jiaming Liu · Li Lin · Xu Tang · Yao Hu · Jianzhuang Liu · Baochang Zhang, ,https://arxiv.org/abs/2312.16794,,2312.16794.pdf,ZONE: Zero-Shot Instruction-Guided Local Editing,"Recent advances in vision-language models like Stable Diffusion have shown
+remarkable power in creative image synthesis and editing.However, most existing
+text-to-image editing methods encounter two obstacles: First, the text prompt
+needs to be carefully crafted to achieve good results, which is not intuitive
+or user-friendly. Second, they are insensitive to local edits and can
+irreversibly affect non-edited regions, leaving obvious editing traces. To
+tackle these problems, we propose a Zero-shot instructiON-guided local image
+Editing approach, termed ZONE. We first convert the editing intent from the
+user-provided instruction (e.g., ""make his tie blue"") into specific image
+editing regions through InstructPix2Pix. We then propose a Region-IoU scheme
+for precise image layer extraction from an off-the-shelf segment model. We
+further develop an edge smoother based on FFT for seamless blending between the
+layer and the image.Our method allows for arbitrary manipulation of a specific
+region with a single instruction while preserving the rest. Extensive
+experiments demonstrate that our ZONE achieves remarkable local editing results
+and user-friendliness, outperforming state-of-the-art methods. Code is
+available at https://github.com/lsl001006/ZONE.",cs.CV,['cs.CV']
+DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing,Kaiwen Zhang · Yifan Zhou · Xudong XU · Bo Dai · Xingang Pan,https://kevin-thu.github.io/DiffMorpher_page,https://arxiv.org/abs/2312.07409,,2312.07409.pdf,DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing,"Diffusion models have achieved remarkable image generation quality surpassing
+previous generative models. However, a notable limitation of diffusion models,
+in comparison to GANs, is their difficulty in smoothly interpolating between
+two image samples, due to their highly unstructured latent space. Such a smooth
+interpolation is intriguing as it naturally serves as a solution for the image
+morphing task with many applications. In this work, we present DiffMorpher, the
+first approach enabling smooth and natural image interpolation using diffusion
+models. Our key idea is to capture the semantics of the two images by fitting
+two LoRAs to them respectively, and interpolate between both the LoRA
+parameters and the latent noises to ensure a smooth semantic transition, where
+correspondence automatically emerges without the need for annotation. In
+addition, we propose an attention interpolation and injection technique and a
+new sampling schedule to further enhance the smoothness between consecutive
+images. Extensive experiments demonstrate that DiffMorpher achieves starkly
+better image morphing effects than previous methods across a variety of object
+categories, bridging a critical functional gap that distinguished diffusion
+models from GANs.",cs.CV,['cs.CV']
+InstructDiffusion: A Generalist Modeling Interface for Vision Tasks,Zigang Geng · Binxin Yang · Tiankai Hang · Chen Li · Shuyang Gu · Ting Zhang · Jianmin Bao · Zheng Zhang · Houqiang Li · Han Hu · Dong Chen · Baining Guo, ,https://arxiv.org/abs/2309.03895,,2309.03895.pdf,InstructDiffusion: A Generalist Modeling Interface for Vision Tasks,"We present InstructDiffusion, a unifying and generic framework for aligning
+computer vision tasks with human instructions. Unlike existing approaches that
+integrate prior knowledge and pre-define the output space (e.g., categories and
+coordinates) for each vision task, we cast diverse vision tasks into a
+human-intuitive image-manipulating process whose output space is a flexible and
+interactive pixel space. Concretely, the model is built upon the diffusion
+process and is trained to predict pixels according to user instructions, such
+as encircling the man's left shoulder in red or applying a blue mask to the
+left car. InstructDiffusion could handle a variety of vision tasks, including
+understanding tasks (such as segmentation and keypoint detection) and
+generative tasks (such as editing and enhancement). It even exhibits the
+ability to handle unseen tasks and outperforms prior methods on novel datasets.
+This represents a significant step towards a generalist modeling interface for
+vision tasks, advancing artificial general intelligence in the field of
+computer vision.",cs.CV,['cs.CV']
+Loose Inertial Poser: Motion Capture with IMU-attached Loose-Wear Jacket,Chengxu Zuo · Yiming Wang · Lishuang Zhan · Shihui Guo · Xinyu Yi · Feng Xu · Yipeng Qin, ,https://arxiv.org/abs/2308.16682,,2308.16682.pdf,DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion,"Motion capture from a limited number of body-worn sensors, such as inertial
+measurement units (IMUs) and pressure insoles, has important applications in
+health, human performance, and entertainment. Recent work has focused on
+accurately reconstructing whole-body motion from a specific sensor
+configuration using six IMUs. While a common goal across applications is to use
+the minimal number of sensors to achieve required accuracy, the optimal
+arrangement of the sensors might differ from application to application. We
+propose a single diffusion model, DiffusionPoser, which reconstructs human
+motion in real-time from an arbitrary combination of sensors, including IMUs
+placed at specified locations, and, pressure insoles. Unlike existing methods,
+our model grants users the flexibility to determine the number and arrangement
+of sensors tailored to the specific activity of interest, without the need for
+retraining. A novel autoregressive inferencing scheme ensures real-time motion
+reconstruction that closely aligns with measured sensor signals. The generative
+nature of DiffusionPoser ensures realistic behavior, even for
+degrees-of-freedom not directly measured. Qualitative results can be found on
+our website: https://diffusionposer.github.io/.",cs.CV,['cs.CV']
+Morphological Prototyping for Unsupervised Slide Representation Learning in Computational Pathology,Andrew Song · Richard J. Chen · Tong Ding · Drew F. K. Williamson · Guillaume Jaume · Faisal Mahmood, ,https://arxiv.org/abs/2405.11643,,2405.11643.pdf,Morphological Prototyping for Unsupervised Slide Representation Learning in Computational Pathology,"Representation learning of pathology whole-slide images (WSIs) has been has
+primarily relied on weak supervision with Multiple Instance Learning (MIL).
+However, the slide representations resulting from this approach are highly
+tailored to specific clinical tasks, which limits their expressivity and
+generalization, particularly in scenarios with limited data. Instead, we
+hypothesize that morphological redundancy in tissue can be leveraged to build a
+task-agnostic slide representation in an unsupervised fashion. To this end, we
+introduce PANTHER, a prototype-based approach rooted in the Gaussian mixture
+model that summarizes the set of WSI patches into a much smaller set of
+morphological prototypes. Specifically, each patch is assumed to have been
+generated from a mixture distribution, where each mixture component represents
+a morphological exemplar. Utilizing the estimated mixture parameters, we then
+construct a compact slide representation that can be readily used for a wide
+range of downstream tasks. By performing an extensive evaluation of PANTHER on
+subtyping and survival tasks using 13 datasets, we show that 1) PANTHER
+outperforms or is on par with supervised MIL baselines and 2) the analysis of
+morphological prototypes brings new qualitative and quantitative insights into
+model interpretability.",cs.CV,"['cs.CV', 'cs.LG', 'stat.AP']"
+FairRAG: Fair Human Generation via Fair Retrieval Augmentation,Robik Shrestha · Yang Zou · Qiuyu Chen · Zhiheng Li · Yusheng Xie · Siqi Deng, ,https://arxiv.org/abs/2403.19964,,2403.19964.pdf,FairRAG: Fair Human Generation via Fair Retrieval Augmentation,"Existing text-to-image generative models reflect or even amplify societal
+biases ingrained in their training data. This is especially concerning for
+human image generation where models are biased against certain demographic
+groups. Existing attempts to rectify this issue are hindered by the inherent
+limitations of the pre-trained models and fail to substantially improve
+demographic diversity. In this work, we introduce Fair Retrieval Augmented
+Generation (FairRAG), a novel framework that conditions pre-trained generative
+models on reference images retrieved from an external image database to improve
+fairness in human generation. FairRAG enables conditioning through a
+lightweight linear module that projects reference images into the textual
+space. To enhance fairness, FairRAG applies simple-yet-effective debiasing
+strategies, providing images from diverse demographic groups during the
+generative process. Extensive experiments demonstrate that FairRAG outperforms
+existing methods in terms of demographic diversity, image-text alignment, and
+image fidelity while incurring minimal computational overhead during inference.",cs.CV,"['cs.CV', 'cs.CY', 'cs.LG']"
+Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction,Guillaume Jaume · Anurag Vaidya · Richard J. Chen · Drew F. K. Williamson · Paul Pu Liang · Faisal Mahmood, ,https://arxiv.org/abs/2404.08027,,2404.08027.pdf,SurvMamba: State Space Model with Multi-grained Multi-modal Interaction for Survival Prediction,"Multi-modal learning that combines pathological images with genomic data has
+significantly enhanced the accuracy of survival prediction. Nevertheless,
+existing methods have not fully utilized the inherent hierarchical structure
+within both whole slide images (WSIs) and transcriptomic data, from which
+better intra-modal representations and inter-modal integration could be
+derived. Moreover, many existing studies attempt to improve multi-modal
+representations through attention mechanisms, which inevitably lead to high
+complexity when processing high-dimensional WSIs and transcriptomic data.
+Recently, a structured state space model named Mamba emerged as a promising
+approach for its superior performance in modeling long sequences with low
+complexity. In this study, we propose Mamba with multi-grained multi-modal
+interaction (SurvMamba) for survival prediction. SurvMamba is implemented with
+a Hierarchical Interaction Mamba (HIM) module that facilitates efficient
+intra-modal interactions at different granularities, thereby capturing more
+detailed local features as well as rich global representations. In addition, an
+Interaction Fusion Mamba (IFM) module is used for cascaded inter-modal
+interactive fusion, yielding more comprehensive features for survival
+prediction. Comprehensive evaluations on five TCGA datasets demonstrate that
+SurvMamba outperforms other existing methods in terms of performance and
+computational cost.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'q-bio.QM']"
+End-to-End Spatio-Temporal Action Localisation with Video Transformers,Alexey Gritsenko · Xuehan Xiong · Josip Djolonga · Mostafa Dehghani · Chen Sun · Mario Lučić · Cordelia Schmid · Anurag Arnab, ,,https://openreview.net/forum?id=Va4t6R8cGG,,,,,nan
+MCNet: Rethinking the Core Ingredients for Accurate and Efficient Homography Estimation,Haokai Zhu · Si-Yuan Cao · Jianxin Hu · Sitong Zuo · Beinan Yu · Jiacheng Ying · Junwei Li · Hui-Liang Shen,https://github.com/zjuzhk/MCNet,,https://www.youtube.com/watch?v=mcRa7BsZrOE,,,,,nan
+SODA: Bottleneck Diffusion Models for Representation Learning,Drew Hudson · Daniel Zoran · Mateusz Malinowski · Andrew Lampinen · Andrew Jaegle · James McClelland · Loic Matthey · Felix Hill · Alexander Lerchner, ,https://arxiv.org/abs/2311.17901,,2311.17901.pdf,SODA: Bottleneck Diffusion Models for Representation Learning,"We introduce SODA, a self-supervised diffusion model, designed for
+representation learning. The model incorporates an image encoder, which
+distills a source view into a compact representation, that, in turn, guides the
+generation of related novel views. We show that by imposing a tight bottleneck
+between the encoder and a denoising decoder, and leveraging novel view
+synthesis as a self-supervised objective, we can turn diffusion models into
+strong representation learners, capable of capturing visual semantics in an
+unsupervised manner. To the best of our knowledge, SODA is the first diffusion
+model to succeed at ImageNet linear-probe classification, and, at the same
+time, it accomplishes reconstruction, editing and synthesis tasks across a wide
+range of datasets. Further investigation reveals the disentangled nature of its
+emergent latent space, that serves as an effective interface to control and
+manipulate the model's produced images. All in all, we aim to shed light on the
+exciting and promising potential of diffusion models, not only for image
+generation, but also for learning rich and robust representations.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+EasyDrag: Efficient Point-based Manipulation on Diffusion Models,Xingzhong Hou · Boxiao Liu · Yi Zhang · Jihao Liu · Yu Liu · Haihang You, ,,https://github.com/Yujun-Shi/DragDiffusion,,,,,nan
+Segment and Caption Anything,Xiaoke Huang · Jianfeng Wang · Yansong Tang · Zheng Zhang · Han Hu · Jiwen Lu · Lijuan Wang · Zicheng Liu,https://xk-huang.github.io/segment-caption-anything/,https://arxiv.org/abs/2312.00869,,2312.00869.pdf,Segment and Caption Anything,"We propose a method to efficiently equip the Segment Anything Model (SAM)
+with the ability to generate regional captions. SAM presents strong
+generalizability to segment anything while is short for semantic understanding.
+By introducing a lightweight query-based feature mixer, we align the
+region-specific features with the embedding space of language models for later
+caption generation. As the number of trainable parameters is small (typically
+in the order of tens of millions), it costs less computation, less memory
+usage, and less communication bandwidth, resulting in both fast and scalable
+training. To address the scarcity problem of regional caption data, we propose
+to first pre-train our model on objection detection and segmentation tasks. We
+call this step weak supervision pretraining since the pre-training data only
+contains category names instead of full-sentence descriptions. The weak
+supervision pretraining allows us to leverage many publicly available object
+detection and segmentation datasets. We conduct extensive experiments to
+demonstrate the superiority of our method and validate each design choice. This
+work serves as a stepping stone towards scaling up regional captioning data and
+sheds light on exploring efficient ways to augment SAM with regional semantics.
+The project page, along with the associated code, can be accessed via
+https://xk-huang.github.io/segment-caption-anything/.",cs.CV,['cs.CV']
+6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation,Li Xu · Haoxuan Qu · Yujun Cai · Jun Liu, ,https://arxiv.org/abs/2401.00029,,2401.00029.pdf,6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation,"Estimating the 6D object pose from a single RGB image often involves noise
+and indeterminacy due to challenges such as occlusions and cluttered
+backgrounds. Meanwhile, diffusion models have shown appealing performance in
+generating high-quality images from random noise with high indeterminacy
+through step-by-step denoising. Inspired by their denoising capability, we
+propose a novel diffusion-based framework (6D-Diff) to handle the noise and
+indeterminacy in object pose estimation for better performance. In our
+framework, to establish accurate 2D-3D correspondence, we formulate 2D
+keypoints detection as a reverse diffusion (denoising) process. To facilitate
+such a denoising process, we design a Mixture-of-Cauchy-based forward diffusion
+process and condition the reverse process on the object features. Extensive
+experiments on the LM-O and YCB-V datasets demonstrate the effectiveness of our
+framework.",cs.CV,['cs.CV']
+UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes,David Rozenberszki · Or Litany · Angela Dai,https://rozdavid.github.io/unscene3d,https://ar5iv.labs.arxiv.org/html/2312.11557,,2312.11557.pdf,SAI3D: Segment Any Instance in 3D Scenes,"Advancements in 3D instance segmentation have traditionally been tethered to
+the availability of annotated datasets, limiting their application to a narrow
+spectrum of object categories. Recent efforts have sought to harness
+vision-language models like CLIP for open-set semantic reasoning, yet these
+methods struggle to distinguish between objects of the same categories and rely
+on specific prompts that are not universally applicable. In this paper, we
+introduce SAI3D, a novel zero-shot 3D instance segmentation approach that
+synergistically leverages geometric priors and semantic cues derived from
+Segment Anything Model (SAM). Our method partitions a 3D scene into geometric
+primitives, which are then progressively merged into 3D instance segmentations
+that are consistent with the multi-view SAM masks. Moreover, we design a
+hierarchical region-growing algorithm with a dynamic thresholding mechanism,
+which largely improves the robustness of finegrained 3D scene parsing.Empirical
+evaluations on ScanNet, Matterport3D and the more challenging ScanNet++
+datasets demonstrate the superiority of our approach. Notably, SAI3D
+outperforms existing open-vocabulary baselines and even surpasses
+fully-supervised methods in class-agnostic segmentation on ScanNet++. Our
+project page is at https://yd-yin.github.io/SAI3D.",cs.CV,['cs.CV']
+Exploring Regional Clues in CLIP for Zero-Shot Semantic Segmentation,Yi Zhang · Meng-Hao Guo · Miao Wang · Shi-Min Hu, ,https://arxiv.org/abs/2403.08426,,2403.08426.pdf,Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation,"The pre-trained vision-language model, exemplified by CLIP, advances
+zero-shot semantic segmentation by aligning visual features with class
+embeddings through a transformer decoder to generate semantic masks. Despite
+its effectiveness, prevailing methods within this paradigm encounter
+challenges, including overfitting on seen classes and small fragmentation in
+masks. To mitigate these issues, we propose a Language-Driven Visual Consensus
+(LDVC) approach, fostering improved alignment of semantic and visual
+information.Specifically, we leverage class embeddings as anchors due to their
+discrete and abstract nature, steering vision features toward class embeddings.
+Moreover, to circumvent noisy alignments from the vision part due to its
+redundant nature, we introduce route attention into self-attention for finding
+visual consensus, thereby enhancing semantic consistency within the same
+object. Equipped with a vision-language prompting strategy, our approach
+significantly boosts the generalization capacity of segmentation models for
+unseen classes. Experimental results underscore the effectiveness of our
+approach, showcasing mIoU gains of 4.5 on the PASCAL VOC 2012 and 3.6 on the
+COCO-Stuff 164k for unseen classes compared with the state-of-the-art methods.",cs.CV,"['cs.CV', 'cs.AI']"
+Selective nonlinearities removal from digital signals,Krzysztof Maliszewski · Magdalena Urbanska · Varvara Vetrova · Sylwia Kolenderska, ,https://arxiv.org/abs/2403.09731,,2403.09731.pdf,Selective nonlinearities removal from digital signals,"Many instruments performing optical and non-optical imaging and sensing, such
+as Optical Coherence Tomography (OCT), Magnetic Resonance Imaging or
+Fourier-transform spectrometry, produce digital signals containing modulations,
+sine-like components, which only after Fourier transformation give information
+about the structure or characteristics of the investigated object. Due to the
+fundamental physics-related limitations of such methods, the distribution of
+these signal components is often nonlinear and, when not properly compensated,
+leads to the resolution, precision or quality drop in the final image. Here, we
+propose an innovative approach that has the potential to allow cleaning of the
+signal from the nonlinearities but most of all, it now allows to switch the
+given order off, leaving all others intact. The latter provides a tool for more
+in-depth analysis of the nonlinearity-inducing properties of the investigated
+object, which can lead to applications in early disease detection or more
+sensitive sensing of chemical compounds. We consider OCT signals and
+nonlinearities up to the third order. In our approach, we propose two neural
+networks: one to remove solely the second-order nonlinearity and the other for
+removing solely the third-order nonlinearity. The input of the networks is a
+novel two-dimensional data structure with all the information needed for the
+network to infer a nonlinearity-free signal. We describe the developed networks
+and present the results for second-order and third-order nonlinearity removal
+in OCT data representing the images of various objects: a mirror, glass, and
+fruits.",eess.IV,"['eess.IV', 'physics.data-an', 'physics.optics']"
+Efficient Model Stealing Defense with Noise Transition Matrix,Dong-Dong Wu · Chilin Fu · Weichang Wu · Wenwen Xia · Xiaolu Zhang · JUN ZHOU · Min-Ling Zhang, ,https://arxiv.org/abs/2309.01838,,2309.01838.pdf,Efficient Defense Against Model Stealing Attacks on Convolutional Neural Networks,"Model stealing attacks have become a serious concern for deep learning
+models, where an attacker can steal a trained model by querying its black-box
+API. This can lead to intellectual property theft and other security and
+privacy risks. The current state-of-the-art defenses against model stealing
+attacks suggest adding perturbations to the prediction probabilities. However,
+they suffer from heavy computations and make impracticable assumptions about
+the adversary. They often require the training of auxiliary models. This can be
+time-consuming and resource-intensive which hinders the deployment of these
+defenses in real-world applications. In this paper, we propose a simple yet
+effective and efficient defense alternative. We introduce a heuristic approach
+to perturb the output probabilities. The proposed defense can be easily
+integrated into models without additional training. We show that our defense is
+effective in defending against three state-of-the-art stealing attacks. We
+evaluate our approach on large and quantized (i.e., compressed) Convolutional
+Neural Networks (CNNs) trained on several vision datasets. Our technique
+outperforms the state-of-the-art defenses with a $\times37$ faster inference
+latency without requiring any additional model and with a low impact on the
+model's performance. We validate that our defense is also effective for
+quantized CNNs targeting edge devices.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CR']"
+Unsupervised Universal Image Segmentation,XuDong Wang · Dantong Niu · Xinyang Han · Long Lian · Roei Herzig · Trevor Darrell, ,https://arxiv.org/abs/2312.17243,,2312.17243.pdf,Unsupervised Universal Image Segmentation,"Several unsupervised image segmentation approaches have been proposed which
+eliminate the need for dense manually-annotated segmentation masks; current
+models separately handle either semantic segmentation (e.g., STEGO) or
+class-agnostic instance segmentation (e.g., CutLER), but not both (i.e.,
+panoptic segmentation). We propose an Unsupervised Universal Segmentation model
+(U2Seg) adept at performing various image segmentation tasks -- instance,
+semantic and panoptic -- using a novel unified framework. U2Seg generates
+pseudo semantic labels for these segmentation tasks via leveraging
+self-supervised models followed by clustering; each cluster represents
+different semantic and/or instance membership of pixels. We then self-train the
+model on these pseudo semantic labels, yielding substantial performance gains
+over specialized methods tailored to each task: a +2.6 AP$^{\text{box}}$ boost
+vs. CutLER in unsupervised instance segmentation on COCO and a +7.0 PixelAcc
+increase (vs. STEGO) in unsupervised semantic segmentation on COCOStuff.
+Moreover, our method sets up a new baseline for unsupervised panoptic
+segmentation, which has not been previously explored. U2Seg is also a strong
+pretrained model for few-shot segmentation, surpassing CutLER by +5.0
+AP$^{\text{mask}}$ when trained on a low-data regime, e.g., only 1% COCO
+labels. We hope our simple yet effective method can inspire more research on
+unsupervised universal image segmentation.",cs.CV,['cs.CV']
+HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation,Xin Huang · Ruizhi Shao · Qi Zhang · Hongwen Zhang · Ying Feng · Yebin Liu · Qing Wang,https://humannorm.github.io,https://arxiv.org/abs/2310.01406,,2310.01406.pdf,HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation,"Recent text-to-3D methods employing diffusion models have made significant
+advancements in 3D human generation. However, these approaches face challenges
+due to the limitations of text-to-image diffusion models, which lack an
+understanding of 3D structures. Consequently, these methods struggle to achieve
+high-quality human generation, resulting in smooth geometry and cartoon-like
+appearances. In this paper, we propose HumanNorm, a novel approach for
+high-quality and realistic 3D human generation. The main idea is to enhance the
+model's 2D perception of 3D geometry by learning a normal-adapted diffusion
+model and a normal-aligned diffusion model. The normal-adapted diffusion model
+can generate high-fidelity normal maps corresponding to user prompts with
+view-dependent and body-aware text. The normal-aligned diffusion model learns
+to generate color images aligned with the normal maps, thereby transforming
+physical geometry details into realistic appearance. Leveraging the proposed
+normal diffusion model, we devise a progressive geometry generation strategy
+and a multi-step Score Distillation Sampling (SDS) loss to enhance the
+performance of 3D human generation. Comprehensive experiments substantiate
+HumanNorm's ability to generate 3D humans with intricate geometry and realistic
+appearances. HumanNorm outperforms existing text-to-3D methods in both geometry
+and texture quality. The project page of HumanNorm is
+https://humannorm.github.io/.",cs.CV,['cs.CV']
+SCE-MAE: Selective Correspondence Enhancement with Masked Autoencoder for Self-Supervised Landmark Estimation,Kejia Yin · Varshanth Rao · Ruowei Jiang · Xudong Liu · Parham Aarabi · David B. Lindell, ,https://arxiv.org/abs/2405.18322,,2405.18322.pdf,SCE-MAE: Selective Correspondence Enhancement with Masked Autoencoder for Self-Supervised Landmark Estimation,"Self-supervised landmark estimation is a challenging task that demands the
+formation of locally distinct feature representations to identify sparse facial
+landmarks in the absence of annotated data. To tackle this task, existing
+state-of-the-art (SOTA) methods (1) extract coarse features from backbones that
+are trained with instance-level self-supervised learning (SSL) paradigms, which
+neglect the dense prediction nature of the task, (2) aggregate them into
+memory-intensive hypercolumn formations, and (3) supervise lightweight
+projector networks to naively establish full local correspondences among all
+pairs of spatial features. In this paper, we introduce SCE-MAE, a framework
+that (1) leverages the MAE, a region-level SSL method that naturally better
+suits the landmark prediction task, (2) operates on the vanilla feature map
+instead of on expensive hypercolumns, and (3) employs a Correspondence
+Approximation and Refinement Block (CARB) that utilizes a simple density peak
+clustering algorithm and our proposed Locality-Constrained Repellence Loss to
+directly hone only select local correspondences. We demonstrate through
+extensive experiments that SCE-MAE is highly effective and robust,
+outperforming existing SOTA methods by large margins of approximately 20%-44%
+on the landmark matching and approximately 9%-15% on the landmark detection
+tasks.",cs.CV,"['cs.CV', 'cs.AI']"
+Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples,Yuyang Yu · Bangzhen Liu · Chenxi Zheng · Xuemiao Xu · Huaidong Zhang · Shengfeng He,https://github.com/Yuyan9Yu/BeyondTextConstraint,https://arxiv.org/abs/2307.16424,,2307.16424.pdf,MetaDiff: Meta-Learning with Conditional Diffusion for Few-Shot Learning,"Equipping a deep model the abaility of few-shot learning, i.e., learning
+quickly from only few examples, is a core challenge for artificial
+intelligence. Gradient-based meta-learning approaches effectively address the
+challenge by learning how to learn novel tasks. Its key idea is learning a deep
+model in a bi-level optimization manner, where the outer-loop process learns a
+shared gradient descent algorithm (i.e., its hyperparameters), while the
+inner-loop process leverage it to optimize a task-specific model by using only
+few labeled data. Although these existing methods have shown superior
+performance, the outer-loop process requires calculating second-order
+derivatives along the inner optimization path, which imposes considerable
+memory burdens and the risk of vanishing gradients. Drawing inspiration from
+recent progress of diffusion models, we find that the inner-loop gradient
+descent process can be actually viewed as a reverse process (i.e., denoising)
+of diffusion where the target of denoising is model weights but the origin
+data. Based on this fact, in this paper, we propose to model the gradient
+descent optimizer as a diffusion model and then present a novel
+task-conditional diffusion-based meta-learning, called MetaDiff, that
+effectively models the optimization process of model weights from Gaussion
+noises to target weights in a denoising manner. Thanks to the training
+efficiency of diffusion models, our MetaDiff do not need to differentiate
+through the inner-loop path such that the memory burdens and the risk of
+vanishing gradients can be effectvely alleviated. Experiment results show that
+our MetaDiff outperforms the state-of-the-art gradient-based meta-learning
+family in few-shot learning tasks.",cs.LG,['cs.LG']
+Joint2Human: High-quality 3D Human Generation via Compact Spherical Embedding of 3D Joints,Muxin Zhang · Qiao Feng · Zhuo Su · Chao Wen · Zhou Xue · Kun Li, ,https://arxiv.org/abs/2312.08591,,2312.08591.pdf,Joint2Human: High-quality 3D Human Generation via Compact Spherical Embedding of 3D Joints,"3D human generation is increasingly significant in various applications.
+However, the direct use of 2D generative methods in 3D generation often results
+in losing local details, while methods that reconstruct geometry from generated
+images struggle with global view consistency. In this work, we introduce
+Joint2Human, a novel method that leverages 2D diffusion models to generate
+detailed 3D human geometry directly, ensuring both global structure and local
+details. To achieve this, we employ the Fourier occupancy field (FOF)
+representation, enabling the direct generation of 3D shapes as preliminary
+results with 2D generative models. With the proposed high-frequency enhancer
+and the multi-view recarving strategy, our method can seamlessly integrate the
+details from different views into a uniform global shape. To better utilize the
+3D human prior and enhance control over the generated geometry, we introduce a
+compact spherical embedding of 3D joints. This allows for an effective guidance
+of pose during the generation process. Additionally, our method can generate 3D
+humans guided by textual inputs. Our experimental results demonstrate the
+capability of our method to ensure global structure, local details, high
+resolution, and low computational cost simultaneously. More results and the
+code can be found on our project page at
+http://cic.tju.edu.cn/faculty/likun/projects/Joint2Human.",cs.CV,['cs.CV']
+Rethinking Generalizable Face Anti-spoofing via Hierarchical Prototype-guided Distribution Refinement in Hyperbolic Space,Chengyang Hu · Ke-Yue Zhang · Taiping Yao · Shouhong Ding · Lizhuang Ma, ,https://arxiv.org/abs/2308.09107,,2308.09107.pdf,Hyperbolic Face Anti-Spoofing,"Learning generalized face anti-spoofing (FAS) models against presentation
+attacks is essential for the security of face recognition systems. Previous FAS
+methods usually encourage models to extract discriminative features, of which
+the distances within the same class (bonafide or attack) are pushed close while
+those between bonafide and attack are pulled away. However, these methods are
+designed based on Euclidean distance, which lacks generalization ability for
+unseen attack detection due to poor hierarchy embedding ability. According to
+the evidence that different spoofing attacks are intrinsically hierarchical, we
+propose to learn richer hierarchical and discriminative spoofing cues in
+hyperbolic space. Specifically, for unimodal FAS learning, the feature
+embeddings are projected into the Poincar\'e ball, and then the hyperbolic
+binary logistic regression layer is cascaded for classification. To further
+improve generalization, we conduct hyperbolic contrastive learning for the
+bonafide only while relaxing the constraints on diverse spoofing attacks. To
+alleviate the vanishing gradient problem in hyperbolic space, a new feature
+clipping method is proposed to enhance the training stability of hyperbolic
+models. Besides, we further design a multimodal FAS framework with Euclidean
+multimodal feature decomposition and hyperbolic multimodal feature fusion &
+classification. Extensive experiments on three benchmark datasets (i.e., WMCA,
+PADISI-Face, and SiW-M) with diverse attack types demonstrate that the proposed
+method can bring significant improvement compared to the Euclidean baselines on
+unseen attack detection. In addition, the proposed framework is also
+generalized well on four benchmark datasets (i.e., MSU-MFSD, IDIAP
+REPLAY-ATTACK, CASIA-FASD, and OULU-NPU) with a limited number of attack types.",cs.CV,['cs.CV']
+NARUTO: Neural Active Reconstruction from Uncertain Target Observations,Ziyue Feng · Huangying Zhan · Zheng Chen · Qingan Yan · Xiangyu Xu · Changjiang Cai · Bing Li · Qilun Zhu · Yi Xu,https://oppo-us-research.github.io/NARUTO-website/,https://arxiv.org/abs/2402.18771v2,,2402.18771v2.pdf,NARUTO: Neural Active Reconstruction from Uncertain Target Observations,"We present NARUTO, a neural active reconstruction system that combines a
+hybrid neural representation with uncertainty learning, enabling high-fidelity
+surface reconstruction. Our approach leverages a multi-resolution hash-grid as
+the mapping backbone, chosen for its exceptional convergence speed and capacity
+to capture high-frequency local features.The centerpiece of our work is the
+incorporation of an uncertainty learning module that dynamically quantifies
+reconstruction uncertainty while actively reconstructing the environment. By
+harnessing learned uncertainty, we propose a novel uncertainty aggregation
+strategy for goal searching and efficient path planning. Our system
+autonomously explores by targeting uncertain observations and reconstructs
+environments with remarkable completeness and fidelity. We also demonstrate the
+utility of this uncertainty-aware approach by enhancing SOTA neural SLAM
+systems through an active ray sampling strategy. Extensive evaluations of
+NARUTO in various environments, using an indoor scene simulator, confirm its
+superior performance and state-of-the-art status in active reconstruction, as
+evidenced by its impressive results on benchmark datasets like Replica and
+MP3D.",cs.CV,"['cs.CV', 'cs.RO']"
+CroSel: Cross Selection of Confident Pseudo Labels for Partial-Label Learning,Shiyu Tian · Hongxin Wei · Yiqun Wang · Lei Feng, ,,https://dblp.org/rec/journals/corr/abs-2303-10365,,,,,nan
+Generative Proxemics: A Prior for 3D Social Interaction from Images,Vickie Ye · Vickie Ye · Georgios Pavlakos · Michael J. Black · Angjoo Kanazawa,https://muelea.github.io/buddi/,https://arxiv.org/abs/2306.09337,,2306.09337.pdf,Generative Proxemics: A Prior for 3D Social Interaction from Images,"Social interaction is a fundamental aspect of human behavior and
+communication. The way individuals position themselves in relation to others,
+also known as proxemics, conveys social cues and affects the dynamics of social
+interaction. Reconstructing such interaction from images presents challenges
+because of mutual occlusion and the limited availability of large training
+datasets. To address this, we present a novel approach that learns a prior over
+the 3D proxemics two people in close social interaction and demonstrate its use
+for single-view 3D reconstruction. We start by creating 3D training data of
+interacting people using image datasets with contact annotations. We then model
+the proxemics using a novel denoising diffusion model called BUDDI that learns
+the joint distribution over the poses of two people in close social
+interaction. Sampling from our generative proxemics model produces realistic 3D
+human interactions, which we validate through a perceptual study. We use BUDDI
+in reconstructing two people in close proximity from a single image without any
+contact annotation via an optimization approach that uses the diffusion model
+as a prior. Our approach recovers accurate and plausible 3D social interactions
+from noisy initial estimates, outperforming state-of-the-art methods. Our code,
+data, and model are availableat our project website at: muelea.github.io/buddi.",cs.CV,['cs.CV']
+Learning Degradation Independent Representations for Camera ISP Pipelines,Yanhui Guo · Fangzhou Luo · Xiaolin Wu, ,https://arxiv.org/abs/2307.00761v3,,2307.00761v3.pdf,Learning Degradation-Independent Representations for Camera ISP Pipelines,"Image signal processing (ISP) pipeline plays a fundamental role in digital
+cameras, which converts raw Bayer sensor data to RGB images. However,
+ISP-generated images usually suffer from imperfections due to the compounded
+degradations that stem from sensor noises, demosaicing noises, compression
+artifacts, and possibly adverse effects of erroneous ISP hyperparameter
+settings such as ISO and gamma values. In a general sense, these ISP
+imperfections can be considered as degradations. The highly complex mechanisms
+of ISP degradations, some of which are even unknown, pose great challenges to
+the generalization capability of deep neural networks (DNN) for image
+restoration and to their adaptability to downstream tasks. To tackle the
+issues, we propose a novel DNN approach to learn degradation-independent
+representations (DiR) through the refinement of a self-supervised learned
+baseline representation. The proposed DiR learning technique has remarkable
+domain generalization capability and consequently, it outperforms
+state-of-the-art methods across various downstream tasks, including blind image
+restoration, object detection, and instance segmentation, as verified in our
+experiments.",cs.CV,['cs.CV']
+VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation,XuDong Wang · Ishan Misra · Ziyun Zeng · Rohit Girdhar · Trevor Darrell, ,https://arxiv.org/abs/2308.14710,,2308.14710.pdf,VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation,"Existing approaches to unsupervised video instance segmentation typically
+rely on motion estimates and experience difficulties tracking small or
+divergent motions. We present VideoCutLER, a simple method for unsupervised
+multi-instance video segmentation without using motion-based learning signals
+like optical flow or training on natural videos. Our key insight is that using
+high-quality pseudo masks and a simple video synthesis method for model
+training is surprisingly sufficient to enable the resulting video model to
+effectively segment and track multiple instances across video frames. We show
+the first competitive unsupervised learning results on the challenging
+YouTubeVIS-2019 benchmark, achieving 50.7% APvideo^50 , surpassing the previous
+state-of-the-art by a large margin. VideoCutLER can also serve as a strong
+pretrained model for supervised video instance segmentation tasks, exceeding
+DINO by 15.9% on YouTubeVIS-2019 in terms of APvideo.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching,Xinghui Li · Jingyi Lu · Kai Han · Victor Adrian Prisacariu, ,https://arxiv.org/abs/2310.17569,,2310.17569.pdf,SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching,"In this paper, we address the challenge of matching semantically similar
+keypoints across image pairs. Existing research indicates that the intermediate
+output of the UNet within the Stable Diffusion (SD) can serve as robust image
+feature maps for such a matching task. We demonstrate that by employing a basic
+prompt tuning technique, the inherent potential of Stable Diffusion can be
+harnessed, resulting in a significant enhancement in accuracy over previous
+approaches. We further introduce a novel conditional prompting module that
+conditions the prompt on the local details of the input image pairs, leading to
+a further improvement in performance. We designate our approach as SD4Match,
+short for Stable Diffusion for Semantic Matching. Comprehensive evaluations of
+SD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets
+new benchmarks in accuracy across all these datasets. Particularly, SD4Match
+outperforms the previous state-of-the-art by a margin of 12 percentage points
+on the challenging SPair-71k dataset.",cs.CV,"['cs.CV', 'cs.LG']"
+PoNQ: a Neural QEM-based Mesh Representation,Nissim Maruani · Maks Ovsjanikov · Pierre Alliez · Mathieu Desbrun,https://nissmar.github.io/projects/ponq/,https://arxiv.org/abs/2403.12870,,2403.12870.pdf,PoNQ: a Neural QEM-based Mesh Representation,"Although polygon meshes have been a standard representation in geometry
+processing, their irregular and combinatorial nature hinders their suitability
+for learning-based applications. In this work, we introduce a novel learnable
+mesh representation through a set of local 3D sample Points and their
+associated Normals and Quadric error metrics (QEM) w.r.t. the underlying shape,
+which we denote PoNQ. A global mesh is directly derived from PoNQ by
+efficiently leveraging the knowledge of the local quadric errors. Besides
+marking the first use of QEM within a neural shape representation, our
+contribution guarantees both topological and geometrical properties by ensuring
+that a PoNQ mesh does not self-intersect and is always the boundary of a
+volume. Notably, our representation does not rely on a regular grid, is
+supervised directly by the target surface alone, and also handles open surfaces
+with boundaries and/or sharp features. We demonstrate the efficacy of PoNQ
+through a learning-based mesh prediction from SDF grids and show that our
+method surpasses recent state-of-the-art techniques in terms of both surface
+and edge-based metrics.",cs.CV,['cs.CV']
+M&M VTO: Multi-Garment Virtual Try-On and Editing,Luyang Zhu · Yingwei Li · Nan Liu · Hao Peng · Dawei Yang · Ira Kemelmacher-Shlizerman,https://mmvto.github.io/,https://arxiv.org/abs/2405.07472,,2405.07472.pdf,GaussianVTON: 3D Human Virtual Try-ON via Multi-Stage Gaussian Splatting Editing with Image Prompting,"The increasing prominence of e-commerce has underscored the importance of
+Virtual Try-On (VTON). However, previous studies predominantly focus on the 2D
+realm and rely heavily on extensive data for training. Research on 3D VTON
+primarily centers on garment-body shape compatibility, a topic extensively
+covered in 2D VTON. Thanks to advances in 3D scene editing, a 2D diffusion
+model has now been adapted for 3D editing via multi-viewpoint editing. In this
+work, we propose GaussianVTON, an innovative 3D VTON pipeline integrating
+Gaussian Splatting (GS) editing with 2D VTON. To facilitate a seamless
+transition from 2D to 3D VTON, we propose, for the first time, the use of only
+images as editing prompts for 3D editing. To further address issues, e.g., face
+blurring, garment inaccuracy, and degraded viewpoint quality during editing, we
+devise a three-stage refinement strategy to gradually mitigate potential
+issues. Furthermore, we introduce a new editing strategy termed Edit Recall
+Reconstruction (ERR) to tackle the limitations of previous editing strategies
+in leading to complex geometric changes. Our comprehensive experiments
+demonstrate the superiority of GaussianVTON, offering a novel perspective on 3D
+VTON while also establishing a novel starting point for image-prompting 3D
+scene editing.",cs.CV,['cs.CV']
+One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls,Minghui Hu · Jianbin Zheng · Chuanxia Zheng · Chaoyue Wang · Dacheng Tao · Tat-Jen Cham, ,https://arxiv.org/abs/2311.15744,,2311.15744.pdf,One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls,"It is well known that many open-released foundational diffusion models have
+difficulty in generating images that substantially depart from average
+brightness, despite such images being present in the training data. This is due
+to an inconsistency: while denoising starts from pure Gaussian noise during
+inference, the training noise schedule retains residual data even in the final
+timestep distribution, due to difficulties in numerical conditioning in
+mainstream formulation, leading to unintended bias during inference. To
+mitigate this issue, certain $\epsilon$-prediction models are combined with an
+ad-hoc offset-noise methodology. In parallel, some contemporary models have
+adopted zero-terminal SNR noise schedules together with
+$\mathbf{v}$-prediction, which necessitate major alterations to pre-trained
+models. However, such changes risk destabilizing a large multitude of
+community-driven applications anchored on these pre-trained models. In light of
+this, our investigation revisits the fundamental causes, leading to our
+proposal of an innovative and principled remedy, called One More Step (OMS). By
+integrating a compact network and incorporating an additional simple yet
+effective step during inference, OMS elevates image fidelity and harmonizes the
+dichotomy between training and inference, while preserving original model
+parameters. Once trained, various pre-trained diffusion models with the same
+latent domain can share the same OMS module.",cs.CV,['cs.CV']
+Back to 3D: Few-Shot 3D Keypoint Detection with Back-Projected 2D Features,Thomas Wimmer · Peter Wonka · Maks Ovsjanikov,https://wimmerth.github.io/back-to-3d.html,https://arxiv.org/abs/2311.18113,,2311.18113.pdf,Back to 3D: Few-Shot 3D Keypoint Detection with Back-Projected 2D Features,"With the immense growth of dataset sizes and computing resources in recent
+years, so-called foundation models have become popular in NLP and vision tasks.
+In this work, we propose to explore foundation models for the task of keypoint
+detection on 3D shapes. A unique characteristic of keypoint detection is that
+it requires semantic and geometric awareness while demanding high localization
+accuracy. To address this problem, we propose, first, to back-project features
+from large pre-trained 2D vision models onto 3D shapes and employ them for this
+task. We show that we obtain robust 3D features that contain rich semantic
+information and analyze multiple candidate features stemming from different 2D
+foundation models. Second, we employ a keypoint candidate optimization module
+which aims to match the average observed distribution of keypoints on the shape
+and is guided by the back-projected features. The resulting approach achieves a
+new state of the art for few-shot keypoint detection on the KeyPointNet
+dataset, almost doubling the performance of the previous best methods.",cs.CV,"['cs.CV', 'cs.GR']"
+Bidirectional Multi-Scale Implicit Neural Representations for Image Deraining,Xiang Chen · Jinshan Pan · Jiangxin Dong,https://github.com/cschenxiang/NeRD-Rain,https://arxiv.org/abs/2404.01547v1,,2404.01547v1.pdf,Bidirectional Multi-Scale Implicit Neural Representations for Image Deraining,"How to effectively explore multi-scale representations of rain streaks is
+important for image deraining. In contrast to existing Transformer-based
+methods that depend mostly on single-scale rain appearance, we develop an
+end-to-end multi-scale Transformer that leverages the potentially useful
+features in various scales to facilitate high-quality image reconstruction. To
+better explore the common degradation representations from spatially-varying
+rain streaks, we incorporate intra-scale implicit neural representations based
+on pixel coordinates with the degraded inputs in a closed-loop design, enabling
+the learned features to facilitate rain removal and improve the robustness of
+the model in complex scenarios. To ensure richer collaborative representation
+from different scales, we embed a simple yet effective inter-scale
+bidirectional feedback operation into our multi-scale Transformer by performing
+coarse-to-fine and fine-to-coarse information communication. Extensive
+experiments demonstrate that our approach, named as NeRD-Rain, performs
+favorably against the state-of-the-art ones on both synthetic and real-world
+benchmark datasets. The source code and trained models are available at
+https://github.com/cschenxiang/NeRD-Rain.",cs.CV,['cs.CV']
+InstanceDiffusion: Instance-level Control for Image Generation,XuDong Wang · Trevor Darrell · Sai Saketh Rambhatla · Rohit Girdhar · Ishan Misra, ,https://arxiv.org/abs/2402.03290,,2402.03290.pdf,InstanceDiffusion: Instance-level Control for Image Generation,"Text-to-image diffusion models produce high quality images but do not offer
+control over individual instances in the image. We introduce InstanceDiffusion
+that adds precise instance-level control to text-to-image diffusion models.
+InstanceDiffusion supports free-form language conditions per instance and
+allows flexible ways to specify instance locations such as simple single
+points, scribbles, bounding boxes or intricate instance segmentation masks, and
+combinations thereof. We propose three major changes to text-to-image models
+that enable precise instance-level control. Our UniFusion block enables
+instance-level conditions for text-to-image models, the ScaleU block improves
+image fidelity, and our Multi-instance Sampler improves generations for
+multiple instances. InstanceDiffusion significantly surpasses specialized
+state-of-the-art models for each location condition. Notably, on the COCO
+dataset, we outperform previous state-of-the-art by 20.4% AP$_{50}^\text{box}$
+for box inputs, and 25.4% IoU for mask inputs.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Constructing and Exploring Intermediate Domains in Mixed Domain Semi-supervised Medical Image Segmentation,Qinghe Ma · Jian Zhang · Lei Qi · Qian Yu · Yinghuan Shi · Yang Gao, ,https://arxiv.org/abs/2404.08951,,2404.08951.pdf,Constructing and Exploring Intermediate Domains in Mixed Domain Semi-supervised Medical Image Segmentation,"Both limited annotation and domain shift are prevalent challenges in medical
+image segmentation. Traditional semi-supervised segmentation and unsupervised
+domain adaptation methods address one of these issues separately. However, the
+coexistence of limited annotation and domain shift is quite common, which
+motivates us to introduce a novel and challenging scenario: Mixed Domain
+Semi-supervised medical image Segmentation (MiDSS). In this scenario, we handle
+data from multiple medical centers, with limited annotations available for a
+single domain and a large amount of unlabeled data from multiple domains. We
+found that the key to solving the problem lies in how to generate reliable
+pseudo labels for the unlabeled data in the presence of domain shift with
+labeled data. To tackle this issue, we employ Unified Copy-Paste (UCP) between
+images to construct intermediate domains, facilitating the knowledge transfer
+from the domain of labeled data to the domains of unlabeled data. To fully
+utilize the information within the intermediate domain, we propose a symmetric
+Guidance training strategy (SymGD), which additionally offers direct guidance
+to unlabeled data by merging pseudo labels from intermediate samples.
+Subsequently, we introduce a Training Process aware Random Amplitude MixUp
+(TP-RAM) to progressively incorporate style-transition components into
+intermediate samples. Compared with existing state-of-the-art approaches, our
+method achieves a notable 13.57% improvement in Dice score on Prostate dataset,
+as demonstrated on three public datasets. Our code is available at
+https://github.com/MQinghe/MiDSS .",cs.CV,"['cs.CV', 'cs.LG']"
+NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors,Yannan He · Garvita Tiwari · Tolga Birdal · Jan Lenssen · Gerard Pons-Moll, ,https://arxiv.org/abs/2403.03122v1,,2403.03122v1.pdf,NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors,"Faithfully modeling the space of articulations is a crucial task that allows
+recovery and generation of realistic poses, and remains a notorious challenge.
+To this end, we introduce Neural Riemannian Distance Fields (NRDFs),
+data-driven priors modeling the space of plausible articulations, represented
+as the zero-level-set of a neural field in a high-dimensional
+product-quaternion space. To train NRDFs only on positive examples, we
+introduce a new sampling algorithm, ensuring that the geodesic distances follow
+a desired distribution, yielding a principled distance field learning paradigm.
+We then devise a projection algorithm to map any random pose onto the level-set
+by an adaptive-step Riemannian optimizer, adhering to the product manifold of
+joint rotations at all times. NRDFs can compute the Riemannian gradient via
+backpropagation and by mathematical analogy, are related to Riemannian flow
+matching, a recent generative model. We conduct a comprehensive evaluation of
+NRDF against other pose priors in various downstream tasks, i.e., pose
+generation, image-based pose estimation, and solving inverse kinematics,
+highlighting NRDF's superior performance. Besides humans, NRDF's versatility
+extends to hand and animal poses, as it can effectively represent any
+articulation.",cs.CV,['cs.CV']
+Privacy-Preserving Face Recognition Using Trainable Feature Subtraction,Yuxi Mi · Zhizhou Zhong · Yuge Huang · Jiazhen Ji · Jianqing Xu · Jun Wang · ShaoMing Wang · Shouhong Ding · Shuigeng Zhou,https://github.com/Tencent/TFace/tree/master/recognition/tasks/minusface,https://arxiv.org/abs/2403.12457,,,Privacy-Preserving Face Recognition Using Trainable Feature Subtraction,"The widespread adoption of face recognition has led to increasing privacy
+concerns, as unauthorized access to face images can expose sensitive personal
+information. This paper explores face image protection against viewing and
+recovery attacks. Inspired by image compression, we propose creating a visually
+uninformative face image through feature subtraction between an original face
+and its model-produced regeneration. Recognizable identity features within the
+image are encouraged by co-training a recognition model on its high-dimensional
+feature representation. To enhance privacy, the high-dimensional representation
+is crafted through random channel shuffling, resulting in randomized
+recognizable images devoid of attacker-leverageable texture details. We distill
+our methodologies into a novel privacy-preserving face recognition method,
+MinusFace. Experiments demonstrate its high recognition accuracy and effective
+privacy protection. Its code is available at https://github.com/Tencent/TFace.",cs.CV,['cs.CV']
+Generating Human Motion in 3D Scenes from Text Descriptions,Zhi Cen · Huaijin Pi · Sida Peng · Zehong Shen · Minghui Yang · Shuai Zhu · Hujun Bao · Xiaowei Zhou,https://zju3dv.github.io/text_scene_motion/,https://arxiv.org/html/2405.07784v1,,2405.07784v1.pdf,Generating Human Motion in 3D Scenes from Text Descriptions,"Generating human motions from textual descriptions has gained growing
+research interest due to its wide range of applications. However, only a few
+works consider human-scene interactions together with text conditions, which is
+crucial for visual and physical realism. This paper focuses on the task of
+generating human motions in 3D indoor scenes given text descriptions of the
+human-scene interactions. This task presents challenges due to the
+multi-modality nature of text, scene, and motion, as well as the need for
+spatial reasoning. To address these challenges, we propose a new approach that
+decomposes the complex problem into two more manageable sub-problems: (1)
+language grounding of the target object and (2) object-centric motion
+generation. For language grounding of the target object, we leverage the power
+of large language models. For motion generation, we design an object-centric
+scene representation for the generative model to focus on the target object,
+thereby reducing the scene complexity and facilitating the modeling of the
+relationship between human motions and the object. Experiments demonstrate the
+better motion quality of our approach compared to baselines and validate our
+design choices.",cs.CV,['cs.CV']
+HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting,Xian Liu · Xiaohang Zhan · Jiaxiang Tang · Ying Shan · Gang Zeng · Dahua Lin · Xihui Liu · Ziwei Liu,https://alvinliu0.github.io/projects/HumanGaussian,https://arxiv.org/abs/2311.17061,,2311.17061.pdf,HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting,"Realistic 3D human generation from text prompts is a desirable yet
+challenging task. Existing methods optimize 3D representations like mesh or
+neural fields via score distillation sampling (SDS), which suffers from
+inadequate fine details or excessive training time. In this paper, we propose
+an efficient yet effective framework, HumanGaussian, that generates
+high-quality 3D humans with fine-grained geometry and realistic appearance. Our
+key insight is that 3D Gaussian Splatting is an efficient renderer with
+periodic Gaussian shrinkage or growing, where such adaptive density control can
+be naturally guided by intrinsic human structures. Specifically, 1) we first
+propose a Structure-Aware SDS that simultaneously optimizes human appearance
+and geometry. The multi-modal score function from both RGB and depth space is
+leveraged to distill the Gaussian densification and pruning process. 2)
+Moreover, we devise an Annealed Negative Prompt Guidance by decomposing SDS
+into a noisier generative score and a cleaner classifier score, which well
+addresses the over-saturation issue. The floating artifacts are further
+eliminated based on Gaussian size in a prune-only phase to enhance generation
+smoothness. Extensive experiments demonstrate the superior efficiency and
+competitive quality of our framework, rendering vivid 3D humans under diverse
+scenarios. Project Page: https://alvinliu0.github.io/projects/HumanGaussian",cs.CV,['cs.CV']
+"See, Say, and Segment: Correcting False Premises with LMMs",Tsung-Han Wu · Giscard Biamby · David Chan · Lisa Dunlap · Ritwik Gupta · XuDong Wang · Trevor Darrell · Joseph Gonzalez,https://see-say-segment.github.io/,https://arxiv.org/html/2312.08366v1,,2312.08366v1.pdf,"See, Say, and Segment: Teaching LMMs to Overcome False Premises","Current open-source Large Multimodal Models (LMMs) excel at tasks such as
+open-vocabulary language grounding and segmentation but can suffer under false
+premises when queries imply the existence of something that is not actually
+present in the image. We observe that existing methods that fine-tune an LMM to
+segment images significantly degrade their ability to reliably determine
+(""see"") if an object is present and to interact naturally with humans (""say""),
+a form of catastrophic forgetting. In this work, we propose a cascading and
+joint training approach for LMMs to solve this task, avoiding catastrophic
+forgetting of previous skills. Our resulting model can ""see"" by detecting
+whether objects are present in an image, ""say"" by telling the user if they are
+not, proposing alternative queries or correcting semantic errors in the query,
+and finally ""segment"" by outputting the mask of the desired objects if they
+exist. Additionally, we introduce a novel False Premise Correction benchmark
+dataset, an extension of existing RefCOCO(+/g) referring segmentation datasets
+(which we call FP-RefCOCO(+/g)). The results show that our method not only
+detects false premises up to 55% better than existing approaches, but under
+false premise conditions produces relative cIOU improvements of more than 31%
+over baselines, and produces natural language feedback judged helpful up to 67%
+of the time.",cs.CV,['cs.CV']
+Investigating and Mitigating the Side Effects of Noisy Views for Self-Supervised Clustering Algorithms in Practical Multi-View Scenarios,Jie Xu · Yazhou Ren · Xiaolong Wang · Lei Feng · Zheng Zhang · Gang Niu · Xiaofeng Zhu,https://github.com/SubmissionsIn/MVCAN,,https://submissionsin.github.io/,,,,,nan
+Learned representation-guided diffusion models for large-image generation,Alexandros Graikos · Srikar Yellapragada · Minh-Quan Le · Saarthak Kapse · Prateek Prasanna · Joel Saltz · Dimitris Samaras,https://histodiffusion.github.io/docs/publications/cvpr_24,https://arxiv.org/abs/2312.07330,,2312.07330.pdf,Learned representation-guided diffusion models for large-image generation,"To synthesize high-fidelity samples, diffusion models typically require
+auxiliary data to guide the generation process. However, it is impractical to
+procure the painstaking patch-level annotation effort required in specialized
+domains like histopathology and satellite imagery; it is often performed by
+domain experts and involves hundreds of millions of patches. Modern-day
+self-supervised learning (SSL) representations encode rich semantic and visual
+information. In this paper, we posit that such representations are expressive
+enough to act as proxies to fine-grained human labels. We introduce a novel
+approach that trains diffusion models conditioned on embeddings from SSL. Our
+diffusion models successfully project these features back to high-quality
+histopathology and remote sensing images. In addition, we construct larger
+images by assembling spatially consistent patches inferred from SSL embeddings,
+preserving long-range dependencies. Augmenting real data by generating
+variations of real images improves downstream classifier accuracy for
+patch-level and larger, image-scale classification tasks. Our models are
+effective even on datasets not encountered during training, demonstrating their
+robustness and generalizability. Generating images from learned embeddings is
+agnostic to the source of the embeddings. The SSL embeddings used to generate a
+large image can either be extracted from a reference image, or sampled from an
+auxiliary model conditioned on any related modality (e.g. class labels, text,
+genomic data). As proof of concept, we introduce the text-to-large image
+synthesis paradigm where we successfully synthesize large pathology and
+satellite images out of text descriptions.",cs.CV,['cs.CV']
+PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis,Zhengyao Lv · Yuxiang Wei · Wangmeng Zuo · Kwan-Yee K. Wong, ,https://arxiv.org/abs/2403.01852,,2403.01852.pdf,PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis,"Recent advancements in large-scale pre-trained text-to-image models have led
+to remarkable progress in semantic image synthesis. Nevertheless, synthesizing
+high-quality images with consistent semantics and layout remains a challenge.
+In this paper, we propose the adaPtive LAyout-semantiC fusion modulE (PLACE)
+that harnesses pre-trained models to alleviate the aforementioned issues.
+Specifically, we first employ the layout control map to faithfully represent
+layouts in the feature space. Subsequently, we combine the layout and semantic
+features in a timestep-adaptive manner to synthesize images with realistic
+details. During fine-tuning, we propose the Semantic Alignment (SA) loss to
+further enhance layout alignment. Additionally, we introduce the Layout-Free
+Prior Preservation (LFP) loss, which leverages unlabeled data to maintain the
+priors of pre-trained models, thereby improving the visual quality and semantic
+consistency of synthesized images. Extensive experiments demonstrate that our
+approach performs favorably in terms of visual quality, semantic consistency,
+and layout alignment. The source code and model are available at
+https://github.com/cszy98/PLACE/tree/main.",cs.CV,['cs.CV']
+Regressor-Segmenter Mutual Prompt Learning for Crowd Counting,Mingyue Guo · Li Yuan · Zhaoyi Yan · Binghui Chen · Yaowei Wang · Qixiang Ye, ,https://arxiv.org/abs/2312.01711v2,,2312.01711v2.pdf,Regressor-Segmenter Mutual Prompt Learning for Crowd Counting,"Crowd counting has achieved significant progress by training regressors to
+predict instance positions. In heavily crowded scenarios, however, regressors
+are challenged by uncontrollable annotation variance, which causes density map
+bias and context information inaccuracy. In this study, we propose mutual
+prompt learning (mPrompt), which leverages a regressor and a segmenter as
+guidance for each other, solving bias and inaccuracy caused by annotation
+variance while distinguishing foreground from background. In specific, mPrompt
+leverages point annotations to tune the segmenter and predict pseudo head masks
+in a way of point prompt learning. It then uses the predicted segmentation
+masks, which serve as spatial constraint, to rectify biased point annotations
+as context prompt learning. mPrompt defines a way of mutual information
+maximization from prompt learning, mitigating the impact of annotation variance
+while improving model accuracy. Experiments show that mPrompt significantly
+reduces the Mean Average Error (MAE), demonstrating the potential to be general
+framework for down-stream vision tasks.",cs.CV,['cs.CV']
+SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting,Zhijing Shao · Wang Zhaolong · Zhuang Li · Duotun Wang · Xiangru Lin · Yu Zhang · Mingming Fan · Zeyu Wang,https://initialneil.github.io/SplattingAvatar,https://arxiv.org/abs/2403.05087,,2403.05087.pdf,SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting,"We present SplattingAvatar, a hybrid 3D representation of photorealistic
+human avatars with Gaussian Splatting embedded on a triangle mesh, which
+renders over 300 FPS on a modern GPU and 30 FPS on a mobile device. We
+disentangle the motion and appearance of a virtual human with explicit mesh
+geometry and implicit appearance modeling with Gaussian Splatting. The
+Gaussians are defined by barycentric coordinates and displacement on a triangle
+mesh as Phong surfaces. We extend lifted optimization to simultaneously
+optimize the parameters of the Gaussians while walking on the triangle mesh.
+SplattingAvatar is a hybrid representation of virtual humans where the mesh
+represents low-frequency motion and surface deformation, while the Gaussians
+take over the high-frequency geometry and detailed appearance. Unlike existing
+deformation methods that rely on an MLP-based linear blend skinning (LBS) field
+for motion, we control the rotation and translation of the Gaussians directly
+by mesh, which empowers its compatibility with various animation techniques,
+e.g., skeletal animation, blend shapes, and mesh editing. Trainable from
+monocular videos for both full-body and head avatars, SplattingAvatar shows
+state-of-the-art rendering quality across multiple datasets.",cs.GR,"['cs.GR', 'cs.CV']"
+FastMAC: Stochastic Spectral Sampling of Correspondence Graph,Yifei Zhang · Hao Zhao · Hongyang Li · Siheng Chen,https://github.com/Forrest-110/FastMAC,https://arxiv.org/abs/2403.08770,,2403.08770.pdf,FastMAC: Stochastic Spectral Sampling of Correspondence Graph,"3D correspondence, i.e., a pair of 3D points, is a fundamental concept in
+computer vision. A set of 3D correspondences, when equipped with compatibility
+edges, forms a correspondence graph. This graph is a critical component in
+several state-of-the-art 3D point cloud registration approaches, e.g., the one
+based on maximal cliques (MAC). However, its properties have not been well
+understood. So we present the first study that introduces graph signal
+processing into the domain of correspondence graph. We exploit the generalized
+degree signal on correspondence graph and pursue sampling strategies that
+preserve high-frequency components of this signal. To address time-consuming
+singular value decomposition in deterministic sampling, we resort to a
+stochastic approximate sampling strategy. As such, the core of our method is
+the stochastic spectral sampling of correspondence graph. As an application, we
+build a complete 3D registration algorithm termed as FastMAC, that reaches
+real-time speed while leading to little to none performance drop. Through
+extensive experiments, we validate that FastMAC works for both indoor and
+outdoor benchmarks. For example, FastMAC can accelerate MAC by 80 times while
+maintaining high registration success rate on KITTI. Codes are publicly
+available at https://github.com/Forrest-110/FastMAC.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']"
+Fairy: Fast Parallellized Instruction-Guided Video-to-Video Synthesis,Bichen Wu · Ching-Yao Chuang · Xiaoyan Wang · Yichen Jia · Kapil Krishnakumar · Tong Xiao · Feng Liang · Licheng Yu · Peter Vajda, ,https://arxiv.org/abs/2312.13834,,2312.13834.pdf,Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis,"In this paper, we introduce Fairy, a minimalist yet robust adaptation of
+image-editing diffusion models, enhancing them for video editing applications.
+Our approach centers on the concept of anchor-based cross-frame attention, a
+mechanism that implicitly propagates diffusion features across frames, ensuring
+superior temporal coherence and high-fidelity synthesis. Fairy not only
+addresses limitations of previous models, including memory and processing
+speed. It also improves temporal consistency through a unique data augmentation
+strategy. This strategy renders the model equivariant to affine transformations
+in both source and target images. Remarkably efficient, Fairy generates
+120-frame 512x384 videos (4-second duration at 30 FPS) in just 14 seconds,
+outpacing prior works by at least 44x. A comprehensive user study, involving
+1000 generated samples, confirms that our approach delivers superior quality,
+decisively outperforming established methods.",cs.CV,['cs.CV']
+MMA: Multi-Modal Adapter for Vision-Language Models,Lingxiao Yang · Ru-Yuan Zhang · Yanchen Wang · Xiaohua Xie, ,https://arxiv.org/abs/2405.15684,,2405.15684.pdf,Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models,"To bridge the gap between vision and language modalities, Multimodal Large
+Language Models (MLLMs) usually learn an adapter that converts visual inputs to
+understandable tokens for Large Language Models (LLMs). However, most adapters
+generate consistent visual tokens, regardless of the specific objects of
+interest mentioned in the prompt. Since these adapters distribute equal
+attention to every detail in the image and focus on the entire scene, they may
+increase the cognitive load for LLMs, particularly when processing complex
+scenes. To alleviate this problem, we propose prompt-aware adapters. These
+adapters are designed with the capability to dynamically embed visual inputs
+based on the specific focus of the prompt. Specifically, prompt-aware adapters
+utilize both global and local textual features to capture the most relevant
+visual clues from the prompt at both coarse and fine granularity levels. This
+approach significantly enhances the ability of LLMs to understand and interpret
+visual content. Experiments on various visual question answering tasks, such as
+counting and position reasoning, demonstrate the effectiveness of prompt-aware
+adapters.",cs.CV,"['cs.CV', 'cs.AI']"
+RoDLA: Benchmarking the Robustness of Document Layout Analysis Models,Yufan Chen · Jiaming Zhang · Kunyu Peng · Junwei Zheng · Ruiping Liu · Philip H.S. Torr · Rainer Stiefelhagen,https://yufanchen96.github.io/projects/RoDLA/,https://arxiv.org/abs/2403.14442,,2403.14442.pdf,RoDLA: Benchmarking the Robustness of Document Layout Analysis Models,"Before developing a Document Layout Analysis (DLA) model in real-world
+applications, conducting comprehensive robustness testing is essential.
+However, the robustness of DLA models remains underexplored in the literature.
+To address this, we are the first to introduce a robustness benchmark for DLA
+models, which includes 450K document images of three datasets. To cover
+realistic corruptions, we propose a perturbation taxonomy with 36 common
+document perturbations inspired by real-world document processing.
+Additionally, to better understand document perturbation impacts, we propose
+two metrics, Mean Perturbation Effect (mPE) for perturbation assessment and
+Mean Robustness Degradation (mRD) for robustness evaluation. Furthermore, we
+introduce a self-titled model, i.e., Robust Document Layout Analyzer (RoDLA),
+which improves attention mechanisms to boost extraction of robust features.
+Experiments on the proposed benchmarks (PubLayNet-P, DocLayNet-P, and
+M$^6$Doc-P) demonstrate that RoDLA obtains state-of-the-art mRD scores of
+115.7, 135.4, and 150.4, respectively. Compared to previous methods, RoDLA
+achieves notable improvements in mAP of +3.8%, +7.1% and +12.1%, respectively.",cs.CV,['cs.CV']
+LAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented Diffusion,Pancheng Zhao · Peng Xu · Pengda Qin · Deng-Ping Fan · Zhicheng Zhang · Guoli Jia · Bowen Zhou · Jufeng Yang, ,https://arxiv.org/abs/2404.00292,,2404.00292.pdf,LAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented Diffusion,"Camouflaged vision perception is an important vision task with numerous
+practical applications. Due to the expensive collection and labeling costs,
+this community struggles with a major bottleneck that the species category of
+its datasets is limited to a small number of object species. However, the
+existing camouflaged generation methods require specifying the background
+manually, thus failing to extend the camouflaged sample diversity in a low-cost
+manner. In this paper, we propose a Latent Background Knowledge
+Retrieval-Augmented Diffusion (LAKE-RED) for camouflaged image generation. To
+our knowledge, our contributions mainly include: (1) For the first time, we
+propose a camouflaged generation paradigm that does not need to receive any
+background inputs. (2) Our LAKE-RED is the first knowledge retrieval-augmented
+method with interpretability for camouflaged generation, in which we propose an
+idea that knowledge retrieval and reasoning enhancement are separated
+explicitly, to alleviate the task-specific challenges. Moreover, our method is
+not restricted to specific foreground targets or backgrounds, offering a
+potential for extending camouflaged vision perception to more diverse domains.
+(3) Experimental results demonstrate that our method outperforms the existing
+approaches, generating more realistic camouflage images.",cs.CV,['cs.CV']
+Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding,Peng Jin · Ryuichi Takanobu · Cai Zhang · Xiaochun Cao · Li Yuan,https://github.com/PKU-YuanGroup/Chat-UniVi,https://arxiv.org/abs/2311.08046,,2311.08046.pdf,Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding,"Large language models have demonstrated impressive universal capabilities
+across a wide range of open-ended tasks and have extended their utility to
+encompass multimodal conversations. However, existing methods encounter
+challenges in effectively handling both image and video understanding,
+particularly with limited visual tokens. In this work, we introduce Chat-UniVi,
+a Unified Vision-language model capable of comprehending and engaging in
+conversations involving images and videos through a unified visual
+representation. Specifically, we employ a set of dynamic visual tokens to
+uniformly represent images and videos. This representation framework empowers
+the model to efficiently utilize a limited number of visual tokens to
+simultaneously capture the spatial details necessary for images and the
+comprehensive temporal relationship required for videos. Moreover, we leverage
+a multi-scale representation, enabling the model to perceive both high-level
+semantic concepts and low-level visual details. Notably, Chat-UniVi is trained
+on a mixed dataset containing both images and videos, allowing direct
+application to tasks involving both mediums without requiring any
+modifications. Extensive experimental results demonstrate that Chat-UniVi
+consistently outperforms even existing methods exclusively designed for either
+images or videos. Code is available at
+https://github.com/PKU-YuanGroup/Chat-UniVi.",cs.CV,['cs.CV']
+HAVE-FUN: Human Avatar Reconstruction from Few-Shot Unconstrained Images,Xihe Yang · Xingyu Chen · Daiheng Gao · Finn Wong · Xiaoguang Han · Baoyuan Wang, ,https://arxiv.org/abs/2311.15672,,2311.15672.pdf,HAVE-FUN: Human Avatar Reconstruction from Few-Shot Unconstrained Images,"As for human avatar reconstruction, contemporary techniques commonly
+necessitate the acquisition of costly data and struggle to achieve satisfactory
+results from a small number of casual images. In this paper, we investigate
+this task from a few-shot unconstrained photo album. The reconstruction of
+human avatars from such data sources is challenging because of limited data
+amount and dynamic articulated poses. For handling dynamic data, we integrate a
+skinning mechanism with deep marching tetrahedra (DMTet) to form a drivable
+tetrahedral representation, which drives arbitrary mesh topologies generated by
+the DMTet for the adaptation of unconstrained images. To effectively mine
+instructive information from few-shot data, we devise a two-phase optimization
+method with few-shot reference and few-shot guidance. The former focuses on
+aligning avatar identity with reference images, while the latter aims to
+generate plausible appearances for unseen regions. Overall, our framework,
+called HaveFun, can undertake avatar reconstruction, rendering, and animation.
+Extensive experiments on our developed benchmarks demonstrate that HaveFun
+exhibits substantially superior performance in reconstructing the human body
+and hand. Project website: https://seanchenxy.github.io/HaveFunWeb/.",cs.CV,['cs.CV']
+BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP,Jiawang Bai · Kuofeng Gao · Shaobo Min · Shu-Tao Xia · Zhifeng Li · Wei Liu, ,https://arxiv.org/abs/2311.16194,,2311.16194.pdf,BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP,"Contrastive Vision-Language Pre-training, known as CLIP, has shown promising
+effectiveness in addressing downstream image recognition tasks. However, recent
+works revealed that the CLIP model can be implanted with a downstream-oriented
+backdoor. On downstream tasks, one victim model performs well on clean samples
+but predicts a specific target class whenever a specific trigger is present.
+For injecting a backdoor, existing attacks depend on a large amount of
+additional data to maliciously fine-tune the entire pre-trained CLIP model,
+which makes them inapplicable to data-limited scenarios. In this work,
+motivated by the recent success of learnable prompts, we address this problem
+by injecting a backdoor into the CLIP model in the prompt learning stage. Our
+method named BadCLIP is built on a novel and effective mechanism in backdoor
+attacks on CLIP, i.e., influencing both the image and text encoders with the
+trigger. It consists of a learnable trigger applied to images and a
+trigger-aware context generator, such that the trigger can change text features
+via trigger-aware prompts, resulting in a powerful and generalizable attack.
+Extensive experiments conducted on 11 datasets verify that the clean accuracy
+of BadCLIP is similar to those of advanced prompt learning methods and the
+attack success rate is higher than 99% in most cases. BadCLIP is also
+generalizable to unseen classes, and shows a strong generalization capability
+under cross-dataset and cross-domain settings.",cs.CV,['cs.CV']
+PromptKD: Unsupervised Prompt Distillation for Vision-Language Models,Zheng Li · Xiang Li · xinyi fu · Xin Zhang · Weiqiang Wang · Shuo Chen · Jian Yang,https://zhengli97.github.io/PromptKD/,https://arxiv.org/abs/2403.02781v3,,2403.02781v3.pdf,PromptKD: Unsupervised Prompt Distillation for Vision-Language Models,"Prompt learning has emerged as a valuable technique in enhancing
+vision-language models (VLMs) such as CLIP for downstream tasks in specific
+domains. Existing work mainly focuses on designing various learning forms of
+prompts, neglecting the potential of prompts as effective distillers for
+learning from larger teacher models. In this paper, we introduce an
+unsupervised domain prompt distillation framework, which aims to transfer the
+knowledge of a larger teacher model to a lightweight target model through
+prompt-driven imitation using unlabeled domain images. Specifically, our
+framework consists of two distinct stages. In the initial stage, we pre-train a
+large CLIP teacher model using domain (few-shot) labels. After pre-training, we
+leverage the unique decoupled-modality characteristics of CLIP by pre-computing
+and storing the text features as class vectors only once through the teacher
+text encoder. In the subsequent stage, the stored class vectors are shared
+across teacher and student image encoders for calculating the predicted logits.
+Further, we align the logits of both the teacher and student models via KL
+divergence, encouraging the student image encoder to generate similar
+probability distributions to the teacher through the learnable prompts. The
+proposed prompt distillation process eliminates the reliance on labeled data,
+enabling the algorithm to leverage a vast amount of unlabeled images within the
+domain. Finally, the well-trained student image encoders and pre-stored text
+features (class vectors) are utilized for inference. To our best knowledge, we
+are the first to (1) perform unsupervised domain-specific prompt-driven
+knowledge distillation for CLIP, and (2) establish a practical pre-storing
+mechanism of text features as shared class vectors between teacher and student.
+Extensive experiments on 11 datasets demonstrate the effectiveness of our
+method.",cs.CV,['cs.CV']
+IBD-SLAM: Learning Image-Based Depth Fusion for Generalizable SLAM,Minghao Yin · Shangzhe Wu · Kai Han, ,https://arxiv.org/html/2405.03413v2,,2405.03413v2.pdf,SL-SLAM: A robust visual-inertial SLAM based deep feature extraction and matching,"This paper explores how deep learning techniques can improve visual-based
+SLAM performance in challenging environments. By combining deep feature
+extraction and deep matching methods, we introduce a versatile hybrid visual
+SLAM system designed to enhance adaptability in challenging scenarios, such as
+low-light conditions, dynamic lighting, weak-texture areas, and severe jitter.
+Our system supports multiple modes, including monocular, stereo,
+monocular-inertial, and stereo-inertial configurations. We also perform
+analysis how to combine visual SLAM with deep learning methods to enlighten
+other researches. Through extensive experiments on both public datasets and
+self-sampled data, we demonstrate the superiority of the SL-SLAM system over
+traditional approaches. The experimental results show that SL-SLAM outperforms
+state-of-the-art SLAM algorithms in terms of localization accuracy and tracking
+robustness. For the benefit of community, we make public the source code at
+https://github.com/zzzzxxxx111/SLslam.",cs.RO,['cs.RO']
+GES: Generalized Exponential Splatting for Efficient Radiance Field Rendering,Abdullah J Hamdi · Luke Melas-Kyriazi · Jinjie Mai · Guocheng Qian · Ruoshi Liu · Carl Vondrick · Bernard Ghanem · Andrea Vedaldi, ,https://arxiv.org/abs/2402.10128,,2402.10128.pdf,GES: Generalized Exponential Splatting for Efficient Radiance Field Rendering,"Advancements in 3D Gaussian Splatting have significantly accelerated 3D
+reconstruction and generation. However, it may require a large number of
+Gaussians, which creates a substantial memory footprint. This paper introduces
+GES (Generalized Exponential Splatting), a novel representation that employs
+Generalized Exponential Function (GEF) to model 3D scenes, requiring far fewer
+particles to represent a scene and thus significantly outperforming Gaussian
+Splatting methods in efficiency with a plug-and-play replacement ability for
+Gaussian-based utilities. GES is validated theoretically and empirically in
+both principled 1D setup and realistic 3D scenes.
+  It is shown to represent signals with sharp edges more accurately, which are
+typically challenging for Gaussians due to their inherent low-pass
+characteristics. Our empirical analysis demonstrates that GEF outperforms
+Gaussians in fitting natural-occurring signals (e.g. squares, triangles, and
+parabolic signals), thereby reducing the need for extensive splitting
+operations that increase the memory footprint of Gaussian Splatting. With the
+aid of a frequency-modulated loss, GES achieves competitive performance in
+novel-view synthesis benchmarks while requiring less than half the memory
+storage of Gaussian Splatting and increasing the rendering speed by up to 39%.
+The code is available on the project website https://abdullahamdi.com/ges .",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+Hybrid Functional Maps for Crease-Aware Non-Isometric Shape Matching,Lennart Bastian · Yizheng Xie · Nassir Navab · Zorah Lähner, ,https://arxiv.org/abs/2312.03678,,2312.03678.pdf,Hybrid Functional Maps for Crease-Aware Non-Isometric Shape Matching,"Non-isometric shape correspondence remains a fundamental challenge in
+computer vision. Traditional methods using Laplace-Beltrami operator (LBO)
+eigenmodes face limitations in characterizing high-frequency extrinsic shape
+changes like bending and creases. We propose a novel approach of combining the
+non-orthogonal extrinsic basis of eigenfunctions of the elastic thin-shell
+hessian with the intrinsic ones of the LBO, creating a hybrid spectral space in
+which we construct functional maps. To this end, we present a theoretical
+framework to effectively integrate non-orthogonal basis functions into
+descriptor- and learning-based functional map methods. Our approach can be
+incorporated easily into existing functional map pipelines across varying
+applications and is able to handle complex deformations beyond isometries. We
+show extensive evaluations across various supervised and unsupervised settings
+and demonstrate significant improvements. Notably, our approach achieves up to
+15% better mean geodesic error for non-isometric correspondence settings and up
+to 45% improvement in scenarios with topological noise.",cs.CV,['cs.CV']
+DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction,Junwen Xiong · Peng Zhang · Tao You · Chuanyue Li · Wei Huang · Yufei Zha,https://github.com/junwenxiong/diff_sal,https://arxiv.org/abs/2403.01226,,2403.01226.pdf,DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction,"Audio-visual saliency prediction can draw support from diverse modality
+complements, but further performance enhancement is still challenged by
+customized architectures as well as task-specific loss functions. In recent
+studies, denoising diffusion models have shown more promising in unifying task
+frameworks owing to their inherent ability of generalization. Following this
+motivation, a novel Diffusion architecture for generalized audio-visual
+Saliency prediction (DiffSal) is proposed in this work, which formulates the
+prediction problem as a conditional generative task of the saliency map by
+utilizing input audio and video as the conditions. Based on the spatio-temporal
+audio-visual features, an extra network Saliency-UNet is designed to perform
+multi-modal attention modulation for progressive refinement of the ground-truth
+saliency map from the noisy map. Extensive experiments demonstrate that the
+proposed DiffSal can achieve excellent performance across six challenging
+audio-visual benchmarks, with an average relative improvement of 6.3\% over the
+previous state-of-the-art results by six metrics.",cs.CV,['cs.CV']
+Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models,Zijin Yang · Kai Zeng · Kejiang Chen · Han Fang · Weiming Zhang · Nenghai Yu, ,https://arxiv.org/abs/2404.04956,,2404.04956.pdf,Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models,"Ethical concerns surrounding copyright protection and inappropriate content
+generation pose challenges for the practical implementation of diffusion
+models. One effective solution involves watermarking the generated images.
+However, existing methods often compromise the model performance or require
+additional training, which is undesirable for operators and users. To address
+this issue, we propose Gaussian Shading, a diffusion model watermarking
+technique that is both performance-lossless and training-free, while serving
+the dual purpose of copyright protection and tracing of offending content. Our
+watermark embedding is free of model parameter modifications and thus is
+plug-and-play. We map the watermark to latent representations following a
+standard Gaussian distribution, which is indistinguishable from latent
+representations obtained from the non-watermarked diffusion model. Therefore we
+can achieve watermark embedding with lossless performance, for which we also
+provide theoretical proof. Furthermore, since the watermark is intricately
+linked with image semantics, it exhibits resilience to lossy processing and
+erasure attempts. The watermark can be extracted by Denoising Diffusion
+Implicit Models (DDIM) inversion and inverse sampling. We evaluate Gaussian
+Shading on multiple versions of Stable Diffusion, and the results demonstrate
+that Gaussian Shading not only is performance-lossless but also outperforms
+existing methods in terms of robustness.",cs.CV,"['cs.CV', 'cs.CR']"
+Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos,Chen Liu · Peike Li · Qingtao Yu · Hongwei Sheng · Dadong Wang · Lincheng Li · Xin Yu, ,https://arxiv.org/abs/2307.16620,,2307.16620.pdf,Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics,"The audio-visual segmentation (AVS) task aims to segment sounding objects
+from a given video. Existing works mainly focus on fusing audio and visual
+features of a given video to achieve sounding object masks. However, we
+observed that prior arts are prone to segment a certain salient object in a
+video regardless of the audio information. This is because sounding objects are
+often the most salient ones in the AVS dataset. Thus, current AVS methods might
+fail to localize genuine sounding objects due to the dataset bias. In this
+work, we present an audio-visual instance-aware segmentation approach to
+overcome the dataset bias. In a nutshell, our method first localizes potential
+sounding objects in a video by an object segmentation network, and then
+associates the sounding object candidates with the given audio. We notice that
+an object could be a sounding object in one video but a silent one in another
+video. This would bring ambiguity in training our object segmentation network
+as only sounding objects have corresponding segmentation masks. We thus propose
+a silent object-aware segmentation objective to alleviate the ambiguity.
+Moreover, since the category information of audio is unknown, especially for
+multiple sounding sources, we propose to explore the audio-visual semantic
+correlation and then associate audio with potential objects. Specifically, we
+attend predicted audio category scores to potential instance masks and these
+scores will highlight corresponding sounding instances while suppressing
+inaudible ones. When we enforce the attended instance masks to resemble the
+ground-truth mask, we are able to establish audio-visual semantics correlation.
+Experimental results on the AVS benchmarks demonstrate that our method can
+effectively segment sounding objects without being biased to salient objects.",cs.SD,"['cs.SD', 'cs.CV', 'eess.AS']"
+Modular Blind Video Quality Assessment,Wen Wen · Mu Li · Yabin ZHANG · Yiting Liao · Junlin Li · Li zhang · Kede Ma, ,https://arxiv.org/abs/2402.19276,,2402.19276.pdf,Modular Blind Video Quality Assessment,"Blind video quality assessment (BVQA) plays a pivotal role in evaluating and
+improving the viewing experience of end-users across a wide range of
+video-based platforms and services. Contemporary deep learning-based models
+primarily analyze video content in its aggressively subsampled format, while
+being blind to the impact of the actual spatial resolution and frame rate on
+video quality. In this paper, we propose a modular BVQA model and a method of
+training it to improve its modularity. Our model comprises a base quality
+predictor, a spatial rectifier, and a temporal rectifier, responding to the
+visual content and distortion, spatial resolution, and frame rate changes on
+video quality, respectively. During training, spatial and temporal rectifiers
+are dropped out with some probabilities to render the base quality predictor a
+standalone BVQA model, which should work better with the rectifiers. Extensive
+experiments on both professionally-generated content and user-generated content
+video databases show that our quality model achieves superior or comparable
+performance to current methods. Additionally, the modularity of our model
+offers an opportunity to analyze existing video quality databases in terms of
+their spatial and temporal complexity.",eess.IV,"['eess.IV', 'cs.CV']"
+Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation,Ji-Jia Wu · Andy Chia-Hao Chang · Chieh-Yu Chuang · Chun-Pei Chen · Yu-Lun Liu · Min-Hung Chen · Hou-Ning Hu · Yung-Yu Chuang · Yen-Yu Lin, ,https://arxiv.org/abs/2404.04231,,2404.04231.pdf,Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation,"This paper addresses text-supervised semantic segmentation, aiming to learn a
+model capable of segmenting arbitrary visual concepts within images by using
+only image-text pairs without dense annotations. Existing methods have
+demonstrated that contrastive learning on image-text pairs effectively aligns
+visual segments with the meanings of texts. We notice that there is a
+discrepancy between text alignment and semantic segmentation: A text often
+consists of multiple semantic concepts, whereas semantic segmentation strives
+to create semantically homogeneous segments. To address this issue, we propose
+a novel framework, Image-Text Co-Decomposition (CoDe), where the paired image
+and text are jointly decomposed into a set of image regions and a set of word
+segments, respectively, and contrastive learning is developed to enforce
+region-word alignment. To work with a vision-language model, we present a
+prompt learning mechanism that derives an extra representation to highlight an
+image segment or a word segment of interest, with which more effective features
+can be extracted from that segment. Comprehensive experimental results
+demonstrate that our method performs favorably against existing text-supervised
+semantic segmentation methods on six benchmark datasets.",cs.CV,['cs.CV']
+Detector-Free Structure from Motion,Xingyi He · Jiaming Sun · Yifan Wang · Sida Peng · Qixing Huang · Hujun Bao · Xiaowei Zhou, ,https://arxiv.org/abs/2306.15669,,2306.15669.pdf,Detector-Free Structure from Motion,"We propose a new structure-from-motion framework to recover accurate camera
+poses and point clouds from unordered images. Traditional SfM systems typically
+rely on the successful detection of repeatable keypoints across multiple views
+as the first step, which is difficult for texture-poor scenes, and poor
+keypoint detection may break down the whole SfM system. We propose a new
+detector-free SfM framework to draw benefits from the recent success of
+detector-free matchers to avoid the early determination of keypoints, while
+solving the multi-view inconsistency issue of detector-free matchers.
+Specifically, our framework first reconstructs a coarse SfM model from
+quantized detector-free matches. Then, it refines the model by a novel
+iterative refinement pipeline, which iterates between an attention-based
+multi-view matching module to refine feature tracks and a geometry refinement
+module to improve the reconstruction accuracy. Experiments demonstrate that the
+proposed framework outperforms existing detector-based SfM systems on common
+benchmark datasets. We also collect a texture-poor SfM dataset to demonstrate
+the capability of our framework to reconstruct texture-poor scenes. Based on
+this framework, we take $\textit{first place}$ in Image Matching Challenge
+2023.",cs.CV,['cs.CV']
+Simple Semantic-Aided Few-Shot Learning,Hai Zhang · Junzhe Xu · Shanlin Jiang · Zhenan He,https://github.com/zhangdoudou123/SemFew,https://arxiv.org/abs/2311.18649,,2311.18649.pdf,Simple Semantic-Aided Few-Shot Learning,"Learning from a limited amount of data, namely Few-Shot Learning, stands out
+as a challenging computer vision task. Several works exploit semantics and
+design complicated semantic fusion mechanisms to compensate for rare
+representative features within restricted data. However, relying on naive
+semantics such as class names introduces biases due to their brevity, while
+acquiring extensive semantics from external knowledge takes a huge time and
+effort. This limitation severely constrains the potential of semantics in
+Few-Shot Learning. In this paper, we design an automatic way called Semantic
+Evolution to generate high-quality semantics. The incorporation of high-quality
+semantics alleviates the need for complex network structures and learning
+algorithms used in previous works. Hence, we employ a simple two-layer network
+termed Semantic Alignment Network to transform semantics and visual features
+into robust class prototypes with rich discriminative features for few-shot
+classification. The experimental results show our framework outperforms all
+previous methods on six benchmarks, demonstrating a simple network with
+high-quality semantics can beat intricate multi-modal modules on few-shot
+classification tasks. Code is available at
+https://github.com/zhangdoudou123/SemFew.",cs.CV,['cs.CV']
+iToF-flow-based High Frame Rate Depth Imaging,Yu Meng · Zhou Xue · Xu Chang · Xuemei Hu · Tao Yue, ,https://arxiv.org/abs/2306.17618,,2306.17618.pdf,Polarimetric iToF: Measuring High-Fidelity Depth through Scattering Media,"Indirect time-of-flight (iToF) imaging allows us to capture dense depth
+information at a low cost. However, iToF imaging often suffers from multipath
+interference (MPI) artifacts in the presence of scattering media, resulting in
+severe depth-accuracy degradation. For instance, iToF cameras cannot measure
+depth accurately through fog because ToF active illumination scatters back to
+the sensor before reaching the farther target surface. In this work, we propose
+a polarimetric iToF imaging method that can capture depth information robustly
+through scattering media. Our observations on the principle of indirect ToF
+imaging and polarization of light allow us to formulate a novel computational
+model of scattering-aware polarimetric phase measurements that enables us to
+correct MPI errors. We first devise a scattering-aware polarimetric iToF model
+that can estimate the phase of unpolarized backscattered light. We then combine
+the optical filtering of polarization and our computational modeling of
+unpolarized backscattered light via scattering analysis of phase and amplitude.
+This allows us to tackle the MPI problem by estimating the scattering energy
+through the participating media. We validate our method on an experimental
+setup using a customized off-the-shelf iToF camera. Our method outperforms
+baseline methods by a significant margin by means of our scattering model and
+polarimetric phase measurements.",cs.CV,['cs.CV']
+Perceptual-Oriented Video Frame Interpolation Via Asymmetric Synergistic Blending,Guangyang Wu · Xin Tao · Changlin Li · Wenyi Wang · Xiaohong Liu · Qingqing Zheng, ,https://arxiv.org/abs/2404.06692,,2404.06692.pdf,Perception-Oriented Video Frame Interpolation via Asymmetric Blending,"Previous methods for Video Frame Interpolation (VFI) have encountered
+challenges, notably the manifestation of blur and ghosting effects. These
+issues can be traced back to two pivotal factors: unavoidable motion errors and
+misalignment in supervision. In practice, motion estimates often prove to be
+error-prone, resulting in misaligned features. Furthermore, the reconstruction
+loss tends to bring blurry results, particularly in misaligned regions. To
+mitigate these challenges, we propose a new paradigm called PerVFI
+(Perception-oriented Video Frame Interpolation). Our approach incorporates an
+Asymmetric Synergistic Blending module (ASB) that utilizes features from both
+sides to synergistically blend intermediate features. One reference frame
+emphasizes primary content, while the other contributes complementary
+information. To impose a stringent constraint on the blending process, we
+introduce a self-learned sparse quasi-binary mask which effectively mitigates
+ghosting and blur artifacts in the output. Additionally, we employ a
+normalizing flow-based generator and utilize the negative log-likelihood loss
+to learn the conditional distribution of the output, which further facilitates
+the generation of clear and fine details. Experimental results validate the
+superiority of PerVFI, demonstrating significant improvements in perceptual
+quality compared to existing methods. Codes are available at
+\url{https://github.com/mulns/PerVFI}",cs.CV,['cs.CV']
+SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos,Changan Chen · Kumar Ashutosh · Rohit Girdhar · David Harwath · Kristen Grauman,https://vision.cs.utexas.edu/projects/soundingactions/,https://arxiv.org/abs/2404.05206,,,SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos,"We propose a novel self-supervised embedding to learn how actions sound from
+narrated in-the-wild egocentric videos. Whereas existing methods rely on
+curated data with known audio-visual correspondence, our multimodal
+contrastive-consensus coding (MC3) embedding reinforces the associations
+between audio, language, and vision when all modality pairs agree, while
+diminishing those associations when any one pair does not. We show our approach
+can successfully discover how the long tail of human actions sound from
+egocentric video, outperforming an array of recent multimodal embedding
+techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal
+tasks.",cs.CV,"['cs.CV', 'cs.MM', 'cs.SD', 'eess.AS']"
+Dynamic LiDAR Re-simulation using Compositional Neural Fields,Hanfeng Wu · Xingxing Zuo · Stefan Leutenegger · Or Litany · Konrad Schindler · Shengyu Huang, ,https://arxiv.org/abs/2312.05247,,2312.05247.pdf,Dynamic LiDAR Re-simulation using Compositional Neural Fields,"We introduce DyNFL, a novel neural field-based approach for high-fidelity
+re-simulation of LiDAR scans in dynamic driving scenes. DyNFL processes LiDAR
+measurements from dynamic environments, accompanied by bounding boxes of moving
+objects, to construct an editable neural field. This field, comprising
+separately reconstructed static background and dynamic objects, allows users to
+modify viewpoints, adjust object positions, and seamlessly add or remove
+objects in the re-simulated scene. A key innovation of our method is the neural
+field composition technique, which effectively integrates reconstructed neural
+assets from various scenes through a ray drop test, accounting for occlusions
+and transparent surfaces. Our evaluation with both synthetic and real-world
+environments demonstrates that DyNFL substantially improves dynamic scene LiDAR
+simulation, offering a combination of physical fidelity and flexible editing
+capabilities.",cs.CV,['cs.CV']
+GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding,Zi-Ting Chou · Sheng-Yu Huang · I-Jieh Liu · Yu-Chiang Frank Wang,https://timchou-ntu.github.io/gsnerf/,https://arxiv.org/abs/2403.03608,,2403.03608.pdf,GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding,"Utilizing multi-view inputs to synthesize novel-view images, Neural Radiance
+Fields (NeRF) have emerged as a popular research topic in 3D vision. In this
+work, we introduce a Generalizable Semantic Neural Radiance Field (GSNeRF),
+which uniquely takes image semantics into the synthesis process so that both
+novel view images and the associated semantic maps can be produced for unseen
+scenes. Our GSNeRF is composed of two stages: Semantic Geo-Reasoning and
+Depth-Guided Visual rendering. The former is able to observe multi-view image
+inputs to extract semantic and geometry features from a scene. Guided by the
+resulting image geometry information, the latter performs both image and
+semantic rendering with improved performances. Our experiments not only confirm
+that GSNeRF performs favorably against prior works on both novel-view image and
+semantic segmentation synthesis but the effectiveness of our sampling strategy
+for visual rendering is further verified.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+MVBench: A Comprehensive Multi-modal Video Understanding Benchmark,Kunchang Li · Yali Wang · Yinan He · Yizhuo Li · Yi Wang · Yi Liu · Zun Wang · Jilan Xu · Guo Chen · Ping Luo · Limin Wang · Yu Qiao,https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2,https://arxiv.org/abs/2311.17005,,2311.17005.pdf,MVBench: A Comprehensive Multi-modal Video Understanding Benchmark,"With the rapid development of Multi-modal Large Language Models (MLLMs), a
+number of diagnostic benchmarks have recently emerged to evaluate the
+comprehension capabilities of these models. However, most benchmarks
+predominantly assess spatial understanding in the static image tasks, while
+overlooking temporal understanding in the dynamic video tasks. To alleviate
+this issue, we introduce a comprehensive Multi-modal Video understanding
+Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot
+be effectively solved with a single frame. Specifically, we first introduce a
+novel static-to-dynamic method to define these temporal-related tasks. By
+transforming various static tasks into dynamic ones, we enable the systematic
+generation of video tasks that require a broad spectrum of temporal skills,
+ranging from perception to cognition. Then, guided by the task definition, we
+automatically convert public video annotations into multiple-choice QA to
+evaluate each task. On one hand, such a distinct paradigm allows us to build
+MVBench efficiently, without much manual intervention. On the other hand, it
+guarantees evaluation fairness with ground-truth video annotations, avoiding
+the biased scoring of LLMs. Moreover, we further develop a robust video MLLM
+baseline, i.e., VideoChat2, by progressive multi-modal training with diverse
+instruction-tuning data. The extensive results on our MVBench reveal that, the
+existing MLLMs are far from satisfactory in temporal understanding, while our
+VideoChat2 largely surpasses these leading models by over 15% on MVBench. All
+models and data are available at https://github.com/OpenGVLab/Ask-Anything.",cs.CV,['cs.CV']
+Unsupervised Gaze Representation Learning from Multi-view Face Images,Yiwei Bao · Feng Lu, ,https://arxiv.org/abs/2309.04506,,2309.04506.pdf,Unsupervised Gaze-aware Contrastive Learning with Subject-specific Condition,"Appearance-based gaze estimation has shown great promise in many applications
+by using a single general-purpose camera as the input device. However, its
+success is highly depending on the availability of large-scale well-annotated
+gaze datasets, which are sparse and expensive to collect. To alleviate this
+challenge we propose ConGaze, a contrastive learning-based framework that
+leverages unlabeled facial images to learn generic gaze-aware representations
+across subjects in an unsupervised way. Specifically, we introduce the
+gaze-specific data augmentation to preserve the gaze-semantic features and
+maintain the gaze consistency, which are proven to be crucial for effective
+contrastive gaze representation learning. Moreover, we devise a novel
+subject-conditional projection module that encourages a share feature extractor
+to learn gaze-aware and generic representations. Our experiments on three
+public gaze estimation datasets show that ConGaze outperforms existing
+unsupervised learning solutions by 6.7% to 22.5%; and achieves 15.1% to 24.6%
+improvement over its supervised learning-based counterpart in cross-dataset
+evaluations.",cs.CV,['cs.CV']
+DIOD: Self-Distillation Meets Object Discovery,Sandra Kara · Hejer AMMAR · Julien Denize · Florian Chabot · Quoc Cuong PHAM, ,https://arxiv.org/abs/2311.02633,,2311.02633.pdf,The Background Also Matters: Background-Aware Motion-Guided Objects Discovery,"Recent works have shown that objects discovery can largely benefit from the
+inherent motion information in video data. However, these methods lack a proper
+background processing, resulting in an over-segmentation of the non-object
+regions into random segments. This is a critical limitation given the
+unsupervised setting, where object segments and noise are not distinguishable.
+To address this limitation we propose BMOD, a Background-aware Motion-guided
+Objects Discovery method. Concretely, we leverage masks of moving objects
+extracted from optical flow and design a learning mechanism to extend them to
+the true foreground composed of both moving and static objects. The background,
+a complementary concept of the learned foreground class, is then isolated in
+the object discovery process. This enables a joint learning of the objects
+discovery task and the object/non-object separation. The conducted experiments
+on synthetic and real-world datasets show that integrating our background
+handling with various cutting-edge methods brings each time a considerable
+improvement. Specifically, we improve the objects discovery performance with a
+large margin, while establishing a strong baseline for object/non-object
+separation.",cs.CV,['cs.CV']
+$\textbf{LaRE}^2$: Latent Reconstruction Error Based Method for Diffusion-Generated Image Detection,Yunpeng Luo · Junlong Du · Ke Yan · Shouhong Ding, ,https://arxiv.org/abs/2403.17465,,2403.17465.pdf,LaRE^2: Latent Reconstruction Error Based Method for Diffusion-Generated Image Detection,"The evolution of Diffusion Models has dramatically improved image generation
+quality, making it increasingly difficult to differentiate between real and
+generated images. This development, while impressive, also raises significant
+privacy and security concerns. In response to this, we propose a novel Latent
+REconstruction error guided feature REfinement method (LaRE^2) for detecting
+the diffusion-generated images. We come up with the Latent Reconstruction Error
+(LaRE), the first reconstruction-error based feature in the latent space for
+generated image detection. LaRE surpasses existing methods in terms of feature
+extraction efficiency while preserving crucial cues required to differentiate
+between the real and the fake. To exploit LaRE, we propose an Error-Guided
+feature REfinement module (EGRE), which can refine the image feature guided by
+LaRE to enhance the discriminativeness of the feature. Our EGRE utilizes an
+align-then-refine mechanism, which effectively refines the image feature for
+generated-image detection from both spatial and channel perspectives. Extensive
+experiments on the large-scale GenImage benchmark demonstrate the superiority
+of our LaRE^2, which surpasses the best SoTA method by up to 11.9%/12.1%
+average ACC/AP across 8 different image generators. LaRE also surpasses
+existing methods in terms of feature extraction cost, delivering an impressive
+speed enhancement of 8 times.",cs.CV,"['cs.CV', 'cs.AI']"
+MindBridge: A Cross-Subject Brain Decoding Framework,Shizun Wang · Songhua Liu · Zhenxiong Tan · Xinchao Wang,https://littlepure2333.github.io/MindBridge/,https://arxiv.org/abs/2404.07850,,2404.07850.pdf,MindBridge: A Cross-Subject Brain Decoding Framework,"Brain decoding, a pivotal field in neuroscience, aims to reconstruct stimuli
+from acquired brain signals, primarily utilizing functional magnetic resonance
+imaging (fMRI). Currently, brain decoding is confined to a
+per-subject-per-model paradigm, limiting its applicability to the same
+individual for whom the decoding model is trained. This constraint stems from
+three key challenges: 1) the inherent variability in input dimensions across
+subjects due to differences in brain size; 2) the unique intrinsic neural
+patterns, influencing how different individuals perceive and process sensory
+information; 3) limited data availability for new subjects in real-world
+scenarios hampers the performance of decoding models. In this paper, we present
+a novel approach, MindBridge, that achieves cross-subject brain decoding by
+employing only one model. Our proposed framework establishes a generic paradigm
+capable of addressing these challenges by introducing biological-inspired
+aggregation function and novel cyclic fMRI reconstruction mechanism for
+subject-invariant representation learning. Notably, by cycle reconstruction of
+fMRI, MindBridge can enable novel fMRI synthesis, which also can serve as
+pseudo data augmentation. Within the framework, we also devise a novel
+reset-tuning method for adapting a pretrained model to a new subject.
+Experimental results demonstrate MindBridge's ability to reconstruct images for
+multiple subjects, which is competitive with dedicated subject-specific models.
+Furthermore, with limited data for a new subject, we achieve a high level of
+decoding accuracy, surpassing that of subject-specific models. This advancement
+in cross-subject brain decoding suggests promising directions for wider
+applications in neuroscience and indicates potential for more efficient
+utilization of limited fMRI data in real-world scenarios. Project page:
+https://littlepure2333.github.io/MindBridge",cs.CV,"['cs.CV', 'cs.AI']"
+Capturing Closely Interacted Two-Person Motions with Reaction Priors,Qi Fang · Yinghui Fan · Yanjun Li · Junting Dong · Dingwei Wu · Weidong Zhang · Kang Chen, ,https://arxiv.org/abs/2404.05490,,2404.05490.pdf,Two-Person Interaction Augmentation with Skeleton Priors,"Close and continuous interaction with rich contacts is a crucial aspect of
+human activities (e.g. hugging, dancing) and of interest in many domains like
+activity recognition, motion prediction, character animation, etc. However,
+acquiring such skeletal motion is challenging. While direct motion capture is
+expensive and slow, motion editing/generation is also non-trivial, as complex
+contact patterns with topological and geometric constraints have to be
+retained. To this end, we propose a new deep learning method for two-body
+skeletal interaction motion augmentation, which can generate variations of
+contact-rich interactions with varying body sizes and proportions while
+retaining the key geometric/topological relations between two bodies. Our
+system can learn effectively from a relatively small amount of data and
+generalize to drastically different skeleton sizes. Through exhaustive
+evaluation and comparison, we show it can generate high-quality motions, has
+strong generalizability and outperforms traditional optimization-based methods
+and alternative deep learning solutions.",cs.CV,['cs.CV']
+Text-conditional Attribute Alignment across Latent Spaces for 3D Controllable Face Image Synthesis,FeiFan Xu · Rui Li · Si Wu · Yong Xu · Hau San Wong, ,,https://huggingface.co/papers/2306.17115,,,,,nan
+Purified and Unified Steganographic Network,GuoBiao Li · Sheng Li · Zicong Luo · Zhenxing Qian · Xinpeng Zhang,https://github.com/albblgb/PUSNet,https://arxiv.org/abs/2402.17210,,2402.17210.pdf,Purified and Unified Steganographic Network,"Steganography is the art of hiding secret data into the cover media for
+covert communication. In recent years, more and more deep neural network
+(DNN)-based steganographic schemes are proposed to train steganographic
+networks for secret embedding and recovery, which are shown to be promising.
+Compared with the handcrafted steganographic tools, steganographic networks
+tend to be large in size. It raises concerns on how to imperceptibly and
+effectively transmit these networks to the sender and receiver to facilitate
+the covert communication. To address this issue, we propose in this paper a
+Purified and Unified Steganographic Network (PUSNet). It performs an ordinary
+machine learning task in a purified network, which could be triggered into
+steganographic networks for secret embedding or recovery using different keys.
+We formulate the construction of the PUSNet into a sparse weight filling
+problem to flexibly switch between the purified and steganographic networks. We
+further instantiate our PUSNet as an image denoising network with two
+steganographic networks concealed for secret image embedding and recovery.
+Comprehensive experiments demonstrate that our PUSNet achieves good performance
+on secret image embedding, secret image recovery, and image denoising in a
+single architecture. It is also shown to be capable of imperceptibly carrying
+the steganographic networks in a purified network. Code is available at
+\url{https://github.com/albblgb/PUSNet}",cs.CR,"['cs.CR', 'cs.CV']"
+Synergistic Global-space Camera and Human Reconstruction from Videos,Yizhou Zhao · Tuanfeng Y. Wang · Bhiksha Raj · Min Xu · Jimei Yang · Chun-Hao P. Huang,https://paulchhuang.github.io/synchmr/,https://arxiv.org/abs/2405.14855,,2405.14855.pdf,Synergistic Global-space Camera and Human Reconstruction from Videos,"Remarkable strides have been made in reconstructing static scenes or human
+bodies from monocular videos. Yet, the two problems have largely been
+approached independently, without much synergy. Most visual SLAM methods can
+only reconstruct camera trajectories and scene structures up to scale, while
+most HMR methods reconstruct human meshes in metric scale but fall short in
+reasoning with cameras and scenes. This work introduces Synergistic Camera and
+Human Reconstruction (SynCHMR) to marry the best of both worlds. Specifically,
+we design Human-aware Metric SLAM to reconstruct metric-scale camera poses and
+scene point clouds using camera-frame HMR as a strong prior, addressing depth,
+scale, and dynamic ambiguities. Conditioning on the dense scene recovered, we
+further learn a Scene-aware SMPL Denoiser to enhance world-frame HMR by
+incorporating spatio-temporal coherency and dynamic scene constraints.
+Together, they lead to consistent reconstructions of camera trajectories, human
+meshes, and dense scene point clouds in a common world frame. Project page:
+https://paulchhuang.github.io/synchmr",cs.CV,"['cs.CV', 'cs.AI']"
+VRetouchEr: Learning Cross-frame Feature Interdependence  with Imperfection Flow for Face Retouching in Videos,Wen Xue · Le Jiang · Lianxin Xie · Si Wu · Yong Xu · Hau San Wong, ,,https://ojs.aaai.org/index.php/AAAI/article/view/28404,,,,,nan
+Task-Adaptive Saliency Guidance for Exemplar-free Class Incremental Learning,Xialei Liu · Jiang-Tian Zhai · Andrew Bagdanov · Ke Li · Ming-Ming Cheng, ,,https://www.youtube.com/watch?v=5VfpqIwrbWM,,,,,nan
+HIMap: HybrId Representation Learning for End-to-end Vectorized HD Map Construction,Yi ZHOU · Hui Zhang · Jiaqian Yu · yifan yang · Sangil Jung · Seung-In Park · ByungIn Yoo, ,https://arxiv.org/abs/2403.08639,,2403.08639.pdf,HIMap: HybrId Representation Learning for End-to-end Vectorized HD Map Construction,"Vectorized High-Definition (HD) map construction requires predictions of the
+category and point coordinates of map elements (e.g. road boundary, lane
+divider, pedestrian crossing, etc.). State-of-the-art methods are mainly based
+on point-level representation learning for regressing accurate point
+coordinates. However, this pipeline has limitations in obtaining element-level
+information and handling element-level failures, e.g. erroneous element shape
+or entanglement between elements. To tackle the above issues, we propose a
+simple yet effective HybrId framework named HIMap to sufficiently learn and
+interact both point-level and element-level information. Concretely, we
+introduce a hybrid representation called HIQuery to represent all map elements,
+and propose a point-element interactor to interactively extract and encode the
+hybrid information of elements, e.g. point position and element shape, into the
+HIQuery. Additionally, we present a point-element consistency constraint to
+enhance the consistency between the point-level and element-level information.
+Finally, the output point-element integrated HIQuery can be directly converted
+into map elements' class, point coordinates, and mask. We conduct extensive
+experiments and consistently outperform previous methods on both nuScenes and
+Argoverse2 datasets. Notably, our method achieves $77.8$ mAP on the nuScenes
+dataset, remarkably superior to previous SOTAs by $8.3$ mAP at least.",cs.CV,['cs.CV']
+Making Vision Transformers Truly Shift-Equivariant,Renan A. Rojas-Gomez · Teck-Yian Lim · Minh Do · Raymond A. Yeh,https://renanrojasg.github.io/shifteq_vit/,,https://www.youtube.com/watch?v=PBNdb93NqiA,,,,,nan
+Dynamic Cues-Assisted Transformer for Robust Point Cloud Registration,Hong Chen · Pei Yan · sihe xiang · Yihua Tan, ,https://arxiv.org/abs/2404.14034,,2404.14034.pdf,PointDifformer: Robust Point Cloud Registration With Neural Diffusion and Transformer,"Point cloud registration is a fundamental technique in 3-D computer vision
+with applications in graphics, autonomous driving, and robotics. However,
+registration tasks under challenging conditions, under which noise or
+perturbations are prevalent, can be difficult. We propose a robust point cloud
+registration approach that leverages graph neural partial differential
+equations (PDEs) and heat kernel signatures. Our method first uses graph neural
+PDE modules to extract high dimensional features from point clouds by
+aggregating information from the 3-D point neighborhood, thereby enhancing the
+robustness of the feature representations. Then, we incorporate heat kernel
+signatures into an attention mechanism to efficiently obtain corresponding
+keypoints. Finally, a singular value decomposition (SVD) module with learnable
+weights is used to predict the transformation between two point clouds.
+Empirical experiments on a 3-D point cloud dataset demonstrate that our
+approach not only achieves state-of-the-art performance for point cloud
+registration but also exhibits better robustness to additive noise or 3-D shape
+perturbations.",cs.CV,['cs.CV']
+Generative Multi-modal Models are Good Class Incremental Learners,Xusheng Cao · Haori Lu · Linlan Huang · Xialei Liu · Ming-Ming Cheng, ,https://arxiv.org/abs/2403.18383,,2403.18383.pdf,Generative Multi-modal Models are Good Class-Incremental Learners,"In class-incremental learning (CIL) scenarios, the phenomenon of catastrophic
+forgetting caused by the classifier's bias towards the current task has long
+posed a significant challenge. It is mainly caused by the characteristic of
+discriminative models. With the growing popularity of the generative
+multi-modal models, we would explore replacing discriminative models with
+generative ones for CIL. However, transitioning from discriminative to
+generative models requires addressing two key challenges. The primary challenge
+lies in transferring the generated textual information into the classification
+of distinct categories. Additionally, it requires formulating the task of CIL
+within a generative framework. To this end, we propose a novel generative
+multi-modal model (GMM) framework for class-incremental learning. Our approach
+directly generates labels for images using an adapted generative model. After
+obtaining the detailed text, we use a text encoder to extract text features and
+employ feature matching to determine the most similar label as the
+classification prediction. In the conventional CIL settings, we achieve
+significantly better results in long-sequence task scenarios. Under the
+Few-shot CIL setting, we have improved by at least 14\% accuracy over all the
+current state-of-the-art methods with significantly less forgetting. Our code
+is available at \url{https://github.com/DoubleClass/GMM}.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models,Rongjie Li · Songyang Zhang · Dahua Lin · Kai Chen · Xuming He, ,https://arxiv.org/abs/2404.00906,,2404.00906.pdf,From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models,"Scene graph generation (SGG) aims to parse a visual scene into an
+intermediate graph representation for downstream reasoning tasks. Despite
+recent advancements, existing methods struggle to generate scene graphs with
+novel visual relation concepts. To address this challenge, we introduce a new
+open-vocabulary SGG framework based on sequence generation. Our framework
+leverages vision-language pre-trained models (VLM) by incorporating an
+image-to-graph generation paradigm. Specifically, we generate scene graph
+sequences via image-to-text generation with VLM and then construct scene graphs
+from these sequences. By doing so, we harness the strong capabilities of VLM
+for open-vocabulary SGG and seamlessly integrate explicit relational modeling
+for enhancing the VL tasks. Experimental results demonstrate that our design
+not only achieves superior performance with an open vocabulary but also
+enhances downstream vision-language task performance through explicit relation
+modeling knowledge.",cs.CV,['cs.CV']
+Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation,Shanshan Zhong · Zhongzhan Huang · Shanghua Gao · Wushao Wen · Liang Lin · Marinka Zitnik · Pan Zhou,https://zhongshsh.github.io/CLoT/,https://arxiv.org/abs/2312.02439,,2312.02439.pdf,Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation,"Chain-of-Thought (CoT) guides large language models (LLMs) to reason
+step-by-step, and can motivate their logical reasoning ability. While effective
+for logical tasks, CoT is not conducive to creative problem-solving which often
+requires out-of-box thoughts and is crucial for innovation advancements. In
+this paper, we explore the Leap-of-Thought (LoT) abilities within LLMs -- a
+non-sequential, creative paradigm involving strong associations and knowledge
+leaps. To this end, we study LLMs on the popular Oogiri game which needs
+participants to have good creativity and strong associative thinking for
+responding unexpectedly and humorously to the given image, text, or both, and
+thus is suitable for LoT study. Then to investigate LLMs' LoT ability in the
+Oogiri game, we first build a multimodal and multilingual Oogiri-GO dataset
+which contains over 130,000 samples from the Oogiri game, and observe the
+insufficient LoT ability or failures of most existing LLMs on the Oogiri game.
+Accordingly, we introduce a creative Leap-of-Thought (CLoT) paradigm to improve
+LLM's LoT ability. CLoT first formulates the Oogiri-GO dataset into
+LoT-oriented instruction tuning data to train pretrained LLM for achieving
+certain LoT humor generation and discrimination abilities. Then CLoT designs an
+explorative self-refinement that encourages the LLM to generate more creative
+LoT data via exploring parallels between seemingly unrelated concepts and
+selects high-quality data to train itself for self-refinement. CLoT not only
+excels in humor generation in the Oogiri game but also boosts creative
+abilities in various tasks like cloud guessing game and divergent association
+task. These findings advance our understanding and offer a pathway to improve
+LLMs' creative capacities for innovative applications across domains. The
+dataset, code, and models will be released online.
+https://zhongshsh.github.io/CLoT/.",cs.AI,"['cs.AI', 'cs.CL', 'cs.CV']"
+Enhancing Visual Continual Learning with Language-Guided Supervision,Bolin Ni · Hongbo Zhao · Chenghao Zhang · Ke Hu · Gaofeng Meng · Zhaoxiang Zhang · Shiming Xiang, ,https://arxiv.org/abs/2403.16124,,2403.16124.pdf,Enhancing Visual Continual Learning with Language-Guided Supervision,"Continual learning (CL) aims to empower models to learn new tasks without
+forgetting previously acquired knowledge. Most prior works concentrate on the
+techniques of architectures, replay data, regularization, \etc. However, the
+category name of each class is largely neglected. Existing methods commonly
+utilize the one-hot labels and randomly initialize the classifier head. We
+argue that the scarce semantic information conveyed by the one-hot labels
+hampers the effective knowledge transfer across tasks. In this paper, we
+revisit the role of the classifier head within the CL paradigm and replace the
+classifier with semantic knowledge from pretrained language models (PLMs).
+Specifically, we use PLMs to generate semantic targets for each class, which
+are frozen and serve as supervision signals during training. Such targets fully
+consider the semantic correlation between all classes across tasks. Empirical
+studies show that our approach mitigates forgetting by alleviating
+representation drifting and facilitating knowledge transfer across tasks. The
+proposed method is simple to implement and can seamlessly be plugged into
+existing methods with negligible adjustments. Extensive experiments based on
+eleven mainstream baselines demonstrate the effectiveness and generalizability
+of our approach to various protocols. For example, under the class-incremental
+learning setting on ImageNet-100, our method significantly improves the Top-1
+accuracy by 3.2\% to 6.1\% while reducing the forgetting rate by 2.6\% to
+13.1\%.",cs.CV,['cs.CV']
+Learned Trajectory Embedding for Subspace Clustering,Yaroslava Lochman · Christopher Zach · Carl Olsson,https://ylochman.github.io/trajectory-embedding,,https://link.springer.com/article/10.1007/s44267-024-00043-0,,,,,nan
+Denoising Point Clouds in Latent Space via Graph Convolution and Invertible Neural Network,Aihua Mao · Biao Yan · Zijing Ma · Ying He, ,https://arxiv.org/abs/2401.09721,,2401.09721.pdf,Fast graph-based denoising for point cloud color information,"Point clouds are utilized in various 3D applications such as cross-reality
+(XR) and realistic 3D displays. In some applications, e.g., for live streaming
+using a 3D point cloud, real-time point cloud denoising methods are required to
+enhance the visual quality. However, conventional high-precision denoising
+methods cannot be executed in real time for large-scale point clouds owing to
+the complexity of graph constructions with K nearest neighbors and noise level
+estimation. This paper proposes a fast graph-based denoising (FGBD) for a
+large-scale point cloud. First, high-speed graph construction is achieved by
+scanning a point cloud in various directions and searching adjacent
+neighborhoods on the scanning lines. Second, we propose a fast noise level
+estimation method using eigenvalues of the covariance matrix on a graph.
+Finally, we also propose a new low-cost filter selection method to enhance
+denoising accuracy to compensate for the degradation caused by the acceleration
+algorithms. In our experiments, we succeeded in reducing the processing time
+dramatically while maintaining accuracy relative to conventional denoising
+methods. Denoising was performed at 30fps, with frames containing approximately
+1 million points.",cs.CV,"['cs.CV', 'eess.IV', 'eess.SP']"
+LASO: Language-guided Affordance Segmentation on 3D Object,Yicong Li · Na Zhao · Junbin Xiao · Chun Feng · Xiang Wang · Tat-seng Chua, ,https://arxiv.org/abs/2309.10911,,2309.10911.pdf,Language-Conditioned Affordance-Pose Detection in 3D Point Clouds,"Affordance detection and pose estimation are of great importance in many
+robotic applications. Their combination helps the robot gain an enhanced
+manipulation capability, in which the generated pose can facilitate the
+corresponding affordance task. Previous methods for affodance-pose joint
+learning are limited to a predefined set of affordances, thus limiting the
+adaptability of robots in real-world environments. In this paper, we propose a
+new method for language-conditioned affordance-pose joint learning in 3D point
+clouds. Given a 3D point cloud object, our method detects the affordance region
+and generates appropriate 6-DoF poses for any unconstrained affordance label.
+Our method consists of an open-vocabulary affordance detection branch and a
+language-guided diffusion model that generates 6-DoF poses based on the
+affordance text. We also introduce a new high-quality dataset for the task of
+language-driven affordance-pose joint learning. Intensive experimental results
+demonstrate that our proposed method works effectively on a wide range of
+open-vocabulary affordances and outperforms other baselines by a large margin.
+In addition, we illustrate the usefulness of our method in real-world robotic
+applications. Our code and dataset are publicly available at
+https://3DAPNet.github.io",cs.RO,['cs.RO']
+MonoCD: Monocular 3D Object Detection with Complementary Depths,Longfei Yan · Pei Yan · Shengzhou Xiong · Xuanyu Xiang · Yihua Tan,https://github.com/elvintanhust/MonoCD,https://arxiv.org/abs/2404.03181v1,,2404.03181v1.pdf,MonoCD: Monocular 3D Object Detection with Complementary Depths,"Monocular 3D object detection has attracted widespread attention due to its
+potential to accurately obtain object 3D localization from a single image at a
+low cost. Depth estimation is an essential but challenging subtask of monocular
+3D object detection due to the ill-posedness of 2D to 3D mapping. Many methods
+explore multiple local depth clues such as object heights and keypoints and
+then formulate the object depth estimation as an ensemble of multiple depth
+predictions to mitigate the insufficiency of single-depth information. However,
+the errors of existing multiple depths tend to have the same sign, which
+hinders them from neutralizing each other and limits the overall accuracy of
+combined depth. To alleviate this problem, we propose to increase the
+complementarity of depths with two novel designs. First, we add a new depth
+prediction branch named complementary depth that utilizes global and efficient
+depth clues from the entire image rather than the local clues to reduce the
+correlation of depth predictions. Second, we propose to fully exploit the
+geometric relations between multiple depth clues to achieve complementarity in
+form. Benefiting from these designs, our method achieves higher
+complementarity. Experiments on the KITTI benchmark demonstrate that our method
+achieves state-of-the-art performance without introducing extra data. In
+addition, complementary depth can also be a lightweight and plug-and-play
+module to boost multiple existing monocular 3d object detectors. Code is
+available at https://github.com/elvintanhust/MonoCD.",cs.CV,['cs.CV']
+All Rivers Run to the Sea: Private Learning with Asymmetric Flows,Yue Niu · Ramy E. Ali · Saurav Prakash · Salman Avestimehr, ,https://arxiv.org/abs/2312.05264,,2312.05264.pdf,All Rivers Run to the Sea: Private Learning with Asymmetric Flows,"Data privacy is of great concern in cloud machine-learning service platforms,
+when sensitive data are exposed to service providers. While private computing
+environments (e.g., secure enclaves), and cryptographic approaches (e.g.,
+homomorphic encryption) provide strong privacy protection, their computing
+performance still falls short compared to cloud GPUs. To achieve privacy
+protection with high computing performance, we propose Delta, a new private
+training and inference framework, with comparable model performance as
+non-private centralized training. Delta features two asymmetric data flows: the
+main information-sensitive flow and the residual flow. The main part flows into
+a small model while the residuals are offloaded to a large model. Specifically,
+Delta embeds the information-sensitive representations into a low-dimensional
+space while pushing the information-insensitive part into high-dimension
+residuals. To ensure privacy protection, the low-dimensional
+information-sensitive part is secured and fed to a small model in a private
+environment. On the other hand, the residual part is sent to fast cloud GPUs,
+and processed by a large model. To further enhance privacy and reduce the
+communication cost, Delta applies a random binary quantization technique along
+with a DP-based technique to the residuals before sharing them with the public
+platform. We theoretically show that Delta guarantees differential privacy in
+the public environment and greatly reduces the complexity in the private
+environment. We conduct empirical analyses on CIFAR-10, CIFAR-100 and ImageNet
+datasets and ResNet-18 and ResNet-34, showing that Delta achieves strong
+privacy protection, fast training, and inference without significantly
+compromising the model utility.",cs.CR,"['cs.CR', 'cs.LG']"
+PH-Net: Semi-Supervised Breast Lesion Segmentation via Patch-wise Hardness,Siyao Jiang · Huisi Wu · Junyang Chen · Qin Zhang · Jing Qin, ,,https://link.springer.com/article/10.1007/s11517-023-02970-4,,,,,nan
+Generalizing 6-DoF Grasp Detection via Domain Prior Knowledge,Haoxiang Ma · Modi Shi · Boyang GAO · Di Huang, ,https://arxiv.org/abs/2404.01727v1,,2404.01727v1.pdf,Generalizing 6-DoF Grasp Detection via Domain Prior Knowledge,"We focus on the generalization ability of the 6-DoF grasp detection method in
+this paper. While learning-based grasp detection methods can predict grasp
+poses for unseen objects using the grasp distribution learned from the training
+set, they often exhibit a significant performance drop when encountering
+objects with diverse shapes and structures. To enhance the grasp detection
+methods' generalization ability, we incorporate domain prior knowledge of
+robotic grasping, enabling better adaptation to objects with significant shape
+and structure differences. More specifically, we employ the physical constraint
+regularization during the training phase to guide the model towards predicting
+grasps that comply with the physical rule on grasping. For the unstable grasp
+poses predicted on novel objects, we design a contact-score joint optimization
+using the projection contact map to refine these poses in cluttered scenarios.
+Extensive experiments conducted on the GraspNet-1billion benchmark demonstrate
+a substantial performance gain on the novel object set and the real-world
+grasping experiments also demonstrate the effectiveness of our generalizing
+6-DoF grasp detection method.",cs.RO,"['cs.RO', 'cs.CV']"
+Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline,Xiao Wang · Shiao Wang · Chuanming Tang · Lin Zhu · Bo Jiang · Yonghong Tian · Jin Tang, ,https://arxiv.org/abs/2309.14611,,2309.14611.pdf,Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline,"Tracking using bio-inspired event cameras has drawn more and more attention
+in recent years. Existing works either utilize aligned RGB and event data for
+accurate tracking or directly learn an event-based tracker. The first category
+needs more cost for inference and the second one may be easily influenced by
+noisy events or sparse spatial resolution. In this paper, we propose a novel
+hierarchical knowledge distillation framework that can fully utilize
+multi-modal / multi-view information during training to facilitate knowledge
+transfer, enabling us to achieve high-speed and low-latency visual tracking
+during testing by using only event signals. Specifically, a teacher
+Transformer-based multi-modal tracking framework is first trained by feeding
+the RGB frame and event stream simultaneously. Then, we design a new
+hierarchical knowledge distillation strategy which includes pairwise
+similarity, feature representation, and response maps-based knowledge
+distillation to guide the learning of the student Transformer network.
+Moreover, since existing event-based tracking datasets are all low-resolution
+($346 \times 260$), we propose the first large-scale high-resolution ($1280
+\times 720$) dataset named EventVOT. It contains 1141 videos and covers a wide
+range of categories such as pedestrians, vehicles, UAVs, ping pongs, etc.
+Extensive experiments on both low-resolution (FE240hz, VisEvent, COESOT), and
+our newly proposed high-resolution EventVOT dataset fully validated the
+effectiveness of our proposed method. The dataset, evaluation toolkit, and
+source code are available on
+\url{https://github.com/Event-AHU/EventVOT_Benchmark}",cs.CV,"['cs.CV', 'cs.NE']"
+An Empirical Study of Scaling Law for Scene Text Recognition,Miao Rang · Zhenni Bi · Chuanjian Liu · Yunhe Wang · Kai Han, ,https://arxiv.org/abs/2401.00028,,2401.00028.pdf,An Empirical Study of Scaling Law for OCR,"The laws of model size, data volume, computation and model performance have
+been extensively studied in the field of Natural Language Processing (NLP).
+However, the scaling laws in Optical Character Recognition (OCR) have not yet
+been investigated. To address this, we conducted comprehensive studies that
+involved examining the correlation between performance and the scale of models,
+data volume and computation in the field of text recognition.Conclusively, the
+study demonstrates smooth power laws between performance and model size, as
+well as training data volume, when other influencing factors are held constant.
+Additionally, we have constructed a large-scale dataset called REBU-Syn, which
+comprises 6 million real samples and 18 million synthetic samples. Based on our
+scaling law and new dataset, we have successfully trained a scene text
+recognition model, achieving a new state-ofthe-art on 6 common test benchmarks
+with a top-1 average accuracy of 97.42%. The models and dataset are publicly
+available at https://github.com/large-ocr-model/large-ocr-model.github.io.",cs.CV,['cs.CV']
+Dual-scale Transformer for Large-scale Single-Pixel Imaging,Gang Qu · Ping Wang · Xin Yuan, ,https://arxiv.org/abs/2404.05001,,2404.05001.pdf,Dual-Scale Transformer for Large-Scale Single-Pixel Imaging,"Single-pixel imaging (SPI) is a potential computational imaging technique
+which produces image by solving an illposed reconstruction problem from few
+measurements captured by a single-pixel detector. Deep learning has achieved
+impressive success on SPI reconstruction. However, previous poor reconstruction
+performance and impractical imaging model limit its real-world applications. In
+this paper, we propose a deep unfolding network with hybrid-attention
+Transformer on Kronecker SPI model, dubbed HATNet, to improve the imaging
+quality of real SPI cameras. Specifically, we unfold the computation graph of
+the iterative shrinkagethresholding algorithm (ISTA) into two alternative
+modules: efficient tensor gradient descent and hybrid-attention multiscale
+denoising. By virtue of Kronecker SPI, the gradient descent module can avoid
+high computational overheads rooted in previous gradient descent modules based
+on vectorized SPI. The denoising module is an encoder-decoder architecture
+powered by dual-scale spatial attention for high- and low-frequency aggregation
+and channel attention for global information recalibration. Moreover, we build
+a SPI prototype to verify the effectiveness of the proposed method. Extensive
+experiments on synthetic and real data demonstrate that our method achieves the
+state-of-the-art performance. The source code and pre-trained models are
+available at https://github.com/Gang-Qu/HATNet-SPI.",cs.CV,['cs.CV']
+Learning Intra-view and Cross-view Geometric Knowledge for Stereo Matching,Rui Gong · Weide Liu · ZAIWANG GU · Xulei Yang · Jun Cheng, ,https://arxiv.org/abs/2402.19270,,2402.19270.pdf,Learning Intra-view and Cross-view Geometric Knowledge for Stereo Matching,"Geometric knowledge has been shown to be beneficial for the stereo matching
+task. However, prior attempts to integrate geometric insights into stereo
+matching algorithms have largely focused on geometric knowledge from single
+images while crucial cross-view factors such as occlusion and matching
+uniqueness have been overlooked. To address this gap, we propose a novel
+Intra-view and Cross-view Geometric knowledge learning Network (ICGNet),
+specifically crafted to assimilate both intra-view and cross-view geometric
+knowledge. ICGNet harnesses the power of interest points to serve as a channel
+for intra-view geometric understanding. Simultaneously, it employs the
+correspondences among these points to capture cross-view geometric
+relationships. This dual incorporation empowers the proposed ICGNet to leverage
+both intra-view and cross-view geometric knowledge in its learning process,
+substantially improving its ability to estimate disparities. Our extensive
+experiments demonstrate the superiority of the ICGNet over contemporary leading
+models.",cs.CV,['cs.CV']
+ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks,Kai Han · Yunhe Wang · Jianyuan Guo · Enhua Wu,https://parameternet.github.io/,https://arxiv.org/abs/2306.14525,,2306.14525.pdf,ParameterNet: Parameters Are All You Need,"The large-scale visual pretraining has significantly improve the performance
+of large vision models. However, we observe the \emph{low FLOPs pitfall} that
+the existing low-FLOPs models cannot benefit from large-scale pretraining. In
+this paper, we introduce a novel design principle, termed ParameterNet, aimed
+at augmenting the number of parameters in large-scale visual pretraining models
+while minimizing the increase in FLOPs. We leverage dynamic convolutions to
+incorporate additional parameters into the networks with only a marginal rise
+in FLOPs. The ParameterNet approach allows low-FLOPs networks to take advantage
+of large-scale visual pretraining. Furthermore, we extend the ParameterNet
+concept to the language domain to enhance inference results while preserving
+inference speed. Experiments on the large-scale ImageNet-22K have shown the
+superiority of our ParameterNet scheme. For example, ParameterNet-600M can
+achieve higher accuracy on ImageNet than the widely-used Swin Transformer
+(81.6\% \emph{vs.} 80.9\%) and has much lower FLOPs (0.6G \emph{vs.} 4.5G). In
+the language domain, LLaMA-1B enhanced with ParameterNet achieves 2\% higher
+accuracy over vanilla LLaMA. The code will be released at
+\url{https://parameternet.github.io/}.",cs.CV,['cs.CV']
+Relational Matching for Weakly Semi-Supervised Oriented Object Detection,Wenhao Wu · Hau San Wong · Si Wu · Tianyou Zhang, ,,https://paperswithcode.com/paper/weakly-semi-supervised-object-detection-in,,,,,nan
+A2XP: Towards Private Domain Generalization,Geunhyeok Yu · Hyoseok Hwang,https://airlabkhu.github.io/A2XP/,https://arxiv.org/abs/2311.10339,,2311.10339.pdf,A2XP: Towards Private Domain Generalization,"Deep Neural Networks (DNNs) have become pivotal in various fields, especially
+in computer vision, outperforming previous methodologies. A critical challenge
+in their deployment is the bias inherent in data across different domains, such
+as image style and environmental conditions, leading to domain gaps. This
+necessitates techniques for learning general representations from biased
+training data, known as domain generalization. This paper presents Attend to
+eXpert Prompts (A2XP), a novel approach for domain generalization that
+preserves the privacy and integrity of the network architecture. A2XP consists
+of two phases: Expert Adaptation and Domain Generalization. In the first phase,
+prompts for each source domain are optimized to guide the model towards the
+optimal direction. In the second phase, two embedder networks are trained to
+effectively amalgamate these expert prompts, aiming for an optimal output. Our
+extensive experiments demonstrate that A2XP achieves state-of-the-art results
+over existing non-private domain generalization methods. The experimental
+results validate that the proposed approach not only tackles the domain
+generalization challenge in DNNs but also offers a privacy-preserving,
+efficient solution to the broader field of computer vision.",cs.CV,['cs.CV']
+Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding,Hoang-Quan Nguyen · Thanh-Dat Truong · Xuan-Bac Nguyen · Ashley Dowling · Xin Li · Khoa Luu,https://uark-cviu.github.io/projects/insect_foundation.html,https://arxiv.org/abs/2311.15206,,2311.15206.pdf,Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding,"In precision agriculture, the detection and recognition of insects play an
+essential role in the ability of crops to grow healthy and produce a
+high-quality yield. The current machine vision model requires a large volume of
+data to achieve high performance. However, there are approximately 5.5 million
+different insect species in the world. None of the existing insect datasets can
+cover even a fraction of them due to varying geographic locations and
+acquisition costs. In this paper, we introduce a novel ""Insect-1M"" dataset, a
+game-changing resource poised to revolutionize insect-related foundation model
+training. Covering a vast spectrum of insect species, our dataset, including 1
+million images with dense identification labels of taxonomy hierarchy and
+insect descriptions, offers a panoramic view of entomology, enabling foundation
+models to comprehend visual and semantic information about insects like never
+before. Then, to efficiently establish an Insect Foundation Model, we develop a
+micro-feature self-supervised learning method with a Patch-wise Relevant
+Attention mechanism capable of discerning the subtle differences among insect
+images. In addition, we introduce Description Consistency loss to improve
+micro-feature modeling via insect descriptions. Through our experiments, we
+illustrate the effectiveness of our proposed approach in insect modeling and
+achieve State-of-the-Art performance on standard benchmarks of insect-related
+tasks. Our Insect Foundation Model and Dataset promise to empower the next
+generation of insect-related vision models, bringing them closer to the
+ultimate goal of precision agriculture.",cs.CV,['cs.CV']
+PostureHMR: Posture Transformation for 3D Human Mesh Recovery,Yu-Pei Song · Xiao WU · Zhaoquan Yuan · Jian-Jun Qiao · Qiang Peng, ,https://arxiv.org/abs/2403.12473,,2403.12473.pdf,PostoMETRO: Pose Token Enhanced Mesh Transformer for Robust 3D Human Mesh Recovery,"With the recent advancements in single-image-based human mesh recovery, there
+is a growing interest in enhancing its performance in certain extreme
+scenarios, such as occlusion, while maintaining overall model accuracy.
+Although obtaining accurately annotated 3D human poses under occlusion is
+challenging, there is still a wealth of rich and precise 2D pose annotations
+that can be leveraged. However, existing works mostly focus on directly
+leveraging 2D pose coordinates to estimate 3D pose and mesh. In this paper, we
+present PostoMETRO($\textbf{Pos}$e $\textbf{to}$ken enhanced $\textbf{ME}$sh
+$\textbf{TR}$ansf$\textbf{O}$rmer), which integrates occlusion-resilient 2D
+pose representation into transformers in a token-wise manner. Utilizing a
+specialized pose tokenizer, we efficiently condense 2D pose data to a compact
+sequence of pose tokens and feed them to the transformer together with the
+image tokens. This process not only ensures a rich depiction of texture from
+the image but also fosters a robust integration of pose and image information.
+Subsequently, these combined tokens are queried by vertex and joint tokens to
+decode 3D coordinates of mesh vertices and human joints. Facilitated by the
+robust pose token representation and the effective combination, we are able to
+produce more precise 3D coordinates, even under extreme scenarios like
+occlusion. Experiments on both standard and occlusion-specific benchmarks
+demonstrate the effectiveness of PostoMETRO. Qualitative results further
+illustrate the clarity of how 2D pose can help 3D reconstruction. Code will be
+made available.",cs.CV,['cs.CV']
+InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning,Jing Shi · Wei Xiong · Zhe Lin · HyunJoon Jung, ,https://arxiv.org/html/2403.11284v1,,2403.11284v1.pdf,Fast Personalized Text-to-Image Syntheses With Attention Injection,"Currently, personalized image generation methods mostly require considerable
+time to finetune and often overfit the concept resulting in generated images
+that are similar to custom concepts but difficult to edit by prompts. We
+propose an effective and fast approach that could balance the text-image
+consistency and identity consistency of the generated image and reference
+image. Our method can generate personalized images without any fine-tuning
+while maintaining the inherent text-to-image generation ability of diffusion
+models. Given a prompt and a reference image, we merge the custom concept into
+generated images by manipulating cross-attention and self-attention layers of
+the original diffusion model to generate personalized images that match the
+text description. Comprehensive experiments highlight the superiority of our
+method.",cs.CV,['cs.CV']
+Exact Fusion via Feature Distribution Matching for Few-shot Image Generation,Yingbo Zhou · Yutong Ye · Pengyu Zhang · Xian Wei · Mingsong Chen, ,https://arxiv.org/abs/2307.14638v1,,2307.14638v1.pdf,EqGAN: Feature Equalization Fusion for Few-shot Image Generation,"Due to the absence of fine structure and texture information, existing
+fusion-based few-shot image generation methods suffer from unsatisfactory
+generation quality and diversity. To address this problem, we propose a novel
+feature Equalization fusion Generative Adversarial Network (EqGAN) for few-shot
+image generation. Unlike existing fusion strategies that rely on either deep
+features or local representations, we design two separate branches to fuse
+structures and textures by disentangling encoded features into shallow and deep
+contents. To refine image contents at all feature levels, we equalize the fused
+structure and texture semantics at different scales and supplement the decoder
+with richer information by skip connections. Since the fused structures and
+textures may be inconsistent with each other, we devise a consistent
+equalization loss between the equalized features and the intermediate output of
+the decoder to further align the semantics. Comprehensive experiments on three
+public datasets demonstrate that, EqGAN not only significantly improves
+generation performance with FID score (by up to 32.7%) and LPIPS score (by up
+to 4.19%), but also outperforms the state-of-the-arts in terms of accuracy (by
+up to 1.97%) for downstream classification tasks.",cs.CV,['cs.CV']
+Data Poisoning based Backdoor Attacks to Contrastive Learning,Jinghuai Zhang · Hongbin Liu · Jinyuan Jia · Neil Zhenqiang Gong,https://github.com/jzhang538/CorruptEncoder,,,,,,,nan
+GenesisTex: Adapting Image Denoising Diffusion to Texture Space,Chenjian Gao · Boyan Jiang · Xinghui Li · YingPeng Zhang · Qian Yu,https://cjeen.github.io/GenesisTexPaper/,https://arxiv.org/abs/2403.17782,,2403.17782.pdf,GenesisTex: Adapting Image Denoising Diffusion to Texture Space,"We present GenesisTex, a novel method for synthesizing textures for 3D
+geometries from text descriptions. GenesisTex adapts the pretrained image
+diffusion model to texture space by texture space sampling. Specifically, we
+maintain a latent texture map for each viewpoint, which is updated with
+predicted noise on the rendering of the corresponding viewpoint. The sampled
+latent texture maps are then decoded into a final texture map. During the
+sampling process, we focus on both global and local consistency across multiple
+viewpoints: global consistency is achieved through the integration of style
+consistency mechanisms within the noise prediction network, and low-level
+consistency is achieved by dynamically aligning latent textures. Finally, we
+apply reference-based inpainting and img2img on denser views for texture
+refinement. Our approach overcomes the limitations of slow optimization in
+distillation-based methods and instability in inpainting-based methods.
+Experiments on meshes from various sources demonstrate that our method
+surpasses the baseline methods quantitatively and qualitatively.",cs.CV,"['cs.CV', 'cs.GR']"
+On Scaling up a Multilingual Vision and Language Model,Xi Chen · Josip Djolonga · Piotr Padlewski · Basil Mustafa · Soravit Changpinyo · Jialin Wu · Carlos Riquelme Ruiz · Sebastian Goodman · Xiao Wang · Yi Tay · Siamak Shakeri · Mostafa Dehghani · Daniel Salz · Mario Lučić · Michael Tschannen · Arsha Nagrani · Hexiang Hu · Mandar Joshi · Bo Pang · Ceslee Montgomery · Paulina Pietrzyk · Marvin Ritter · AJ Piergiovanni · Matthias Minderer · Filip Pavetic · Austin Waters · Gang Li · Ibrahim Alabdulmohsin · Lucas Beyer · Julien Amelot · Kenton Lee · Andreas Steiner · Yang Li · Daniel Keysers · Anurag Arnab · Yuanzhong Xu · Keran Rong · Alexander Kolesnikov · Mojtaba Seyedhosseini · Anelia Angelova · Xiaohua Zhai · Neil Houlsby · Radu Soricut, ,https://ar5iv.labs.arxiv.org/html/2312.07533,,2312.07533.pdf,VILA: On Pre-training for Visual Language Models,"Visual language models (VLMs) rapidly progressed with the recent success of
+large language models. There have been growing efforts on visual instruction
+tuning to extend the LLM with visual inputs, but lacks an in-depth study of the
+visual language pre-training process, where the model learns to perform joint
+modeling on both modalities. In this work, we examine the design options for
+VLM pre-training by augmenting LLM towards VLM through step-by-step
+controllable comparisons. We introduce three main findings: (1) freezing LLMs
+during pre-training can achieve decent zero-shot performance, but lack
+in-context learning capability, which requires unfreezing the LLM; (2)
+interleaved pre-training data is beneficial whereas image-text pairs alone are
+not optimal; (3) re-blending text-only instruction data to image-text data
+during instruction fine-tuning not only remedies the degradation of text-only
+tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe
+we build VILA, a Visual Language model family that consistently outperforms the
+state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells
+and whistles. Multi-modal pre-training also helps unveil appealing properties
+of VILA, including multi-image reasoning, enhanced in-context learning, and
+better world knowledge.",cs.CV,['cs.CV']
+$V_kD:$ Improving knowledge distillation using orthogonal projections,Roy Miles · Ismail Elezi · Jiankang Deng, ,https://arxiv.org/abs/2403.06213,,2403.06213.pdf,$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections,"Knowledge distillation is an effective method for training small and
+efficient deep learning models. However, the efficacy of a single method can
+degenerate when transferring to other tasks, modalities, or even other
+architectures. To address this limitation, we propose a novel constrained
+feature distillation method. This method is derived from a small set of core
+principles, which results in two emerging components: an orthogonal projection
+and a task-specific normalisation. Equipped with both of these components, our
+transformer models can outperform all previous methods on ImageNet and reach up
+to a 4.4% relative improvement over the previous state-of-the-art methods. To
+further demonstrate the generality of our method, we apply it to object
+detection and image generation, whereby we obtain consistent and substantial
+performance improvements over state-of-the-art. Code and models are publicly
+available: https://github.com/roymiles/vkd",cs.CV,"['cs.CV', 'cs.AI']"
+Towards Modern Image Manipulation Localization: A Large-Scale Dataset and Novel Methods,Chenfan Qu · Yiwu Zhong · Chongyu Liu · Guitao Xu · Dezhi Peng · Fengjun Guo · Lianwen Jin, ,https://arxiv.org/abs/2309.01858,,2309.01858.pdf,Towards Universal Image Embeddings: A Large-Scale Dataset and Challenge for Generic Image Representations,"Fine-grained and instance-level recognition methods are commonly trained and
+evaluated on specific domains, in a model per domain scenario. Such an
+approach, however, is impractical in real large-scale applications. In this
+work, we address the problem of universal image embedding, where a single
+universal model is trained and used in multiple domains. First, we leverage
+existing domain-specific datasets to carefully construct a new large-scale
+public benchmark for the evaluation of universal image embeddings, with 241k
+query images, 1.4M index images and 2.8M training images across 8 different
+domains and 349k classes. We define suitable metrics, training and evaluation
+protocols to foster future research in this area. Second, we provide a
+comprehensive experimental evaluation on the new dataset, demonstrating that
+existing approaches and simplistic extensions lead to worse performance than an
+assembly of models trained for each domain separately. Finally, we conducted a
+public research competition on this topic, leveraging industrial datasets,
+which attracted the participation of more than 1k teams worldwide. This
+exercise generated many interesting research ideas and findings which we
+present in detail. Project webpage: https://cmp.felk.cvut.cz/univ_emb/",cs.CV,['cs.CV']
+Permutation Equivariance of Transformers and Its Applications,Hengyuan Xu · Liyao Xiang · Hangyu Ye · Dixi Yao · Pengzhi Chu · Baochun Li,https://github.com/Doby-Xu/ST,https://arxiv.org/abs/2403.05842,,2403.05842.pdf,Hufu: A Modality-Agnositc Watermarking System for Pre-Trained Transformers via Permutation Equivariance,"With the blossom of deep learning models and services, it has become an
+imperative concern to safeguard the valuable model parameters from being
+stolen. Watermarking is considered an important tool for ownership
+verification. However, current watermarking schemes are customized for
+different models and tasks, hard to be integrated as an integrated intellectual
+protection service. We propose Hufu, a modality-agnostic watermarking system
+for pre-trained Transformer-based models, relying on the permutation
+equivariance property of Transformers. Hufu embeds watermark by fine-tuning the
+pre-trained model on a set of data samples specifically permuted, and the
+embedded model essentially contains two sets of weights -- one for normal use
+and the other for watermark extraction which is triggered on permuted inputs.
+The permutation equivariance ensures minimal interference between these two
+sets of model weights and thus high fidelity on downstream tasks. Since our
+method only depends on the model itself, it is naturally modality-agnostic,
+task-independent, and trigger-sample-free. Extensive experiments on the
+state-of-the-art vision Transformers, BERT, and GPT2 have demonstrated Hufu's
+superiority in meeting watermarking requirements including effectiveness,
+efficiency, fidelity, and robustness, showing its great potential to be
+deployed as a uniform ownership verification service for various Transformers.",cs.CR,"['cs.CR', 'cs.AI']"
+CLOAF: CoLlisiOn-Aware Human Flow,Andrey Davydov · Martin Engilberge · Mathieu Salzmann · Pascal Fua,https://arxiv.org/abs/2403.09050,https://arxiv.org/abs/2403.09050,,2403.09050.pdf,CLOAF: CoLlisiOn-Aware Human Flow,"Even the best current algorithms for estimating body 3D shape and pose yield
+results that include body self-intersections. In this paper, we present CLOAF,
+which exploits the diffeomorphic nature of Ordinary Differential Equations to
+eliminate such self-intersections while still imposing body shape constraints.
+We show that, unlike earlier approaches to addressing this issue, ours
+completely eliminates the self-intersections without compromising the accuracy
+of the reconstructions. Being differentiable, CLOAF can be used to fine-tune
+pose and shape estimation baselines to improve their overall performance and
+eliminate self-intersections in their predictions. Furthermore, we demonstrate
+how our CLOAF strategy can be applied to practically any motion field induced
+by the user. CLOAF also makes it possible to edit motion to interact with the
+environment without worrying about potential collision or loss of body-shape
+prior.",cs.CV,['cs.CV']
+A Physics-informed Low-rank Deep Neural Network for Blind and Universal Lens Aberration Correction,Jin Gong · Runzhao Yang · Weihang Zhang · Jinli Suo · Qionghai Dai, ,https://arxiv.org/abs/2310.09528,,2310.09528.pdf,Hypernetwork-based Meta-Learning for Low-Rank Physics-Informed Neural Networks,"In various engineering and applied science applications, repetitive numerical
+simulations of partial differential equations (PDEs) for varying input
+parameters are often required (e.g., aircraft shape optimization over many
+design parameters) and solvers are required to perform rapid execution. In this
+study, we suggest a path that potentially opens up a possibility for
+physics-informed neural networks (PINNs), emerging deep-learning-based solvers,
+to be considered as one such solver. Although PINNs have pioneered a proper
+integration of deep-learning and scientific computing, they require repetitive
+time-consuming training of neural networks, which is not suitable for
+many-query scenarios. To address this issue, we propose a lightweight low-rank
+PINNs containing only hundreds of model parameters and an associated
+hypernetwork-based meta-learning algorithm, which allows efficient
+approximation of solutions of PDEs for varying ranges of PDE input parameters.
+Moreover, we show that the proposed method is effective in overcoming a
+challenging issue, known as ""failure modes"" of PINNs.",cs.LG,"['cs.LG', 'cs.NA', 'math.NA', 'physics.comp-ph']"
+Pre-training Vision Models with Mandelbulb Variations,Benjamin N. Chiche · Yuto Horikawa · Ryo Fujita, ,https://arxiv.org/abs/2403.03346,,2403.03346.pdf,Enhancing Vision-Language Pre-training with Rich Supervisions,"We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel
+pre-training paradigm for Vision-Language Models using data from large-scale
+web screenshot rendering. Using web screenshots unlocks a treasure trove of
+visual and textual cues that are not present in using image-text pairs. In S4,
+we leverage the inherent tree-structured hierarchy of HTML elements and the
+spatial localization to carefully design 10 pre-training tasks with large scale
+annotated data. These tasks resemble downstream tasks across different domains
+and the annotations are cheap to obtain. We demonstrate that, compared to
+current screenshot pre-training objectives, our innovative pre-training method
+significantly enhances performance of image-to-text model in nine varied and
+popular downstream tasks - up to 76.1% improvements on Table Detection, and at
+least 1% on Widget Captioning.",cs.CV,['cs.CV']
+Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following,Yutong Feng · Biao Gong · Di Chen · Yujun Shen · Yu Liu · Jingren Zhou, ,https://arxiv.org/abs/2311.17002,,2311.17002.pdf,Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following,"Existing text-to-image (T2I) diffusion models usually struggle in
+interpreting complex prompts, especially those with quantity, object-attribute
+binding, and multi-subject descriptions. In this work, we introduce a semantic
+panel as the middleware in decoding texts to images, supporting the generator
+to better follow instructions. The panel is obtained through arranging the
+visual concepts parsed from the input text by the aid of large language models,
+and then injected into the denoising network as a detailed control signal to
+complement the text condition. To facilitate text-to-panel learning, we come up
+with a carefully designed semantic formatting protocol, accompanied by a
+fully-automatic data preparation pipeline. Thanks to such a design, our
+approach, which we call Ranni, manages to enhance a pre-trained T2I generator
+regarding its textual controllability. More importantly, the introduction of
+the generative middleware brings a more convenient form of interaction (i.e.,
+directly adjusting the elements in the panel or using language instructions)
+and further allows users to finely customize their generation, based on which
+we develop a practical system and showcase its potential in continuous
+generation and chatting-based editing. Our project page is at
+https://ranni-t2i.github.io/Ranni.",cs.CV,['cs.CV']
+Defense without Forgetting: Continual Adversarial Defense with Anisotropic & Isotropic Pseudo Replay,Yuhang Zhou · Zhongyun Hua, ,https://arxiv.org/abs/2404.01828,,2404.01828.pdf,Defense without Forgetting: Continual Adversarial Defense with Anisotropic & Isotropic Pseudo Replay,"Deep neural networks have demonstrated susceptibility to adversarial attacks.
+Adversarial defense techniques often focus on one-shot setting to maintain
+robustness against attack. However, new attacks can emerge in sequences in
+real-world deployment scenarios. As a result, it is crucial for a defense model
+to constantly adapt to new attacks, but the adaptation process can lead to
+catastrophic forgetting of previously defended against attacks. In this paper,
+we discuss for the first time the concept of continual adversarial defense
+under a sequence of attacks, and propose a lifelong defense baseline called
+Anisotropic \& Isotropic Replay (AIR), which offers three advantages: (1)
+Isotropic replay ensures model consistency in the neighborhood distribution of
+new data, indirectly aligning the output preference between old and new tasks.
+(2) Anisotropic replay enables the model to learn a compromise data manifold
+with fresh mixed semantics for further replay constraints and potential future
+attacks. (3) A straightforward regularizer mitigates the 'plasticity-stability'
+trade-off by aligning model output between new and old tasks. Experiment
+results demonstrate that AIR can approximate or even exceed the empirical
+performance upper bounds achieved by Joint Training.",cs.LG,"['cs.LG', 'cs.AI']"
+CDFormer: When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution,Qingguo Liu · Chenyi Zhuang · Pan Gao · Jie Qin, ,https://arxiv.org/abs/2405.07648,,2405.07648.pdf,CDFormer:When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution,"Existing Blind image Super-Resolution (BSR) methods focus on estimating
+either kernel or degradation information, but have long overlooked the
+essential content details. In this paper, we propose a novel BSR approach,
+Content-aware Degradation-driven Transformer (CDFormer), to capture both
+degradation and content representations. However, low-resolution images cannot
+provide enough content details, and thus we introduce a diffusion-based module
+$CDFormer_{diff}$ to first learn Content Degradation Prior (CDP) in both low-
+and high-resolution images, and then approximate the real distribution given
+only low-resolution information. Moreover, we apply an adaptive SR network
+$CDFormer_{SR}$ that effectively utilizes CDP to refine features. Compared to
+previous diffusion-based SR methods, we treat the diffusion model as an
+estimator that can overcome the limitations of expensive sampling time and
+excessive diversity. Experiments show that CDFormer can outperform existing
+methods, establishing a new state-of-the-art performance on various benchmarks
+under blind settings. Codes and models will be available at
+\href{https://github.com/I2-Multimedia-Lab/CDFormer}{https://github.com/I2-Multimedia-Lab/CDFormer}.",cs.CV,"['cs.CV', 'eess.IV']"
+Modality-Agnostic Structural Image Representation Learning for Deformable Multi-Modality Medical Image Registration,Tony C. W. MOK · Zi Li · Yunhao Bai · Jianpeng Zhang · Wei Liu · Yan-Jie Zhou · Ke Yan · Dakai Jin · Yu Shi · Xiaoli Yin · Le Lu · Ling Zhang, ,https://arxiv.org/abs/2402.18933,,2402.18933.pdf,Modality-Agnostic Structural Image Representation Learning for Deformable Multi-Modality Medical Image Registration,"Establishing dense anatomical correspondence across distinct imaging
+modalities is a foundational yet challenging procedure for numerous medical
+image analysis studies and image-guided radiotherapy. Existing multi-modality
+image registration algorithms rely on statistical-based similarity measures or
+local structural image representations. However, the former is sensitive to
+locally varying noise, while the latter is not discriminative enough to cope
+with complex anatomical structures in multimodal scans, causing ambiguity in
+determining the anatomical correspondence across scans with different
+modalities. In this paper, we propose a modality-agnostic structural
+representation learning method, which leverages Deep Neighbourhood
+Self-similarity (DNS) and anatomy-aware contrastive learning to learn
+discriminative and contrast-invariance deep structural image representations
+(DSIR) without the need for anatomical delineations or pre-aligned training
+images. We evaluate our method on multiphase CT, abdomen MR-CT, and brain MR
+T1w-T2w registration. Comprehensive results demonstrate that our method is
+superior to the conventional local structural representation and
+statistical-based similarity measures in terms of discriminability and
+accuracy.",cs.CV,['cs.CV']
+Towards Accurate and Robust Architectures via Neural Architecture Search,Yuwei Ou · Yuqi Feng · Yanan Sun, ,https://arxiv.org/abs/2405.05502,,2405.05502.pdf,Towards Accurate and Robust Architectures via Neural Architecture Search,"To defend deep neural networks from adversarial attacks, adversarial training
+has been drawing increasing attention for its effectiveness. However, the
+accuracy and robustness resulting from the adversarial training are limited by
+the architecture, because adversarial training improves accuracy and robustness
+by adjusting the weight connection affiliated to the architecture. In this
+work, we propose ARNAS to search for accurate and robust architectures for
+adversarial training. First we design an accurate and robust search space, in
+which the placement of the cells and the proportional relationship of the
+filter numbers are carefully determined. With the design, the architectures can
+obtain both accuracy and robustness by deploying accurate and robust structures
+to their sensitive positions, respectively. Then we propose a differentiable
+multi-objective search strategy, performing gradient descent towards directions
+that are beneficial for both natural loss and adversarial loss, thus the
+accuracy and robustness can be guaranteed at the same time. We conduct
+comprehensive experiments in terms of white-box attacks, black-box attacks, and
+transferability. Experimental results show that the searched architecture has
+the strongest robustness with the competitive accuracy, and breaks the
+traditional idea that NAS-based architectures cannot transfer well to complex
+tasks in robustness scenarios. By analyzing outstanding architectures searched,
+we also conclude that accurate and robust neural architectures tend to deploy
+different structures near the input and output, which has great practical
+significance on both hand-crafting and automatically designing of accurate and
+robust architectures.",cs.CV,"['cs.CV', 'cs.CR', 'cs.LG']"
+Fast Adaptation for Human Pose Estimation via Meta-Optimization,Shengxiang Hu · Huaijiang Sun · Bin Li · Dong Wei · Weiqing Li · Jianfeng Lu, ,https://arxiv.org/abs/2405.05216,,2405.05216.pdf,FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models,"The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to
+predict human joint coordinates in 3D space. Despite recent advancements in
+deep learning-based methods, they mostly ignore the capability of coupling
+accessible texts and naturally feasible knowledge of humans, missing out on
+valuable implicit supervision to guide the 3D HPE task. Moreover, previous
+efforts often study this task from the perspective of the whole human body,
+neglecting fine-grained guidance hidden in different body parts. To this end,
+we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model
+for 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancing
+the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt
+learning (FPP) block constructs fine-grained part-aware prompts via coupling
+accessible texts and naturally feasible knowledge of body parts with learnable
+prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication
+(FPC) block establishes fine-grained communications between learned part-aware
+prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp
+Stylization (PTS) block integrates learned prompt embedding and temporal
+information related to the noise level to enable adaptive adjustment at each
+denoising step. Extensive experiments on public single-human pose estimation
+datasets show that FinePOSE outperforms state-of-the-art methods. We further
+extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE
+on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with
+complex multi-human scenarios. Code is available at
+https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024.",cs.CV,['cs.CV']
+PrPSeg: Universal Proposition Learning for Panoramic Renal Pathology Segmentation,Ruining Deng · Quan Liu · Can Cui · Tianyuan Yao · Jialin Yue · Juming Xiong · Lining yu · Yifei Wu · Mengmeng Yin · Yu Wang · Shilin Zhao · Yucheng Tang · Haichun Yang · Yuankai Huo, ,https://arxiv.org/abs/2402.19286,,2402.19286.pdf,PrPSeg: Universal Proposition Learning for Panoramic Renal Pathology Segmentation,"Understanding the anatomy of renal pathology is crucial for advancing disease
+diagnostics, treatment evaluation, and clinical research. The complex kidney
+system comprises various components across multiple levels, including regions
+(cortex, medulla), functional units (glomeruli, tubules), and cells (podocytes,
+mesangial cells in glomerulus). Prior studies have predominantly overlooked the
+intricate spatial interrelations among objects from clinical knowledge. In this
+research, we introduce a novel universal proposition learning approach, called
+panoramic renal pathology segmentation (PrPSeg), designed to segment
+comprehensively panoramic structures within kidney by integrating extensive
+knowledge of kidney anatomy.
+  In this paper, we propose (1) the design of a comprehensive universal
+proposition matrix for renal pathology, facilitating the incorporation of
+classification and spatial relationships into the segmentation process; (2) a
+token-based dynamic head single network architecture, with the improvement of
+the partial label image segmentation and capability for future data
+enlargement; and (3) an anatomy loss function, quantifying the inter-object
+relationships across the kidney.",eess.IV,"['eess.IV', 'cs.CV']"
+Analyzing and Improving the Training Dynamics of Diffusion Models,Tero Karras · Miika Aittala · Jaakko Lehtinen · Janne Hellsten · Timo Aila · Samuli Laine, ,https://arxiv.org/abs/2312.02696,,2312.02696.pdf,Analyzing and Improving the Training Dynamics of Diffusion Models,"Diffusion models currently dominate the field of data-driven image synthesis
+with their unparalleled scaling to large datasets. In this paper, we identify
+and rectify several causes for uneven and ineffective training in the popular
+ADM diffusion model architecture, without altering its high-level structure.
+Observing uncontrolled magnitude changes and imbalances in both the network
+activations and weights over the course of training, we redesign the network
+layers to preserve activation, weight, and update magnitudes on expectation. We
+find that systematic application of this philosophy eliminates the observed
+drifts and imbalances, resulting in considerably better networks at equal
+computational complexity. Our modifications improve the previous record FID of
+2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic
+sampling.
+  As an independent contribution, we present a method for setting the
+exponential moving average (EMA) parameters post-hoc, i.e., after completing
+the training run. This allows precise tuning of EMA length without the cost of
+performing several training runs, and reveals its surprising interactions with
+network architecture, training time, and guidance.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.NE', 'stat.ML']"
+POCE: Primal Policy Optimization with Conservative Estimation for Multi-constraint Offline Reinforcement Learning,Jiayi Guan · Li Shen · Ao Zhou · Lusong Li · Han Hu · Xiaodong He · Guang Chen · Changjun Jiang, ,https://arxiv.org/abs/2401.14758,,2401.14758.pdf,Off-Policy Primal-Dual Safe Reinforcement Learning,"Primal-dual safe RL methods commonly perform iterations between the primal
+update of the policy and the dual update of the Lagrange Multiplier. Such a
+training paradigm is highly susceptible to the error in cumulative cost
+estimation since this estimation serves as the key bond connecting the primal
+and dual update processes. We show that this problem causes significant
+underestimation of cost when using off-policy methods, leading to the failure
+to satisfy the safety constraint. To address this issue, we propose
+conservative policy optimization, which learns a policy in a
+constraint-satisfying area by considering the uncertainty in cost estimation.
+This improves constraint satisfaction but also potentially hinders reward
+maximization. We then introduce local policy convexification to help eliminate
+such suboptimality by gradually reducing the estimation uncertainty. We provide
+theoretical interpretations of the joint coupling effect of these two
+ingredients and further verify them by extensive experiments. Results on
+benchmark tasks show that our method not only achieves an asymptotic
+performance comparable to state-of-the-art on-policy methods while using much
+fewer samples, but also significantly reduces constraint violation during
+training. Our code is available at https://github.com/ZifanWu/CAL.",cs.LG,['cs.LG']
+CAMEL: CAusal Motion Enhancement tailored for Lifting Text-driven Video Editing,Guiwei Zhang · Tianyu Zhang · Guanglin Niu · Zichang Tan · Zichang Tan · Yalong Bai · Qing Yang, ,,https://openreview.net/forum?id=5a79AqFr0c,,,,,nan
+VicTR: Video-conditioned Text Representations for Activity Recognition,Kumara Kahatapitiya · Anurag Arnab · Arsha Nagrani · Michael Ryoo, ,https://ar5iv.labs.arxiv.org/html/2309.00696,,2309.00696.pdf,AAN: Attributes-Aware Network for Temporal Action Detection,"The challenge of long-term video understanding remains constrained by the
+efficient extraction of object semantics and the modelling of their
+relationships for downstream tasks. Although the CLIP visual features exhibit
+discriminative properties for various vision tasks, particularly in object
+encoding, they are suboptimal for long-term video understanding. To address
+this issue, we present the Attributes-Aware Network (AAN), which consists of
+two key components: the Attributes Extractor and a Graph Reasoning block. These
+components facilitate the extraction of object-centric attributes and the
+modelling of their relationships within the video. By leveraging CLIP features,
+AAN outperforms state-of-the-art approaches on two popular action detection
+datasets: Charades and Toyota Smarthome Untrimmed datasets.",cs.CV,['cs.CV']
+Enhancing Quality of Compressed Images by Mitigating Enhancement Bias Towards Compression Domain,Qunliang Xing · Mai Xu · Shengxi Li · Xin Deng · Meisong Zheng · huaida liu · Ying Chen, ,https://arxiv.org/abs/2402.17200,,2402.17200.pdf,Enhancing Quality of Compressed Images by Mitigating Enhancement Bias Towards Compression Domain,"Existing quality enhancement methods for compressed images focus on aligning
+the enhancement domain with the raw domain to yield realistic images. However,
+these methods exhibit a pervasive enhancement bias towards the compression
+domain, inadvertently regarding it as more realistic than the raw domain. This
+bias makes enhanced images closely resemble their compressed counterparts, thus
+degrading their perceptual quality. In this paper, we propose a simple yet
+effective method to mitigate this bias and enhance the quality of compressed
+images. Our method employs a conditional discriminator with the compressed
+image as a key condition, and then incorporates a domain-divergence
+regularization to actively distance the enhancement domain from the compression
+domain. Through this dual strategy, our method enables the discrimination
+against the compression domain, and brings the enhancement domain closer to the
+raw domain. Comprehensive quality evaluations confirm the superiority of our
+method over other state-of-the-art methods without incurring inference
+overheads.",cs.CV,"['cs.CV', 'eess.IV']"
+SocialCircle: Learning the Angle-based Social Interaction Representation for Pedestrian Trajectory Prediction,Conghao Wong · Beihao Xia · Ziqian Zou · Yulong Wang · Xinge You,https://cocoon2wong.github.io/SocialCircle,https://arxiv.org/abs/2310.05370,,2310.05370.pdf,SocialCircle: Learning the Angle-based Social Interaction Representation for Pedestrian Trajectory Prediction,"Analyzing and forecasting trajectories of agents like pedestrians and cars in
+complex scenes has become more and more significant in many intelligent systems
+and applications. The diversity and uncertainty in socially interactive
+behaviors among a rich variety of agents make this task more challenging than
+other deterministic computer vision tasks. Researchers have made a lot of
+efforts to quantify the effects of these interactions on future trajectories
+through different mathematical models and network structures, but this problem
+has not been well solved. Inspired by marine animals that localize the
+positions of their companions underwater through echoes, we build a new
+anglebased trainable social interaction representation, named SocialCircle, for
+continuously reflecting the context of social interactions at different angular
+orientations relative to the target agent. We validate the effect of the
+proposed SocialCircle by training it along with several newly released
+trajectory prediction models, and experiments show that the SocialCircle not
+only quantitatively improves the prediction performance, but also qualitatively
+helps better simulate social interactions when forecasting pedestrian
+trajectories in a way that is consistent with human intuitions.",cs.CV,['cs.CV']
+Dr.Hair: Reconstructing Scalp-Connected Hair Strands without Pre-training via Differentiable Rendering of Line Segments,Yusuke Takimoto · Hikari Takehara · Hiroyuki Sato · Zihao Zhu · Bo Zheng,https://dr-hair.github.io/Dr-Hair/,https://arxiv.org/abs/2403.17496,,2403.17496.pdf,Dr.Hair: Reconstructing Scalp-Connected Hair Strands without Pre-training via Differentiable Rendering of Line Segments,"In the film and gaming industries, achieving a realistic hair appearance
+typically involves the use of strands originating from the scalp. However,
+reconstructing these strands from observed surface images of hair presents
+significant challenges. The difficulty in acquiring Ground Truth (GT) data has
+led state-of-the-art learning-based methods to rely on pre-training with
+manually prepared synthetic CG data. This process is not only labor-intensive
+and costly but also introduces complications due to the domain gap when
+compared to real-world data. In this study, we propose an optimization-based
+approach that eliminates the need for pre-training. Our method represents hair
+strands as line segments growing from the scalp and optimizes them using a
+novel differentiable rendering algorithm. To robustly optimize a substantial
+number of slender explicit geometries, we introduce 3D orientation estimation
+utilizing global optimization, strand initialization based on Laplace's
+equation, and reparameterization that leverages geometric connectivity and
+spatial proximity. Unlike existing optimization-based methods, our method is
+capable of reconstructing internal hair flow in an absolute direction. Our
+method exhibits robust and accurate inverse rendering, surpassing the quality
+of existing methods and significantly improving processing speed.",cs.CV,"['cs.CV', 'cs.GR']"
+SnAG: Scalable and Accurate Video Grounding,Fangzhou Mu · Sicheng Mo · Yin Li, ,https://arxiv.org/abs/2404.02257,,2404.02257.pdf,SnAG: Scalable and Accurate Video Grounding,"Temporal grounding of text descriptions in videos is a central problem in
+vision-language learning and video understanding. Existing methods often
+prioritize accuracy over scalability -- they have been optimized for grounding
+only a few text queries within short videos, and fail to scale up to long
+videos with hundreds of queries. In this paper, we study the effect of
+cross-modal fusion on the scalability of video grounding models. Our analysis
+establishes late fusion as a more cost-effective fusion scheme for long-form
+videos with many text queries. Moreover, it leads us to a novel, video-centric
+sampling scheme for efficient training. Based on these findings, we present
+SnAG, a simple baseline for scalable and accurate video grounding. Without
+bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a
+state of the art for long-form video grounding on the challenging MAD dataset,
+while achieving highly competitive results on short videos.",cs.CV,['cs.CV']
+Hide in Thicket: Generating Imperceptible and Rational Adversarial Perturbations on 3D Point Clouds,Tianrui Lou · Xiaojun Jia · Jindong Gu · Li Liu · Siyuan Liang · Bangyan He · Xiaochun Cao, ,https://arxiv.org/abs/2403.05247,,2403.05247.pdf,Hide in Thicket: Generating Imperceptible and Rational Adversarial Perturbations on 3D Point Clouds,"Adversarial attack methods based on point manipulation for 3D point cloud
+classification have revealed the fragility of 3D models, yet the adversarial
+examples they produce are easily perceived or defended against. The trade-off
+between the imperceptibility and adversarial strength leads most point attack
+methods to inevitably introduce easily detectable outlier points upon a
+successful attack. Another promising strategy, shape-based attack, can
+effectively eliminate outliers, but existing methods often suffer significant
+reductions in imperceptibility due to irrational deformations. We find that
+concealing deformation perturbations in areas insensitive to human eyes can
+achieve a better trade-off between imperceptibility and adversarial strength,
+specifically in parts of the object surface that are complex and exhibit
+drastic curvature changes. Therefore, we propose a novel shape-based
+adversarial attack method, HiT-ADV, which initially conducts a two-stage search
+for attack regions based on saliency and imperceptibility scores, and then adds
+deformation perturbations in each attack region using Gaussian kernel
+functions. Additionally, HiT-ADV is extendable to physical attack. We propose
+that by employing benign resampling and benign rigid transformations, we can
+further enhance physical adversarial strength with little sacrifice to
+imperceptibility. Extensive experiments have validated the superiority of our
+method in terms of adversarial and imperceptible properties in both digital and
+physical spaces. Our code is avaliable at: https://github.com/TRLou/HiT-ADV.",cs.CV,"['cs.CV', 'eess.IV']"
+PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization,Zining Chen · Weiqiu Wang · Zhicheng Zhao · Fei Su · Aidong Men · Hongying Meng, ,https://arxiv.org/abs/2404.09011,,2404.09011.pdf,PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization,"Domain Generalization (DG) aims to resolve distribution shifts between source
+and target domains, and current DG methods are default to the setting that data
+from source and target domains share identical categories. Nevertheless, there
+exists unseen classes from target domains in practical scenarios. To address
+this issue, Open Set Domain Generalization (OSDG) has emerged and several
+methods have been exclusively proposed. However, most existing methods adopt
+complex architectures with slight improvement compared with DG methods.
+Recently, vision-language models (VLMs) have been introduced in DG following
+the fine-tuning paradigm, but consume huge training overhead with large vision
+models. Therefore, in this paper, we innovate to transfer knowledge from VLMs
+to lightweight vision models and improve the robustness by introducing
+Perturbation Distillation (PD) from three perspectives, including Score, Class
+and Instance (SCI), named SCI-PD. Moreover, previous methods are oriented by
+the benchmarks with identical and fixed splits, ignoring the divergence between
+source domains. These methods are revealed to suffer from sharp performance
+decay with our proposed new benchmark Hybrid Domain Generalization (HDG) and a
+novel metric $H^{2}$-CV, which construct various splits to comprehensively
+assess the robustness of algorithms. Extensive experiments demonstrate that our
+method outperforms state-of-the-art algorithms on multiple datasets, especially
+improving the robustness when confronting data scarcity.",cs.CV,"['cs.CV', 'cs.LG']"
+Kernel Adaptive Convolution for Scene Text Detection via Distance Map Prediction,Jinzhi Zheng · Heng Fan · Libo Zhang, ,https://arxiv.org/html/2401.11704v1,,2401.11704v1.pdf,EK-Net:Real-time Scene Text Detection with Expand Kernel Distance,"Recently, scene text detection has received significant attention due to its
+wide application. However, accurate detection in complex scenes of multiple
+scales, orientations, and curvature remains a challenge. Numerous detection
+methods adopt the Vatti clipping (VC) algorithm for multiple-instance training
+to address the issue of arbitrary-shaped text. Yet we identify several bias
+results from these approaches called the ""shrinked kernel"". Specifically, it
+refers to a decrease in accuracy resulting from an output that overly favors
+the text kernel. In this paper, we propose a new approach named Expand Kernel
+Network (EK-Net) with expand kernel distance to compensate for the previous
+deficiency, which includes three-stages regression to complete instance
+detection. Moreover, EK-Net not only realize the precise positioning of
+arbitrary-shaped text, but also achieve a trade-off between performance and
+speed. Evaluation results demonstrate that EK-Net achieves state-of-the-art or
+competitive performance compared to other advanced methods, e.g., F-measure of
+85.72% at 35.42 FPS on ICDAR 2015, F-measure of 85.75% at 40.13 FPS on CTW1500.",cs.CV,['cs.CV']
+CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion,Xiaoyu Wu · Yang Hua · Chumeng Liang · Jiaru Zhang · Hao Wang · Tao Song · Haibing Guan,https://github.com/Nicholas0228/Revelio,https://arxiv.org/abs/2403.11162,,2403.11162.pdf,CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion,"Diffusion Models (DMs) have evolved into advanced image generation tools,
+especially for few-shot generation where a pretrained model is fine-tuned on a
+small set of images to capture a specific style or object. Despite their
+success, concerns exist about potential copyright violations stemming from the
+use of unauthorized data in this process. In response, we present Contrasting
+Gradient Inversion for Diffusion Models (CGI-DM), a novel method featuring
+vivid visual representations for digital copyright authentication. Our approach
+involves removing partial information of an image and recovering missing
+details by exploiting conceptual differences between the pretrained and
+fine-tuned models. We formulate the differences as KL divergence between latent
+variables of the two models when given the same input image, which can be
+maximized through Monte Carlo sampling and Projected Gradient Descent (PGD).
+The similarity between original and recovered images serves as a strong
+indicator of potential infringements. Extensive experiments on the WikiArt and
+Dreambooth datasets demonstrate the high accuracy of CGI-DM in digital
+copyright authentication, surpassing alternative validation techniques. Code
+implementation is available at https://github.com/Nicholas0228/Revelio.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CR', 'cs.CY', 'cs.LG']"
+Editable Scene Simulation for Autonomous Driving via LLM-Agent Collaboration,Yuxi Wei · Zi Wang · Yifan Lu · Chenxin Xu · Changxing Liu · Hao Zhao · Siheng Chen · Yanfeng Wang,https://yifanlu0227.github.io/ChatSim/,https://arxiv.org/abs/2402.05746,,2402.05746.pdf,Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents,"Scene simulation in autonomous driving has gained significant attention
+because of its huge potential for generating customized data. However, existing
+editable scene simulation approaches face limitations in terms of user
+interaction efficiency, multi-camera photo-realistic rendering and external
+digital assets integration. To address these challenges, this paper introduces
+ChatSim, the first system that enables editable photo-realistic 3D driving
+scene simulations via natural language commands with external digital assets.
+To enable editing with high command flexibility,~ChatSim leverages a large
+language model (LLM) agent collaboration framework. To generate photo-realistic
+outcomes, ChatSim employs a novel multi-camera neural radiance field method.
+Furthermore, to unleash the potential of extensive high-quality digital assets,
+ChatSim employs a novel multi-camera lighting estimation method to achieve
+scene-consistent assets' rendering. Our experiments on Waymo Open Dataset
+demonstrate that ChatSim can handle complex language commands and generate
+corresponding photo-realistic scene videos.",cs.CV,['cs.CV']
+An Upload-Efficient Scheme for Transferring Knowledge From a Server-Side Pre-trained Generator to Clients in Heterogeneous Federated Learning,Jianqing Zhang · Yang Liu · Yang Hua · Jian Cao,https://github.com/TsingZ0/FedKTL,https://arxiv.org/abs/2403.15760,,2403.15760.pdf,An Upload-Efficient Scheme for Transferring Knowledge From a Server-Side Pre-trained Generator to Clients in Heterogeneous Federated Learning,"Heterogeneous Federated Learning (HtFL) enables collaborative learning on
+multiple clients with different model architectures while preserving privacy.
+Despite recent research progress, knowledge sharing in HtFL is still difficult
+due to data and model heterogeneity. To tackle this issue, we leverage the
+knowledge stored in pre-trained generators and propose a new upload-efficient
+knowledge transfer scheme called Federated Knowledge-Transfer Loop (FedKTL).
+Our FedKTL can produce client-task-related prototypical image-vector pairs via
+the generator's inference on the server. With these pairs, each client can
+transfer pre-existing knowledge from the generator to its local model through
+an additional supervised local task. We conduct extensive experiments on four
+datasets under two types of data heterogeneity with 14 kinds of models
+including CNNs and ViTs. Results show that our upload-efficient FedKTL
+surpasses seven state-of-the-art methods by up to 7.31% in accuracy. Moreover,
+our knowledge transfer scheme is applicable in scenarios with only one edge
+client. Code: https://github.com/TsingZ0/FedKTL",cs.AI,"['cs.AI', 'cs.DC']"
+Language-conditioned Detection Transformer,Jang Hyun Cho · Philipp Krähenbühl,https://janghyuncho.github.io/DECOLA/,,https://www.semanticscholar.org/paper/Language-conditioned-Detection-Transformer-Cho-Krähenbühl/d590b8cabee3630327fa72149a2b137b2c0892f9/figure/0,,,,,nan
+Audio-Visual Segmentation via Unlabeled Frame Exploitation,Jinxiang Liu · Yikun Liu · Ferenas · Chen Ju · Ya Zhang · Yanfeng Wang, ,https://arxiv.org/abs/2403.11074,,2403.11074.pdf,Audio-Visual Segmentation via Unlabeled Frame Exploitation,"Audio-visual segmentation (AVS) aims to segment the sounding objects in video
+frames. Although great progress has been witnessed, we experimentally reveal
+that current methods reach marginal performance gain within the use of the
+unlabeled frames, leading to the underutilization issue. To fully explore the
+potential of the unlabeled frames for AVS, we explicitly divide them into two
+categories based on their temporal characteristics, i.e., neighboring frame
+(NF) and distant frame (DF). NFs, temporally adjacent to the labeled frame,
+often contain rich motion information that assists in the accurate localization
+of sounding objects. Contrary to NFs, DFs have long temporal distances from the
+labeled frame, which share semantic-similar objects with appearance variations.
+Considering their unique characteristics, we propose a versatile framework that
+effectively leverages them to tackle AVS. Specifically, for NFs, we exploit the
+motion cues as the dynamic guidance to improve the objectness localization.
+Besides, we exploit the semantic cues in DFs by treating them as valid
+augmentations to the labeled frames, which are then used to enrich data
+diversity in a self-training manner. Extensive experimental results demonstrate
+the versatility and superiority of our method, unleashing the power of the
+abundant unlabeled frames.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM', 'cs.SD', 'eess.AS']"
+Distilling ODE Solvers of Diffusion Models into Smaller Steps,Sanghwan Kim · Hao Tang · Fisher Yu, ,https://arxiv.org/abs/2309.16421,,2309.16421.pdf,Distilling ODE Solvers of Diffusion Models into Smaller Steps,"Abstract Diffusion models have recently gained prominence as a novel category
+of generative models. Despite their success, these models face a notable
+drawback in terms of slow sampling speeds, requiring a high number of function
+evaluations (NFE) in the order of hundreds or thousands. In response, both
+learning-free and learning-based sampling strategies have been explored to
+expedite the sampling process. Learning-free sampling employs various ordinary
+differential equation (ODE) solvers based on the formulation of diffusion ODEs.
+However, it encounters challenges in faithfully tracking the true sampling
+trajectory, particularly for small NFE. Conversely, learning-based sampling
+methods, such as knowledge distillation, demand extensive additional training,
+limiting their practical applicability. To overcome these limitations, we
+introduce Distilled-ODE solvers (D-ODE solvers), a straightforward distillation
+approach grounded in ODE solver formulations. Our method seamlessly integrates
+the strengths of both learning-free and learning-based sampling. D-ODE solvers
+are constructed by introducing a single parameter adjustment to existing ODE
+solvers. Furthermore, we optimize D-ODE solvers with smaller steps using
+knowledge distillation from ODE solvers with larger steps across a batch of
+samples. Comprehensive experiments demonstrate the superior performance of
+D-ODE solvers compared to existing ODE solvers, including DDIM, PNDM,
+DPM-Solver, DEIS, and EDM, particularly in scenarios with fewer NFE. Notably,
+our method incurs negligible computational overhead compared to previous
+distillation techniques, facilitating straightforward and rapid integration
+with existing samplers. Qualitative analysis reveals that D-ODE solvers not
+only enhance image quality but also faithfully follow the target ODE
+trajectory.",cs.CV,['cs.CV']
+Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation,Keonhee Han · Dominik Muhle · Felix Wimbauer · Daniel Cremers,https://keonhee-han.github.io/publications/kdbts/,https://arxiv.org/abs/2404.07933,,2404.07933.pdf,Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation,"Inferring scene geometry from images via Structure from Motion is a
+long-standing and fundamental problem in computer vision. While classical
+approaches and, more recently, depth map predictions only focus on the visible
+parts of a scene, the task of scene completion aims to reason about geometry
+even in occluded regions. With the popularity of neural radiance fields
+(NeRFs), implicit representations also became popular for scene completion by
+predicting so-called density fields. Unlike explicit approaches. e.g.
+voxel-based methods, density fields also allow for accurate depth prediction
+and novel-view synthesis via image-based rendering. In this work, we propose to
+fuse the scene reconstruction from multiple images and distill this knowledge
+into a more accurate single-view scene reconstruction. To this end, we propose
+Multi-View Behind the Scenes (MVBTS) to fuse density fields from multiple posed
+images, trained fully self-supervised only from image data. Using knowledge
+distillation, we use MVBTS to train a single-view scene completion network via
+direct supervision called KDBTS. It achieves state-of-the-art performance on
+occupancy prediction, especially in occluded regions.",cs.CV,['cs.CV']
+Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion,Sofia Casarin · Cynthia Ugwu · Sergio Escalera · Oswald Lanz, ,https://arxiv.org/abs/2403.15194,,2403.15194.pdf,Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion,"The landscape of deep learning research is moving towards innovative
+strategies to harness the true potential of data. Traditionally, emphasis has
+been on scaling model architectures, resulting in large and complex neural
+networks, which can be difficult to train with limited computational resources.
+However, independently of the model size, data quality (i.e. amount and
+variability) is still a major factor that affects model generalization. In this
+work, we propose a novel technique to exploit available data through the use of
+automatic data augmentation for the tasks of image classification and semantic
+segmentation. We introduce the first Differentiable Augmentation Search method
+(DAS) to generate variations of images that can be processed as videos.
+Compared to previous approaches, DAS is extremely fast and flexible, allowing
+the search on very large search spaces in less than a GPU day. Our intuition is
+that the increased receptive field in the temporal dimension provided by DAS
+could lead to benefits also to the spatial receptive field. More specifically,
+we leverage DAS to guide the reshaping of the spatial receptive field by
+selecting task-dependant transformations. As a result, compared to standard
+augmentation alternatives, we improve in terms of accuracy on ImageNet,
+Cifar10, Cifar100, Tiny-ImageNet, Pascal-VOC-2012 and CityScapes datasets when
+plugging-in our DAS over different light-weight video backbones.",cs.CV,"['cs.CV', 'cs.LG']"
+A Dynamic Kernel Prior Model for Unsupervised Blind Image Super-Resolution,Zhixiong Yang · Jingyuan Xia · Shengxi Li · Xinghua Huang · Shuanghui Zhang · Zhen Liu · Yaowen Fu · Yongxiang Liu, ,https://arxiv.org/abs/2404.15620,,2404.15620.pdf,A Dynamic Kernel Prior Model for Unsupervised Blind Image Super-Resolution,"Deep learning-based methods have achieved significant successes on solving
+the blind super-resolution (BSR) problem. However, most of them request
+supervised pre-training on labelled datasets. This paper proposes an
+unsupervised kernel estimation model, named dynamic kernel prior (DKP), to
+realize an unsupervised and pre-training-free learning-based algorithm for
+solving the BSR problem. DKP can adaptively learn dynamic kernel priors to
+realize real-time kernel estimation, and thereby enables superior HR image
+restoration performances. This is achieved by a Markov chain Monte Carlo
+sampling process on random kernel distributions. The learned kernel prior is
+then assigned to optimize a blur kernel estimation network, which entails a
+network-based Langevin dynamic optimization strategy. These two techniques
+ensure the accuracy of the kernel estimation. DKP can be easily used to replace
+the kernel estimation models in the existing methods, such as Double-DIP and
+FKP-DIP, or be added to the off-the-shelf image restoration model, such as
+diffusion model. In this paper, we incorporate our DKP model with DIP and
+diffusion model, referring to DIP-DKP and Diff-DKP, for validations. Extensive
+simulations on Gaussian and motion kernel scenarios demonstrate that the
+proposed DKP model can significantly improve the kernel estimation with
+comparable runtime and memory usage, leading to state-of-the-art BSR results.
+The code is available at https://github.com/XYLGroup/DKP.",eess.IV,['eess.IV']
+DiffusionTrack:  Point Set Diffusion Model for Visual Object Tracking,Fei Xie · Zhongdao Wang · Chao Ma, ,https://arxiv.org/abs/2308.09905,,2308.09905.pdf,DiffusionTrack: Diffusion Model For Multi-Object Tracking,"Multi-object tracking (MOT) is a challenging vision task that aims to detect
+individual objects within a single frame and associate them across multiple
+frames. Recent MOT approaches can be categorized into two-stage
+tracking-by-detection (TBD) methods and one-stage joint detection and tracking
+(JDT) methods. Despite the success of these approaches, they also suffer from
+common problems, such as harmful global or local inconsistency, poor trade-off
+between robustness and model complexity, and lack of flexibility in different
+scenes within the same video. In this paper we propose a simple but robust
+framework that formulates object detection and association jointly as a
+consistent denoising diffusion process from paired noise boxes to paired
+ground-truth boxes. This novel progressive denoising diffusion strategy
+substantially augments the tracker's effectiveness, enabling it to discriminate
+between various objects. During the training stage, paired object boxes diffuse
+from paired ground-truth boxes to random distribution, and the model learns
+detection and tracking simultaneously by reversing this noising process. In
+inference, the model refines a set of paired randomly generated boxes to the
+detection and tracking results in a flexible one-step or multi-step denoising
+diffusion process. Extensive experiments on three widely used MOT benchmarks,
+including MOT17, MOT20, and Dancetrack, demonstrate that our approach achieves
+competitive performance compared to the current state-of-the-art methods.",cs.CV,['cs.CV']
+SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field,Lizhe Liu · Bohua Wang · Hongwei Xie · Daqi Liu · Li Liu · Kuiyuan Yang · Bing Wang · Zhiqiang Tian, ,https://arxiv.org/abs/2403.14366,,2403.14366.pdf,SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field,"Vision-centric 3D environment understanding is both vital and challenging for
+autonomous driving systems. Recently, object-free methods have attracted
+considerable attention. Such methods perceive the world by predicting the
+semantics of discrete voxel grids but fail to construct continuous and accurate
+obstacle surfaces. To this end, in this paper, we propose SurroundSDF to
+implicitly predict the signed distance field (SDF) and semantic field for the
+continuous perception from surround images. Specifically, we introduce a
+query-based approach and utilize SDF constrained by the Eikonal formulation to
+accurately describe the surfaces of obstacles. Furthermore, considering the
+absence of precise SDF ground truth, we propose a novel weakly supervised
+paradigm for SDF, referred to as the Sandwich Eikonal formulation, which
+emphasizes applying correct and dense constraints on both sides of the surface,
+thereby enhancing the perceptual accuracy of the surface. Experiments suggest
+that our method achieves SOTA for both occupancy prediction and 3D scene
+reconstruction tasks on the nuScenes dataset.",cs.CV,['cs.CV']
+DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior,Tianyu Huang · Yihan Zeng · Zhilu Zhang · Wan Xu · Hang Xu · Songcen Xu · Rynson W.H. Lau · Wangmeng Zuo,https://github.com/tyhuang0428/DreamControl,https://arxiv.org/abs/2312.06439,,2312.06439.pdf,DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior,"3D generation has raised great attention in recent years. With the success of
+text-to-image diffusion models, the 2D-lifting technique becomes a promising
+route to controllable 3D generation. However, these methods tend to present
+inconsistent geometry, which is also known as the Janus problem. We observe
+that the problem is caused mainly by two aspects, i.e., viewpoint bias in 2D
+diffusion models and overfitting of the optimization objective. To address it,
+we propose a two-stage 2D-lifting framework, namely DreamControl, which
+optimizes coarse NeRF scenes as 3D self-prior and then generates fine-grained
+objects with control-based score distillation. Specifically, adaptive viewpoint
+sampling and boundary integrity metric are proposed to ensure the consistency
+of generated priors. The priors are then regarded as input conditions to
+maintain reasonable geometries, in which conditional LoRA and weighted score
+are further proposed to optimize detailed textures. DreamControl can generate
+high-quality 3D content in terms of both geometry consistency and texture
+fidelity. Moreover, our control-based optimization guidance is applicable to
+more downstream tasks, including user-guided generation and 3D animation. The
+project page is available at https://github.com/tyhuang0428/DreamControl.",cs.CV,['cs.CV']
+Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID,Wentao Tan · Changxing Ding · Jiayu Jiang · Fei Wang · Yibing Zhan · Dapeng Tao, ,https://arxiv.org/abs/2405.04940,,,Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID,"Text-to-image person re-identification (ReID) retrieves pedestrian images
+according to textual descriptions. Manually annotating textual descriptions is
+time-consuming, restricting the scale of existing datasets and therefore the
+generalization ability of ReID models. As a result, we study the transferable
+text-to-image ReID problem, where we train a model on our proposed large-scale
+database and directly deploy it to various datasets for evaluation. We obtain
+substantial training data via Multi-modal Large Language Models (MLLMs).
+Moreover, we identify and address two key challenges in utilizing the obtained
+textual descriptions. First, an MLLM tends to generate descriptions with
+similar structures, causing the model to overfit specific sentence patterns.
+Thus, we propose a novel method that uses MLLMs to caption images according to
+various templates. These templates are obtained using a multi-turn dialogue
+with a Large Language Model (LLM). Therefore, we can build a large-scale
+dataset with diverse textual descriptions. Second, an MLLM may produce
+incorrect descriptions. Hence, we introduce a novel method that automatically
+identifies words in a description that do not correspond with the image. This
+method is based on the similarity between one text and all patch token
+embeddings in the image. Then, we mask these words with a larger probability in
+the subsequent training epoch, alleviating the impact of noisy textual
+descriptions. The experimental results demonstrate that our methods
+significantly boost the direct transfer text-to-image ReID performance.
+Benefiting from the pre-trained model weights, we also achieve state-of-the-art
+performance in the traditional evaluation settings.",cs.CV,['cs.CV']
+Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields,Leili Goli · Cody Reading · Silvia Sellán · Alec Jacobson · Andrea Tagliasacchi, ,https://arxiv.org/abs/2309.03185,,2309.03185.pdf,Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields,"Neural Radiance Fields (NeRFs) have shown promise in applications like view
+synthesis and depth estimation, but learning from multiview images faces
+inherent uncertainties. Current methods to quantify them are either heuristic
+or computationally demanding. We introduce BayesRays, a post-hoc framework to
+evaluate uncertainty in any pre-trained NeRF without modifying the training
+process. Our method establishes a volumetric uncertainty field using spatial
+perturbations and a Bayesian Laplace approximation. We derive our algorithm
+statistically and show its superior performance in key metrics and
+applications. Additional results available at: https://bayesrays.github.io.",cs.CV,['cs.CV']
+CPLIP: Zero-Shot Learning for Histopathology with Comprehensive Vision-Language Alignment,Sajid Javed · Arif Mahmood · IYYAKUTTI IYAPPAN GANAPATHI · Fayaz Ali · Naoufel Werghi · Mohammed Bennamoun, ,https://arxiv.org/abs/2306.07831,,2306.07831.pdf,Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images,"Contrastive visual language pretraining has emerged as a powerful method for
+either training new language-aware image encoders or augmenting existing
+pretrained models with zero-shot visual recognition capabilities. However,
+existing works typically train on large datasets of image-text pairs and have
+been designed to perform downstream tasks involving only small to medium
+sized-images, neither of which are applicable to the emerging field of
+computational pathology where there are limited publicly available paired
+image-text datasets and each image can span up to 100,000 x 100,000 pixels. In
+this paper we present MI-Zero, a simple and intuitive framework for unleashing
+the zero-shot transfer capabilities of contrastively aligned image and text
+models on gigapixel histopathology whole slide images, enabling multiple
+downstream diagnostic tasks to be carried out by pretrained encoders without
+requiring any additional labels. MI-Zero reformulates zero-shot transfer under
+the framework of multiple instance learning to overcome the computational
+challenge of inference on extremely large images. We used over 550k pathology
+reports and other available in-domain text corpora to pre-train our text
+encoder. By effectively leveraging strong pre-trained encoders, our best model
+pretrained on over 33k histopathology image-caption pairs achieves an average
+median zero-shot accuracy of 70.2% across three different real-world cancer
+subtyping tasks. Our code is available at:
+https://github.com/mahmoodlab/MI-Zero.",cs.CV,['cs.CV']
+UniMODE: Unified Monocular 3D Object Detection,Zhuoling Li · Xiaogang Xu · Ser-Nam Lim · Hengshuang Zhao, ,https://arxiv.org/abs/2402.18573,,2402.18573.pdf,UniMODE: Unified Monocular 3D Object Detection,"Realizing unified monocular 3D object detection, including both indoor and
+outdoor scenes, holds great importance in applications like robot navigation.
+However, involving various scenarios of data to train models poses challenges
+due to their significantly different characteristics, e.g., diverse geometry
+properties and heterogeneous domain distributions. To address these challenges,
+we build a detector based on the bird's-eye-view (BEV) detection paradigm,
+where the explicit feature projection is beneficial to addressing the geometry
+learning ambiguity when employing multiple scenarios of data to train
+detectors. Then, we split the classical BEV detection architecture into two
+stages and propose an uneven BEV grid design to handle the convergence
+instability caused by the aforementioned challenges. Moreover, we develop a
+sparse BEV feature projection strategy to reduce computational cost and a
+unified domain alignment method to handle heterogeneous domains. Combining
+these techniques, a unified detector UniMODE is derived, which surpasses the
+previous state-of-the-art on the challenging Omni3D dataset (a large-scale
+dataset including both indoor and outdoor scenes) by 4.9% AP_3D, revealing the
+first successful generalization of a BEV detector to unified 3D object
+detection.",cs.CV,['cs.CV']
+Perceptual Assessment and Optimization of HDR Image Rendering,Peibei Cao · Rafal Mantiuk · Kede Ma, ,https://arxiv.org/abs/2310.12877v4,,2310.12877v4.pdf,Perceptual Assessment and Optimization of High Dynamic Range Image Rendering,"High dynamic range (HDR) rendering has the ability to faithfully reproduce
+the wide luminance ranges in natural scenes, but how to accurately assess the
+rendering quality is relatively underexplored. Existing quality models are
+mostly designed for low dynamic range (LDR) images, and do not align well with
+human perception of HDR image quality. To fill this gap, we propose a family of
+HDR quality metrics, in which the key step is employing a simple inverse
+display model to decompose an HDR image into a stack of LDR images with varying
+exposures. Subsequently, these decomposed images are assessed through
+well-established LDR quality metrics. Our HDR quality models present three
+distinct benefits. First, they directly inherit the recent advancements of LDR
+quality metrics. Second, they do not rely on human perceptual data of HDR image
+quality for re-calibration. Third, they facilitate the alignment and
+prioritization of specific luminance ranges for more accurate and detailed
+quality assessment. Experimental results show that our HDR quality metrics
+consistently outperform existing models in terms of quality assessment on four
+HDR image quality datasets and perceptual optimization of HDR novel view
+synthesis.",eess.IV,"['eess.IV', 'cs.CV']"
+From-Ground-To-Objects: Coarse-to-Fine Self-supervised Monocular Depth Estimation of Dynamic Objects with Ground Contact Prior,Jaeho Moon · Juan Luis Gonzalez Bello · Byeongjun Kwon · Munchurl Kim,https://kaist-viclab.github.io/From_Ground_To_Objects_site/,https://arxiv.org/abs/2312.10118,,2312.10118.pdf,From-Ground-To-Objects: Coarse-to-Fine Self-supervised Monocular Depth Estimation of Dynamic Objects with Ground Contact Prior,"Self-supervised monocular depth estimation (DE) is an approach to learning
+depth without costly depth ground truths. However, it often struggles with
+moving objects that violate the static scene assumption during training. To
+address this issue, we introduce a coarse-to-fine training strategy leveraging
+the ground contacting prior based on the observation that most moving objects
+in outdoor scenes contact the ground. In the coarse training stage, we exclude
+the objects in dynamic classes from the reprojection loss calculation to avoid
+inaccurate depth learning. To provide precise supervision on the depth of the
+objects, we present a novel Ground-contacting-prior Disparity Smoothness Loss
+(GDS-Loss) that encourages a DE network to align the depth of the objects with
+their ground-contacting points. Subsequently, in the fine training stage, we
+refine the DE network to learn the detailed depth of the objects from the
+reprojection loss, while ensuring accurate DE on the moving object regions by
+employing our regularization loss with a cost-volume-based weighting factor.
+Our overall coarse-to-fine training strategy can easily be integrated with
+existing DE methods without any modifications, significantly enhancing DE
+performance on challenging Cityscapes and KITTI datasets, especially in the
+moving object regions.",cs.CV,['cs.CV']
+DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models,Khawar Islam · Muhammad Zaigham Zaheer · Arif Mahmood · Karthik Nandakumar,https://diffusemix.github.io/,https://arxiv.org/abs/2405.14881,,2405.14881.pdf,DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models,"Recently, a number of image-mixing-based augmentation techniques have been
+introduced to improve the generalization of deep neural networks. In these
+techniques, two or more randomly selected natural images are mixed together to
+generate an augmented image. Such methods may not only omit important portions
+of the input images but also introduce label ambiguities by mixing images
+across labels resulting in misleading supervisory signals. To address these
+limitations, we propose DiffuseMix, a novel data augmentation technique that
+leverages a diffusion model to reshape training images, supervised by our
+bespoke conditional prompts. First, concatenation of a partial natural image
+and its generated counterpart is obtained which helps in avoiding the
+generation of unrealistic images or label ambiguities. Then, to enhance
+resilience against adversarial attacks and improves safety measures, a randomly
+selected structural pattern from a set of fractal images is blended into the
+concatenated image to form the final augmented image for training. Our
+empirical results on seven different datasets reveal that DiffuseMix achieves
+superior performance compared to existing state-of the-art methods on tasks
+including general classification,fine-grained classification, fine-tuning, data
+scarcity, and adversarial robustness. Augmented datasets and codes are
+available here: https://diffusemix.github.io/",cs.CV,['cs.CV']
+Neural Exposure Fusion for High-Dynamic Range Object Detection,Emmanuel Onzon · Maximilian Bömer · Fahim Mannan · Felix Heide, ,https://arxiv.org/abs/2405.16038,,2405.16038.pdf,Rethinking Early-Fusion Strategies for Improved Multispectral Object Detection,"Most recent multispectral object detectors employ a two-branch structure to
+extract features from RGB and thermal images. While the two-branch structure
+achieves better performance than a single-branch structure, it overlooks
+inference efficiency. This conflict is increasingly aggressive, as recent works
+solely pursue higher performance rather than both performance and efficiency.
+In this paper, we address this issue by improving the performance of efficient
+single-branch structures. We revisit the reasons causing the performance gap
+between these structures. For the first time, we reveal the information
+interference problem in the naive early-fusion strategy adopted by previous
+single-branch structures. Besides, we find that the domain gap between
+multispectral images, and weak feature representation of the single-branch
+structure are also key obstacles for performance. Focusing on these three
+problems, we propose corresponding solutions, including a novel shape-priority
+early-fusion strategy, a weakly supervised learning method, and a core
+knowledge distillation technique. Experiments demonstrate that single-branch
+networks equipped with these three contributions achieve significant
+performance enhancements while retaining high efficiency. Our code will be
+available at
+\url{https://github.com/XueZ-phd/Efficient-RGB-T-Early-Fusion-Detection}.",cs.CV,['cs.CV']
+Cross-view and Cross-pose Completion for 3D Human Understanding,Matthieu Armando · Salma Galaaoui · Fabien Baradel · Thomas Lucas · Vincent Leroy · Romain BRÉGIER · Philippe Weinzaepfel · Grégory Rogez, ,https://arxiv.org/abs/2311.09104,,2311.09104.pdf,Cross-view and Cross-pose Completion for 3D Human Understanding,"Human perception and understanding is a major domain of computer vision
+which, like many other vision subdomains recently, stands to gain from the use
+of large models pre-trained on large datasets. We hypothesize that the most
+common pre-training strategy of relying on general purpose, object-centric
+image datasets such as ImageNet, is limited by an important domain shift. On
+the other hand, collecting domain-specific ground truth such as 2D or 3D labels
+does not scale well. Therefore, we propose a pre-training approach based on
+self-supervised learning that works on human-centric data using only images.
+Our method uses pairs of images of humans: the first is partially masked and
+the model is trained to reconstruct the masked parts given the visible ones and
+a second image. It relies on both stereoscopic (cross-view) pairs, and temporal
+(cross-pose) pairs taken from videos, in order to learn priors about 3D as well
+as human motion. We pre-train a model for body-centric tasks and one for
+hand-centric tasks. With a generic transformer architecture, these models
+outperform existing self-supervised pre-training methods on a wide set of
+human-centric downstream tasks, and obtain state-of-the-art performance for
+instance when fine-tuning for model-based and model-free human mesh recovery.",cs.CV,['cs.CV']
+Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation,Zhiwu Qing · Shiwei Zhang · Jiayu Wang · Xiang Wang · Yujie Wei · Yingya Zhang · Changxin Gao · Nong Sang, ,https://arxiv.org/abs/2312.04483,,2312.04483.pdf,Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation,"Despite diffusion models having shown powerful abilities to generate
+photorealistic images, generating videos that are realistic and diverse still
+remains in its infancy. One of the key reasons is that current methods
+intertwine spatial content and temporal dynamics together, leading to a notably
+increased complexity of text-to-video generation (T2V). In this work, we
+propose HiGen, a diffusion model-based method that improves performance by
+decoupling the spatial and temporal factors of videos from two perspectives,
+i.e., structure level and content level. At the structure level, we decompose
+the T2V task into two steps, including spatial reasoning and temporal
+reasoning, using a unified denoiser. Specifically, we generate spatially
+coherent priors using text during spatial reasoning and then generate
+temporally coherent motions from these priors during temporal reasoning. At the
+content level, we extract two subtle cues from the content of the input video
+that can express motion and appearance changes, respectively. These two cues
+then guide the model's training for generating videos, enabling flexible
+content variations and enhancing temporal stability. Through the decoupled
+paradigm, HiGen can effectively reduce the complexity of this task and generate
+realistic videos with semantics accuracy and motion stability. Extensive
+experiments demonstrate the superior performance of HiGen over the
+state-of-the-art T2V methods.",cs.CV,['cs.CV']
+SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks,Xinyu Shi · Zecheng Hao · Zhaofei Yu, ,https://arxiv.org/abs/2403.14302,,2403.14302.pdf,SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks,"The remarkable success of Vision Transformers in Artificial Neural Networks
+(ANNs) has led to a growing interest in incorporating the self-attention
+mechanism and transformer-based architecture into Spiking Neural Networks
+(SNNs). While existing methods propose spiking self-attention mechanisms that
+are compatible with SNNs, they lack reasonable scaling methods, and the overall
+architectures proposed by these methods suffer from a bottleneck in effectively
+extracting local features. To address these challenges, we propose a novel
+spiking self-attention mechanism named Dual Spike Self-Attention (DSSA) with a
+reasonable scaling method. Based on DSSA, we propose a novel spiking Vision
+Transformer architecture called SpikingResformer, which combines the
+ResNet-based multi-stage architecture with our proposed DSSA to improve both
+performance and energy efficiency while reducing parameters. Experimental
+results show that SpikingResformer achieves higher accuracy with fewer
+parameters and lower energy consumption than other spiking Vision Transformer
+counterparts. Notably, our SpikingResformer-L achieves 79.40% top-1 accuracy on
+ImageNet with 4 time-steps, which is the state-of-the-art result in the SNN
+field.",cs.NE,"['cs.NE', 'cs.CV', 'cs.LG']"
+C$^2$KD: Bridging the Modality Gap for Cross-Modal Knowledge Distillation,Fushuo Huo · Wenchao Xu · Jingcai Guo · Haozhao Wang · Song Guo, ,https://arxiv.org/abs/2312.17648,,2312.17648.pdf,Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation,"Visual grounding aims to align visual information of specific regions of
+images with corresponding natural language expressions. Current visual
+grounding methods leverage pre-trained visual and language backbones separately
+to obtain visual features and linguistic features. Although these two types of
+features are then fused via delicately designed networks, the heterogeneity of
+the features makes them inapplicable for multi-modal reasoning. This problem
+arises from the domain gap between the single-modal pre-training backbone used
+in current visual grounding methods, which can hardly be overcome by the
+traditional end-to-end training method. To alleviate this, our work proposes an
+Empowering pre-trained model for Visual Grounding (EpmVG) framework, which
+distills a multimodal pre-trained model to guide the visual grounding task.
+EpmVG is based on a novel cross-modal distillation mechanism, which can
+effectively introduce the consistency information of images and texts in the
+pre-trained model, to reduce the domain gap existing in the backbone networks,
+thereby improving the performance of the model in the visual grounding task.
+Extensive experiments are carried out on five conventionally used datasets, and
+results demonstrate that our method achieves better performance than
+state-of-the-art methods.",cs.CV,"['cs.CV', 'cs.AI']"
+Map-Relative Pose Regression for Visual Re-Localization,Shuai Chen · Tommaso Cavallari · Victor Adrian Prisacariu · Eric Brachmann, ,https://arxiv.org/abs/2404.09884,,2404.09884.pdf,Map-Relative Pose Regression for Visual Re-Localization,"Pose regression networks predict the camera pose of a query image relative to
+a known environment. Within this family of methods, absolute pose regression
+(APR) has recently shown promising accuracy in the range of a few centimeters
+in position error. APR networks encode the scene geometry implicitly in their
+weights. To achieve high accuracy, they require vast amounts of training data
+that, realistically, can only be created using novel view synthesis in a
+days-long process. This process has to be repeated for each new scene again and
+again. We present a new approach to pose regression, map-relative pose
+regression (marepo), that satisfies the data hunger of the pose regression
+network in a scene-agnostic fashion. We condition the pose regressor on a
+scene-specific map representation such that its pose predictions are relative
+to the scene map. This allows us to train the pose regressor across hundreds of
+scenes to learn the generic relation between a scene-specific map
+representation and the camera pose. Our map-relative pose regressor can be
+applied to new map representations immediately or after mere minutes of
+fine-tuning for the highest accuracy. Our approach outperforms previous pose
+regression methods by far on two public datasets, indoor and outdoor. Code is
+available: https://nianticlabs.github.io/marepo",cs.CV,"['cs.CV', 'cs.LG']"
+Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation,Zihan Wang · Xiangyang Li · Jiahao Yang · Yeqi Liu · Junjie Hu · Ming Jiang · Shuqiang Jiang,https://github.com/MrZihan/HNR-VLN,https://arxiv.org/abs/2404.01943,,2404.01943.pdf,Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation,"Vision-and-language navigation (VLN) enables the agent to navigate to a
+remote location following the natural language instruction in 3D environments.
+At each navigation step, the agent selects from possible candidate locations
+and then makes the move. For better navigation planning, the lookahead
+exploration strategy aims to effectively evaluate the agent's next action by
+accurately anticipating the future environment of candidate locations. To this
+end, some existing works predict RGB images for future environments, while this
+strategy suffers from image distortion and high computational cost. To address
+these issues, we propose the pre-trained hierarchical neural radiance
+representation model (HNR) to produce multi-level semantic features for future
+environments, which are more robust and efficient than pixel-wise RGB
+reconstruction. Furthermore, with the predicted future environmental
+representations, our lookahead VLN model is able to construct the navigable
+future path tree and select the optimal path via efficient parallel evaluation.
+Extensive experiments on the VLN-CE datasets confirm the effectiveness of our
+method.",cs.CV,"['cs.CV', 'cs.RO']"
+CAT-Seg: Cost Aggregation for Open-vocabulary Semantic Segmentation,Seokju Cho · Heeseong Shin · Sunghwan Hong · Anurag Arnab · Paul Hongsuck Seo · Seungryong Kim, ,,https://openreview.net/forum?id=ZWytHTcnTy,,,,,nan
+Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments,Liyuan Zhu · Shengyu Huang · Konrad Schindler · Iro Armeni,https://www.zhuliyuan.net/livingscenes,https://arxiv.org/abs/2312.09138,,2312.09138.pdf,Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments,"Research into dynamic 3D scene understanding has primarily focused on
+short-term change tracking from dense observations, while little attention has
+been paid to long-term changes with sparse observations. We address this gap
+with MoRE, a novel approach for multi-object relocalization and reconstruction
+in evolving environments. We view these environments as ""living scenes"" and
+consider the problem of transforming scans taken at different points in time
+into a 3D reconstruction of the object instances, whose accuracy and
+completeness increase over time. At the core of our method lies an
+SE(3)-equivariant representation in a single encoder-decoder network, trained
+on synthetic data. This representation enables us to seamlessly tackle instance
+matching, registration, and reconstruction. We also introduce a joint
+optimization algorithm that facilitates the accumulation of point clouds
+originating from the same instance across multiple scans taken at different
+points in time. We validate our method on synthetic and real-world data and
+demonstrate state-of-the-art performance in both end-to-end performance and
+individual subtasks.",cs.CV,['cs.CV']
+CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation,Kangfu Mei · Mauricio Delbracio · Hossein Talebi · Zhengzhong Tu · Vishal M. Patel · Peyman Milanfar,https://fast-codi.github.io/,https://arxiv.org/abs/2310.01407,,2310.01407.pdf,CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation,"Large generative diffusion models have revolutionized text-to-image
+generation and offer immense potential for conditional generation tasks such as
+image enhancement, restoration, editing, and compositing. However, their
+widespread adoption is hindered by the high computational cost, which limits
+their real-time application. To address this challenge, we introduce a novel
+method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept
+additional image conditioning inputs while significantly reducing the sampling
+steps required to achieve high-quality results. Our method can leverage
+architectures such as ControlNet to incorporate conditioning inputs without
+compromising the model's prior knowledge gained during large scale
+pre-training. Additionally, a conditional consistency loss enforces consistent
+predictions across diffusion steps, effectively compelling the model to
+generate high-quality images with conditions in a few steps. Our
+conditional-task learning and distillation approach outperforms previous
+distillation methods, achieving a new state-of-the-art in producing
+high-quality images with very few steps (e.g., 1-4) across multiple tasks,
+including super-resolution, text-guided image editing, and depth-to-image
+generation.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows,Zhenggang Tang · Jason Ren · Xiaoming Zhao · Bowen Wen · Jonathan Tremblay · Stan Birchfield · Alexander G. Schwing, ,https://arxiv.org/abs/2405.05010,,2405.05010.pdf,${M^2D}$NeRF: Multi-Modal Decomposition NeRF with 3D Feature Fields,"Neural fields (NeRF) have emerged as a promising approach for representing
+continuous 3D scenes. Nevertheless, the lack of semantic encoding in NeRFs
+poses a significant challenge for scene decomposition. To address this
+challenge, we present a single model, Multi-Modal Decomposition NeRF
+(${M^2D}$NeRF), that is capable of both text-based and visual patch-based
+edits. Specifically, we use multi-modal feature distillation to integrate
+teacher features from pretrained visual and language models into 3D semantic
+feature volumes, thereby facilitating consistent 3D editing. To enforce
+consistency between the visual and language features in our 3D feature volumes,
+we introduce a multi-modal similarity constraint. We also introduce a
+patch-based joint contrastive loss that helps to encourage object-regions to
+coalesce in the 3D feature space, resulting in more precise boundaries.
+Experiments on various real-world scenes show superior performance in 3D scene
+decomposition tasks compared to prior NeRF-based methods.",cs.CV,['cs.CV']
+Stationary Representations: Optimally Approximating Compatibility and Implications for Improved Model Replacements,Niccolò Biondi · Federico Pernici · Simone Ricci · Alberto Del Bimbo,https://github.com/miccunifi/iamcl2r,https://arxiv.org/abs/2405.02581,,2405.02581.pdf,Stationary Representations: Optimally Approximating Compatibility and Implications for Improved Model Replacements,"Learning compatible representations enables the interchangeable use of
+semantic features as models are updated over time. This is particularly
+relevant in search and retrieval systems where it is crucial to avoid
+reprocessing of the gallery images with the updated model. While recent
+research has shown promising empirical evidence, there is still a lack of
+comprehensive theoretical understanding about learning compatible
+representations. In this paper, we demonstrate that the stationary
+representations learned by the $d$-Simplex fixed classifier optimally
+approximate compatibility representation according to the two inequality
+constraints of its formal definition. This not only establishes a solid
+foundation for future works in this line of research but also presents
+implications that can be exploited in practical learning scenarios. An
+exemplary application is the now-standard practice of downloading and
+fine-tuning new pre-trained models. Specifically, we show the strengths and
+critical issues of stationary representations in the case in which a model
+undergoing sequential fine-tuning is asynchronously replaced by downloading a
+better-performing model pre-trained elsewhere. Such a representation enables
+seamless delivery of retrieval service (i.e., no reprocessing of gallery
+images) and offers improved performance without operational disruptions during
+model replacement. Code available at: https://github.com/miccunifi/iamcl2r.",cs.CV,['cs.CV']
+Boosting Flow-based Generative Super-Resolution Models via Learned Prior,Li-Yuan Tsao · Yi-Chen Lo · Chia-Che Chang · Hao-Wei Chen · Roy Tseng · Chien Feng · Chun-Yi Lee,https://github.com/liyuantsao/FlowSR-LP,https://arxiv.org/abs/2403.10988,,2403.10988.pdf,Boosting Flow-based Generative Super-Resolution Models via Learned Prior,"Flow-based super-resolution (SR) models have demonstrated astonishing
+capabilities in generating high-quality images. However, these methods
+encounter several challenges during image generation, such as grid artifacts,
+exploding inverses, and suboptimal results due to a fixed sampling temperature.
+To overcome these issues, this work introduces a conditional learned prior to
+the inference phase of a flow-based SR model. This prior is a latent code
+predicted by our proposed latent module conditioned on the low-resolution
+image, which is then transformed by the flow model into an SR image. Our
+framework is designed to seamlessly integrate with any contemporary flow-based
+SR model without modifying its architecture or pre-trained weights. We evaluate
+the effectiveness of our proposed framework through extensive experiments and
+ablation analyses. The proposed framework successfully addresses all the
+inherent issues in flow-based SR models and enhances their performance in
+various SR scenarios. Our code is available at:
+https://github.com/liyuantsao/BFSR",cs.CV,"['cs.CV', 'cs.AI']"
+Video Frame Interpolation via Direct Synthesis with the Event-based Reference,Yuhan Liu · Yongjian Deng · Hao Chen · Zhen Yang, ,https://arxiv.org/abs/2404.18156,,2404.18156.pdf,Event-based Video Frame Interpolation with Edge Guided Motion Refinement,"Video frame interpolation, the process of synthesizing intermediate frames
+between sequential video frames, has made remarkable progress with the use of
+event cameras. These sensors, with microsecond-level temporal resolution, fill
+information gaps between frames by providing precise motion cues. However,
+contemporary Event-Based Video Frame Interpolation (E-VFI) techniques often
+neglect the fact that event data primarily supply high-confidence features at
+scene edges during multi-modal feature fusion, thereby diminishing the role of
+event signals in optical flow (OF) estimation and warping refinement. To
+address this overlooked aspect, we introduce an end-to-end E-VFI learning
+method (referred to as EGMR) to efficiently utilize edge features from event
+signals for motion flow and warping enhancement. Our method incorporates an
+Edge Guided Attentive (EGA) module, which rectifies estimated video motion
+through attentive aggregation based on the local correlation of multi-modal
+features in a coarse-to-fine strategy. Moreover, given that event data can
+provide accurate visual references at scene edges between consecutive frames,
+we introduce a learned visibility map derived from event data to adaptively
+mitigate the occlusion problem in the warping refinement process. Extensive
+experiments on both synthetic and real datasets show the effectiveness of the
+proposed approach, demonstrating its potential for higher quality video frame
+interpolation.",cs.CV,['cs.CV']
+Universal Robustness via Median Random Smoothing for Real-World Super-Resolution,Zakariya Chaouai · Mohamed Tamaazousti, ,https://arxiv.org/abs/2405.14934,,2405.14934.pdf,Universal Robustness via Median Randomized Smoothing for Real-World Super-Resolution,"Most of the recent literature on image Super-Resolution (SR) can be
+classified into two main approaches. The first one involves learning a
+corruption model tailored to a specific dataset, aiming to mimic the noise and
+corruption in low-resolution images, such as sensor noise. However, this
+approach is data-specific, tends to lack adaptability, and its accuracy
+diminishes when faced with unseen types of image corruptions. A second and more
+recent approach, referred to as Robust Super-Resolution (RSR), proposes to
+improve real-world SR by harnessing the generalization capabilities of a model
+by making it robust to adversarial attacks. To delve further into this second
+approach, our paper explores the universality of various methods for enhancing
+the robustness of deep learning SR models. In other words, we inquire: ""Which
+robustness method exhibits the highest degree of adaptability when dealing with
+a wide range of adversarial attacks ?"". Our extensive experimentation on both
+synthetic and real-world images empirically demonstrates that median randomized
+smoothing (MRS) is more general in terms of robustness compared to adversarial
+learning techniques, which tend to focus on specific types of attacks.
+Furthermore, as expected, we also illustrate that the proposed universal robust
+method enables the SR model to handle standard corruptions more effectively,
+such as blur and Gaussian noise, and notably, corruptions naturally present in
+real-world images. These results support the significance of shifting the
+paradigm in the development of real-world SR methods towards RSR, especially
+via MRS.",eess.IV,"['eess.IV', 'cs.CV']"
+AAMDM: Accelerated Auto-regressive Motion Diffusion Model,Tianyu Li · Calvin Zhuhan Qiao · Ren Guanqiao · KangKang Yin · Sehoon Ha, ,https://arxiv.org/abs/2401.06146,,2401.06146.pdf,AAMDM: Accelerated Auto-regressive Motion Diffusion Model,"Interactive motion synthesis is essential in creating immersive experiences
+in entertainment applications, such as video games and virtual reality.
+However, generating animations that are both high-quality and contextually
+responsive remains a challenge. Traditional techniques in the game industry can
+produce high-fidelity animations but suffer from high computational costs and
+poor scalability. Trained neural network models alleviate the memory and speed
+issues, yet fall short on generating diverse motions. Diffusion models offer
+diverse motion synthesis with low memory usage, but require expensive reverse
+diffusion processes. This paper introduces the Accelerated Auto-regressive
+Motion Diffusion Model (AAMDM), a novel motion synthesis framework designed to
+achieve quality, diversity, and efficiency all together. AAMDM integrates
+Denoising Diffusion GANs as a fast Generation Module, and an Auto-regressive
+Diffusion Model as a Polishing Module. Furthermore, AAMDM operates in a
+lower-dimensional embedded space rather than the full-dimensional pose space,
+which reduces the training complexity as well as further improves the
+performance. We show that AAMDM outperforms existing methods in motion quality,
+diversity, and runtime efficiency, through comprehensive quantitative analyses
+and visual comparisons. We also demonstrate the effectiveness of each
+algorithmic component through ablation studies.",cs.CV,"['cs.CV', 'cs.GR']"
+SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion,Hsuan-I Ho · Jie Song · Otmar Hilliges,https://ait.ethz.ch/sith,https://arxiv.org/abs/2311.15855,,2311.15855.pdf,SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion,"A long-standing goal of 3D human reconstruction is to create lifelike and
+fully detailed 3D humans from single-view images. The main challenge lies in
+inferring unknown body shapes, appearances, and clothing details in areas not
+visible in the images. To address this, we propose SiTH, a novel pipeline that
+uniquely integrates an image-conditioned diffusion model into a 3D mesh
+reconstruction workflow. At the core of our method lies the decomposition of
+the challenging single-view reconstruction problem into generative
+hallucination and reconstruction subproblems. For the former, we employ a
+powerful generative diffusion model to hallucinate unseen back-view appearance
+based on the input images. For the latter, we leverage skinned body meshes as
+guidance to recover full-body texture meshes from the input and back-view
+images. SiTH requires as few as 500 3D human scans for training while
+maintaining its generality and robustness to diverse images. Extensive
+evaluations on two 3D human benchmarks, including our newly created one,
+highlighted our method's superior accuracy and perceptual quality in 3D
+textured human reconstruction. Our code and evaluation benchmark are available
+at https://ait.ethz.ch/sith",cs.CV,['cs.CV']
+HUGS: Human Gaussian Splatting,Muhammed Kocabas · Jen-Hao Rick Chang · James Gabriel · Oncel Tuzel · Anurag Ranjan,https://machinelearning.apple.com/research/hugs,https://arxiv.org/abs/2311.17910v1,,2311.17910v1.pdf,HUGS: Human Gaussian Splats,"Recent advances in neural rendering have improved both training and rendering
+times by orders of magnitude. While these methods demonstrate state-of-the-art
+quality and speed, they are designed for photogrammetry of static scenes and do
+not generalize well to freely moving humans in the environment. In this work,
+we introduce Human Gaussian Splats (HUGS) that represents an animatable human
+together with the scene using 3D Gaussian Splatting (3DGS). Our method takes
+only a monocular video with a small number of (50-100) frames, and it
+automatically learns to disentangle the static scene and a fully animatable
+human avatar within 30 minutes. We utilize the SMPL body model to initialize
+the human Gaussians. To capture details that are not modeled by SMPL (e.g.
+cloth, hairs), we allow the 3D Gaussians to deviate from the human body model.
+Utilizing 3D Gaussians for animated humans brings new challenges, including the
+artifacts created when articulating the Gaussians. We propose to jointly
+optimize the linear blend skinning weights to coordinate the movements of
+individual Gaussians during animation. Our approach enables novel-pose
+synthesis of human and novel view synthesis of both the human and the scene. We
+achieve state-of-the-art rendering quality with a rendering speed of 60 FPS
+while being ~100x faster to train over previous work. Our code will be
+announced here: https://github.com/apple/ml-hugs",cs.CV,"['cs.CV', 'cs.GR']"
+SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency,Paul Roetzer · Florian Bernard, ,https://arxiv.org/abs/2310.08230,,2310.08230.pdf,Fast Discrete Optimisation for Geometrically Consistent 3D Shape Matching,"In this work we propose to combine the advantages of learning-based and
+combinatorial formalisms for 3D shape matching. While learning-based shape
+matching solutions lead to state-of-the-art matching performance, they do not
+ensure geometric consistency, so that obtained matchings are locally unsmooth.
+On the contrary, axiomatic methods allow to take geometric consistency into
+account by explicitly constraining the space of valid matchings. However,
+existing axiomatic formalisms are impractical since they do not scale to
+practically relevant problem sizes, or they require user input for the
+initialisation of non-convex optimisation problems. In this work we aim to
+close this gap by proposing a novel combinatorial solver that combines a unique
+set of favourable properties: our approach is (i) initialisation free, (ii)
+massively parallelisable powered by a quasi-Newton method, (iii) provides
+optimality gaps, and (iv) delivers decreased runtime and globally optimal
+results for many instances.",cs.CV,['cs.CV']
+Building Optimal Neural Architectures using Interpretable Knowledge,Keith Mills · Fred Han · Mohammad Salameh · Shengyao Lu · CHUNHUA ZHOU · Jiao He · Fengyu Sun · Di Niu,https://github.com/Ascend-Research/AutoBuild,https://arxiv.org/abs/2403.13293,,2403.13293.pdf,Building Optimal Neural Architectures using Interpretable Knowledge,"Neural Architecture Search is a costly practice. The fact that a search space
+can span a vast number of design choices with each architecture evaluation
+taking nontrivial overhead makes it hard for an algorithm to sufficiently
+explore candidate networks. In this paper, we propose AutoBuild, a scheme which
+learns to align the latent embeddings of operations and architecture modules
+with the ground-truth performance of the architectures they appear in. By doing
+so, AutoBuild is capable of assigning interpretable importance scores to
+architecture modules, such as individual operation features and larger macro
+operation sequences such that high-performance neural networks can be
+constructed without any need for search. Through experiments performed on
+state-of-the-art image classification, segmentation, and Stable Diffusion
+models, we show that by mining a relatively small set of evaluated
+architectures, AutoBuild can learn to build high-quality architectures directly
+or help to reduce search space to focus on relevant areas, finding better
+architectures that outperform both the original labeled ones and ones found by
+search baselines. Code available at
+https://github.com/Ascend-Research/AutoBuild",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Using Human Feedback to Fine-tune Diffusion Models  without Any Reward Model,Kai Yang · Jian Tao · Jiafei Lyu · Chunjiang Ge · Jiaxin Chen · Weihan Shen · Xiaolong Zhu · Xiu Li,https://github.com/yk7333/d3po/,https://arxiv.org/abs/2311.13231,,2311.13231.pdf,Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model,"Using reinforcement learning with human feedback (RLHF) has shown significant
+promise in fine-tuning diffusion models. Previous methods start by training a
+reward model that aligns with human preferences, then leverage RL techniques to
+fine-tune the underlying models. However, crafting an efficient reward model
+demands extensive datasets, optimal architecture, and manual hyperparameter
+tuning, making the process both time and cost-intensive. The direct preference
+optimization (DPO) method, effective in fine-tuning large language models,
+eliminates the necessity for a reward model. However, the extensive GPU memory
+requirement of the diffusion model's denoising process hinders the direct
+application of the DPO method. To address this issue, we introduce the Direct
+Preference for Denoising Diffusion Policy Optimization (D3PO) method to
+directly fine-tune diffusion models. The theoretical analysis demonstrates that
+although D3PO omits training a reward model, it effectively functions as the
+optimal reward model trained using human feedback data to guide the learning
+process. This approach requires no training of a reward model, proving to be
+more direct, cost-effective, and minimizing computational overhead. In
+experiments, our method uses the relative scale of objectives as a proxy for
+human preference, delivering comparable results to methods using ground-truth
+rewards. Moreover, D3PO demonstrates the ability to reduce image distortion
+rates and generate safer images, overcoming challenges lacking robust reward
+models. Our code is publicly available at https://github.com/yk7333/D3PO.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization,Anna Kukleva · Fadime Sener · Edoardo Remelli · Bugra Tekin · Eric Sauser · Bernt Schiele · Shugao Ma, ,https://arxiv.org/abs/2403.19811,,2403.19811.pdf,X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization,"Lately, there has been growing interest in adapting vision-language models
+(VLMs) to image and third-person video classification due to their success in
+zero-shot recognition. However, the adaptation of these models to egocentric
+videos has been largely unexplored. To address this gap, we propose a simple
+yet effective cross-modal adaptation framework, which we call X-MIC. Using a
+video adapter, our pipeline learns to align frozen text embeddings to each
+egocentric video directly in the shared embedding space. Our novel adapter
+architecture retains and improves generalization of the pre-trained VLMs by
+disentangling learnable temporal modeling and frozen visual encoder. This
+results in an enhanced alignment of text embeddings to each egocentric video,
+leading to a significant improvement in cross-dataset generalization. We
+evaluate our approach on the Epic-Kitchens, Ego4D, and EGTEA datasets for
+fine-grained cross-dataset action generalization, demonstrating the
+effectiveness of our method. Code is available at
+https://github.com/annusha/xmic",cs.CV,['cs.CV']
+MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation,Zhicheng Zhang · Pancheng Zhao · Eunil Park · Jufeng Yang, ,https://arxiv.org/abs/2306.15876,,2306.15876.pdf,Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners,"Representation learning has been evolving from traditional supervised
+training to Contrastive Learning (CL) and Masked Image Modeling (MIM). Previous
+works have demonstrated their pros and cons in specific scenarios, i.e., CL and
+supervised pre-training excel at capturing longer-range global patterns and
+enabling better feature discrimination, while MIM can introduce more local and
+diverse attention across all transformer layers. In this paper, we explore how
+to obtain a model that combines their strengths. We start by examining previous
+feature distillation and mask feature reconstruction methods and identify their
+limitations. We find that their increasing diversity mainly derives from the
+asymmetric designs, but these designs may in turn compromise the discrimination
+ability. In order to better obtain both discrimination and diversity, we
+propose a simple but effective Hybrid Distillation strategy, which utilizes
+both the supervised/CL teacher and the MIM teacher to jointly guide the student
+model. Hybrid Distill imitates the token relations of the MIM teacher to
+alleviate attention collapse, as well as distills the feature maps of the
+supervised/CL teacher to enable discrimination. Furthermore, a progressive
+redundant token masking strategy is also utilized to reduce the distilling
+costs and avoid falling into local optima. Experiment results prove that Hybrid
+Distill can achieve superior performance on different benchmarks.",cs.CV,['cs.CV']
+Monocular Identity-Conditioned Facial Reflectance Reconstruction,Xingyu Ren · Jiankang Deng · Yuhao Cheng · Jia Guo · Chao Ma · Yichao Yan · Wenhan Zhu · Xiaokang Yang,https://xingyuren.github.io/id2reflectance/,https://arxiv.org/abs/2404.00301,,2404.00301.pdf,Monocular Identity-Conditioned Facial Reflectance Reconstruction,"Recent 3D face reconstruction methods have made remarkable advancements, yet
+there remain huge challenges in monocular high-quality facial reflectance
+reconstruction. Existing methods rely on a large amount of light-stage captured
+data to learn facial reflectance models. However, the lack of subject diversity
+poses challenges in achieving good generalization and widespread applicability.
+In this paper, we learn the reflectance prior in image space rather than UV
+space and present a framework named ID2Reflectance. Our framework can directly
+estimate the reflectance maps of a single image while using limited reflectance
+data for training. Our key insight is that reflectance data shares facial
+structures with RGB faces, which enables obtaining expressive facial prior from
+inexpensive RGB data thus reducing the dependency on reflectance data. We first
+learn a high-quality prior for facial reflectance. Specifically, we pretrain
+multi-domain facial feature codebooks and design a codebook fusion method to
+align the reflectance and RGB domains. Then, we propose an identity-conditioned
+swapping module that injects facial identity from the target image into the
+pre-trained autoencoder to modify the identity of the source reflectance image.
+Finally, we stitch multi-view swapped reflectance images to obtain renderable
+assets. Extensive experiments demonstrate that our method exhibits excellent
+generalization capability and achieves state-of-the-art facial reflectance
+reconstruction results for in-the-wild faces. Our project page is
+https://xingyuren.github.io/id2reflectance/.",cs.CV,['cs.CV']
+CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing,Ajian Liu · Shuai Xue · Gan Jianwen · Jun Wan · Yanyan Liang · Jiankang Deng · Sergio Escalera · Zhen Lei, ,https://arxiv.org/abs/2403.14333,,2403.14333.pdf,CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing,"Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the
+model's performance on unseen domains. Existing methods either rely on domain
+labels to align domain-invariant feature spaces, or disentangle generalizable
+features from the whole sample, which inevitably lead to the distortion of
+semantic feature structures and achieve limited generalization. In this work,
+we make use of large-scale VLMs like CLIP and leverage the textual feature to
+dynamically adjust the classifier's weights for exploring generalizable visual
+features. Specifically, we propose a novel Class Free Prompt Learning (CFPL)
+paradigm for DG FAS, which utilizes two lightweight transformers, namely
+Content Q-Former (CQF) and Style Q-Former (SQF), to learn the different
+semantic prompts conditioned on content and style features by using a set of
+learnable query vectors, respectively. Thus, the generalizable prompt can be
+learned by two improvements: (1) A Prompt-Text Matched (PTM) supervision is
+introduced to ensure CQF learns visual representation that is most informative
+of the content description. (2) A Diversified Style Prompt (DSP) technology is
+proposed to diversify the learning of style prompts by mixing feature
+statistics between instance-specific styles. Finally, the learned text features
+modulate visual features to generalization through the designed Prompt
+Modulation (PM). Extensive experiments show that the CFPL is effective and
+outperforms the state-of-the-art methods on several cross-domain datasets.",cs.CV,['cs.CV']
+BEM: Balanced and Entropy-based Mix for Long-Tailed Semi-Supervised Learning,Hongwei Zheng · Linyuan Zhou · Han Li · Jinming Su · Xiaoming Wei · Xu Xiaoming, ,https://arxiv.org/abs/2404.01179,,2404.01179.pdf,BEM: Balanced and Entropy-based Mix for Long-Tailed Semi-Supervised Learning,"Data mixing methods play a crucial role in semi-supervised learning (SSL),
+but their application is unexplored in long-tailed semi-supervised learning
+(LTSSL). The primary reason is that the in-batch mixing manner fails to address
+class imbalance. Furthermore, existing LTSSL methods mainly focus on
+re-balancing data quantity but ignore class-wise uncertainty, which is also
+vital for class balance. For instance, some classes with sufficient samples
+might still exhibit high uncertainty due to indistinguishable features. To this
+end, this paper introduces the Balanced and Entropy-based Mix (BEM), a
+pioneering mixing approach to re-balance the class distribution of both data
+quantity and uncertainty. Specifically, we first propose a class balanced mix
+bank to store data of each class for mixing. This bank samples data based on
+the estimated quantity distribution, thus re-balancing data quantity. Then, we
+present an entropy-based learning approach to re-balance class-wise
+uncertainty, including entropy-based sampling strategy, entropy-based selection
+module, and entropy-based class balanced loss. Our BEM first leverages data
+mixing for improving LTSSL, and it can also serve as a complement to the
+existing re-balancing methods. Experimental results show that BEM significantly
+enhances various LTSSL frameworks and achieves state-of-the-art performances
+across multiple benchmarks.",cs.CV,"['cs.CV', 'cs.LG']"
+Relightable Gaussian Codec Avatars,Shunsuke Saito · Gabriel Schwartz · Tomas Simon · Junxuan Li · Giljoo Nam, ,https://arxiv.org/abs/2312.03704,,2312.03704.pdf,Relightable Gaussian Codec Avatars,"The fidelity of relighting is bounded by both geometry and appearance
+representations. For geometry, both mesh and volumetric approaches have
+difficulty modeling intricate structures like 3D hair geometry. For appearance,
+existing relighting models are limited in fidelity and often too slow to render
+in real-time with high-resolution continuous environments. In this work, we
+present Relightable Gaussian Codec Avatars, a method to build high-fidelity
+relightable head avatars that can be animated to generate novel expressions.
+Our geometry model based on 3D Gaussians can capture 3D-consistent
+sub-millimeter details such as hair strands and pores on dynamic face
+sequences. To support diverse materials of human heads such as the eyes, skin,
+and hair in a unified manner, we present a novel relightable appearance model
+based on learnable radiance transfer. Together with global illumination-aware
+spherical harmonics for the diffuse components, we achieve real-time relighting
+with all-frequency reflections using spherical Gaussians. This appearance model
+can be efficiently relit under both point light and continuous illumination. We
+further improve the fidelity of eye reflections and enable explicit gaze
+control by introducing relightable explicit eye models. Our method outperforms
+existing approaches without compromising real-time performance. We also
+demonstrate real-time relighting of avatars on a tethered consumer VR headset,
+showcasing the efficiency and fidelity of our avatars.",cs.GR,"['cs.GR', 'cs.CV']"
+4D-DRESS: A 4D Dataset of Real-World Human Clothing With Semantic Annotations,Wenbo Wang · Hsuan-I Ho · Chen Guo · Boxiang Rong · Artur Grigorev · Jie Song · Juan Jose Zarate · Otmar Hilliges,https://ait.ethz.ch/4d-dress,https://arxiv.org/abs/2404.18630,,2404.18630.pdf,4D-DRESS: A 4D Dataset of Real-world Human Clothing with Semantic Annotations,"The studies of human clothing for digital avatars have predominantly relied
+on synthetic datasets. While easy to collect, synthetic data often fall short
+in realism and fail to capture authentic clothing dynamics. Addressing this
+gap, we introduce 4D-DRESS, the first real-world 4D dataset advancing human
+clothing research with its high-quality 4D textured scans and garment meshes.
+4D-DRESS captures 64 outfits in 520 human motion sequences, amounting to 78k
+textured scans. Creating a real-world clothing dataset is challenging,
+particularly in annotating and segmenting the extensive and complex 4D human
+scans. To address this, we develop a semi-automatic 4D human parsing pipeline.
+We efficiently combine a human-in-the-loop process with automation to
+accurately label 4D scans in diverse garments and body movements. Leveraging
+precise annotations and high-quality garment meshes, we establish several
+benchmarks for clothing simulation and reconstruction. 4D-DRESS offers
+realistic and challenging data that complements synthetic sources, paving the
+way for advancements in research of lifelike human clothing. Website:
+https://ait.ethz.ch/4d-dress.",cs.CV,['cs.CV']
+DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback,Yangyi Chen · Karan Sikka · Michael Cogswell · Heng Ji · Ajay Divakaran, ,https://arxiv.org/abs/2311.10081,,2311.10081.pdf,DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback,"We present DRESS, a large vision language model (LVLM) that innovatively
+exploits Natural Language feedback (NLF) from Large Language Models to enhance
+its alignment and interactions by addressing two key limitations in the
+state-of-the-art LVLMs. First, prior LVLMs generally rely only on the
+instruction finetuning stage to enhance alignment with human preferences.
+Without incorporating extra feedback, they are still prone to generate
+unhelpful, hallucinated, or harmful responses. Second, while the visual
+instruction tuning data is generally structured in a multi-turn dialogue
+format, the connections and dependencies among consecutive conversational turns
+are weak. This reduces the capacity for effective multi-turn interactions. To
+tackle these, we propose a novel categorization of the NLF into two key types:
+critique and refinement. The critique NLF identifies the strengths and
+weaknesses of the responses and is used to align the LVLMs with human
+preferences. The refinement NLF offers concrete suggestions for improvement and
+is adopted to improve the interaction ability of the LVLMs-- which focuses on
+LVLMs' ability to refine responses by incorporating feedback in multi-turn
+interactions. To address the non-differentiable nature of NLF, we generalize
+conditional reinforcement learning for training. Our experimental results
+demonstrate that DRESS can generate more helpful (9.76%), honest (11.52%), and
+harmless (21.03%) responses, and more effectively learn from feedback during
+multi-turn interactions compared to SOTA LVMLs.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']"
+Gaussian Shadow Casting for Neural Characters,Luis Bolanos · Shih-Yang Su · Helge Rhodin, ,https://arxiv.org/abs/2401.06116v1,,2401.06116v1.pdf,Gaussian Shadow Casting for Neural Characters,"Neural character models can now reconstruct detailed geometry and texture
+from video, but they lack explicit shadows and shading, leading to artifacts
+when generating novel views and poses or during relighting. It is particularly
+difficult to include shadows as they are a global effect and the required
+casting of secondary rays is costly. We propose a new shadow model using a
+Gaussian density proxy that replaces sampling with a simple analytic formula.
+It supports dynamic motion and is tailored for shadow computation, thereby
+avoiding the affine projection approximation and sorting required by the
+closely related Gaussian splatting. Combined with a deferred neural rendering
+model, our Gaussian shadows enable Lambertian shading and shadow casting with
+minimal overhead. We demonstrate improved reconstructions, with better
+separation of albedo, shading, and shadows in challenging outdoor scenes with
+direct sun light and hard shadows. Our method is able to optimize the light
+direction without any input from the user. As a result, novel poses have fewer
+shadow artifacts and relighting in novel scenes is more realistic compared to
+the state-of-the-art methods, providing new ways to pose neural characters in
+novel environments, increasing their applicability.",cs.CV,['cs.CV']
+CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update,Zhi Gao · Yuntao Du. · Xintong Zhang · Xiaojian Ma · Wenjuan Han · Song-Chun Zhu · Qing Li, ,https://arxiv.org/abs/2312.10908,,2312.10908.pdf,CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update,"Utilizing large language models (LLMs) to compose off-the-shelf visual tools
+represents a promising avenue of research for developing robust visual
+assistants capable of addressing diverse visual tasks. However, these methods
+often overlook the potential for continual learning, typically by freezing the
+utilized tools, thus limiting their adaptation to environments requiring new
+knowledge. To tackle this challenge, we propose CLOVA, a Closed-Loop Visual
+Assistant, which operates within a framework encompassing inference,
+reflection, and learning phases. During the inference phase, LLMs generate
+programs and execute corresponding tools to complete assigned tasks. In the
+reflection phase, a multimodal global-local reflection scheme analyzes human
+feedback to determine which tools require updating. Lastly, the learning phase
+employs three flexible approaches to automatically gather training data and
+introduces a novel prompt tuning scheme to update the tools, allowing CLOVA to
+efficiently acquire new knowledge. Experimental findings demonstrate that CLOVA
+surpasses existing tool-usage methods by 5% in visual question answering and
+multiple-image reasoning, by 10% in knowledge tagging, and by 20% in image
+editing. These results underscore the significance of the continual learning
+capability in general visual assistants.",cs.CV,['cs.CV']
+Enhancing the Power of OOD Detection via Sample-Aware Model Selection,Feng Xue · Zi He · Yuan Zhang · Chuanlong Xie · Zhenguo Li · Falong Tan, ,,https://www.youtube.com/watch?v=XNso9qsWxHo,,,,,nan
+Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation,Zhekai Du · Xinyao Li · Fengling Li · Ke Lu · Lei Zhu · Jingjing Li,https://github.com/TL-UESTC/DAMP,https://arxiv.org/abs/2403.02899,,2403.02899.pdf,Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation,"Conventional Unsupervised Domain Adaptation (UDA) strives to minimize
+distribution discrepancy between domains, which neglects to harness rich
+semantics from data and struggles to handle complex domain shifts. A promising
+technique is to leverage the knowledge of large-scale pre-trained
+vision-language models for more guided adaptation. Despite some endeavors,
+current methods often learn textual prompts to embed domain semantics for
+source and target domains separately and perform classification within each
+domain, limiting cross-domain knowledge transfer. Moreover, prompting only the
+language branch lacks flexibility to adapt both modalities dynamically. To
+bridge this gap, we propose Domain-Agnostic Mutual Prompting (DAMP) to exploit
+domain-invariant semantics by mutually aligning visual and textual embeddings.
+Specifically, the image contextual information is utilized to prompt the
+language branch in a domain-agnostic and instance-conditioned way. Meanwhile,
+visual prompts are imposed based on the domain-agnostic textual prompt to
+elicit domain-invariant visual embeddings. These two branches of prompts are
+learned mutually with a cross-attention module and regularized with a
+semantic-consistency loss and an instance-discrimination contrastive loss.
+Experiments on three UDA benchmarks demonstrate the superiority of DAMP over
+state-of-the-art approaches.",cs.AI,['cs.AI']
+DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction,Weiyi Lv · Yuhang Huang · NING Zhang · Ruei-Sung Lin · Mei Han · Dan Zeng,https://diffmot.github.io/,https://arxiv.org/abs/2403.02075,,2403.02075.pdf,DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction,"In Multiple Object Tracking, objects often exhibit non-linear motion of
+acceleration and deceleration, with irregular direction changes.
+Tacking-by-detection (TBD) trackers with Kalman Filter motion prediction work
+well in pedestrian-dominant scenarios but fall short in complex situations when
+multiple objects perform non-linear and diverse motion simultaneously. To
+tackle the complex non-linear motion, we propose a real-time diffusion-based
+MOT approach named DiffMOT. Specifically, for the motion predictor component,
+we propose a novel Decoupled Diffusion-based Motion Predictor (D$^2$MP). It
+models the entire distribution of various motion presented by the data as a
+whole. It also predicts an individual object's motion conditioning on an
+individual's historical motion information. Furthermore, it optimizes the
+diffusion process with much fewer sampling steps. As a MOT tracker, the DiffMOT
+is real-time at 22.7FPS, and also outperforms the state-of-the-art on
+DanceTrack and SportsMOT datasets with $62.3\%$ and $76.2\%$ in HOTA metrics,
+respectively. To the best of our knowledge, DiffMOT is the first to introduce a
+diffusion probabilistic model into the MOT to tackle non-linear motion
+prediction.",cs.CV,['cs.CV']
+Dynamic Support Information Mining for Category-Agnostic Pose Estimation,Pengfei Ren · Yuanyuan Gao · Haifeng Sun · Qi Qi · Jingyu Wang · Jianxin Liao, ,https://arxiv.org/abs/2403.13647,,,Meta-Point Learning and Refining for Category-Agnostic Pose Estimation,"Category-agnostic pose estimation (CAPE) aims to predict keypoints for
+arbitrary classes given a few support images annotated with keypoints. Existing
+methods only rely on the features extracted at support keypoints to predict or
+refine the keypoints on query image, but a few support feature vectors are
+local and inadequate for CAPE. Considering that human can quickly perceive
+potential keypoints of arbitrary objects, we propose a novel framework for CAPE
+based on such potential keypoints (named as meta-points). Specifically, we
+maintain learnable embeddings to capture inherent information of various
+keypoints, which interact with image feature maps to produce meta-points
+without any support. The produced meta-points could serve as meaningful
+potential keypoints for CAPE. Due to the inevitable gap between inherency and
+annotation, we finally utilize the identities and details offered by support
+keypoints to assign and refine meta-points to desired keypoints in query image.
+In addition, we propose a progressive deformable point decoder and a slacked
+regression loss for better prediction and supervision. Our novel framework not
+only reveals the inherency of keypoints but also outperforms existing methods
+of CAPE. Comprehensive experiments and in-depth studies on large-scale MP-100
+dataset demonstrate the effectiveness of our framework.",cs.CV,['cs.CV']
+Implicit Event-RGBD Neural SLAM,Delin Qu · Chi Yan · Dong Wang · Jie Yin · Qizhi Chen · Dan Xu · Yiting Zhang · Bin Zhao · Xuelong Li,https://delinqu.github.io/EN-SLAM,https://arxiv.org/abs/2311.11013,,2311.11013.pdf,Implicit Event-RGBD Neural SLAM,"Implicit neural SLAM has achieved remarkable progress recently. Nevertheless,
+existing methods face significant challenges in non-ideal scenarios, such as
+motion blur or lighting variation, which often leads to issues like convergence
+failures, localization drifts, and distorted mapping. To address these
+challenges, we propose EN-SLAM, the first event-RGBD implicit neural SLAM
+framework, which effectively leverages the high rate and high dynamic range
+advantages of event data for tracking and mapping. Specifically, EN-SLAM
+proposes a differentiable CRF (Camera Response Function) rendering technique to
+generate distinct RGB and event camera data via a shared radiance field, which
+is optimized by learning a unified implicit representation with the captured
+event and RGBD supervision. Moreover, based on the temporal difference property
+of events, we propose a temporal aggregating optimization strategy for the
+event joint tracking and global bundle adjustment, capitalizing on the
+consecutive difference constraints of events, significantly enhancing tracking
+accuracy and robustness. Finally, we construct the simulated dataset
+DEV-Indoors and real captured dataset DEV-Reals containing 6 scenes, 17
+sequences with practical motion blur and lighting changes for evaluations.
+Experimental results show that our method outperforms the SOTA methods in both
+tracking ATE and mapping ACC with a real-time 17 FPS in various challenging
+environments. Project page: https://delinqu.github.io/EN-SLAM.",cs.CV,['cs.CV']
+DiffCast: A Unified Framework via Residual Diffusion for Precipitation Nowcasting,Demin Yu · Xutao Li · Yunming Ye · Baoquan Zhang · Luo Chuyao · Kuai Dai · wangrui · Chenxunlai, ,https://arxiv.org/abs/2312.06734,,2312.06734.pdf,DiffCast: A Unified Framework via Residual Diffusion for Precipitation Nowcasting,"Precipitation nowcasting is an important spatio-temporal prediction task to
+predict the radar echoes sequences based on current observations, which can
+serve both meteorological science and smart city applications. Due to the
+chaotic evolution nature of the precipitation systems, it is a very challenging
+problem. Previous studies address the problem either from the perspectives of
+deterministic modeling or probabilistic modeling. However, their predictions
+suffer from the blurry, high-value echoes fading away and position inaccurate
+issues. The root reason of these issues is that the chaotic evolutionary
+precipitation systems are not appropriately modeled. Inspired by the nature of
+the systems, we propose to decompose and model them from the perspective of
+global deterministic motion and local stochastic variations with residual
+mechanism. A unified and flexible framework that can equip any type of
+spatio-temporal models is proposed based on residual diffusion, which
+effectively tackles the shortcomings of previous methods. Extensive
+experimental results on four publicly available radar datasets demonstrate the
+effectiveness and superiority of the proposed framework, compared to
+state-of-the-art techniques. Our code is publicly available at
+https://github.com/DeminYu98/DiffCast.",cs.CV,['cs.CV']
+Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions,Zeyu Han · Fangrui Zhu · Qianru Lao · Huaizu Jiang, ,https://arxiv.org/abs/2311.17048,,2311.17048.pdf,Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions,"Zero-shot referring expression comprehension aims at localizing bounding
+boxes in an image corresponding to provided textual prompts, which requires:
+(i) a fine-grained disentanglement of complex visual scene and textual context,
+and (ii) a capacity to understand relationships among disentangled entities.
+Unfortunately, existing large vision-language alignment (VLA) models, e.g.,
+CLIP, struggle with both aspects so cannot be directly used for this task. To
+mitigate this gap, we leverage large foundation models to disentangle both
+images and texts into triplets in the format of (subject, predicate, object).
+After that, grounding is accomplished by calculating the structural similarity
+matrix between visual and textual triplets with a VLA model, and subsequently
+propagate it to an instance-level similarity matrix. Furthermore, to equip VLA
+models with the ability of relationship understanding, we design a
+triplet-matching objective to fine-tune the VLA models on a collection of
+curated dataset containing abundant entity relationships. Experiments
+demonstrate that our visual grounding performance increase of up to 19.5% over
+the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo
+dataset, our zero-shot approach achieves comparable accuracy to the fully
+supervised model. Code is available at
+https://github.com/Show-han/Zeroshot_REC.",cs.CV,['cs.CV']
+Self-Supervised Multi-Object Tracking with Path Consistency,Zijia Lu · Bing Shuai · Yanbei Chen · Zhenlin Xu · Davide Modolo, ,https://arxiv.org/abs/2404.05136,,2404.05136.pdf,Self-Supervised Multi-Object Tracking with Path Consistency,"In this paper, we propose a novel concept of path consistency to learn robust
+object matching without using manual object identity supervision. Our key idea
+is that, to track a object through frames, we can obtain multiple different
+association results from a model by varying the frames it can observe, i.e.,
+skipping frames in observation. As the differences in observations do not alter
+the identities of objects, the obtained association results should be
+consistent. Based on this rationale, we generate multiple observation paths,
+each specifying a different set of frames to be skipped, and formulate the Path
+Consistency Loss that enforces the association results are consistent across
+different observation paths. We use the proposed loss to train our object
+matching model with only self-supervision. By extensive experiments on three
+tracking datasets (MOT17, PersonPath22, KITTI), we demonstrate that our method
+outperforms existing unsupervised methods with consistent margins on various
+evaluation metrics, and even achieves performance close to supervised methods.",cs.CV,"['cs.CV', 'cs.AI']"
+Correcting Diffusion Generation through Resampling,Yujian Liu · Yang Zhang · Tommi Jaakkola · Shiyu Chang, ,https://arxiv.org/abs/2312.06038,,2312.06038.pdf,Correcting Diffusion Generation through Resampling,"Despite diffusion models' superior capabilities in modeling complex
+distributions, there are still non-trivial distributional discrepancies between
+generated and ground-truth images, which has resulted in several notable
+problems in image generation, including missing object errors in text-to-image
+generation and low image quality. Existing methods that attempt to address
+these problems mostly do not tend to address the fundamental cause behind these
+problems, which is the distributional discrepancies, and hence achieve
+sub-optimal results. In this paper, we propose a particle filtering framework
+that can effectively address both problems by explicitly reducing the
+distributional discrepancies. Specifically, our method relies on a set of
+external guidance, including a small set of real images and a pre-trained
+object detector, to gauge the distribution gap, and then design the resampling
+weight accordingly to correct the gap. Experiments show that our methods can
+effectively correct missing object errors and improve image quality in various
+image generation tasks. Notably, our method outperforms the existing strongest
+baseline by 5% in object occurrence and 1.0 in FID on MS-COCO. Our code is
+publicly available at
+https://github.com/UCSB-NLP-Chang/diffusion_resampling.git.",cs.CV,"['cs.CV', 'cs.LG']"
+Exploring Orthogonality in Open World Object Detection,Zhicheng Sun · Jinghan Li · Yadong Mu,https://github.com/feifeiobama/OrthogonalDet,,https://www.youtube.com/watch?v=fNDF2pIWbmM,,,,,nan
+YolOOD: Utilizing Object Detection Concepts for Multi-Label Out-of-Distribution Detection,Alon Zolfi · Guy AmiT · Amit Baras · Satoru Koda · Ikuya Morikawa · Yuval Elovici · Asaf Shabtai, ,https://arxiv.org/abs/2402.18162,,2402.18162.pdf,Out-of-Distribution Detection using Neural Activation Prior,"Out-of-distribution detection (OOD) is a crucial technique for deploying
+machine learning models in the real world to handle the unseen scenarios. In
+this paper, we first propose a simple yet effective Neural Activation Prior
+(NAP) for OOD detection. Our neural activation prior is based on a key
+observation that, for a channel before the global pooling layer of a fully
+trained neural network, the probability of a few neurons being activated with a
+large response by an in-distribution (ID) sample is significantly higher than
+that by an OOD sample. An intuitive explanation is that for a model fully
+trained on ID dataset, each channel would play a role in detecting a certain
+pattern in the ID dataset, and a few neurons can be activated with a large
+response when the pattern is detected in an input sample. Then, a new scoring
+function based on this prior is proposed to highlight the role of these
+strongly activated neurons in OOD detection. Our approach is plug-and-play and
+does not lead to any performance degradation on ID data classification and
+requires no extra training or statistics from training or external datasets.
+Notice that previous methods primarily rely on post-global-pooling features of
+the neural networks, while the within-channel distribution information we
+leverage would be discarded by the global pooling operator. Consequently, our
+method is orthogonal to existing approaches and can be effectively combined
+with them in various applications. Experimental results show that our method
+achieves the state-of-the-art performance on CIFAR benchmark and ImageNet
+dataset, which demonstrates the power of the proposed prior. Finally, we extend
+our method to Transformers and the experimental findings indicate that NAP can
+also significantly enhance the performance of OOD detection on Transformers,
+thereby demonstrating the broad applicability of this prior knowledge.",cs.CV,['cs.CV']
+3DSFLabelling: Boosting 3D Scene Flow Estimation by Pseudo Auto-labelling,Chaokang Jiang · Guangming Wang · Jiuming Liu · Hesheng Wang · Zhuang Ma · Zhenqiang Liu · LIANG · Yi Shan · Dalong Du,https://jiangchaokang.github.io/3DSFLabelling-Page/,https://arxiv.org/abs/2402.18146,,2402.18146.pdf,3DSFLabelling: Boosting 3D Scene Flow Estimation by Pseudo Auto-labelling,"Learning 3D scene flow from LiDAR point clouds presents significant
+difficulties, including poor generalization from synthetic datasets to real
+scenes, scarcity of real-world 3D labels, and poor performance on real sparse
+LiDAR point clouds. We present a novel approach from the perspective of
+auto-labelling, aiming to generate a large number of 3D scene flow pseudo
+labels for real-world LiDAR point clouds. Specifically, we employ the
+assumption of rigid body motion to simulate potential object-level rigid
+movements in autonomous driving scenarios. By updating different motion
+attributes for multiple anchor boxes, the rigid motion decomposition is
+obtained for the whole scene. Furthermore, we developed a novel 3D scene flow
+data augmentation method for global and local motion. By perfectly synthesizing
+target point clouds based on augmented motion parameters, we easily obtain lots
+of 3D scene flow labels in point clouds highly consistent with real scenarios.
+On multiple real-world datasets including LiDAR KITTI, nuScenes, and Argoverse,
+our method outperforms all previous supervised and unsupervised methods without
+requiring manual labelling. Impressively, our method achieves a tenfold
+reduction in EPE3D metric on the LiDAR KITTI dataset, reducing it from $0.190m$
+to a mere $0.008m$ error.",cs.CV,['cs.CV']
+Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling,Leon Sick · Dominik Engel · Pedro Hermosilla · Timo Ropinski,https://leonsick.github.io/depthg/,https://arxiv.org/abs/2309.12378,,2309.12378.pdf,Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling,"Traditionally, training neural networks to perform semantic segmentation
+required expensive human-made annotations. But more recently, advances in the
+field of unsupervised learning have made significant progress on this issue and
+towards closing the gap to supervised algorithms. To achieve this, semantic
+knowledge is distilled by learning to correlate randomly sampled features from
+images across an entire dataset. In this work, we build upon these advances by
+incorporating information about the structure of the scene into the training
+process through the use of depth information. We achieve this by (1) learning
+depth-feature correlation by spatially correlate the feature maps with the
+depth maps to induce knowledge about the structure of the scene and (2)
+implementing farthest-point sampling to more effectively select relevant
+features by utilizing 3D sampling techniques on depth information of the scene.
+Finally, we demonstrate the effectiveness of our technical contributions
+through extensive experimentation and present significant improvements in
+performance across multiple benchmark datasets.",cs.CV,['cs.CV']
+Super-Resolution Reconstruction from Bayer-Pattern Spike Streams,Yanchen Dong · Ruiqin Xiong · Jian Zhang · Zhaofei Yu · Xiaopeng Fan · Shuyuan Zhu · Tiejun Huang,https://github.com/csycdong/CSCSR,,https://ojs.aaai.org/index.php/AAAI/article/view/27924,,,,,nan
+Random Entangled Tokens for Adversarially Robust Vision Transformer,Huihui Gong · Minjing Dong · Siqi Ma · Seyit Camtepe · Surya Nepal · Chang Xu, ,https://arxiv.org/abs/2402.07183,,2402.07183.pdf,A Random Ensemble of Encrypted Vision Transformers for Adversarially Robust Defense,"Deep neural networks (DNNs) are well known to be vulnerable to adversarial
+examples (AEs). In previous studies, the use of models encrypted with a secret
+key was demonstrated to be robust against white-box attacks, but not against
+black-box ones. In this paper, we propose a novel method using the vision
+transformer (ViT) that is a random ensemble of encrypted models for enhancing
+robustness against both white-box and black-box attacks. In addition, a
+benchmark attack method, called AutoAttack, is applied to models to test
+adversarial robustness objectively. In experiments, the method was demonstrated
+to be robust against not only white-box attacks but also black-box ones in an
+image classification task on the CIFAR-10 and ImageNet datasets. The method was
+also compared with the state-of-the-art in a standardized benchmark for
+adversarial robustness, RobustBench, and it was verified to outperform
+conventional defenses in terms of clean accuracy and robust accuracy.",cs.AI,['cs.AI']
+PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding,Zhen Li · Mingdeng Cao · Xintao Wang · Zhongang Qi · Ming-Ming Cheng · Ying Shan,https://github.com/TencentARC/PhotoMaker,https://arxiv.org/abs/2312.04461,,2312.04461.pdf,PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding,"Recent advances in text-to-image generation have made remarkable progress in
+synthesizing realistic human photos conditioned on given text prompts. However,
+existing personalized generation methods cannot simultaneously satisfy the
+requirements of high efficiency, promising identity (ID) fidelity, and flexible
+text controllability. In this work, we introduce PhotoMaker, an efficient
+personalized text-to-image generation method, which mainly encodes an arbitrary
+number of input ID images into a stack ID embedding for preserving ID
+information. Such an embedding, serving as a unified ID representation, can not
+only encapsulate the characteristics of the same input ID comprehensively, but
+also accommodate the characteristics of different IDs for subsequent
+integration. This paves the way for more intriguing and practically valuable
+applications. Besides, to drive the training of our PhotoMaker, we propose an
+ID-oriented data construction pipeline to assemble the training data. Under the
+nourishment of the dataset constructed through the proposed pipeline, our
+PhotoMaker demonstrates better ID preservation ability than test-time
+fine-tuning based methods, yet provides significant speed improvements,
+high-quality generation results, strong generalization capabilities, and a wide
+range of applications. Our project page is available at
+https://photo-maker.github.io/",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.MM']"
+Hierarchical Correlation Clustering and Tree Preserving Embedding,Morteza Haghir Chehreghani · Mostafa Haghir Chehreghani, ,https://arxiv.org/abs/2402.03587,,2402.03587.pdf,Information-Theoretic Active Correlation Clustering,"We study correlation clustering where the pairwise similarities are not known
+in advance. For this purpose, we employ active learning to query pairwise
+similarities in a cost-efficient way. We propose a number of effective
+information-theoretic acquisition functions based on entropy and information
+gain. We extensively investigate the performance of our methods in different
+settings and demonstrate their superior performance compared to the
+alternatives.",cs.LG,"['cs.LG', 'stat.ML']"
+Referring Image Editing: Object-level Image Editing via Referring Expressions,Chang Liu · Xiangtai Li · Henghui Ding, ,,https://link.springer.com/article/10.1007/s11063-024-11487-2,,,,,nan
+CORE-MPI: Consistency Object Removal with Embedding MultiPlane Image,Donggeun Yoon · Donghyeon Cho, ,https://arxiv.org/abs/2310.08092,,2310.08092.pdf,Consistent123: Improve Consistency for One Image to 3D Object Synthesis,"Large image diffusion models enable novel view synthesis with high quality
+and excellent zero-shot capability. However, such models based on
+image-to-image translation have no guarantee of view consistency, limiting the
+performance for downstream tasks like 3D reconstruction and image-to-3D
+generation. To empower consistency, we propose Consistent123 to synthesize
+novel views simultaneously by incorporating additional cross-view attention
+layers and the shared self-attention mechanism. The proposed attention
+mechanism improves the interaction across all synthesized views, as well as the
+alignment between the condition view and novel views. In the sampling stage,
+such architecture supports simultaneously generating an arbitrary number of
+views while training at a fixed length. We also introduce a progressive
+classifier-free guidance strategy to achieve the trade-off between texture and
+geometry for synthesized object views. Qualitative and quantitative experiments
+show that Consistent123 outperforms baselines in view consistency by a large
+margin. Furthermore, we demonstrate a significant improvement of Consistent123
+on varying downstream tasks, showing its great potential in the 3D generation
+field. The project page is available at consistent-123.github.io.",cs.CV,['cs.CV']
+WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models,Changhoon Kim · Kyle Min · Maitreya Patel · Sheng Cheng · 'YZ' Yezhou Yang, ,https://arxiv.org/abs/2306.04744,,2306.04744.pdf,WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models,"The rapid advancement of generative models, facilitating the creation of
+hyper-realistic images from textual descriptions, has concurrently escalated
+critical societal concerns such as misinformation. Although providing some
+mitigation, traditional fingerprinting mechanisms fall short in attributing
+responsibility for the malicious use of synthetic images. This paper introduces
+a novel approach to model fingerprinting that assigns responsibility for the
+generated images, thereby serving as a potential countermeasure to model
+misuse. Our method modifies generative models based on each user's unique
+digital fingerprint, imprinting a unique identifier onto the resultant content
+that can be traced back to the user. This approach, incorporating fine-tuning
+into Text-to-Image (T2I) tasks using the Stable Diffusion Model, demonstrates
+near-perfect attribution accuracy with a minimal impact on output quality.
+Through extensive evaluation, we show that our method outperforms baseline
+methods with an average improvement of 11\% in handling image post-processes.
+Our method presents a promising and novel avenue for accountable model
+distribution and responsible use. Our code is available in
+\url{https://github.com/kylemin/WOUAF}.",cs.CV,['cs.CV']
+Improving Unsupervised Hierarchical Representation with Reinforcement Learning,Ruyi An · Yewen Li · Xu He · Pengjie Gu · Mengchen Zhao · Dong Li · Jianye Hao · Bo An · Chaojie Wang · Mingyuan Zhou, ,,https://www2.scut.edu.cn/sse/2024/0226/c16789a534834/page.htm,,,,,nan
+Joint-Task Regularization for Partially Labeled Multi-Task Learning,Kento Nishi · Junsik Kim · Wanhua Li · Hanspeter Pfister,https://kentonishi.com/JTR-CVPR-2024/,https://arxiv.org/abs/2404.01976,,2404.01976.pdf,Joint-Task Regularization for Partially Labeled Multi-Task Learning,"Multi-task learning has become increasingly popular in the machine learning
+field, but its practicality is hindered by the need for large, labeled
+datasets. Most multi-task learning methods depend on fully labeled datasets
+wherein each input example is accompanied by ground-truth labels for all target
+tasks. Unfortunately, curating such datasets can be prohibitively expensive and
+impractical, especially for dense prediction tasks which require per-pixel
+labels for each image. With this in mind, we propose Joint-Task Regularization
+(JTR), an intuitive technique which leverages cross-task relations to
+simultaneously regularize all tasks in a single joint-task latent space to
+improve learning when data is not fully labeled for all tasks. JTR stands out
+from existing approaches in that it regularizes all tasks jointly rather than
+separately in pairs -- therefore, it achieves linear complexity relative to the
+number of tasks while previous methods scale quadratically. To demonstrate the
+validity of our approach, we extensively benchmark our method across a wide
+variety of partially labeled scenarios based on NYU-v2, Cityscapes, and
+Taskonomy.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+HMD-Poser: On-Device Real-time Human Motion Tracking from Scalable Sparse Observations,Peng Dai · Yang Zhang · Tao Liu · ZhenFan · Tianyuan Du · Zhuo Su · Xiaozheng Zheng · Zeming Li,https://pico-ai-team.github.io/hmd-poser,https://arxiv.org/abs/2403.03561,,2403.03561.pdf,HMD-Poser: On-Device Real-time Human Motion Tracking from Scalable Sparse Observations,"It is especially challenging to achieve real-time human motion tracking on a
+standalone VR Head-Mounted Display (HMD) such as Meta Quest and PICO. In this
+paper, we propose HMD-Poser, the first unified approach to recover full-body
+motions using scalable sparse observations from HMD and body-worn IMUs. In
+particular, it can support a variety of input scenarios, such as HMD,
+HMD+2IMUs, HMD+3IMUs, etc. The scalability of inputs may accommodate users'
+choices for both high tracking accuracy and easy-to-wear. A lightweight
+temporal-spatial feature learning network is proposed in HMD-Poser to guarantee
+that the model runs in real-time on HMDs. Furthermore, HMD-Poser presents
+online body shape estimation to improve the position accuracy of body joints.
+Extensive experimental results on the challenging AMASS dataset show that
+HMD-Poser achieves new state-of-the-art results in both accuracy and real-time
+performance. We also build a new free-dancing motion dataset to evaluate
+HMD-Poser's on-device performance and investigate the performance gap between
+synthetic data and real-captured sensor data. Finally, we demonstrate our
+HMD-Poser with a real-time Avatar-driving application on a commercial HMD. Our
+code and free-dancing motion dataset are available
+https://pico-ai-team.github.io/hmd-poser",cs.CV,['cs.CV']
+BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model,song yiran · Qianyu Zhou · Xiangtai Li · Deng-Ping Fan · Xuequan Lu · Lizhuang Ma, ,https://arxiv.org/abs/2401.02317,,2401.02317.pdf,BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model,"In this paper, we address the challenge of image resolution variation for the
+Segment Anything Model (SAM). SAM, known for its zero-shot generalizability,
+exhibits a performance degradation when faced with datasets with varying image
+sizes. Previous approaches tend to resize the image to a fixed size or adopt
+structure modifications, hindering the preservation of SAM's rich prior
+knowledge. Besides, such task-specific tuning necessitates a complete
+retraining of the model, which is cost-expensive and unacceptable for
+deployment in the downstream tasks. In this paper, we reformulate this issue as
+a length extrapolation problem, where token sequence length varies while
+maintaining a consistent patch size for images of different sizes. To this end,
+we propose Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's
+adaptability to varying image resolutions while eliminating the need for
+structure modifications. Firstly, we introduce a new scaling factor to ensure
+consistent magnitude in the attention layer's dot product values when the token
+sequence length changes. Secondly, we present a bias-mode attention mask that
+allows each token to prioritize neighboring information, mitigating the impact
+of untrained distant information. Our BA-SAM demonstrates efficacy in two
+scenarios: zero-shot and fine-tuning. Extensive evaluation on diverse datasets,
+including DIS5K, DUTS, ISIC, COD10K, and COCO, reveals its ability to
+significantly mitigate performance degradation in the zero-shot setting and
+achieve state-of-the-art performance with minimal fine-tuning. Furthermore, we
+propose a generalized model and benchmark, showcasing BA-SAM's generalizability
+across all four datasets simultaneously. Code is available at
+https://github.com/zongzi13545329/BA-SAM",cs.CV,['cs.CV']
+CosalPure: Learning Concept from Group Images for Robust Co-Saliency Detection,Jiayi Zhu · Qing Guo · Felix Juefei Xu · Yihao Huang · Yang Liu · Geguang Pu, ,https://arxiv.org/abs/2403.18554,,2403.18554.pdf,CosalPure: Learning Concept from Group Images for Robust Co-Saliency Detection,"Co-salient object detection (CoSOD) aims to identify the common and salient
+(usually in the foreground) regions across a given group of images. Although
+achieving significant progress, state-of-the-art CoSODs could be easily
+affected by some adversarial perturbations, leading to substantial accuracy
+reduction. The adversarial perturbations can mislead CoSODs but do not change
+the high-level semantic information (e.g., concept) of the co-salient objects.
+In this paper, we propose a novel robustness enhancement framework by first
+learning the concept of the co-salient objects based on the input group images
+and then leveraging this concept to purify adversarial perturbations, which are
+subsequently fed to CoSODs for robustness enhancement. Specifically, we propose
+CosalPure containing two modules, i.e., group-image concept learning and
+concept-guided diffusion purification. For the first module, we adopt a
+pre-trained text-to-image diffusion model to learn the concept of co-salient
+objects within group images where the learned concept is robust to adversarial
+examples. For the second module, we map the adversarial image to the latent
+space and then perform diffusion generation by embedding the learned concept
+into the noise prediction function as an extra condition. Our method can
+effectively alleviate the influence of the SOTA adversarial attack containing
+different adversarial patterns, including exposure and noise. The extensive
+results demonstrate that our method could enhance the robustness of CoSODs
+significantly.",cs.CV,['cs.CV']
+Adaptive Bidirectional Displacement for Semi-Supervised Medical Image Segmentation,Hanyang Chi · Jian Pang · Bingfeng Zhang · Weifeng Liu, ,https://arxiv.org/abs/2405.00378,,2405.00378.pdf,Adaptive Bidirectional Displacement for Semi-Supervised Medical Image Segmentation,"Consistency learning is a central strategy to tackle unlabeled data in
+semi-supervised medical image segmentation (SSMIS), which enforces the model to
+produce consistent predictions under the perturbation. However, most current
+approaches solely focus on utilizing a specific single perturbation, which can
+only cope with limited cases, while employing multiple perturbations
+simultaneously is hard to guarantee the quality of consistency learning. In
+this paper, we propose an Adaptive Bidirectional Displacement (ABD) approach to
+solve the above challenge. Specifically, we first design a bidirectional patch
+displacement based on reliable prediction confidence for unlabeled data to
+generate new samples, which can effectively suppress uncontrollable regions and
+still retain the influence of input perturbations. Meanwhile, to enforce the
+model to learn the potentially uncontrollable content, a bidirectional
+displacement operation with inverse confidence is proposed for the labeled
+images, which generates samples with more unreliable information to facilitate
+model learning. Extensive experiments show that ABD achieves new
+state-of-the-art performances for SSMIS, significantly improving different
+baselines. Source code is available at https://github.com/chy-upc/ABD.",cs.CV,['cs.CV']
+UniMix: Towards Domain Adaptive and Generalizable LiDAR Semantic Segmentation in Adverse Weather,Haimei Zhao · Jing Zhang · Zhuo Chen · Shanshan Zhao · Dacheng Tao, ,https://arxiv.org/abs/2404.05145,,2404.05145.pdf,UniMix: Towards Domain Adaptive and Generalizable LiDAR Semantic Segmentation in Adverse Weather,"LiDAR semantic segmentation (LSS) is a critical task in autonomous driving
+and has achieved promising progress. However, prior LSS methods are
+conventionally investigated and evaluated on datasets within the same domain in
+clear weather. The robustness of LSS models in unseen scenes and all weather
+conditions is crucial for ensuring safety and reliability in real applications.
+To this end, we propose UniMix, a universal method that enhances the
+adaptability and generalizability of LSS models. UniMix first leverages
+physically valid adverse weather simulation to construct a Bridge Domain, which
+serves to bridge the domain gap between the clear weather scenes and the
+adverse weather scenes. Then, a Universal Mixing operator is defined regarding
+spatial, intensity, and semantic distributions to create the intermediate
+domain with mixed samples from given domains. Integrating the proposed two
+techniques into a teacher-student framework, UniMix efficiently mitigates the
+domain gap and enables LSS models to learn weather-robust and domain-invariant
+representations. We devote UniMix to two main setups: 1) unsupervised domain
+adaption, adapting the model from the clear weather source domain to the
+adverse weather target domain; 2) domain generalization, learning a model that
+generalizes well to unseen scenes in adverse weather. Extensive experiments
+validate the effectiveness of UniMix across different tasks and datasets, all
+achieving superior performance over state-of-the-art methods. The code will be
+released.",cs.CV,['cs.CV']
+Estimating Extreme 3D Image Rotations using Cascaded Attention,Shay Dekel · Yosi Keller · Martin Čadík, ,,https://www.youtube.com/watch?v=LzUPefef_8Q,,,,,nan
+PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics,Tianyi Xie · Zeshun Zong · Yuxing Qiu · Xuan Li · Yutao Feng · Yin Yang · Chenfanfu Jiang, ,https://arxiv.org/abs/2311.12198,,2311.12198.pdf,PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics,"We introduce PhysGaussian, a new method that seamlessly integrates physically
+grounded Newtonian dynamics within 3D Gaussians to achieve high-quality novel
+motion synthesis. Employing a custom Material Point Method (MPM), our approach
+enriches 3D Gaussian kernels with physically meaningful kinematic deformation
+and mechanical stress attributes, all evolved in line with continuum mechanics
+principles. A defining characteristic of our method is the seamless integration
+between physical simulation and visual rendering: both components utilize the
+same 3D Gaussian kernels as their discrete representations. This negates the
+necessity for triangle/tetrahedron meshing, marching cubes, ""cage meshes,"" or
+any other geometry embedding, highlighting the principle of ""what you see is
+what you simulate (WS$^2$)."" Our method demonstrates exceptional versatility
+across a wide variety of materials--including elastic entities, metals,
+non-Newtonian fluids, and granular materials--showcasing its strong
+capabilities in creating diverse visual content with novel viewpoints and
+movements. Our project page is at: https://xpandora.github.io/PhysGaussian/",cs.GR,"['cs.GR', 'cs.AI', 'cs.CV', 'cs.LG']"
+RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding,Jihan Yang · Runyu Ding · Weipeng DENG · Zhe Wang · Xiaojuan Qi, ,https://arxiv.org/abs/2308.00353,,2308.00353.pdf,Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding,"Open-world instance-level scene understanding aims to locate and recognize
+unseen object categories that are not present in the annotated dataset. This
+task is challenging because the model needs to both localize novel 3D objects
+and infer their semantic categories. A key factor for the recent progress in 2D
+open-world perception is the availability of large-scale image-text pairs from
+the Internet, which cover a wide range of vocabulary concepts. However, this
+success is hard to replicate in 3D scenarios due to the scarcity of 3D-text
+pairs. To address this challenge, we propose to harness pre-trained
+vision-language (VL) foundation models that encode extensive knowledge from
+image-text pairs to generate captions for multi-view images of 3D scenes. This
+allows us to establish explicit associations between 3D shapes and
+semantic-rich captions. Moreover, to enhance the fine-grained visual-semantic
+representation learning from captions for object-level categorization, we
+design hierarchical point-caption association methods to learn semantic-aware
+embeddings that exploit the 3D geometry between 3D points and multi-view
+images. In addition, to tackle the localization challenge for novel classes in
+the open-world setting, we develop debiased instance localization, which
+involves training object grouping modules on unlabeled data using
+instance-level pseudo supervision. This significantly improves the
+generalization capabilities of instance grouping and thus the ability to
+accurately locate novel objects. We conduct extensive experiments on 3D
+semantic, instance, and panoptic segmentation tasks, covering indoor and
+outdoor scenes across three datasets. Our method outperforms baseline methods
+by a significant margin in semantic segmentation (e.g. 34.5%$\sim$65.3%),
+instance segmentation (e.g. 21.8%$\sim$54.0%) and panoptic segmentation (e.g.
+14.7%$\sim$43.3%). Code will be available.",cs.CV,['cs.CV']
+Modality-Collaborative Test-Time Adaptation for Action Recognition,Baochen Xiong · Xiaoshan Yang · Yaguang Song · Yaowei Wang · Changsheng Xu, ,,https://dl.acm.org/doi/pdf/10.1145/3581783.3611757,,,,,nan
+3D Human Pose Perception from Egocentric Stereo Videos,Hiroyasu Akada · Jian Wang · Vladislav Golyanik · Christian Theobalt, ,https://arxiv.org/abs/2401.00889,,2401.00889.pdf,3D Human Pose Perception from Egocentric Stereo Videos,"While head-mounted devices are becoming more compact, they provide egocentric
+views with significant self-occlusions of the device user. Hence, existing
+methods often fail to accurately estimate complex 3D poses from egocentric
+views. In this work, we propose a new transformer-based framework to improve
+egocentric stereo 3D human pose estimation, which leverages the scene
+information and temporal context of egocentric stereo videos. Specifically, we
+utilize 1) depth features from our 3D scene reconstruction module with
+uniformly sampled windows of egocentric stereo frames, and 2) human joint
+queries enhanced by temporal features of the video inputs. Our method is able
+to accurately estimate human poses even in challenging scenarios, such as
+crouching and sitting. Furthermore, we introduce two new benchmark datasets,
+i.e., UnrealEgo2 and UnrealEgo-RW (RealWorld). The proposed datasets offer a
+much larger number of egocentric stereo views with a wider variety of human
+motions than the existing datasets, allowing comprehensive evaluation of
+existing and upcoming methods. Our extensive experiments show that the proposed
+approach significantly outperforms previous methods. We will release
+UnrealEgo2, UnrealEgo-RW, and trained models on our project page.",cs.CV,['cs.CV']
+Deep Generative Model based Rate-Distortion for Image Downscaling Assessment,yuanbang liang · Bhavesh Garg · Paul L. Rosin · Yipeng Qin, ,https://arxiv.org/abs/2403.15139,,2403.15139.pdf,Deep Generative Model based Rate-Distortion for Image Downscaling Assessment,"In this paper, we propose Image Downscaling Assessment by Rate-Distortion
+(IDA-RD), a novel measure to quantitatively evaluate image downscaling
+algorithms. In contrast to image-based methods that measure the quality of
+downscaled images, ours is process-based that draws ideas from rate-distortion
+theory to measure the distortion incurred during downscaling. Our main idea is
+that downscaling and super-resolution (SR) can be viewed as the encoding and
+decoding processes in the rate-distortion model, respectively, and that a
+downscaling algorithm that preserves more details in the resulting
+low-resolution (LR) images should lead to less distorted high-resolution (HR)
+images in SR. In other words, the distortion should increase as the downscaling
+algorithm deteriorates. However, it is non-trivial to measure this distortion
+as it requires the SR algorithm to be blind and stochastic. Our key insight is
+that such requirements can be met by recent SR algorithms based on deep
+generative models that can find all matching HR images for a given LR image on
+their learned image manifolds. Extensive experimental results show the
+effectiveness of our IDA-RD measure.",cs.CV,"['cs.CV', 'eess.IV']"
+Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?,Zhiqi Li · Zhiding Yu · Shiyi Lan · Jiahan Li · Jan Kautz · Tong Lu · Jose M. Alvarez, ,https://arxiv.org/abs/2312.03031,,2312.03031.pdf,Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?,"End-to-end autonomous driving recently emerged as a promising research
+direction to target autonomy from a full-stack perspective. Along this line,
+many of the latest works follow an open-loop evaluation setting on nuScenes to
+study the planning behavior. In this paper, we delve deeper into the problem by
+conducting thorough analyses and demystifying more devils in the details. We
+initially observed that the nuScenes dataset, characterized by relatively
+simple driving scenarios, leads to an under-utilization of perception
+information in end-to-end models incorporating ego status, such as the ego
+vehicle's velocity. These models tend to rely predominantly on the ego
+vehicle's status for future path planning. Beyond the limitations of the
+dataset, we also note that current metrics do not comprehensively assess the
+planning quality, leading to potentially biased conclusions drawn from existing
+benchmarks. To address this issue, we introduce a new metric to evaluate
+whether the predicted trajectories adhere to the road. We further propose a
+simple baseline able to achieve competitive results without relying on
+perception annotations. Given the current limitations on the benchmark and
+metrics, we suggest the community reassess relevant prevailing research and be
+cautious whether the continued pursuit of state-of-the-art would yield
+convincing and universal conclusions. Code and models are available at
+\url{https://github.com/NVlabs/BEV-Planner}",cs.CV,['cs.CV']
+FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding,Jun Xiang · Xuan Gao · Yudong Guo · Juyong Zhang, ,https://arxiv.org/abs/2312.02214,,2312.02214.pdf,FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding,"We propose FlashAvatar, a novel and lightweight 3D animatable avatar
+representation that could reconstruct a digital avatar from a short monocular
+video sequence in minutes and render high-fidelity photo-realistic images at
+300FPS on a consumer-grade GPU. To achieve this, we maintain a uniform 3D
+Gaussian field embedded in the surface of a parametric face model and learn
+extra spatial offset to model non-surface regions and subtle facial details.
+While full use of geometric priors can capture high-frequency facial details
+and preserve exaggerated expressions, proper initialization can help reduce the
+number of Gaussians, thus enabling super-fast rendering speed. Extensive
+experimental results demonstrate that FlashAvatar outperforms existing works
+regarding visual quality and personalized details and is almost an order of
+magnitude faster in rendering speed. Project page:
+https://ustc3dv.github.io/FlashAvatar/",cs.CV,"['cs.CV', 'cs.GR']"
+The Manga Whisperer: Automatically Generating Transcriptions for Comics,Ragav Sachdeva · Andrew Zisserman,https://github.com/ragavsachdeva/magi,https://arxiv.org/abs/2401.10224,,2401.10224.pdf,The Manga Whisperer: Automatically Generating Transcriptions for Comics,"In the past few decades, Japanese comics, commonly referred to as Manga, have
+transcended both cultural and linguistic boundaries to become a true worldwide
+sensation. Yet, the inherent reliance on visual cues and illustration within
+manga renders it largely inaccessible to individuals with visual impairments.
+In this work, we seek to address this substantial barrier, with the aim of
+ensuring that manga can be appreciated and actively engaged by everyone.
+Specifically, we tackle the problem of diarisation i.e. generating a
+transcription of who said what and when, in a fully automatic way.
+  To this end, we make the following contributions: (1) we present a unified
+model, Magi, that is able to (a) detect panels, text boxes and character boxes,
+(b) cluster characters by identity (without knowing the number of clusters
+apriori), and (c) associate dialogues to their speakers; (2) we propose a novel
+approach that is able to sort the detected text boxes in their reading order
+and generate a dialogue transcript; (3) we annotate an evaluation benchmark for
+this task using publicly available [English] manga pages. The code, evaluation
+datasets and the pre-trained model can be found at:
+https://github.com/ragavsachdeva/magi.",cs.CV,['cs.CV']
+SNIDA: Unlocking Few-Shot Object Detection with Non-linear Semantic Decoupling Augmentation,Yanjie Wang · Xu Zou · Luxin Yan · Sheng Zhong · Jiahuan Zhou, ,https://arxiv.org/abs/2401.11140,,2401.11140.pdf,Stability Plasticity Decoupled Fine-tuning For Few-shot end-to-end Object Detection,"Few-shot object detection(FSOD) aims to design methods to adapt object
+detectors efficiently with only few annotated samples. Fine-tuning has been
+shown to be an effective and practical approach. However, previous works often
+take the classical base-novel two stage fine-tuning procedure but ignore the
+implicit stability-plasticity contradiction among different modules.
+Specifically, the random re-initialized classifiers need more plasticity to
+adapt to novel samples. The other modules inheriting pre-trained weights demand
+more stability to reserve their class-agnostic knowledge. Regular fine-tuning
+which couples the optimization of these two parts hurts the model
+generalization in FSOD scenarios. In this paper, we find that this problem is
+prominent in the end-to-end object detector Sparse R-CNN for its
+multi-classifier cascaded architecture. We propose to mitigate this
+contradiction by a new three-stage fine-tuning procedure by introducing an
+addtional plasticity classifier fine-tuning(PCF) stage. We further design the
+multi-source ensemble(ME) technique to enhance the generalization of the model
+in the final fine-tuning stage. Extensive experiments verify that our method is
+effective in regularizing Sparse R-CNN, outperforming previous methods in the
+FSOD benchmark.",cs.CV,"['cs.CV', 'cs.AI']"
+Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement,Xiuquan Hou · Meiqin Liu · Senlin Zhang · Ping Wei · Badong Chen,https://github.com/xiuqhou/Salience-DETR,https://arxiv.org/abs/2403.16131,,2403.16131.pdf,Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement,"DETR-like methods have significantly increased detection performance in an
+end-to-end manner. The mainstream two-stage frameworks of them perform dense
+self-attention and select a fraction of queries for sparse cross-attention,
+which is proven effective for improving performance but also introduces a heavy
+computational burden and high dependence on stable query selection. This paper
+demonstrates that suboptimal two-stage selection strategies result in scale
+bias and redundancy due to the mismatch between selected queries and objects in
+two-stage initialization. To address these issues, we propose hierarchical
+salience filtering refinement, which performs transformer encoding only on
+filtered discriminative queries, for a better trade-off between computational
+efficiency and precision. The filtering process overcomes scale bias through a
+novel scale-independent salience supervision. To compensate for the semantic
+misalignment among queries, we introduce elaborate query refinement modules for
+stable two-stage initialization. Based on above improvements, the proposed
+Salience DETR achieves significant improvements of +4.0% AP, +0.2% AP, +4.4% AP
+on three challenging task-specific detection datasets, as well as 49.2% AP on
+COCO 2017 with less FLOPs. The code is available at
+https://github.com/xiuqhou/Salience-DETR.",cs.CV,['cs.CV']
+One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion,Minghua Liu · Ruoxi Shi · Linghao Chen · Zhuoyang Zhang · Chao Xu · Xinyue Wei · Hansheng Chen · Chong Zeng · Jiayuan Gu · Hao Su,https://sudo-ai-3d.github.io/One2345plus_page/,,https://github.com/SUDO-AI-3D/One2345plus,,,,,nan
+Physical 3D Adversarial Attacks against Monocular Depth Estimation in Autonomous Driving,Junhao Zheng · Chenhao Lin · Jiahao Sun · Zhengyu Zhao · Qian Li · Chao Shen,https://github.com/gandolfczjh/3d2fool,https://arxiv.org/abs/2403.17301,,2403.17301.pdf,Physical 3D Adversarial Attacks against Monocular Depth Estimation in Autonomous Driving,"Deep learning-based monocular depth estimation (MDE), extensively applied in
+autonomous driving, is known to be vulnerable to adversarial attacks. Previous
+physical attacks against MDE models rely on 2D adversarial patches, so they
+only affect a small, localized region in the MDE map but fail under various
+viewpoints. To address these limitations, we propose 3D Depth Fool
+(3D$^2$Fool), the first 3D texture-based adversarial attack against MDE models.
+3D$^2$Fool is specifically optimized to generate 3D adversarial textures
+agnostic to model types of vehicles and to have improved robustness in bad
+weather conditions, such as rain and fog. Experimental results validate the
+superior performance of our 3D$^2$Fool across various scenarios, including
+vehicles, MDE models, weather conditions, and viewpoints. Real-world
+experiments with printed 3D textures on physical vehicle models further
+demonstrate that our 3D$^2$Fool can cause an MDE error of over 10 meters.",cs.CV,"['cs.CV', 'cs.CR']"
+VecFusion: Vector Font Generation with Diffusion,Vikas Thamizharasan · Difan Liu · Shantanu Agarwal · Matthew Fisher · Michaël Gharbi · Oliver Wang · Alec Jacobson · Evangelos Kalogerakis, ,https://arxiv.org/abs/2312.10540,,2312.10540.pdf,VecFusion: Vector Font Generation with Diffusion,"We present VecFusion, a new neural architecture that can generate vector
+fonts with varying topological structures and precise control point positions.
+Our approach is a cascaded diffusion model which consists of a raster diffusion
+model followed by a vector diffusion model. The raster model generates
+low-resolution, rasterized fonts with auxiliary control point information,
+capturing the global style and shape of the font, while the vector model
+synthesizes vector fonts conditioned on the low-resolution raster fonts from
+the first stage. To synthesize long and complex curves, our vector diffusion
+model uses a transformer architecture and a novel vector representation that
+enables the modeling of diverse vector geometry and the precise prediction of
+control points. Our experiments show that, in contrast to previous generative
+models for vector graphics, our new cascaded vector diffusion model generates
+higher quality vector fonts, with complex structures and diverse styles.",cs.CV,"['cs.CV', 'cs.GR']"
+LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection,Dat NGUYEN · Nesryne Mejri · Inder Pal Singh · Polina Kuleshova · Marcella Astrid · Anis Kacem · Enjie Ghorbel · Djamila Aouada,https://github.com/10Ring/LAA-Net,https://arxiv.org/abs/2401.13856,,2401.13856.pdf,LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection,"This paper introduces a novel approach for high-quality deepfake detection
+called Localized Artifact Attention Network (LAA-Net). Existing methods for
+high-quality deepfake detection are mainly based on a supervised binary
+classifier coupled with an implicit attention mechanism. As a result, they do
+not generalize well to unseen manipulations. To handle this issue, two main
+contributions are made. First, an explicit attention mechanism within a
+multi-task learning framework is proposed. By combining heatmap-based and
+self-consistency attention strategies, LAA-Net is forced to focus on a few
+small artifact-prone vulnerable regions. Second, an Enhanced Feature Pyramid
+Network (E-FPN) is proposed as a simple and effective mechanism for spreading
+discriminative low-level features into the final feature output, with the
+advantage of limiting redundancy. Experiments performed on several benchmarks
+show the superiority of our approach in terms of Area Under the Curve (AUC) and
+Average Precision (AP). The code is available at
+https://github.com/10Ring/LAA-Net.",cs.CV,['cs.CV']
+SAI3D: Segment Any Instance in 3D Scenes,Yingda Yin · Yuzheng Liu · Yang Xiao · Daniel Cohen-Or · Jingwei Huang · Baoquan Chen,https://yd-yin.github.io/SAI3D/,https://arxiv.org/abs/2312.11557,,,SAI3D: Segment Any Instance in 3D Scenes,"Advancements in 3D instance segmentation have traditionally been tethered to
+the availability of annotated datasets, limiting their application to a narrow
+spectrum of object categories. Recent efforts have sought to harness
+vision-language models like CLIP for open-set semantic reasoning, yet these
+methods struggle to distinguish between objects of the same categories and rely
+on specific prompts that are not universally applicable. In this paper, we
+introduce SAI3D, a novel zero-shot 3D instance segmentation approach that
+synergistically leverages geometric priors and semantic cues derived from
+Segment Anything Model (SAM). Our method partitions a 3D scene into geometric
+primitives, which are then progressively merged into 3D instance segmentations
+that are consistent with the multi-view SAM masks. Moreover, we design a
+hierarchical region-growing algorithm with a dynamic thresholding mechanism,
+which largely improves the robustness of finegrained 3D scene parsing.Empirical
+evaluations on ScanNet, Matterport3D and the more challenging ScanNet++
+datasets demonstrate the superiority of our approach. Notably, SAI3D
+outperforms existing open-vocabulary baselines and even surpasses
+fully-supervised methods in class-agnostic segmentation on ScanNet++. Our
+project page is at https://yd-yin.github.io/SAI3D.",cs.CV,['cs.CV']
+InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models,Jiun Tian Hoe · Xudong Jiang · Chee Seng Chan · Yap-peng Tan · Weipeng Hu,https://jiuntian.github.io/interactdiffusion/,https://arxiv.org/abs/2312.05849,,2312.05849.pdf,InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models,"Large-scale text-to-image (T2I) diffusion models have showcased incredible
+capabilities in generating coherent images based on textual descriptions,
+enabling vast applications in content generation. While recent advancements
+have introduced control over factors such as object localization, posture, and
+image contours, a crucial gap remains in our ability to control the
+interactions between objects in the generated content. Well-controlling
+interactions in generated images could yield meaningful applications, such as
+creating realistic scenes with interacting characters. In this work, we study
+the problems of conditioning T2I diffusion models with Human-Object Interaction
+(HOI) information, consisting of a triplet label (person, action, object) and
+corresponding bounding boxes. We propose a pluggable interaction control model,
+called InteractDiffusion that extends existing pre-trained T2I diffusion models
+to enable them being better conditioned on interactions. Specifically, we
+tokenize the HOI information and learn their relationships via interaction
+embeddings. A conditioning self-attention layer is trained to map HOI tokens to
+visual tokens, thereby conditioning the visual tokens better in existing T2I
+diffusion models. Our model attains the ability to control the interaction and
+location on existing T2I diffusion models, which outperforms existing baselines
+by a large margin in HOI detection score, as well as fidelity in FID and KID.
+Project page: https://jiuntian.github.io/interactdiffusion.",cs.CV,"['cs.CV', 'cs.GR', 'cs.MM']"
+G3DR: Generative 3D Reconstruction in ImageNet,Pradyumna Reddy · Ismail Elezi · Jiankang Deng,https://preddy5.github.io/g3dr_website/,https://arxiv.org/abs/2403.00939,,2403.00939.pdf,G3DR: Generative 3D Reconstruction in ImageNet,"We introduce a novel 3D generative method, Generative 3D Reconstruction
+(G3DR) in ImageNet, capable of generating diverse and high-quality 3D objects
+from single images, addressing the limitations of existing methods. At the
+heart of our framework is a novel depth regularization technique that enables
+the generation of scenes with high-geometric fidelity. G3DR also leverages a
+pretrained language-vision model, such as CLIP, to enable reconstruction in
+novel views and improve the visual realism of generations. Additionally, G3DR
+designs a simple but effective sampling procedure to further improve the
+quality of generations. G3DR offers diverse and efficient 3D asset generation
+based on class or text conditioning. Despite its simplicity, G3DR is able to
+beat state-of-theart methods, improving over them by up to 22% in perceptual
+metrics and 90% in geometry scores, while needing only half of the training
+time. Code is available at https://github.com/preddy5/G3DR",cs.CV,"['cs.CV', 'cs.GR']"
+ZeroRF: Fast Sparse View 360° Reconstruction with Zero Pretraining,Ruoxi Shi · Xinyue Wei · Cheng Wang · Hao Su, ,https://arxiv.org/abs/2312.09249,,2312.09249.pdf,ZeroRF: Fast Sparse View 360° Reconstruction with Zero Pretraining,"We present ZeroRF, a novel per-scene optimization method addressing the
+challenge of sparse view 360{\deg} reconstruction in neural field
+representations. Current breakthroughs like Neural Radiance Fields (NeRF) have
+demonstrated high-fidelity image synthesis but struggle with sparse input
+views. Existing methods, such as Generalizable NeRFs and per-scene optimization
+approaches, face limitations in data dependency, computational cost, and
+generalization across diverse scenarios. To overcome these challenges, we
+propose ZeroRF, whose key idea is to integrate a tailored Deep Image Prior into
+a factorized NeRF representation. Unlike traditional methods, ZeroRF
+parametrizes feature grids with a neural network generator, enabling efficient
+sparse view 360{\deg} reconstruction without any pretraining or additional
+regularization. Extensive experiments showcase ZeroRF's versatility and
+superiority in terms of both quality and speed, achieving state-of-the-art
+results on benchmark datasets. ZeroRF's significance extends to applications in
+3D content generation and editing. Project page:
+https://sarahweiii.github.io/zerorf/",cs.CV,"['cs.CV', 'cs.GR']"
+HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data,Qifan Yu · Juncheng Li · Longhui Wei · Liang Pang · Wentao Ye · Bosheng Qin · Siliang Tang · Qi Tian · Yueting Zhuang, ,https://arxiv.org/abs/2311.13614,,2311.13614.pdf,HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data,"Multi-modal Large Language Models (MLLMs) tuned on machine-generated
+instruction-following data have demonstrated remarkable performance in various
+multi-modal understanding and generation tasks. However, the hallucinations
+inherent in machine-generated data, which could lead to hallucinatory outputs
+in MLLMs, remain under-explored. This work aims to investigate various
+hallucinations (i.e., object, relation, attribute hallucinations) and mitigate
+those hallucinatory toxicities in large-scale machine-generated visual
+instruction datasets. Drawing on the human ability to identify factual errors,
+we present a novel hallucination detection and elimination framework,
+HalluciDoctor, based on the cross-checking paradigm. We use our framework to
+identify and eliminate hallucinations in the training data automatically.
+Interestingly, HalluciDoctor also indicates that spurious correlations arising
+from long-tail object co-occurrences contribute to hallucinations. Based on
+that, we execute counterfactual visual instruction expansion to balance data
+distribution, thereby enhancing MLLMs' resistance to hallucinations.
+Comprehensive experiments on hallucination evaluation benchmarks show that our
+method successfully mitigates 44.6% hallucinations relatively and maintains
+competitive performance compared to LLaVA. The data and code for this paper are
+publicly available. \url{https://github.com/Yuqifan1117/HalluciDoctor}.",cs.CV,"['cs.CV', 'cs.AI']"
+Mudslide: A Universal Nuclear Instance Segmentation Method,Jun Wang, ,https://arxiv.org/abs/2311.15939,,2311.15939.pdf,Unleashing the Power of Prompt-driven Nucleus Instance Segmentation,"Nucleus instance segmentation in histology images is crucial for a broad
+spectrum of clinical applications. Current dominant algorithms rely on
+regression of nuclear proxy maps. Distinguishing nucleus instances from the
+estimated maps requires carefully curated post-processing, which is error-prone
+and parameter-sensitive. Recently, the Segment Anything Model (SAM) has earned
+huge attention in medical image segmentation, owing to its impressive
+generalization ability and promptable property. Nevertheless, its potential on
+nucleus instance segmentation remains largely underexplored. In this paper, we
+present a novel prompt-driven framework that consists of a nucleus prompter and
+SAM for automatic nucleus instance segmentation. Specifically, the prompter
+learns to generate a unique point prompt for each nucleus while the SAM is
+fine-tuned to output the corresponding mask for the prompted nucleus.
+Furthermore, we propose the inclusion of adjacent nuclei as negative prompts to
+enhance the model's capability to identify overlapping nuclei. Without
+complicated post-processing, our proposed method sets a new state-of-the-art
+performance on three challenging benchmarks. Code is available at
+\url{github.com/windygoo/PromptNucSeg}",cs.CV,['cs.CV']
+MMCert: Provable Defense against Adversarial Attacks to Multi-modal Models,Yanting Wang · Hongye Fu · Wei Zou · Jinyuan Jia, ,https://arxiv.org/abs/2403.19080,,2403.19080.pdf,MMCert: Provable Defense against Adversarial Attacks to Multi-modal Models,"Different from a unimodal model whose input is from a single modality, the
+input (called multi-modal input) of a multi-modal model is from multiple
+modalities such as image, 3D points, audio, text, etc. Similar to unimodal
+models, many existing studies show that a multi-modal model is also vulnerable
+to adversarial perturbation, where an attacker could add small perturbation to
+all modalities of a multi-modal input such that the multi-modal model makes
+incorrect predictions for it. Existing certified defenses are mostly designed
+for unimodal models, which achieve sub-optimal certified robustness guarantees
+when extended to multi-modal models as shown in our experimental results. In
+our work, we propose MMCert, the first certified defense against adversarial
+attacks to a multi-modal model. We derive a lower bound on the performance of
+our MMCert under arbitrary adversarial attacks with bounded perturbations to
+both modalities (e.g., in the context of auto-driving, we bound the number of
+changed pixels in both RGB image and depth image). We evaluate our MMCert using
+two benchmark datasets: one for the multi-modal road segmentation task and the
+other for the multi-modal emotion recognition task. Moreover, we compare our
+MMCert with a state-of-the-art certified defense extended from unimodal models.
+Our experimental results show that our MMCert outperforms the baseline.",cs.CV,"['cs.CV', 'cs.CR']"
+NTO3D: Neural Target Object 3D Reconstruction with Segment Anything,Xiaobao Wei · Renrui Zhang · Jiarui Wu · Jiaming Liu · Ming Lu · Yandong Guo · Shanghang Zhang, ,https://arxiv.org/abs/2309.12790,,2309.12790.pdf,NTO3D: Neural Target Object 3D Reconstruction with Segment Anything,"Neural 3D reconstruction from multi-view images has recently attracted
+increasing attention from the community. Existing methods normally learn a
+neural field for the whole scene, while it is still under-explored how to
+reconstruct a target object indicated by users. Considering the Segment
+Anything Model (SAM) has shown effectiveness in segmenting any 2D images, in
+this paper, we propose NTO3D, a novel high-quality Neural Target Object 3D
+(NTO3D) reconstruction method, which leverages the benefits of both neural
+field and SAM. We first propose a novel strategy to lift the multi-view 2D
+segmentation masks of SAM into a unified 3D occupancy field. The 3D occupancy
+field is then projected into 2D space and generates the new prompts for SAM.
+This process is iterative until convergence to separate the target object from
+the scene. After this, we then lift the 2D features of the SAM encoder into a
+3D feature field in order to improve the reconstruction quality of the target
+object. NTO3D lifts the 2D masks and features of SAM into the 3D neural field
+for high-quality neural target object 3D reconstruction. We conduct detailed
+experiments on several benchmark datasets to demonstrate the advantages of our
+method. The code will be available at: https://github.com/ucwxb/NTO3D.",cs.CV,['cs.CV']
+A Bayesian Approach to OOD Robustness in Image Classification,Prakhar Kaushik · Adam Kortylewski · Alan L. Yuille, ,https://arxiv.org/abs/2403.07277v1,,2403.07277v1.pdf,A Bayesian Approach to OOD Robustness in Image Classification,"An important and unsolved problem in computer vision is to ensure that the
+algorithms are robust to changes in image domains. We address this problem in
+the scenario where we have access to images from the target domains but no
+annotations. Motivated by the challenges of the OOD-CV benchmark where we
+encounter real world Out-of-Domain (OOD) nuisances and occlusion, we introduce
+a novel Bayesian approach to OOD robustness for object classification. Our work
+extends Compositional Neural Networks (CompNets), which have been shown to be
+robust to occlusion but degrade badly when tested on OOD data. We exploit the
+fact that CompNets contain a generative head defined over feature vectors
+represented by von Mises-Fisher (vMF) kernels, which correspond roughly to
+object parts, and can be learned without supervision. We obverse that some vMF
+kernels are similar between different domains, while others are not. This
+enables us to learn a transitional dictionary of vMF kernels that are
+intermediate between the source and target domains and train the generative
+model on this dictionary using the annotations on the source domain, followed
+by iterative refinement. This approach, termed Unsupervised Generative
+Transition (UGT), performs very well in OOD scenarios even when occlusion is
+present. UGT is evaluated on different OOD benchmarks including the OOD-CV
+dataset, several popular datasets (e.g., ImageNet-C [9]), artificial image
+corruptions (including adding occluders), and synthetic-to-real domain
+transfer, and does well in all scenarios outperforming SOTA alternatives (e.g.
+up to 10% top-1 accuracy on Occluded OOD-CV dataset).",cs.CV,"['cs.CV', 'cs.AI']"
+SNI-SLAM: Semantic Neural Implicit SLAM,Siting Zhu · Guangming Wang · Hermann Blum · Jiuming Liu · LiangSong · Marc Pollefeys · Hesheng Wang, ,https://arxiv.org/abs/2311.11016,,2311.11016.pdf,SNI-SLAM: Semantic Neural Implicit SLAM,"We propose SNI-SLAM, a semantic SLAM system utilizing neural implicit
+representation, that simultaneously performs accurate semantic mapping,
+high-quality surface reconstruction, and robust camera tracking. In this
+system, we introduce hierarchical semantic representation to allow multi-level
+semantic comprehension for top-down structured semantic mapping of the scene.
+In addition, to fully utilize the correlation between multiple attributes of
+the environment, we integrate appearance, geometry and semantic features
+through cross-attention for feature collaboration. This strategy enables a more
+multifaceted understanding of the environment, thereby allowing SNI-SLAM to
+remain robust even when single attribute is defective. Then, we design an
+internal fusion-based decoder to obtain semantic, RGB, Truncated Signed
+Distance Field (TSDF) values from multi-level features for accurate decoding.
+Furthermore, we propose a feature loss to update the scene representation at
+the feature level. Compared with low-level losses such as RGB loss and depth
+loss, our feature loss is capable of guiding the network optimization on a
+higher-level. Our SNI-SLAM method demonstrates superior performance over all
+recent NeRF-based SLAM methods in terms of mapping and tracking accuracy on
+Replica and ScanNet datasets, while also showing excellent capabilities in
+accurate semantic segmentation and real-time semantic mapping.",cs.RO,['cs.RO']
+PBWR: Parametric Building Wireframe Reconstruction from Aerial LiDAR Point Clouds,Shangfeng Huang · Ruisheng Wang · Bo Guo · Hongxin Yang, ,https://arxiv.org/abs/2311.12062,,2311.12062.pdf,PBWR: Parametric Building Wireframe Reconstruction from Aerial LiDAR Point Clouds,"In this paper, we present an end-to-end 3D building wireframe reconstruction
+method to regress edges directly from aerial LiDAR point clouds.Our method,
+named Parametric Building Wireframe Reconstruction (PBWR), takes aerial LiDAR
+point clouds and initial edge entities as input, and fully uses self-attention
+mechanism of transformers to regress edge parameters without any intermediate
+steps such as corner prediction. We propose an edge non-maximum suppression
+(E-NMS) module based on edge similarityto remove redundant edges. Additionally,
+a dedicated edge loss function is utilized to guide the PBWR in regressing
+edges parameters, where simple use of edge distance loss isn't suitable. In our
+experiments, we demonstrate state-of-the-art results on the Building3D dataset,
+achieving an improvement of approximately 36% in entry-level dataset edge
+accuracy and around 42% improvement in the Tallinn dataset.",cs.CV,"['cs.CV', 'cs.AI']"
+Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling,Zhe Li · Zerong Zheng · Lizhen Wang · Yebin Liu,https://animatable-gaussians.github.io/,https://arxiv.org/abs/2311.16096,,2311.16096.pdf,Animatable and Relightable Gaussians for High-fidelity Human Avatar Modeling,"Modeling animatable human avatars from RGB videos is a long-standing and
+challenging problem. Recent works usually adopt MLP-based neural radiance
+fields (NeRF) to represent 3D humans, but it remains difficult for pure MLPs to
+regress pose-dependent garment details. To this end, we introduce Animatable
+Gaussians, a new avatar representation that leverages powerful 2D CNNs and 3D
+Gaussian splatting to create high-fidelity avatars. To associate 3D Gaussians
+with the animatable avatar, we learn a parametric template from the input
+videos, and then parameterize the template on two front & back canonical
+Gaussian maps where each pixel represents a 3D Gaussian. The learned template
+is adaptive to the wearing garments for modeling looser clothes like dresses.
+Such template-guided 2D parameterization enables us to employ a powerful
+StyleGAN-based CNN to learn the pose-dependent Gaussian maps for modeling
+detailed dynamic appearances. Furthermore, we introduce a pose projection
+strategy for better generalization given novel poses. To tackle the realistic
+relighting of animatable avatars, we introduce physically-based rendering into
+the avatar representation for decomposing avatar materials and environment
+illumination. Overall, our method can create lifelike avatars with dynamic,
+realistic, generalized and relightable appearances. Experiments show that our
+method outperforms other state-of-the-art approaches.",cs.CV,"['cs.CV', 'cs.GR']"
+Genuine Knowledge from Practice: Diffusion Test-Time Adaptation for Video Adverse Weather Removal,Yijun Yang · Hongtao Wu · Angelica I. Aviles-Rivero · Yulun Zhang · Jing Qin · Lei Zhu, ,https://arxiv.org/abs/2403.07684,,2403.07684.pdf,Genuine Knowledge from Practice: Diffusion Test-Time Adaptation for Video Adverse Weather Removal,"Real-world vision tasks frequently suffer from the appearance of unexpected
+adverse weather conditions, including rain, haze, snow, and raindrops. In the
+last decade, convolutional neural networks and vision transformers have yielded
+outstanding results in single-weather video removal. However, due to the
+absence of appropriate adaptation, most of them fail to generalize to other
+weather conditions. Although ViWS-Net is proposed to remove adverse weather
+conditions in videos with a single set of pre-trained weights, it is seriously
+blinded by seen weather at train-time and degenerates when coming to unseen
+weather during test-time. In this work, we introduce test-time adaptation into
+adverse weather removal in videos, and propose the first framework that
+integrates test-time adaptation into the iterative diffusion reverse process.
+Specifically, we devise a diffusion-based network with a novel temporal noise
+model to efficiently explore frame-correlated information in degraded video
+clips at training stage. During inference stage, we introduce a proxy task
+named Diffusion Tubelet Self-Calibration to learn the primer distribution of
+test video stream and optimize the model by approximating the temporal noise
+model for online adaptation. Experimental results, on benchmark datasets,
+demonstrate that our Test-Time Adaptation method with Diffusion-based
+network(Diff-TTA) outperforms state-of-the-art methods in terms of restoring
+videos degraded by seen weather conditions. Its generalizable capability is
+also validated with unseen weather conditions in both synthesized and
+real-world videos.",cs.CV,['cs.CV']
+Generalizable Novel-View Synthesis using a Stereo Camera,Haechan Lee · Wonjoon Jin · Seung-Hwan Baek · Sunghyun Cho,https://jinwonjoon.github.io/stereonerf/,https://arxiv.org/abs/2404.13541,,2404.13541.pdf,Generalizable Novel-View Synthesis using a Stereo Camera,"In this paper, we propose the first generalizable view synthesis approach
+that specifically targets multi-view stereo-camera images. Since recent stereo
+matching has demonstrated accurate geometry prediction, we introduce stereo
+matching into novel-view synthesis for high-quality geometry reconstruction. To
+this end, this paper proposes a novel framework, dubbed StereoNeRF, which
+integrates stereo matching into a NeRF-based generalizable view synthesis
+approach. StereoNeRF is equipped with three key components to effectively
+exploit stereo matching in novel-view synthesis: a stereo feature extractor, a
+depth-guided plane-sweeping, and a stereo depth loss. Moreover, we propose the
+StereoNVS dataset, the first multi-view dataset of stereo-camera images,
+encompassing a wide variety of both real and synthetic scenes. Our experimental
+results demonstrate that StereoNeRF surpasses previous approaches in
+generalizable view synthesis.",cs.CV,['cs.CV']
+PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos,Yufei Zhang · Jeffrey Kephart · Zijun Cui · Qiang Ji, ,https://arxiv.org/abs/2404.04430,,2404.04430.pdf,PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos,"While current methods have shown promising progress on estimating 3D human
+motion from monocular videos, their motion estimates are often physically
+unrealistic because they mainly consider kinematics. In this paper, we
+introduce Physics-aware Pretrained Transformer (PhysPT), which improves
+kinematics-based motion estimates and infers motion forces. PhysPT exploits a
+Transformer encoder-decoder backbone to effectively learn human dynamics in a
+self-supervised manner. Moreover, it incorporates physics principles governing
+human motion. Specifically, we build a physics-based body representation and
+contact force model. We leverage them to impose novel physics-inspired training
+losses (i.e., force loss, contact loss, and Euler-Lagrange loss), enabling
+PhysPT to capture physical properties of the human body and the forces it
+experiences. Experiments demonstrate that, once trained, PhysPT can be directly
+applied to kinematics-based estimates to significantly enhance their physical
+plausibility and generate favourable motion forces. Furthermore, we show that
+these physically meaningful quantities translate into improved accuracy of an
+important downstream task: human action recognition.",cs.CV,['cs.CV']
+Self-Distilled Masked Auto-Encoders are Efficient  Video Anomaly Detectors,Nicolae Ristea · Florinel Croitoru · Radu Tudor Ionescu · Marius Popescu · Fahad Shahbaz Khan · Mubarak Shah,https://github.com/ristea/aed-mae/tree/main,https://arxiv.org/abs/2306.12041v2,,2306.12041v2.pdf,Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors,"We propose an efficient abnormal event detection model based on a lightweight
+masked auto-encoder (AE) applied at the video frame level. The novelty of the
+proposed model is threefold. First, we introduce an approach to weight tokens
+based on motion gradients, thus shifting the focus from the static background
+scene to the foreground objects. Second, we integrate a teacher decoder and a
+student decoder into our architecture, leveraging the discrepancy between the
+outputs given by the two decoders to improve anomaly detection. Third, we
+generate synthetic abnormal events to augment the training videos, and task the
+masked AE model to jointly reconstruct the original frames (without anomalies)
+and the corresponding pixel-level anomaly maps. Our design leads to an
+efficient and effective model, as demonstrated by the extensive experiments
+carried out on four benchmarks: Avenue, ShanghaiTech, UBnormal and UCSD Ped2.
+The empirical results show that our model achieves an excellent trade-off
+between speed and accuracy, obtaining competitive AUC scores, while processing
+1655 FPS. Hence, our model is between 8 and 70 times faster than competing
+methods. We also conduct an ablation study to justify our design. Our code is
+freely available at: https://github.com/ristea/aed-mae.",cs.CV,"['cs.CV', 'cs.LG']"
+Prompt-Driven Dynamic Object-Centric Learning for Single Domain Generalization,Deng Li · Aming Wu · Yaowei Wang · Yahong Han, ,https://arxiv.org/abs/2402.18447,,2402.18447.pdf,Prompt-Driven Dynamic Object-Centric Learning for Single Domain Generalization,"Single-domain generalization aims to learn a model from single source domain
+data to achieve generalized performance on other unseen target domains.
+Existing works primarily focus on improving the generalization ability of
+static networks. However, static networks are unable to dynamically adapt to
+the diverse variations in different image scenes, leading to limited
+generalization capability. Different scenes exhibit varying levels of
+complexity, and the complexity of images further varies significantly in
+cross-domain scenarios. In this paper, we propose a dynamic object-centric
+perception network based on prompt learning, aiming to adapt to the variations
+in image complexity. Specifically, we propose an object-centric gating module
+based on prompt learning to focus attention on the object-centric features
+guided by the various scene prompts. Then, with the object-centric gating
+masks, the dynamic selective module dynamically selects highly correlated
+feature regions in both spatial and channel dimensions enabling the model to
+adaptively perceive object-centric relevant features, thereby enhancing the
+generalization capability. Extensive experiments were conducted on
+single-domain generalization tasks in image classification and object
+detection. The experimental results demonstrate that our approach outperforms
+state-of-the-art methods, which validates the effectiveness and generally of
+our proposed method.",cs.CV,['cs.CV']
+PairAug: What Can Augmented Image-Text Pairs Do for Radiology?,Yutong Xie · Qi Chen · Sinuo Wang · Minh-Son To · Iris Lee · Ee Win Khoo · Kerolos Hendy · Daniel Koh · Yong Xia · Qi Wu, ,https://arxiv.org/abs/2404.04960,,2404.04960.pdf,PairAug: What Can Augmented Image-Text Pairs Do for Radiology?,"Current vision-language pre-training (VLP) methodologies predominantly depend
+on paired image-text datasets, a resource that is challenging to acquire in
+radiology due to privacy considerations and labelling complexities. Data
+augmentation provides a practical solution to overcome the issue of data
+scarcity, however, most augmentation methods exhibit a limited focus,
+prioritising either image or text augmentation exclusively. Acknowledging this
+limitation, our objective is to devise a framework capable of concurrently
+augmenting medical image and text data. We design a Pairwise Augmentation
+(PairAug) approach that contains an Inter-patient Augmentation (InterAug)
+branch and an Intra-patient Augmentation (IntraAug) branch. Specifically, the
+InterAug branch of our approach generates radiology images using synthesised
+yet plausible reports derived from a Large Language Model (LLM). The generated
+pairs can be considered a collection of new patient cases since they are
+artificially created and may not exist in the original dataset. In contrast,
+the IntraAug branch uses newly generated reports to manipulate images. This
+process allows us to create new paired data for each individual with diverse
+medical conditions. Our extensive experiments on various downstream tasks
+covering medical image classification zero-shot and fine-tuning analysis
+demonstrate that our PairAug, concurrently expanding both image and text data,
+substantially outperforms image-/text-only expansion baselines and advanced
+medical VLP baselines. Our code is released at
+\url{https://github.com/YtongXie/PairAug}.",cs.CV,['cs.CV']
+CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive Learning,Lianggangxu Chen · Xuejiao Wang · Jiale Lu · Shaohui Lin · Changbo Wang · Gaoqi He, ,https://arxiv.org/abs/2309.16650,,2309.16650.pdf,ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning,"For robots to perform a wide variety of tasks, they require a 3D
+representation of the world that is semantically rich, yet compact and
+efficient for task-driven perception and planning. Recent approaches have
+attempted to leverage features from large vision-language models to encode
+semantics in 3D representations. However, these approaches tend to produce maps
+with per-point feature vectors, which do not scale well in larger environments,
+nor do they contain semantic spatial relationships between entities in the
+environment, which are useful for downstream planning. In this work, we propose
+ConceptGraphs, an open-vocabulary graph-structured representation for 3D
+scenes. ConceptGraphs is built by leveraging 2D foundation models and fusing
+their output to 3D by multi-view association. The resulting representations
+generalize to novel semantic classes, without the need to collect large 3D
+datasets or finetune models. We demonstrate the utility of this representation
+through a number of downstream planning tasks that are specified through
+abstract (language) prompts and require complex reasoning over spatial and
+semantic concepts. (Project page: https://concept-graphs.github.io/ Explainer
+video: https://youtu.be/mRhNkQwRYnc )",cs.RO,"['cs.RO', 'cs.CV']"
+Initialization Matters for Adversarial Transfer Learning,Andong Hua · Jindong Gu · Zhiyu Xue · Nicholas Carlini · Eric Wong · Yao Qin, ,https://arxiv.org/abs/2312.05716,,2312.05716.pdf,Initialization Matters for Adversarial Transfer Learning,"With the prevalence of the Pretraining-Finetuning paradigm in transfer
+learning, the robustness of downstream tasks has become a critical concern. In
+this work, we delve into adversarial robustness in transfer learning and reveal
+the critical role of initialization, including both the pretrained model and
+the linear head. First, we discover the necessity of an adversarially robust
+pretrained model. Specifically, we reveal that with a standard pretrained
+model, Parameter-Efficient Finetuning (PEFT) methods either fail to be
+adversarially robust or continue to exhibit significantly degraded adversarial
+robustness on downstream tasks, even with adversarial training during
+finetuning. Leveraging a robust pretrained model, surprisingly, we observe that
+a simple linear probing can outperform full finetuning and other PEFT methods
+with random initialization on certain datasets. We further identify that linear
+probing excels in preserving robustness from the robust pretraining. Based on
+this, we propose Robust Linear Initialization (RoLI) for adversarial
+finetuning, which initializes the linear head with the weights obtained by
+adversarial linear probing to maximally inherit the robustness from
+pretraining. Across five different image classification datasets, we
+demonstrate the effectiveness of RoLI and achieve new state-of-the-art results.
+Our code is available at \url{https://github.com/DongXzz/RoLI}.",cs.CV,['cs.CV']
+PEGASUS: Personalized Generative 3D Avatars with Composable Attributes,Hyunsoo Cha · Byungjun Kim · Hanbyul Joo, ,https://arxiv.org/abs/2402.10636,,2402.10636.pdf,PEGASUS: Personalized Generative 3D Avatars with Composable Attributes,"We present PEGASUS, a method for constructing a personalized generative 3D
+face avatar from monocular video sources. Our generative 3D avatar enables
+disentangled controls to selectively alter the facial attributes (e.g., hair or
+nose) while preserving the identity. Our approach consists of two stages:
+synthetic database generation and constructing a personalized generative
+avatar. We generate a synthetic video collection of the target identity with
+varying facial attributes, where the videos are synthesized by borrowing the
+attributes from monocular videos of diverse identities. Then, we build a
+person-specific generative 3D avatar that can modify its attributes
+continuously while preserving its identity. Through extensive experiments, we
+demonstrate that our method of generating a synthetic database and creating a
+3D generative avatar is the most effective in preserving identity while
+achieving high realism. Subsequently, we introduce a zero-shot approach to
+achieve the same goal of generative modeling more efficiently by leveraging a
+previously constructed personalized generative model.",cs.CV,['cs.CV']
+FedHCA$^2$: Towards Hetero-Client Federated Multi-Task Learning,Yuxiang Lu · Suizhi Huang · Yuwen Yang · Shalayiding Sirejiding · Yue Ding · Hongtao Lu,https://github.com/innovator-zero/FedHCA2,https://arxiv.org/abs/2311.13250v2,,2311.13250v2.pdf,FedHCA$^2$: Towards Hetero-Client Federated Multi-Task Learning,"Federated Learning (FL) enables joint training across distributed clients
+using their local data privately. Federated Multi-Task Learning (FMTL) builds
+on FL to handle multiple tasks, assuming model congruity that identical model
+architecture is deployed in each client. To relax this assumption and thus
+extend real-world applicability, we introduce a novel problem setting,
+Hetero-Client Federated Multi-Task Learning (HC-FMTL), to accommodate diverse
+task setups. The main challenge of HC-FMTL is the model incongruity issue that
+invalidates conventional aggregation methods. It also escalates the
+difficulties in accurate model aggregation to deal with data and task
+heterogeneity inherent in FMTL. To address these challenges, we propose the
+FedHCA$^2$ framework, which allows for federated training of personalized
+models by modeling relationships among heterogeneous clients. Drawing on our
+theoretical insights into the difference between multi-task and federated
+optimization, we propose the Hyper Conflict-Averse Aggregation scheme to
+mitigate conflicts during encoder updates. Additionally, inspired by task
+interaction in MTL, the Hyper Cross Attention Aggregation scheme uses
+layer-wise cross attention to enhance decoder interactions while alleviating
+model incongruity. Moreover, we employ learnable Hyper Aggregation Weights for
+each client to customize personalized parameter updates. Extensive experiments
+demonstrate the superior performance of FedHCA$^2$ in various HC-FMTL scenarios
+compared to representative methods. Our code will be made publicly available.",cs.CV,"['cs.CV', 'cs.LG']"
+Adapt Before Comparison: A New Perspective on Cross-Domain Few-Shot Segmentation,Jonas Herzog, ,https://arxiv.org/abs/2402.17614,,2402.17614.pdf,Adapt Before Comparison: A New Perspective on Cross-Domain Few-Shot Segmentation,"Few-shot segmentation performance declines substantially when facing images
+from a domain different than the training domain, effectively limiting
+real-world use cases. To alleviate this, recently cross-domain few-shot
+segmentation (CD-FSS) has emerged. Works that address this task mainly
+attempted to learn segmentation on a source domain in a manner that generalizes
+across domains. Surprisingly, we can outperform these approaches while
+eliminating the training stage and removing their main segmentation network. We
+show test-time task-adaption is the key for successful CD-FSS instead.
+Task-adaption is achieved by appending small networks to the feature pyramid of
+a conventionally classification-pretrained backbone. To avoid overfitting to
+the few labeled samples in supervised fine-tuning, consistency across augmented
+views of input images serves as guidance while learning the parameters of the
+attached layers. Despite our self-restriction not to use any images other than
+the few labeled samples at test time, we achieve new state-of-the-art
+performance in CD-FSS, evidencing the need to rethink approaches for the task.",cs.CV,['cs.CV']
+TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models,Haomiao Ni · Bernhard Egger · Suhas Lohit · Anoop Cherian · Ye Wang · Toshiaki Koike-Akino · Sharon X. Huang · Tim Marks, ,https://arxiv.org/abs/2404.16306,,2404.16306.pdf,TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models,"Text-conditioned image-to-video generation (TI2V) aims to synthesize a
+realistic video starting from a given image (e.g., a woman's photo) and a text
+description (e.g., ""a woman is drinking water.""). Existing TI2V frameworks
+often require costly training on video-text datasets and specific model designs
+for text and image conditioning. In this paper, we propose TI2V-Zero, a
+zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V)
+diffusion model to be conditioned on a provided image, enabling TI2V generation
+without any optimization, fine-tuning, or introducing external modules. Our
+approach leverages a pretrained T2V diffusion foundation model as the
+generative prior. To guide video generation with the additional image input, we
+propose a ""repeat-and-slide"" strategy that modulates the reverse denoising
+process, allowing the frozen diffusion model to synthesize a video
+frame-by-frame starting from the provided image. To ensure temporal continuity,
+we employ a DDPM inversion strategy to initialize Gaussian noise for each newly
+synthesized frame and a resampling technique to help preserve visual details.
+We conduct comprehensive experiments on both domain-specific and open-domain
+datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V
+model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks
+such as video infilling and prediction when provided with more images. Its
+autoregressive design also supports long video generation.",cs.CV,['cs.CV']
+Atom-Level Optical Chemical Structure Recognition with Limited Supervision,Martijn Oldenhof · Edward De Brouwer · Adam Arany · Yves Moreau,https://github.com/molden/atomlenz,https://arxiv.org/abs/2404.01743,,2404.01743.pdf,Atom-Level Optical Chemical Structure Recognition with Limited Supervision,"Identifying the chemical structure from a graphical representation, or image,
+of a molecule is a challenging pattern recognition task that would greatly
+benefit drug development. Yet, existing methods for chemical structure
+recognition do not typically generalize well, and show diminished effectiveness
+when confronted with domains where data is sparse, or costly to generate, such
+as hand-drawn molecule images. To address this limitation, we propose a new
+chemical structure recognition tool that delivers state-of-the-art performance
+and can adapt to new domains with a limited number of data samples and
+supervision. Unlike previous approaches, our method provides atom-level
+localization, and can therefore segment the image into the different atoms and
+bonds. Our model is the first model to perform OCSR with atom-level entity
+detection with only SMILES supervision. Through rigorous and extensive
+benchmarking, we demonstrate the preeminence of our chemical structure
+recognition approach in terms of data efficiency, accuracy, and atom-level
+entity prediction.",cs.CV,['cs.CV']
+SubT-MRS Datasets: Pushing SLAM Towards All-weather Environments,Shibo Zhao · Yuanjun Gao · Tianhao Wu · Damanpreet Singh · Rushan Jiang · Haoxiang Sun · Mansi Sarawata · Warren Whittaker · Ian Higgins · Shaoshu Su · Yi Du · Can Xu · John Keller · Jay Karhade · Lucas Nogueira · Sourojit Saha · Yuheng Qiu · Ji Zhang · Wenshan Wang · Chen Wang · Sebastian Scherer,https://superodometry.com/datasets,https://arxiv.org/abs/2307.07607,,2307.07607.pdf,SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments,"Simultaneous localization and mapping (SLAM) is a fundamental task for
+numerous applications such as autonomous navigation and exploration. Despite
+many SLAM datasets have been released, current SLAM solutions still struggle to
+have sustained and resilient performance. One major issue is the absence of
+high-quality datasets including diverse all-weather conditions and a reliable
+metric for assessing robustness. This limitation significantly restricts the
+scalability and generalizability of SLAM technologies, impacting their
+development, validation, and deployment. To address this problem, we present
+SubT-MRS, an extremely challenging real-world dataset designed to push SLAM
+towards all-weather environments to pursue the most robust SLAM performance. It
+contains multi-degraded environments including over 30 diverse scenes such as
+structureless corridors, varying lighting conditions, and perceptual obscurants
+like smoke and dust; multimodal sensors such as LiDAR, fisheye camera, IMU, and
+thermal camera; and multiple locomotions like aerial, legged, and wheeled
+robots. We develop accuracy and robustness evaluation tracks for SLAM and
+introduced novel robustness metrics. Comprehensive studies are performed,
+revealing new observations, challenges, and opportunities for future research.",cs.RO,['cs.RO']
+Class Incremental Learning with Multi-Teacher Distillation,Haitao Wen · Lili Pan · Yu Dai · Heqian Qiu · Lanxiao Wang · Qingbo Wu · Hongliang Li, ,https://arxiv.org/abs/2306.17560,,2306.17560.pdf,Class-Incremental Learning using Diffusion Model for Distillation and Replay,"Class-incremental learning aims to learn new classes in an incremental
+fashion without forgetting the previously learned ones. Several research works
+have shown how additional data can be used by incremental models to help
+mitigate catastrophic forgetting. In this work, following the recent
+breakthrough in text-to-image generative models and their wide distribution, we
+propose the use of a pretrained Stable Diffusion model as a source of
+additional data for class-incremental learning. Compared to competitive methods
+that rely on external, often unlabeled, datasets of real images, our approach
+can generate synthetic samples belonging to the same classes as the previously
+encountered images. This allows us to use those additional data samples not
+only in the distillation loss but also for replay in the classification loss.
+Experiments on the competitive benchmarks CIFAR100, ImageNet-Subset, and
+ImageNet demonstrate how this new approach can be used to further improve the
+performance of state-of-the-art methods for class-incremental learning on large
+scale datasets.",cs.LG,"['cs.LG', 'cs.CV']"
+MPOD123: One Image to 3D Content Generation Using Mask-enhanced Progressive Outline-to-Detail Optimization,Jimin Xu · Tianbao Wang · Tao Jin · Shengyu Zhang · Dongjie Fu · Zhe Wang · Jiangjing Lyu · Chengfei Lv · Chaoyue Niu · Zhou Yu · Zhou Zhao · Fei Wu,https://mpod-123.github.io/,https://arxiv.org/abs/2306.17843,,2306.17843.pdf,Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors,"We present Magic123, a two-stage coarse-to-fine approach for high-quality,
+textured 3D meshes generation from a single unposed image in the wild using
+both2D and 3D priors. In the first stage, we optimize a neural radiance field
+to produce a coarse geometry. In the second stage, we adopt a memory-efficient
+differentiable mesh representation to yield a high-resolution mesh with a
+visually appealing texture. In both stages, the 3D content is learned through
+reference view supervision and novel views guided by a combination of 2D and 3D
+diffusion priors. We introduce a single trade-off parameter between the 2D and
+3D priors to control exploration (more imaginative) and exploitation (more
+precise) of the generated geometry. Additionally, we employ textual inversion
+and monocular depth regularization to encourage consistent appearances across
+views and to prevent degenerate solutions, respectively. Magic123 demonstrates
+a significant improvement over previous image-to-3D techniques, as validated
+through extensive experiments on synthetic benchmarks and diverse real-world
+images. Our code, models, and generated 3D assets are available at
+https://github.com/guochengqian/Magic123.",cs.CV,['cs.CV']
+RadSimReal: Bridging the Gap Between Synthetic and Real Data in Radar Object Detection With Simulation,Oded Bialer · Yuval Haitman,https://yuvalhg.github.io/RadSimReal/,https://arxiv.org/abs/2404.18150,,2404.18150.pdf,RadSimReal: Bridging the Gap Between Synthetic and Real Data in Radar Object Detection With Simulation,"Object detection in radar imagery with neural networks shows great potential
+for improving autonomous driving. However, obtaining annotated datasets from
+real radar images, crucial for training these networks, is challenging,
+especially in scenarios with long-range detection and adverse weather and
+lighting conditions where radar performance excels. To address this challenge,
+we present RadSimReal, an innovative physical radar simulation capable of
+generating synthetic radar images with accompanying annotations for various
+radar types and environmental conditions, all without the need for real data
+collection. Remarkably, our findings demonstrate that training object detection
+models on RadSimReal data and subsequently evaluating them on real-world data
+produce performance levels comparable to models trained and tested on real data
+from the same dataset, and even achieves better performance when testing across
+different real datasets. RadSimReal offers advantages over other physical radar
+simulations that it does not necessitate knowledge of the radar design details,
+which are often not disclosed by radar suppliers, and has faster run-time. This
+innovative tool has the potential to advance the development of computer vision
+algorithms for radar-based autonomous driving applications.",cs.CV,['cs.CV']
+AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error,Jonas Ricker · Denis Lukovnikov · Asja Fischer, ,https://arxiv.org/abs/2401.17879,,2401.17879.pdf,AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error,"With recent text-to-image models, anyone can generate deceptively realistic
+images with arbitrary contents, fueling the growing threat of visual
+disinformation. A key enabler for generating high-resolution images with low
+computational cost has been the development of latent diffusion models (LDMs).
+In contrast to conventional diffusion models, LDMs perform the denoising
+process in the low-dimensional latent space of a pre-trained autoencoder (AE)
+instead of the high-dimensional image space. Despite their relevance, the
+forensic analysis of LDMs is still in its infancy. In this work we propose
+AEROBLADE, a novel detection method which exploits an inherent component of
+LDMs: the AE used to transform images between image and latent space. We find
+that generated images can be more accurately reconstructed by the AE than real
+images, allowing for a simple detection approach based on the reconstruction
+error. Most importantly, our method is easy to implement and does not require
+any training, yet nearly matches the performance of detectors that rely on
+extensive training. We empirically demonstrate that AEROBLADE is effective
+against state-of-the-art LDMs, including Stable Diffusion and Midjourney.
+Beyond detection, our approach allows for the qualitative analysis of images,
+which can be leveraged for identifying inpainted regions. We release our code
+and data at https://github.com/jonasricker/aeroblade .",cs.CV,['cs.CV']
+"Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance",Zan Wang · Yixin Chen · Baoxiong Jia · Puhao Li · Jinlu Zhang · Jingze Zhang · Tengyu Liu · Yixin Zhu · Wei Liang · Siyuan Huang,https://afford-motion.github.io/,https://arxiv.org/abs/2403.18036,,2403.18036.pdf,"Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance","Despite significant advancements in text-to-motion synthesis, generating
+language-guided human motion within 3D environments poses substantial
+challenges. These challenges stem primarily from (i) the absence of powerful
+generative models capable of jointly modeling natural language, 3D scenes, and
+human motion, and (ii) the generative models' intensive data requirements
+contrasted with the scarcity of comprehensive, high-quality,
+language-scene-motion datasets. To tackle these issues, we introduce a novel
+two-stage framework that employs scene affordance as an intermediate
+representation, effectively linking 3D scene grounding and conditional motion
+generation. Our framework comprises an Affordance Diffusion Model (ADM) for
+predicting explicit affordance map and an Affordance-to-Motion Diffusion Model
+(AMDM) for generating plausible human motions. By leveraging scene affordance
+maps, our method overcomes the difficulty in generating human motion under
+multimodal condition signals, especially when training with limited data
+lacking extensive language-scene-motion pairs. Our extensive experiments
+demonstrate that our approach consistently outperforms all baselines on
+established benchmarks, including HumanML3D and HUMANISE. Additionally, we
+validate our model's exceptional generalization capabilities on a specially
+curated evaluation set featuring previously unseen descriptions and scenes.",cs.CV,['cs.CV']
+SignGraph: A Sign Sequence is Worth Graphs of Nodes,Shiwei Gan · Yafeng Yin · Zhiwei Jiang · Hongkai Wen · Lei Xie · Sanglu Lu,https://github.com/gswycf/SignGraph,,https://www.semanticscholar.org/paper/Towards-Real-Time-Sign-Language-Recognition-and-on-Gan-Yin/dba462bcf68db62a4722c7f220f38461ff981f15,,,,,nan
+Animating General Image with Large Visual Motion Model,Dengsheng Chen · Xiaoming Wei · Xiaolin Wei, ,https://arxiv.org/abs/2311.12886,,2311.12886.pdf,AnimateAnything: Fine-Grained Open Domain Image Animation with Motion Guidance,"Image animation is a key task in computer vision which aims to generate
+dynamic visual content from static image. Recent image animation methods employ
+neural based rendering technique to generate realistic animations. Despite
+these advancements, achieving fine-grained and controllable image animation
+guided by text remains challenging, particularly for open-domain images
+captured in diverse real environments. In this paper, we introduce an open
+domain image animation method that leverages the motion prior of video
+diffusion model. Our approach introduces targeted motion area guidance and
+motion strength guidance, enabling precise control the movable area and its
+motion speed. This results in enhanced alignment between the animated visual
+elements and the prompting text, thereby facilitating a fine-grained and
+interactive animation generation process for intricate motion sequences. We
+validate the effectiveness of our method through rigorous experiments on an
+open-domain dataset, with the results showcasing its superior performance.
+Project page can be found at https://animationai.github.io/AnimateAnything.",cs.CV,['cs.CV']
+DyBluRF: Dynamic Neural Radiance Fields from Blurry Monocular Video,Huiqiang Sun · Xingyi Li · Liao Shen · Xinyi Ye · Ke Xian · Zhiguo Cao, ,https://arxiv.org/abs/2403.10103,,2403.10103.pdf,DyBluRF: Dynamic Neural Radiance Fields from Blurry Monocular Video,"Recent advancements in dynamic neural radiance field methods have yielded
+remarkable outcomes. However, these approaches rely on the assumption of sharp
+input images. When faced with motion blur, existing dynamic NeRF methods often
+struggle to generate high-quality novel views. In this paper, we propose
+DyBluRF, a dynamic radiance field approach that synthesizes sharp novel views
+from a monocular video affected by motion blur. To account for motion blur in
+input images, we simultaneously capture the camera trajectory and object
+Discrete Cosine Transform (DCT) trajectories within the scene. Additionally, we
+employ a global cross-time rendering approach to ensure consistent temporal
+coherence across the entire scene. We curate a dataset comprising diverse
+dynamic scenes that are specifically tailored for our task. Experimental
+results on our dataset demonstrate that our method outperforms existing
+approaches in generating sharp novel views from motion-blurred inputs while
+maintaining spatial-temporal consistency of the scene.",cs.CV,['cs.CV']
+Dynamic Policy-Driven Adaptive Multi-Instance Learning for Whole Slide Image Classification,Tingting Zheng · Kui Jiang · Hongxun Yao,https://vilab.hit.edu.cn/projects/pamil,https://arxiv.org/abs/2403.07939,,2403.07939.pdf,Dynamic Policy-Driven Adaptive Multi-Instance Learning for Whole Slide Image Classification,"Multi-Instance Learning (MIL) has shown impressive performance for
+histopathology whole slide image (WSI) analysis using bags or pseudo-bags. It
+involves instance sampling, feature representation, and decision-making.
+However, existing MIL-based technologies at least suffer from one or more of
+the following problems: 1) requiring high storage and intensive pre-processing
+for numerous instances (sampling); 2) potential over-fitting with limited
+knowledge to predict bag labels (feature representation); 3) pseudo-bag counts
+and prior biases affect model robustness and generalizability
+(decision-making). Inspired by clinical diagnostics, using the past sampling
+instances can facilitate the final WSI analysis, but it is barely explored in
+prior technologies. To break free these limitations, we integrate the dynamic
+instance sampling and reinforcement learning into a unified framework to
+improve the instance selection and feature aggregation, forming a novel Dynamic
+Policy Instance Selection (DPIS) scheme for better and more credible
+decision-making. Specifically, the measurement of feature distance and reward
+function are employed to boost continuous instance sampling. To alleviate the
+over-fitting, we explore the latent global relations among instances for more
+robust and discriminative feature representation while establishing reward and
+punishment mechanisms to correct biases in pseudo-bags using contrastive
+learning. These strategies form the final Dynamic Policy-Driven Adaptive
+Multi-Instance Learning (PAMIL) method for WSI tasks. Extensive experiments
+reveal that our PAMIL method outperforms the state-of-the-art by 3.8\% on
+CAMELYON16 and 4.4\% on TCGA lung cancer datasets.",cs.CV,['cs.CV']
+OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos,Dongyoung Choi · Hyeonjoong Jang · Min H. Kim,https://vclab.kaist.ac.kr/cvpr2024p1,https://arxiv.org/abs/2404.00676,,2404.00676.pdf,OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos,"Omnidirectional cameras are extensively used in various applications to
+provide a wide field of vision. However, they face a challenge in synthesizing
+novel views due to the inevitable presence of dynamic objects, including the
+photographer, in their wide field of view. In this paper, we introduce a new
+approach called Omnidirectional Local Radiance Fields (OmniLocalRF) that can
+render static-only scene views, removing and inpainting dynamic objects
+simultaneously. Our approach combines the principles of local radiance fields
+with the bidirectional optimization of omnidirectional rays. Our input is an
+omnidirectional video, and we evaluate the mutual observations of the entire
+angle between the previous and current frames. To reduce ghosting artifacts of
+dynamic objects and inpaint occlusions, we devise a multi-resolution motion
+mask prediction module. Unlike existing methods that primarily separate dynamic
+components through the temporal domain, our method uses multi-resolution neural
+feature planes for precise segmentation, which is more suitable for long
+360-degree videos. Our experiments validate that OmniLocalRF outperforms
+existing methods in both qualitative and quantitative metrics, especially in
+scenarios with complex real-world scenes. In particular, our approach
+eliminates the need for manual interaction, such as drawing motion masks by
+hand and additional pose estimation, making it a highly effective and efficient
+solution.",cs.CV,"['cs.CV', 'cs.GR']"
+VBench: Comprehensive Benchmark Suite for Video Generative Models,Ziqi Huang · Yinan He · Jiashuo Yu · Fan Zhang · Chenyang Si · Yuming Jiang · Yuanhan Zhang · Tianxing Wu · Jin Qingyang · Nattapol Chanpaisit · Yaohui Wang · Xinyuan Chen · Limin Wang · Dahua Lin · Yu Qiao · Ziwei Liu,https://vchitect.github.io/VBench-project/,https://arxiv.org/abs/2311.17982,,2311.17982.pdf,VBench: Comprehensive Benchmark Suite for Video Generative Models,"Video generation has witnessed significant advancements, yet evaluating these
+models remains a challenge. A comprehensive evaluation benchmark for video
+generation is indispensable for two reasons: 1) Existing metrics do not fully
+align with human perceptions; 2) An ideal evaluation system should provide
+insights to inform future developments of video generation. To this end, we
+present VBench, a comprehensive benchmark suite that dissects ""video generation
+quality"" into specific, hierarchical, and disentangled dimensions, each with
+tailored prompts and evaluation methods. VBench has three appealing properties:
+1) Comprehensive Dimensions: VBench comprises 16 dimensions in video generation
+(e.g., subject identity inconsistency, motion smoothness, temporal flickering,
+and spatial relationship, etc). The evaluation metrics with fine-grained levels
+reveal individual models' strengths and weaknesses. 2) Human Alignment: We also
+provide a dataset of human preference annotations to validate our benchmarks'
+alignment with human perception, for each evaluation dimension respectively. 3)
+Valuable Insights: We look into current models' ability across various
+evaluation dimensions, and various content types. We also investigate the gaps
+between video and image generation models. We will open-source VBench,
+including all prompts, evaluation methods, generated videos, and human
+preference annotations, and also include more video generation models in VBench
+to drive forward the field of video generation.",cs.CV,['cs.CV']
+Privacy-preserving Optics for Enhancing Protection in Face De-identification,Jhon Lopez · Carlos Hinojosa · Henry Arguello · Bernard Ghanem,https://carloshinojosa.me/project/privacy-face-deid/,https://arxiv.org/abs/2404.00777,,2404.00777.pdf,Privacy-preserving Optics for Enhancing Protection in Face De-identification,"The modern surge in camera usage alongside widespread computer vision
+technology applications poses significant privacy and security concerns.
+Current artificial intelligence (AI) technologies aid in recognizing relevant
+events and assisting in daily tasks in homes, offices, hospitals, etc. The need
+to access or process personal information for these purposes raises privacy
+concerns. While software-level solutions like face de-identification provide a
+good privacy/utility trade-off, they present vulnerabilities to sniffing
+attacks. In this paper, we propose a hardware-level face de-identification
+method to solve this vulnerability. Specifically, our approach first learns an
+optical encoder along with a regression model to obtain a face heatmap while
+hiding the face identity from the source image. We also propose an
+anonymization framework that generates a new face using the privacy-preserving
+image, face heatmap, and a reference face image from a public dataset as input.
+We validate our approach with extensive simulations and hardware experiments.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CR', 'cs.LG', 'eess.IV']"
+Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection,Yicheng Xiao · Zhuoyan Luo · Yong Liu · Yue Ma · Hengwei Bian · Yatai Ji · Yujiu Yang · Xiu Li, ,https://arxiv.org/abs/2311.16464,,2311.16464.pdf,Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection,"Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted
+significant attention due to the growing demand for video analysis. Recent
+approaches treat MR and HD as similar video grounding problems and address them
+together with transformer-based architecture. However, we observe that the
+emphasis of MR and HD differs, with one necessitating the perception of local
+relationships and the other prioritizing the understanding of global contexts.
+Consequently, the lack of task-specific design will inevitably lead to
+limitations in associating the intrinsic specialty of two tasks. To tackle the
+issue, we propose a Unified Video COMprehension framework (UVCOM) to bridge the
+gap and jointly solve MR and HD effectively. By performing progressive
+integration on intra and inter-modality across multi-granularity, UVCOM
+achieves the comprehensive understanding in processing a video. Moreover, we
+present multi-aspect contrastive learning to consolidate the local relation
+modeling and global knowledge accumulation via well aligned multi-modal space.
+Extensive experiments on QVHighlights, Charades-STA, TACoS , YouTube Highlights
+and TVSum datasets demonstrate the effectiveness and rationality of UVCOM which
+outperforms the state-of-the-art methods by a remarkable margin.",cs.CV,"['cs.CV', 'cs.AI']"
+Hyperbolic Learning with Synthetic Captions for Open-World Detection,Fanjie Kong · Yanbei Chen · Jiarui Cai · Davide Modolo, ,https://arxiv.org/abs/2404.05016,,2404.05016.pdf,Hyperbolic Learning with Synthetic Captions for Open-World Detection,"Open-world detection poses significant challenges, as it requires the
+detection of any object using either object class labels or free-form texts.
+Existing related works often use large-scale manual annotated caption datasets
+for training, which are extremely expensive to collect. Instead, we propose to
+transfer knowledge from vision-language models (VLMs) to enrich the
+open-vocabulary descriptions automatically. Specifically, we bootstrap dense
+synthetic captions using pre-trained VLMs to provide rich descriptions on
+different regions in images, and incorporate these captions to train a novel
+detector that generalizes to novel concepts. To mitigate the noise caused by
+hallucination in synthetic captions, we also propose a novel hyperbolic
+vision-language learning approach to impose a hierarchy between visual and
+caption embeddings. We call our detector ``HyperLearner''. We conduct extensive
+experiments on a wide variety of open-world detection benchmarks (COCO, LVIS,
+Object Detection in the Wild, RefCOCO) and our results show that our model
+consistently outperforms existing state-of-the-art methods, such as GLIP,
+GLIPv2 and Grounding DINO, when using the same backbone.",cs.CV,['cs.CV']
+Coherence As Texture -- Passive Textureless 3D Reconstruction by Self-interference,Wei-Yu Chen · Aswin C. Sankaranarayanan · Anat Levin · Matthew O’Toole, ,,https://onlinelibrary.wiley.com/doi/10.1002/lpor.202301155,,,,,nan
+Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment,Angchi Xu · Wei-Shi Zheng, ,https://arxiv.org/abs/2403.19225,,2403.19225.pdf,Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment,"Weakly-supervised action segmentation is a task of learning to partition a
+long video into several action segments, where training videos are only
+accompanied by transcripts (ordered list of actions). Most of existing methods
+need to infer pseudo segmentation for training by serial alignment between all
+frames and the transcript, which is time-consuming and hard to be parallelized
+while training. In this work, we aim to escape from this inefficient alignment
+with massive but redundant frames, and instead to directly localize a few
+action transitions for pseudo segmentation generation, where a transition
+refers to the change from an action segment to its next adjacent one in the
+transcript. As the true transitions are submerged in noisy boundaries due to
+intra-segment visual variation, we propose a novel Action-Transition-Aware
+Boundary Alignment (ATBA) framework to efficiently and effectively filter out
+noisy boundaries and detect transitions. In addition, to boost the semantic
+learning in the case that noise is inevitably present in the pseudo
+segmentation, we also introduce video-level losses to utilize the trusted
+video-level supervision. Extensive experiments show the effectiveness of our
+approach on both performance and training speed.",cs.CV,['cs.CV']
+Physics-aware Hand-object Interaction Denoising,Haowen Luo · Yunze Liu · Li Yi, ,https://arxiv.org/abs/2405.11481,,2405.11481.pdf,Physics-aware Hand-object Interaction Denoising,"The credibility and practicality of a reconstructed hand-object interaction
+sequence depend largely on its physical plausibility. However, due to high
+occlusions during hand-object interaction, physical plausibility remains a
+challenging criterion for purely vision-based tracking methods. To address this
+issue and enhance the results of existing hand trackers, this paper proposes a
+novel physically-aware hand motion de-noising method. Specifically, we
+introduce two learned loss terms that explicitly capture two crucial aspects of
+physical plausibility: grasp credibility and manipulation feasibility. These
+terms are used to train a physically-aware de-noising network. Qualitative and
+quantitative experiments demonstrate that our approach significantly improves
+both fine-grained physical plausibility and overall pose accuracy, surpassing
+current state-of-the-art de-noising methods.",cs.CV,['cs.CV']
+ToNNO: Tomographic Reconstruction of a Neural Network’s Output for Weakly Supervised Segmentation of 3D Medical Images,Marius Schmidt-Mengin · Alexis Benichoux · Shibeshih Belachew · Nikos Komodakis · Nikos Paragios, ,https://arxiv.org/abs/2404.13103,,2404.13103.pdf,ToNNO: Tomographic Reconstruction of a Neural Network's Output for Weakly Supervised Segmentation of 3D Medical Images,"Annotating lots of 3D medical images for training segmentation models is
+time-consuming. The goal of weakly supervised semantic segmentation is to train
+segmentation models without using any ground truth segmentation masks. Our work
+addresses the case where only image-level categorical labels, indicating the
+presence or absence of a particular region of interest (such as tumours or
+lesions), are available. Most existing methods rely on class activation mapping
+(CAM). We propose a novel approach, ToNNO, which is based on the Tomographic
+reconstruction of a Neural Network's Output. Our technique extracts stacks of
+slices with different angles from the input 3D volume, feeds these slices to a
+2D encoder, and applies the inverse Radon transform in order to reconstruct a
+3D heatmap of the encoder's predictions. This generic method allows to perform
+dense prediction tasks on 3D volumes using any 2D image encoder. We apply it to
+weakly supervised medical image segmentation by training the 2D encoder to
+output high values for slices containing the regions of interest. We test it on
+four large scale medical image datasets and outperform 2D CAM methods. We then
+extend ToNNO by combining tomographic reconstruction with CAM methods,
+proposing Averaged CAM and Tomographic CAM, which obtain even better results.",eess.IV,"['eess.IV', 'cs.CV', 'cs.LG']"
+An Aggregation-Free Federated Learning for Tackling Data Heterogeneity,Yuan Wang · Huazhu Fu · Renuga Kanagavelu · Qingsong Wei · Yong Liu · Rick Goh, ,https://arxiv.org/abs/2404.18962,,2404.18962.pdf,An Aggregation-Free Federated Learning for Tackling Data Heterogeneity,"The performance of Federated Learning (FL) hinges on the effectiveness of
+utilizing knowledge from distributed datasets. Traditional FL methods adopt an
+aggregate-then-adapt framework, where clients update local models based on a
+global model aggregated by the server from the previous training round. This
+process can cause client drift, especially with significant cross-client data
+heterogeneity, impacting model performance and convergence of the FL algorithm.
+To address these challenges, we introduce FedAF, a novel aggregation-free FL
+algorithm. In this framework, clients collaboratively learn condensed data by
+leveraging peer knowledge, the server subsequently trains the global model
+using the condensed data and soft labels received from the clients. FedAF
+inherently avoids the issue of client drift, enhances the quality of condensed
+data amid notable data heterogeneity, and improves the global model
+performance. Extensive numerical studies on several popular benchmark datasets
+show FedAF surpasses various state-of-the-art FL algorithms in handling
+label-skew and feature-skew data heterogeneity, leading to superior global
+model accuracy and faster convergence.",cs.CV,"['cs.CV', 'cs.LG']"
+HoloVIC: Large-scale Dataset and Benchmark for Multi-Sensor Holographic Intersection and Vehicle-Infrastructure Cooperative,CONG MA · Qiao Lei · Chengkai Zhu · Kai Liu · Zelong Kong · Liqing · Xueqi Zhou · Yuheng KAN · Wei Wu, ,https://arxiv.org/abs/2403.02640,,2403.02640.pdf,HoloVIC: Large-scale Dataset and Benchmark for Multi-Sensor Holographic Intersection and Vehicle-Infrastructure Cooperative,"Vehicle-to-everything (V2X) is a popular topic in the field of Autonomous
+Driving in recent years. Vehicle-infrastructure cooperation (VIC) becomes one
+of the important research area. Due to the complexity of traffic conditions
+such as blind spots and occlusion, it greatly limits the perception
+capabilities of single-view roadside sensing systems. To further enhance the
+accuracy of roadside perception and provide better information to the vehicle
+side, in this paper, we constructed holographic intersections with various
+layouts to build a large-scale multi-sensor holographic vehicle-infrastructure
+cooperation dataset, called HoloVIC. Our dataset includes 3 different types of
+sensors (Camera, Lidar, Fisheye) and employs 4 sensor-layouts based on the
+different intersections. Each intersection is equipped with 6-18 sensors to
+capture synchronous data. While autonomous vehicles pass through these
+intersections for collecting VIC data. HoloVIC contains in total on 100k+
+synchronous frames from different sensors. Additionally, we annotated 3D
+bounding boxes based on Camera, Fisheye, and Lidar. We also associate the IDs
+of the same objects across different devices and consecutive frames in
+sequence. Based on HoloVIC, we formulated four tasks to facilitate the
+development of related research. We also provide benchmarks for these tasks.",cs.CV,['cs.CV']
+OneFormer3D: One Transformer for Unified Point Cloud Segmentation,Maksim Kolodiazhnyi · Anna Vorontsova · Anton Konushin · Danila Rukhovich,https://github.com/oneformer3d/oneformer3d,https://arxiv.org/abs/2311.14405,,2311.14405.pdf,OneFormer3D: One Transformer for Unified Point Cloud Segmentation,"Semantic, instance, and panoptic segmentation of 3D point clouds have been
+addressed using task-specific models of distinct design. Thereby, the
+similarity of all segmentation tasks and the implicit relationship between them
+have not been utilized effectively. This paper presents a unified, simple, and
+effective model addressing all these tasks jointly. The model, named
+OneFormer3D, performs instance and semantic segmentation consistently, using a
+group of learnable kernels, where each kernel is responsible for generating a
+mask for either an instance or a semantic category. These kernels are trained
+with a transformer-based decoder with unified instance and semantic queries
+passed as an input. Such a design enables training a model end-to-end in a
+single run, so that it achieves top performance on all three segmentation tasks
+simultaneously. Specifically, our OneFormer3D ranks 1st and sets a new
+state-of-the-art (+2.1 mAP50) in the ScanNet test leaderboard. We also
+demonstrate the state-of-the-art results in semantic, instance, and panoptic
+segmentation of ScanNet (+21 PQ), ScanNet200 (+3.8 mAP50), and S3DIS (+0.8
+mIoU) datasets.",cs.CV,['cs.CV']
+Federated Online Adaptation for Deep Stereo,Matteo Poggi · Fabio Tosi,https://fedstereo.github.io/,http://export.arxiv.org/abs/2405.14873,,2405.14873.pdf,Federated Online Adaptation for Deep Stereo,"We introduce a novel approach for adapting deep stereo networks in a
+collaborative manner. By building over principles of federated learning, we
+develop a distributed framework allowing for demanding the optimization process
+to a number of clients deployed in different environments. This makes it
+possible, for a deep stereo network running on resourced-constrained devices,
+to capitalize on the adaptation process carried out by other instances of the
+same architecture, and thus improve its accuracy in challenging environments
+even when it cannot carry out adaptation on its own. Experimental results show
+how federated adaptation performs equivalently to on-device adaptation, and
+even better when dealing with challenging environments.",cs.CV,['cs.CV']
+Learning Transferable Negative Prompts for Out-of-Distribution Detection,Tianqi Li · Guansong Pang · wenjun miao · Xiao Bai · Jin Zheng, ,,https://paperswithcode.com/paper/learning-transferable-negative-prompts-for,,,,,nan
+JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups,Simindokht Jahangard · Zhixi Cai · Shiki Wen · Hamid Rezatofighi, ,https://arxiv.org/abs/2404.04458,,2404.04458.pdf,JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups,"Understanding human social behaviour is crucial in computer vision and
+robotics. Micro-level observations like individual actions fall short,
+necessitating a comprehensive approach that considers individual behaviour,
+intra-group dynamics, and social group levels for a thorough understanding. To
+address dataset limitations, this paper introduces JRDB-Social, an extension of
+JRDB. Designed to fill gaps in human understanding across diverse indoor and
+outdoor social contexts, JRDB-Social provides annotations at three levels:
+individual attributes, intra-group interactions, and social group context. This
+dataset aims to enhance our grasp of human social dynamics for robotic
+applications. Utilizing the recent cutting-edge multi-modal large language
+models, we evaluated our benchmark to explore their capacity to decipher social
+human behaviour.",cs.CV,['cs.CV']
+Region-Based Representations Revisited,Michal Shlapentokh-Rothman · Ansel Blume · Yao Xiao · Yuqun Wu · Sethuraman T V · Heyi Tao · Jae Yong Lee · Wilfredo Torres-Calderon · Yu-Xiong Wang · Derek Hoiem, ,https://arxiv.org/abs/2402.02352,,2402.02352.pdf,Region-Based Representations Revisited,"We investigate whether region-based representations are effective for
+recognition. Regions were once a mainstay in recognition approaches, but pixel
+and patch-based features are now used almost exclusively. We show that recent
+class-agnostic segmenters like SAM can be effectively combined with strong
+unsupervised representations like DINOv2 and used for a wide variety of tasks,
+including semantic segmentation, object-based image retrieval, and multi-image
+analysis. Once the masks and features are extracted, these representations,
+even with linear decoders, enable competitive performance, making them well
+suited to applications that require custom queries. The compactness of the
+representation also makes it well-suited to video analysis and other problems
+requiring inference across many images.",cs.CV,['cs.CV']
+CycleINR: Cycle Implicit Neural Representation for Arbitrary-Scale Volumetric Super-Resolution of Medical Data,Wei Fang · Yuxing Tang · Heng Guo · Mingze Yuan · Tony C. W. MOK · Ke Yan · Jiawen Yao · Xin Chen · Zaiyi Liu · Le Lu · Ling Zhang · Minfeng Xu, ,https://arxiv.org/abs/2404.04878,,2404.04878.pdf,CycleINR: Cycle Implicit Neural Representation for Arbitrary-Scale Volumetric Super-Resolution of Medical Data,"In the realm of medical 3D data, such as CT and MRI images, prevalent
+anisotropic resolution is characterized by high intra-slice but diminished
+inter-slice resolution. The lowered resolution between adjacent slices poses
+challenges, hindering optimal viewing experiences and impeding the development
+of robust downstream analysis algorithms. Various volumetric super-resolution
+algorithms aim to surmount these challenges, enhancing inter-slice resolution
+and overall 3D medical imaging quality. However, existing approaches confront
+inherent challenges: 1) often tailored to specific upsampling factors, lacking
+flexibility for diverse clinical scenarios; 2) newly generated slices
+frequently suffer from over-smoothing, degrading fine details, and leading to
+inter-slice inconsistency. In response, this study presents CycleINR, a novel
+enhanced Implicit Neural Representation model for 3D medical data volumetric
+super-resolution. Leveraging the continuity of the learned implicit function,
+the CycleINR model can achieve results with arbitrary up-sampling rates,
+eliminating the need for separate training. Additionally, we enhance the grid
+sampling in CycleINR with a local attention mechanism and mitigate
+over-smoothing by integrating cycle-consistent loss. We introduce a new metric,
+Slice-wise Noise Level Inconsistency (SNLI), to quantitatively assess
+inter-slice noise level inconsistency. The effectiveness of our approach is
+demonstrated through image quality evaluations on an in-house dataset and a
+downstream task analysis on the Medical Segmentation Decathlon liver tumor
+dataset.",eess.IV,"['eess.IV', 'cs.CV']"
+"Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video",Hongchi Xia · Chih-Hao Lin · Wei-Chiu Ma · Shenlong Wang, ,https://arxiv.org/abs/2404.09833v1,,2404.09833v1.pdf,"Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video","Creating high-quality and interactive virtual environments, such as games and
+simulators, often involves complex and costly manual modeling processes. In
+this paper, we present Video2Game, a novel approach that automatically converts
+videos of real-world scenes into realistic and interactive game environments.
+At the heart of our system are three core components:(i) a neural radiance
+fields (NeRF) module that effectively captures the geometry and visual
+appearance of the scene; (ii) a mesh module that distills the knowledge from
+NeRF for faster rendering; and (iii) a physics module that models the
+interactions and physical dynamics among the objects. By following the
+carefully designed pipeline, one can construct an interactable and actionable
+digital replica of the real world. We benchmark our system on both indoor and
+large-scale outdoor scenes. We show that we can not only produce
+highly-realistic renderings in real-time, but also build interactive games on
+top.",cs.CV,"['cs.CV', 'cs.AI']"
+Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection,Jin Yang · Ping Wei · Huan Li · Ziyang Ren, ,https://arxiv.org/abs/2404.09263,,2404.09263.pdf,Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection,"Video moment retrieval and highlight detection are two highly valuable tasks
+in video understanding, but until recently they have been jointly studied.
+Although existing studies have made impressive advancement recently, they
+predominantly follow the data-driven bottom-up paradigm. Such paradigm
+overlooks task-specific and inter-task effects, resulting in poor model
+performance. In this paper, we propose a novel task-driven top-down framework
+TaskWeave for joint moment retrieval and highlight detection. The framework
+introduces a task-decoupled unit to capture task-specific and common
+representations. To investigate the interplay between the two tasks, we propose
+an inter-task feedback mechanism, which transforms the results of one task as
+guiding masks to assist the other task. Different from existing methods, we
+present a task-dependent joint loss function to optimize the model.
+Comprehensive experiments and in-depth ablation studies on QVHighlights, TVSum,
+and Charades-STA datasets corroborate the effectiveness and flexibility of the
+proposed framework. Codes are available at
+https://github.com/EdenGabriel/TaskWeave.",cs.CV,"['cs.CV', 'cs.AI']"
+Egocentric Full Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement,Jian Wang · Zhe Cao · Diogo Luvizon · Lingjie Liu · Kripasindhu Sarkar · Danhang Tang · Thabo Beeler · Christian Theobalt, ,https://arxiv.org/abs/2311.16495,,2311.16495.pdf,Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement,"In this work, we explore egocentric whole-body motion capture using a single
+fisheye camera, which simultaneously estimates human body and hand motion. This
+task presents significant challenges due to three factors: the lack of
+high-quality datasets, fisheye camera distortion, and human body
+self-occlusion. To address these challenges, we propose a novel approach that
+leverages FisheyeViT to extract fisheye image features, which are subsequently
+converted into pixel-aligned 3D heatmap representations for 3D human body pose
+prediction. For hand tracking, we incorporate dedicated hand detection and hand
+pose estimation networks for regressing 3D hand poses. Finally, we develop a
+diffusion-based whole-body motion prior model to refine the estimated
+whole-body motion while accounting for joint uncertainties. To train these
+networks, we collect a large synthetic dataset, EgoWholeBody, comprising
+840,000 high-quality egocentric images captured across a diverse range of
+whole-body motion sequences. Quantitative and qualitative evaluations
+demonstrate the effectiveness of our method in producing high-quality
+whole-body motion estimates from a single egocentric camera.",cs.CV,['cs.CV']
+PSDPM: Prototype-based Secondary Discriminative Pixels Mining for Weakly Supervised Semantic Segmentation,Xinqiao Zhao · Ziqian Yang · Tianhong Dai · Bingfeng Zhang · Jimin Xiao, ,https://arxiv.org/abs/2405.06586,,2405.06586.pdf,Enhancing Weakly Supervised Semantic Segmentation with Multi-modal Foundation Models: An End-to-End Approach,"Semantic segmentation is a core computer vision problem, but the high costs
+of data annotation have hindered its wide application. Weakly-Supervised
+Semantic Segmentation (WSSS) offers a cost-efficient workaround to extensive
+labeling in comparison to fully-supervised methods by using partial or
+incomplete labels. Existing WSSS methods have difficulties in learning the
+boundaries of objects leading to poor segmentation results. We propose a novel
+and effective framework that addresses these issues by leveraging visual
+foundation models inside the bounding box. Adopting a two-stage WSSS framework,
+our proposed network consists of a pseudo-label generation module and a
+segmentation module. The first stage leverages Segment Anything Model (SAM) to
+generate high-quality pseudo-labels. To alleviate the problem of delineating
+precise boundaries, we adopt SAM inside the bounding box with the help of
+another pre-trained foundation model (e.g., Grounding-DINO). Furthermore, we
+eliminate the necessity of using the supervision of image labels, by employing
+CLIP in classification. Then in the second stage, the generated high-quality
+pseudo-labels are used to train an off-the-shelf segmenter that achieves the
+state-of-the-art performance on PASCAL VOC 2012 and MS COCO 2014.",cs.CV,['cs.CV']
+Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation,Haojie Zhang · Yongyi Su · Xun Xu · Kui Jia, ,https://arxiv.org/abs/2312.03502,,2312.03502.pdf,Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation,"The success of large language models has inspired the computer vision
+community to explore image segmentation foundation model that is able to
+zero/few-shot generalize through prompt engineering. Segment-Anything(SAM),
+among others, is the state-of-the-art image segmentation foundation model
+demonstrating strong zero/few-shot generalization. Despite the success, recent
+studies reveal the weakness of SAM under strong distribution shift. In
+particular, SAM performs awkwardly on corrupted natural images, camouflaged
+images, medical images, etc. Motivated by the observations, we aim to develop a
+self-training based strategy to adapt SAM to target distribution. Given the
+unique challenges of large source dataset, high computation cost and incorrect
+pseudo label, we propose a weakly supervised self-training architecture with
+anchor regularization and low-rank finetuning to improve the robustness and
+computation efficiency of adaptation. We validate the effectiveness on 5 types
+of downstream segmentation tasks including natural clean/corrupted images,
+medical images, camouflaged images and robotic images. Our proposed method is
+task-agnostic in nature and outperforms pre-trained SAM and state-of-the-art
+domain adaptation methods on almost all downstream tasks with the same testing
+prompt inputs.",cs.CV,['cs.CV']
+SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities,Boyuan Chen · Zhuo Xu · Sean Kirmani · brian ichter · Dorsa Sadigh · Leonidas Guibas · Fei Xia,https://spatial-vlm.github.io/,https://arxiv.org/abs/2401.12168,,2401.12168.pdf,SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities,"Understanding and reasoning about spatial relationships is a fundamental
+capability for Visual Question Answering (VQA) and robotics. While Vision
+Language Models (VLM) have demonstrated remarkable performance in certain VQA
+benchmarks, they still lack capabilities in 3D spatial reasoning, such as
+recognizing quantitative relationships of physical objects like distances or
+size differences. We hypothesize that VLMs' limited spatial reasoning
+capability is due to the lack of 3D spatial knowledge in training data and aim
+to solve this problem by training VLMs with Internet-scale spatial reasoning
+data. To this end, we present a system to facilitate this approach. We first
+develop an automatic 3D spatial VQA data generation framework that scales up to
+2 billion VQA examples on 10 million real-world images. We then investigate
+various factors in the training recipe, including data quality, training
+pipeline, and VLM architecture. Our work features the first internet-scale 3D
+spatial reasoning dataset in metric space. By training a VLM on such data, we
+significantly enhance its ability on both qualitative and quantitative spatial
+VQA. Finally, we demonstrate that this VLM unlocks novel downstream
+applications in chain-of-thought spatial reasoning and robotics due to its
+quantitative estimation capability. Project website:
+https://spatial-vlm.github.io/",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG', 'cs.RO']"
+Learning to Transform Dynamically for Better Adversarial Transferability,Rongyi Zhu · Zeliang Zhang · Susan Liang · Zhuo Liu · Chenliang Xu, ,https://arxiv.org/abs/2405.14077,,2405.14077.pdf,Learning to Transform Dynamically for Better Adversarial Transferability,"Adversarial examples, crafted by adding perturbations imperceptible to
+humans, can deceive neural networks. Recent studies identify the adversarial
+transferability across various models, \textit{i.e.}, the cross-model attack
+ability of adversarial samples. To enhance such adversarial transferability,
+existing input transformation-based methods diversify input data with
+transformation augmentation. However, their effectiveness is limited by the
+finite number of available transformations. In our study, we introduce a novel
+approach named Learning to Transform (L2T). L2T increases the diversity of
+transformed images by selecting the optimal combination of operations from a
+pool of candidates, consequently improving adversarial transferability. We
+conceptualize the selection of optimal transformation combinations as a
+trajectory optimization problem and employ a reinforcement learning strategy to
+effectively solve the problem. Comprehensive experiments on the ImageNet
+dataset, as well as practical tests with Google Vision and GPT-4V, reveal that
+L2T surpasses current methodologies in enhancing adversarial transferability,
+thereby confirming its effectiveness and practical significance. The code is
+available at https://github.com/RongyiZhu/L2T.",cs.CV,"['cs.CV', 'cs.AI']"
+Can’t make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models,Himangi Mittal · Nakul Agarwal · Shao-Yuan Lo · Kwonjoon Lee, ,https://arxiv.org/abs/2405.20305,,2405.20305.pdf,Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models,"We introduce PlausiVL, a large video-language model for anticipating action
+sequences that are plausible in the real-world. While significant efforts have
+been made towards anticipating future actions, prior approaches do not take
+into account the aspect of plausibility in an action sequence. To address this
+limitation, we explore the generative capability of a large video-language
+model in our work and further, develop the understanding of plausibility in an
+action sequence by introducing two objective functions, a counterfactual-based
+plausible action sequence learning loss and a long-horizon action repetition
+loss. We utilize temporal logical constraints as well as verb-noun action pair
+logical constraints to create implausible/counterfactual action sequences and
+use them to train the model with plausible action sequence learning loss. This
+loss helps the model to differentiate between plausible and not plausible
+action sequences and also helps the model to learn implicit temporal cues
+crucial for the task of action anticipation. The long-horizon action repetition
+loss puts a higher penalty on the actions that are more prone to repetition
+over a longer temporal window. With this penalization, the model is able to
+generate diverse, plausible action sequences. We evaluate our approach on two
+large-scale datasets, Ego4D and EPIC-Kitchens-100, and show improvements on the
+task of action anticipation.",cs.CV,['cs.CV']
+Adapting to Length Shift: FlexiLength Network for Trajectory Prediction,Yi Xu · Yun Fu, ,https://arxiv.org/abs/2404.00742,,2404.00742.pdf,Adapting to Length Shift: FlexiLength Network for Trajectory Prediction,"Trajectory prediction plays an important role in various applications,
+including autonomous driving, robotics, and scene understanding. Existing
+approaches mainly focus on developing compact neural networks to increase
+prediction precision on public datasets, typically employing a standardized
+input duration. However, a notable issue arises when these models are evaluated
+with varying observation lengths, leading to a significant performance drop, a
+phenomenon we term the Observation Length Shift. To address this issue, we
+introduce a general and effective framework, the FlexiLength Network (FLN), to
+enhance the robustness of existing trajectory prediction techniques against
+varying observation periods. Specifically, FLN integrates trajectory data with
+diverse observation lengths, incorporates FlexiLength Calibration (FLC) to
+acquire temporal invariant representations, and employs FlexiLength Adaptation
+(FLA) to further refine these representations for more accurate future
+trajectory predictions. Comprehensive experiments on multiple datasets, ie,
+ETH/UCY, nuScenes, and Argoverse 1, demonstrate the effectiveness and
+flexibility of our proposed FLN framework.",cs.CV,['cs.CV']
+Learning Group Activity Features Through Person Attribute Prediction,Chihiro Nakatani · Hiroaki Kawashima · Norimichi Ukita, ,https://arxiv.org/abs/2403.02753,,2403.02753.pdf,Learning Group Activity Features Through Person Attribute Prediction,"This paper proposes Group Activity Feature (GAF) learning in which features
+of multi-person activity are learned as a compact latent vector. Unlike prior
+work in which the manual annotation of group activities is required for
+supervised learning, our method learns the GAF through person attribute
+prediction without group activity annotations. By learning the whole network in
+an end-to-end manner so that the GAF is required for predicting the person
+attributes of people in a group, the GAF is trained as the features of
+multi-person activity. As a person attribute, we propose to use a person's
+action class and appearance features because the former is easy to annotate due
+to its simpleness, and the latter requires no manual annotation. In addition,
+we introduce a location-guided attribute prediction to disentangle the complex
+GAF for extracting the features of each target person properly. Various
+experimental results validate that our method outperforms SOTA methods
+quantitatively and qualitatively on two public datasets. Visualization of our
+GAF also demonstrates that our method learns the GAF representing fined-grained
+group activity classes. Code: https://github.com/chihina/GAFL-CVPR2024.",cs.CV,['cs.CV']
+Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation,Xingqun Qi · Jiahao Pan · Peng Li · Ruibin Yuan · Xiaowei Chi · Mengfei Li · Wenhan Luo · Wei Xue · Shanghang Zhang · Qifeng Liu · Yike Guo, ,https://arxiv.org/abs/2311.17532,,2311.17532.pdf,Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation,"Generating vivid and emotional 3D co-speech gestures is crucial for virtual
+avatar animation in human-machine interaction applications. While the existing
+methods enable generating the gestures to follow a single emotion label, they
+overlook that long gesture sequence modeling with emotion transition is more
+practical in real scenes. In addition, the lack of large-scale available
+datasets with emotional transition speech and corresponding 3D human gestures
+also limits the addressing of this task. To fulfill this goal, we first
+incorporate the ChatGPT-4 and an audio inpainting approach to construct the
+high-fidelity emotion transition human speeches. Considering obtaining the
+realistic 3D pose annotations corresponding to the dynamically inpainted
+emotion transition audio is extremely difficult, we propose a novel weakly
+supervised training strategy to encourage authority gesture transitions.
+Specifically, to enhance the coordination of transition gestures w.r.t
+different emotional ones, we model the temporal association representation
+between two different emotional gesture sequences as style guidance and infuse
+it into the transition generation. We further devise an emotion mixture
+mechanism that provides weak supervision based on a learnable mixed emotion
+label for transition gestures. Last, we present a keyframe sampler to supply
+effective initial posture cues in long sequences, enabling us to generate
+diverse gestures. Extensive experiments demonstrate that our method outperforms
+the state-of-the-art models constructed by adapting single emotion-conditioned
+counterparts on our newly defined emotion transition task and datasets. Our
+code and dataset will be released on the project page:
+https://xingqunqi-lab.github.io/Emo-Transition-Gesture/.",cs.CV,['cs.CV']
+FocusMAE: Gallbladder Cancer Detection from Ultrasound Videos with Focused Masked Autoencoders,Soumen Basu · Mayuna Gupta · Chetan Madan · Pankaj Gupta · Chetan Arora,https://gbc-iitd.github.io/focusmae,https://arxiv.org/abs/2403.08848,,2403.08848.pdf,FocusMAE: Gallbladder Cancer Detection from Ultrasound Videos with Focused Masked Autoencoders,"In recent years, automated Gallbladder Cancer (GBC) detection has gained the
+attention of researchers. Current state-of-the-art (SOTA) methodologies relying
+on ultrasound sonography (US) images exhibit limited generalization,
+emphasizing the need for transformative approaches. We observe that individual
+US frames may lack sufficient information to capture disease manifestation.
+This study advocates for a paradigm shift towards video-based GBC detection,
+leveraging the inherent advantages of spatiotemporal representations. Employing
+the Masked Autoencoder (MAE) for representation learning, we address
+shortcomings in conventional image-based methods. We propose a novel design
+called FocusMAE to systematically bias the selection of masking tokens from
+high-information regions, fostering a more refined representation of
+malignancy. Additionally, we contribute the most extensive US video dataset for
+GBC detection. We also note that, this is the first study on US video-based GBC
+detection. We validate the proposed methods on the curated dataset, and report
+a new state-of-the-art (SOTA) accuracy of 96.4% for the GBC detection problem,
+against an accuracy of 84% by current Image-based SOTA - GBCNet, and RadFormer,
+and 94.7% by Video-based SOTA - AdaMAE. We further demonstrate the generality
+of the proposed FocusMAE on a public CT-based Covid detection dataset,
+reporting an improvement in accuracy by 3.3% over current baselines. The source
+code and pretrained models are available at:
+https://gbc-iitd.github.io/focusmae",eess.IV,"['eess.IV', 'cs.CV']"
+Learning to Predict Activity Progress by Self-Supervised Video Alignment,Gerard Donahue · Ehsan Elhamifar, ,https://arxiv.org/abs/2405.15160,,2405.15160.pdf,ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning,"This paper presents a new self-supervised video representation learning
+framework, ARVideo, which autoregressively predicts the next video token in a
+tailored sequence order. Two key designs are included. First, we organize
+autoregressive video tokens into clusters that span both spatially and
+temporally, thereby enabling a richer aggregation of contextual information
+compared to the standard spatial-only or temporal-only clusters. Second, we
+adopt a randomized spatiotemporal prediction order to facilitate learning from
+multi-dimensional data, addressing the limitations of a handcrafted
+spatial-first or temporal-first sequence order. Extensive experiments establish
+ARVideo as an effective paradigm for self-supervised video representation
+learning. For example, when trained with the ViT-B backbone, ARVideo
+competitively attains 81.2% on Kinetics-400 and 70.9% on Something-Something
+V2, which are on par with the strong benchmark set by VideoMAE. Importantly,
+ARVideo also demonstrates higher training efficiency, i.e., it trains 14%
+faster and requires 58% less GPU memory compared to VideoMAE.",cs.CV,['cs.CV']
+Revisiting Global Translation Estimation with Feature Tracks,Peilin Tao · Hainan Cui · Mengqi Rong · Shuhan Shen, ,https://arxiv.org/abs/2403.14118,,2403.14118.pdf,From Handcrafted Features to LLMs: A Brief Survey for Machine Translation Quality Estimation,"Machine Translation Quality Estimation (MTQE) is the task of estimating the
+quality of machine-translated text in real time without the need for reference
+translations, which is of great importance for the development of MT. After two
+decades of evolution, QE has yielded a wealth of results. This article provides
+a comprehensive overview of QE datasets, annotation methods, shared tasks,
+methodologies, challenges, and future research directions. It begins with an
+introduction to the background and significance of QE, followed by an
+explanation of the concepts and evaluation metrics for word-level QE,
+sentence-level QE, document-level QE, and explainable QE. The paper categorizes
+the methods developed throughout the history of QE into those based on
+handcrafted features, deep learning, and Large Language Models (LLMs), with a
+further division of deep learning-based methods into classic deep learning and
+those incorporating pre-trained language models (LMs). Additionally, the
+article details the advantages and limitations of each method and offers a
+straightforward comparison of different approaches. Finally, the paper
+discusses the current challenges in QE research and provides an outlook on
+future research directions.",cs.CL,['cs.CL']
+Directed Decentralized Collaboration for Personalized Federated Learning,Yingqi Liu · Yifan Shi · Qinglun Li · Baoyuan Wu · Xueqian Wang · Li Shen, ,https://arxiv.org/abs/2405.17876,,2405.17876.pdf,Decentralized Directed Collaboration for Personalized Federated Learning,"Personalized Federated Learning (PFL) is proposed to find the greatest
+personalized models for each client. To avoid the central failure and
+communication bottleneck in the server-based FL, we concentrate on the
+Decentralized Personalized Federated Learning (DPFL) that performs distributed
+model training in a Peer-to-Peer (P2P) manner. Most personalized works in DPFL
+are based on undirected and symmetric topologies, however, the data,
+computation and communication resources heterogeneity result in large variances
+in the personalized models, which lead the undirected aggregation to suboptimal
+personalized performance and unguaranteed convergence. To address these issues,
+we propose a directed collaboration DPFL framework by incorporating stochastic
+gradient push and partial model personalized, called \textbf{D}ecentralized
+\textbf{Fed}erated \textbf{P}artial \textbf{G}radient \textbf{P}ush
+(\textbf{DFedPGP}). It personalizes the linear classifier in the modern deep
+model to customize the local solution and learns a consensus representation in
+a fully decentralized manner. Clients only share gradients with a subset of
+neighbors based on the directed and asymmetric topologies, which guarantees
+flexible choices for resource efficiency and better convergence. Theoretically,
+we show that the proposed DFedPGP achieves a superior convergence rate of
+$\mathcal{O}(\frac{1}{\sqrt{T}})$ in the general non-convex setting, and prove
+the tighter connectivity among clients will speed up the convergence. The
+proposed method achieves state-of-the-art (SOTA) accuracy in both data and
+computation heterogeneity scenarios, demonstrating the efficiency of the
+directed collaboration and partial gradient push.",cs.LG,"['cs.LG', 'cs.DC', 'math.OC']"
+Towards Calibrated Multi-label Deep Neural Networks,Jiacheng Cheng · Nuno Vasconcelos, ,,https://paperswithcode.com/paper/towards-calibrated-deep-clustering-network,,,,,nan
+PolarRec: Improving Radio Interferometric Data Reconstruction Using Polar Coordinates,Ruoqi Wang · Zhuoyang Chen · Jiayi Zhu · Qiong Luo · Feng Wang, ,https://arxiv.org/abs/2308.14610,,2308.14610.pdf,PolarRec: Radio Interferometric Data Reconstruction with Polar Coordinate Representation,"In radio astronomy, visibility data, which are measurements of wave signals
+from radio telescopes, are transformed into images for observation of distant
+celestial objects. However, these resultant images usually contain both real
+sources and artifacts, due to signal sparsity and other factors. One way to
+obtain cleaner images is to reconstruct samples into dense forms before
+imaging. Unfortunately, existing reconstruction methods often miss some
+components of visibility in frequency domain, so blurred object edges and
+persistent artifacts remain in the images. Furthermore, the computation
+overhead is high on irregular visibility samples due to the data skew. To
+address these problems, we propose PolarRec, a transformer-encoder-conditioned
+reconstruction pipeline with visibility samples converted into the polar
+coordinate representation. This representation matches the way in which radio
+telescopes observe a celestial area as the Earth rotates. As a result,
+visibility samples distribute in the polar system more uniformly than in the
+Cartesian space. Therefore, we propose to use radial distance in the loss
+function, to help reconstruct complete visibility effectively. Also, we group
+visibility samples by their polar angles and propose a group-based encoding
+scheme to improve the efficiency. Our experiments demonstrate that PolarRec
+markedly improves imaging results by faithfully reconstructing all frequency
+components in the visibility domain while significantly reducing the
+computation cost in visibility data encoding. We believe this high-quality and
+high-efficiency imaging of PolarRec will better facilitate astronomers to
+conduct their research.",astro-ph.IM,"['astro-ph.IM', 'cs.AI', 'cs.CV']"
+SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution,Rongyuan Wu · Tao Yang · Lingchen Sun · Zhengqiang ZHANG · Shuai Li · Lei Zhang, ,https://arxiv.org/abs/2311.16518,,2311.16518.pdf,SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution,"Owe to the powerful generative priors, the pre-trained text-to-image (T2I)
+diffusion models have become increasingly popular in solving the real-world
+image super-resolution problem. However, as a consequence of the heavy quality
+degradation of input low-resolution (LR) images, the destruction of local
+structures can lead to ambiguous image semantics. As a result, the content of
+reproduced high-resolution image may have semantic errors, deteriorating the
+super-resolution performance. To address this issue, we present a
+semantics-aware approach to better preserve the semantic fidelity of generative
+real-world image super-resolution. First, we train a degradation-aware prompt
+extractor, which can generate accurate soft and hard semantic prompts even
+under strong degradation. The hard semantic prompts refer to the image tags,
+aiming to enhance the local perception ability of the T2I model, while the soft
+semantic prompts compensate for the hard ones to provide additional
+representation information. These semantic prompts can encourage the T2I model
+to generate detailed and semantically accurate results. Furthermore, during the
+inference process, we integrate the LR images into the initial sampling noise
+to mitigate the diffusion model's tendency to generate excessive random
+details. The experiments show that our method can reproduce more realistic
+image details and hold better the semantics.",cs.CV,['cs.CV']
+PanoContext-Former: Panoramic Total Scene Understanding with a Transformer,Yuan Dong · Chuan Fang · Liefeng Bo · Zilong Dong · Ping Tan,https://fangchuan.github.io/PanoContext-Former/,https://arxiv.org/abs/2312.07378v1,,2312.07378v1.pdf,X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-modal Knowledge Transfer,"The field of 4D point cloud understanding is rapidly developing with the goal
+of analyzing dynamic 3D point cloud sequences. However, it remains a
+challenging task due to the sparsity and lack of texture in point clouds.
+Moreover, the irregularity of point cloud poses a difficulty in aligning
+temporal information within video sequences. To address these issues, we
+propose a novel cross-modal knowledge transfer framework, called
+X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring
+texture priors from RGB sequences using a Transformer architecture with
+temporal relationship mining. Specifically, the framework is designed with a
+dual-branch architecture, consisting of an 4D point cloud transformer and a
+Gradient-aware Image Transformer (GIT). During training, we employ multiple
+knowledge transfer techniques, including temporal consistency losses and masked
+self-attention, to strengthen the knowledge transfer between modalities. This
+leads to enhanced performance during inference using single-modal 4D point
+cloud inputs. Extensive experiments demonstrate the superior performance of our
+framework on various 4D point cloud video understanding tasks, including action
+recognition, action segmentation and semantic segmentation. The results achieve
+1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action
+segmentation and semantic segmentation, on the HOI4D
+challenge\footnote{\url{http://www.hoi4d.top/}.}, outperforming previous
+state-of-the-art by a large margin. We release the code at
+https://github.com/jinglinglingling/X4D",cs.CV,['cs.CV']
+Text-image Alignment for Diffusion-based Perception,Neehar Kondapaneni · Markus Marks · Manuel Knott · Rogério Guimarães · Pietro Perona,https://www.vision.caltech.edu/tadp/,https://arxiv.org/abs/2310.00031,,2310.00031.pdf,Text-image Alignment for Diffusion-based Perception,"Diffusion models are generative models with impressive text-to-image
+synthesis capabilities and have spurred a new wave of creative methods for
+classical machine learning tasks. However, the best way to harness the
+perceptual knowledge of these generative models for visual tasks is still an
+open question. Specifically, it is unclear how to use the prompting interface
+when applying diffusion backbones to vision tasks. We find that automatically
+generated captions can improve text-image alignment and significantly enhance a
+model's cross-attention maps, leading to better perceptual performance. Our
+approach improves upon the current state-of-the-art (SOTA) in diffusion-based
+semantic segmentation on ADE20K and the current overall SOTA for depth
+estimation on NYUv2. Furthermore, our method generalizes to the cross-domain
+setting. We use model personalization and caption modifications to align our
+model to the target domain and find improvements over unaligned baselines. Our
+cross-domain object detection model, trained on Pascal VOC, achieves SOTA
+results on Watercolor2K. Our cross-domain segmentation method, trained on
+Cityscapes, achieves SOTA results on Dark Zurich-val and Nighttime Driving.
+Project page: https://www.vision.caltech.edu/tadp/. Code:
+https://github.com/damaggu/TADP.",cs.CV,['cs.CV']
+DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Iterative Diffusion-Based Refinement,Jiuming Liu · Guangming Wang · Weicai Ye · Chaokang Jiang · Jinru Han · Zhe Liu · Guofeng Zhang · Dalong Du · Hesheng Wang, ,https://arxiv.org/abs/2311.17456,,2311.17456.pdf,DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Diffusion Model,"Scene flow estimation, which aims to predict per-point 3D displacements of
+dynamic scenes, is a fundamental task in the computer vision field. However,
+previous works commonly suffer from unreliable correlation caused by locally
+constrained searching ranges, and struggle with accumulated inaccuracy arising
+from the coarse-to-fine structure. To alleviate these problems, we propose a
+novel uncertainty-aware scene flow estimation network (DifFlow3D) with the
+diffusion probabilistic model. Iterative diffusion-based refinement is designed
+to enhance the correlation robustness and resilience to challenging cases, e.g.
+dynamics, noisy inputs, repetitive patterns, etc. To restrain the generation
+diversity, three key flow-related features are leveraged as conditions in our
+diffusion model. Furthermore, we also develop an uncertainty estimation module
+within diffusion to evaluate the reliability of estimated scene flow. Our
+DifFlow3D achieves state-of-the-art performance, with 24.0% and 29.1% EPE3D
+reduction respectively on FlyingThings3D and KITTI 2015 datasets. Notably, our
+method achieves an unprecedented millimeter-level accuracy (0.0078m in EPE3D)
+on the KITTI dataset. Additionally, our diffusion-based refinement paradigm can
+be readily integrated as a plug-and-play module into existing scene flow
+networks, significantly increasing their estimation accuracy. Codes are
+released at https://github.com/IRMVLab/DifFlow3D.",cs.CV,['cs.CV']
+Mind Artist: Creating Artistic Snapshots with Human Thought,Jiaxuan Chen · Yu Qi · Yueming Wang · Gang Pan, ,https://ar5iv.labs.arxiv.org/html/2309.15729,,2309.15729.pdf,MindGPT: Interpreting What You See with Non-invasive Brain Recordings,"Decoding of seen visual contents with non-invasive brain recordings has
+important scientific and practical values. Efforts have been made to recover
+the seen images from brain signals. However, most existing approaches cannot
+faithfully reflect the visual contents due to insufficient image quality or
+semantic mismatches. Compared with reconstructing pixel-level visual images,
+speaking is a more efficient and effective way to explain visual information.
+Here we introduce a non-invasive neural decoder, termed as MindGPT, which
+interprets perceived visual stimuli into natural languages from fMRI signals.
+Specifically, our model builds upon a visually guided neural encoder with a
+cross-attention mechanism, which permits us to guide latent neural
+representations towards a desired language semantic direction in an end-to-end
+manner by the collaborative use of the large language model GPT. By doing so,
+we found that the neural representations of the MindGPT are explainable, which
+can be used to evaluate the contributions of visual properties to language
+semantics. Our experiments show that the generated word sequences truthfully
+represented the visual information (with essential details) conveyed in the
+seen stimuli. The results also suggested that with respect to language decoding
+tasks, the higher visual cortex (HVC) is more semantically informative than the
+lower visual cortex (LVC), and using only the HVC can recover most of the
+semantic information. The code of the MindGPT model will be publicly available
+at https://github.com/JxuanC/MindGPT.",cs.CV,"['cs.CV', 'cs.AI']"
+Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness,Sibo Wang · Jie Zhang · Zheng Yuan · Shiguang Shan,https://github.com/serendipity1122/Pre-trained-Model-Guided-Fine-Tuning-for-Zero-Shot-Adversarial-Robustness,https://arxiv.org/html/2401.04350v3,,2401.04350v3.pdf,Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness,"Large-scale pre-trained vision-language models like CLIP have demonstrated
+impressive performance across various tasks, and exhibit remarkable zero-shot
+generalization capability, while they are also vulnerable to imperceptible
+adversarial examples. Existing works typically employ adversarial training
+(fine-tuning) as a defense method against adversarial examples. However, direct
+application to the CLIP model may result in overfitting, compromising the
+model's capacity for generalization. In this paper, we propose Pre-trained
+Model Guided Adversarial Fine-Tuning (PMG-AFT) method, which leverages
+supervision from the original pre-trained model by carefully designing an
+auxiliary branch, to enhance the model's zero-shot adversarial robustness.
+Specifically, PMG-AFT minimizes the distance between the features of
+adversarial examples in the target model and those in the pre-trained model,
+aiming to preserve the generalization features already captured by the
+pre-trained model. Extensive Experiments on 15 zero-shot datasets demonstrate
+that PMG-AFT significantly outperforms the state-of-the-art method, improving
+the top-1 robust accuracy by an average of 4.99%. Furthermore, our approach
+consistently improves clean accuracy by an average of 8.72%. Our code is
+available at
+https://github.com/serendipity1122/Pre-trained-Model-Guided-Fine-Tuning-for-Zero-Shot-Adversarial-Robustness.",cs.CV,['cs.CV']
+ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D image,Marco Pesavento · Yuanlu Xu · Nikolaos Sarafianos · Robert Maier · Ziyan Wang · Chun-Han Yao · Marco Volino · Edmond Boyer · Adrian Hilton · Tony Tung, ,https://arxiv.org/abs/2403.10357,,2403.10357.pdf,ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D image,"Recent progress in human shape learning, shows that neural implicit models
+are effective in generating 3D human surfaces from limited number of views, and
+even from a single RGB image. However, existing monocular approaches still
+struggle to recover fine geometric details such as face, hands or cloth
+wrinkles. They are also easily prone to depth ambiguities that result in
+distorted geometries along the camera optical axis. In this paper, we explore
+the benefits of incorporating depth observations in the reconstruction process
+by introducing ANIM, a novel method that reconstructs arbitrary 3D human shapes
+from single-view RGB-D images with an unprecedented level of accuracy. Our
+model learns geometric details from both multi-resolution pixel-aligned and
+voxel-aligned features to leverage depth information and enable spatial
+relationships, mitigating depth ambiguities. We further enhance the quality of
+the reconstructed shape by introducing a depth-supervision strategy, which
+improves the accuracy of the signed distance field estimation of points that
+lie on the reconstructed surface. Experiments demonstrate that ANIM outperforms
+state-of-the-art works that use RGB, surface normals, point cloud or RGB-D data
+as input. In addition, we introduce ANIM-Real, a new multi-modal dataset
+comprising high-quality scans paired with consumer-grade RGB-D camera, and our
+protocol to fine-tune ANIM, enabling high-quality reconstruction from
+real-world human capture.",cs.CV,"['cs.CV', 'cs.GR']"
+GLOW: Global Layout Aware Attacks on Object Detection,Jun Bao · Buyu Liu · Kui Ren · Jun Yu, ,,https://paperswithcode.com/search?q=author:Jun+Yu,,,,,nan
+ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention,Jiawei Wang · Changjian Li,https://enigma-li.github.io/projects/contextSeg/contextSeg.html,https://arxiv.org/abs/2311.16682,,2311.16682.pdf,ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention,"Sketch semantic segmentation is a well-explored and pivotal problem in
+computer vision involving the assignment of pre-defined part labels to
+individual strokes. This paper presents ContextSeg - a simple yet highly
+effective approach to tackling this problem with two stages. In the first
+stage, to better encode the shape and positional information of strokes, we
+propose to predict an extra dense distance field in an autoencoder network to
+reinforce structural information learning. In the second stage, we treat an
+entire stroke as a single entity and label a group of strokes within the same
+semantic part using an auto-regressive Transformer with the default attention
+mechanism. By group-based labeling, our method can fully leverage the context
+information when making decisions for the remaining groups of strokes. Our
+method achieves the best segmentation accuracy compared with state-of-the-art
+approaches on two representative datasets and has been extensively evaluated
+demonstrating its superior performance. Additionally, we offer insights into
+solving part imbalance in training data and the preliminary experiment on
+cross-category training, which can inspire future research in this field.",cs.CV,"['cs.CV', 'cs.GR']"
+GEARS: Local Geometry-aware Hand-object Interaction Synthesis,Keyang Zhou · Bharat Lal Bhatnagar · Jan Lenssen · Gerard Pons-Moll, ,https://arxiv.org/abs/2404.01758,,2404.01758.pdf,GEARS: Local Geometry-aware Hand-object Interaction Synthesis,"Generating realistic hand motion sequences in interaction with objects has
+gained increasing attention with the growing interest in digital humans. Prior
+work has illustrated the effectiveness of employing occupancy-based or
+distance-based virtual sensors to extract hand-object interaction features.
+Nonetheless, these methods show limited generalizability across object
+categories, shapes and sizes. We hypothesize that this is due to two reasons:
+1) the limited expressiveness of employed virtual sensors, and 2) scarcity of
+available training data. To tackle this challenge, we introduce a novel
+joint-centered sensor designed to reason about local object geometry near
+potential interaction regions. The sensor queries for object surface points in
+the neighbourhood of each hand joint. As an important step towards mitigating
+the learning complexity, we transform the points from global frame to hand
+template frame and use a shared module to process sensor features of each
+individual joint. This is followed by a spatio-temporal transformer network
+aimed at capturing correlation among the joints in different dimensions.
+Moreover, we devise simple heuristic rules to augment the limited training
+sequences with vast static hand grasping samples. This leads to a broader
+spectrum of grasping types observed during training, in turn enhancing our
+model's generalization capability. We evaluate on two public datasets, GRAB and
+InterCap, where our method shows superiority over baselines both quantitatively
+and perceptually.",cs.CV,['cs.CV']
+Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts,Cansu Korkmaz · Ahmet Murat Tekalp · Zafer Dogan,https://github.com/mandalinadagi/WGSR,,https://paperswithcode.com/paper/training-generative-image-super-resolution,,,,,nan
+OrthCaps: An Orthogonal CapsNet with Sparse Attention Routing and Pruning,Geng Xinyu · Jiaming Wang · Jiawei Gong · yuerong xue · Jun Xu · Fanglin Chen · Xiaolin Huang, ,https://arxiv.org/abs/2403.13351v1,,2403.13351v1.pdf,OrthCaps: An Orthogonal CapsNet with Sparse Attention Routing and Pruning,"Redundancy is a persistent challenge in Capsule Networks (CapsNet),leading to
+high computational costs and parameter counts. Although previous works have
+introduced pruning after the initial capsule layer, dynamic routing's fully
+connected nature and non-orthogonal weight matrices reintroduce redundancy in
+deeper layers. Besides, dynamic routing requires iterating to converge, further
+increasing computational demands. In this paper, we propose an Orthogonal
+Capsule Network (OrthCaps) to reduce redundancy, improve routing performance
+and decrease parameter counts. Firstly, an efficient pruned capsule layer is
+introduced to discard redundant capsules. Secondly, dynamic routing is replaced
+with orthogonal sparse attention routing, eliminating the need for iterations
+and fully connected structures. Lastly, weight matrices during routing are
+orthogonalized to sustain low capsule similarity, which is the first approach
+to introduce orthogonality into CapsNet as far as we know. Our experiments on
+baseline datasets affirm the efficiency and robustness of OrthCaps in
+classification tasks, in which ablation studies validate the criticality of
+each component. Remarkably, OrthCaps-Shallow outperforms other Capsule Network
+benchmarks on four datasets, utilizing only 110k parameters, which is a mere
+1.25% of a standard Capsule Network's total. To the best of our knowledge, it
+achieves the smallest parameter count among existing Capsule Networks.
+Similarly, OrthCaps-Deep demonstrates competitive performance across four
+datasets, utilizing only 1.2% of the parameters required by its counterparts.",cs.CV,['cs.CV']
+Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping,Alex Costanzino · Pierluigi Zama Ramirez · Giuseppe Lisanti · Luigi Di Stefano,https://cvlab-unibo.github.io/CrossmodalFeatureMapping/,https://arxiv.org/abs/2312.04521,,2312.04521.pdf,Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping,"The paper explores the industrial multimodal Anomaly Detection (AD) task,
+which exploits point clouds and RGB images to localize anomalies. We introduce
+a novel light and fast framework that learns to map features from one modality
+to the other on nominal samples. At test time, anomalies are detected by
+pinpointing inconsistencies between observed and mapped features. Extensive
+experiments show that our approach achieves state-of-the-art detection and
+segmentation performance in both the standard and few-shot settings on the
+MVTec 3D-AD dataset while achieving faster inference and occupying less memory
+than previous multimodal AD methods. Moreover, we propose a layer-pruning
+technique to improve memory and time efficiency with a marginal sacrifice in
+performance.",cs.CV,['cs.CV']
+Discover and Mitigate Multiple Biased Subgroups in Image Classifiers,Zeliang Zhang · Mingqian Feng · Zhiheng Li · Chenliang Xu, ,https://arxiv.org/abs/2403.12777,,2403.12777.pdf,Discover and Mitigate Multiple Biased Subgroups in Image Classifiers,"Machine learning models can perform well on in-distribution data but often
+fail on biased subgroups that are underrepresented in the training data,
+hindering the robustness of models for reliable applications. Such subgroups
+are typically unknown due to the absence of subgroup labels. Discovering biased
+subgroups is the key to understanding models' failure modes and further
+improving models' robustness. Most previous works of subgroup discovery make an
+implicit assumption that models only underperform on a single biased subgroup,
+which does not hold on in-the-wild data where multiple biased subgroups exist.
+  In this work, we propose Decomposition, Interpretation, and Mitigation (DIM),
+a novel method to address a more challenging but also more practical problem of
+discovering multiple biased subgroups in image classifiers. Our approach
+decomposes the image features into multiple components that represent multiple
+subgroups. This decomposition is achieved via a bilinear dimension reduction
+method, Partial Least Square (PLS), guided by useful supervision from the image
+classifier. We further interpret the semantic meaning of each subgroup
+component by generating natural language descriptions using vision-language
+foundation models. Finally, DIM mitigates multiple biased subgroups
+simultaneously via two strategies, including the data- and model-centric
+strategies. Extensive experiments on CIFAR-100 and Breeds datasets demonstrate
+the effectiveness of DIM in discovering and mitigating multiple biased
+subgroups. Furthermore, DIM uncovers the failure modes of the classifier on
+Hard ImageNet, showcasing its broader applicability to understanding model bias
+in image classifiers. The code is available at
+https://github.com/ZhangAIPI/DIM.",cs.CV,"['cs.CV', 'cs.AI']"
+RMT: Retentive Networks Meet Vision Transformers,Qihang Fan · Huaibo Huang · Mingrui Chen · Hongmin Liu · Ran He,https://github.com/qhfan/RMT,https://arxiv.org/abs/2309.11523,,2309.11523.pdf,RMT: Retentive Networks Meet Vision Transformers,"Vision Transformer (ViT) has gained increasing attention in the computer
+vision community in recent years. However, the core component of ViT,
+Self-Attention, lacks explicit spatial priors and bears a quadratic
+computational complexity, thereby constraining the applicability of ViT. To
+alleviate these issues, we draw inspiration from the recent Retentive Network
+(RetNet) in the field of NLP, and propose RMT, a strong vision backbone with
+explicit spatial prior for general purposes. Specifically, we extend the
+RetNet's temporal decay mechanism to the spatial domain, and propose a spatial
+decay matrix based on the Manhattan distance to introduce the explicit spatial
+prior to Self-Attention. Additionally, an attention decomposition form that
+adeptly adapts to explicit spatial prior is proposed, aiming to reduce the
+computational burden of modeling global information without disrupting the
+spatial decay matrix. Based on the spatial decay matrix and the attention
+decomposition form, we can flexibly integrate explicit spatial prior into the
+vision backbone with linear complexity. Extensive experiments demonstrate that
+RMT exhibits exceptional performance across various vision tasks. Specifically,
+without extra training data, RMT achieves **84.8%** and **86.1%** top-1 acc on
+ImageNet-1k with **27M/4.5GFLOPs** and **96M/18.2GFLOPs**. For downstream
+tasks, RMT achieves **54.5** box AP and **47.2** mask AP on the COCO detection
+task, and **52.8** mIoU on the ADE20K semantic segmentation task. Code is
+available at https://github.com/qhfan/RMT",cs.CV,['cs.CV']
+No More Ambiguity in 360$^\circ$ Room Layout via Bi-Layout Estimation,Yu-Ju Tsai · Jin-Cheng Jhang · JINGJING ZHENG · Wei Wang · Albert Chen · Min Sun · Cheng-Hao Kuo · Ming-Hsuan Yang, ,https://arxiv.org/abs/2404.09993,,2404.09993.pdf,No More Ambiguity in 360° Room Layout via Bi-Layout Estimation,"Inherent ambiguity in layout annotations poses significant challenges to
+developing accurate 360{\deg} room layout estimation models. To address this
+issue, we propose a novel Bi-Layout model capable of predicting two distinct
+layout types. One stops at ambiguous regions, while the other extends to
+encompass all visible areas. Our model employs two global context embeddings,
+where each embedding is designed to capture specific contextual information for
+each layout type. With our novel feature guidance module, the image feature
+retrieves relevant context from these embeddings, generating layout-aware
+features for precise bi-layout predictions. A unique property of our Bi-Layout
+model is its ability to inherently detect ambiguous regions by comparing the
+two predictions. To circumvent the need for manual correction of ambiguous
+annotations during testing, we also introduce a new metric for disambiguating
+ground truth layouts. Our method demonstrates superior performance on benchmark
+datasets, notably outperforming leading approaches. Specifically, on the
+MatterportLayout dataset, it improves 3DIoU from 81.70% to 82.57% across the
+full test set and notably from 54.80% to 59.97% in subsets with significant
+ambiguity. Project page: https://liagm.github.io/Bi_Layout/",cs.CV,['cs.CV']
+AVID: Any-Length Video Inpainting with Diffusion Model,Zhixing Zhang · Bichen Wu · Xiaoyan Wang · Yaqiao Luo · Luxin Zhang · Yinan Zhao · Peter Vajda · Dimitris N. Metaxas · Licheng Yu,https://zhang-zx.github.io/AVID/,https://arxiv.org/abs/2312.03816,,2312.03816.pdf,AVID: Any-Length Video Inpainting with Diffusion Model,"Recent advances in diffusion models have successfully enabled text-guided
+image inpainting. While it seems straightforward to extend such editing
+capability into the video domain, there have been fewer works regarding
+text-guided video inpainting. Given a video, a masked region at its initial
+frame, and an editing prompt, it requires a model to do infilling at each frame
+following the editing guidance while keeping the out-of-mask region intact.
+There are three main challenges in text-guided video inpainting: ($i$) temporal
+consistency of the edited video, ($ii$) supporting different inpainting types
+at different structural fidelity levels, and ($iii$) dealing with variable
+video length. To address these challenges, we introduce Any-Length Video
+Inpainting with Diffusion Model, dubbed as AVID. At its core, our model is
+equipped with effective motion modules and adjustable structure guidance, for
+fixed-length video inpainting. Building on top of that, we propose a novel
+Temporal MultiDiffusion sampling pipeline with a middle-frame attention
+guidance mechanism, facilitating the generation of videos with any desired
+duration. Our comprehensive experiments show our model can robustly deal with
+various inpainting types at different video duration ranges, with high quality.
+More visualization results are made publicly available at
+https://zhang-zx.github.io/AVID/ .",cs.CV,['cs.CV']
+PaReNeRF: Toward Fast Large-scale Dynamic NeRF with Patch-based Reference,Xiao Tang · Min Yang · Penghui Sun · Hui Li · Yuchao Dai · feng zhu · Hojae Lee, ,https://arxiv.org/abs/2405.08609,,2405.08609.pdf,Dynamic NeRF: A Review,"Neural Radiance Field(NeRF) is an novel implicit method to achieve the 3D
+reconstruction and representation with a high resolution. After the first
+research of NeRF is proposed, NeRF has gained a robust developing power and is
+booming in the 3D modeling, representation and reconstruction areas. However
+the first and most of the followed research projects based on NeRF is static,
+which are weak in the practical applications. Therefore, more researcher are
+interested and focused on the study of dynamic NeRF that is more feasible and
+useful in practical applications or situations. Compared with the static NeRF,
+implementing the Dynamic NeRF is more difficult and complex. But Dynamic is
+more potential in the future even is the basic of Editable NeRF. In this
+review, we made a detailed and abundant statement for the development and
+important implementation principles of Dynamci NeRF. The analysis of main
+principle and development of Dynamic NeRF is from 2021 to 2023, including the
+most of the Dynamic NeRF projects. What is more, with colorful and novel
+special designed figures and table, We also made a detailed comparison and
+analysis of different features of various of Dynamic. Besides, we analyzed and
+discussed the key methods to implement a Dynamic NeRF. The volume of the
+reference papers is large. The statements and comparisons are multidimensional.
+With a reading of this review, the whole development history and most of the
+main design method or principles of Dynamic NeRF can be easy understood and
+gained.",cs.CV,['cs.CV']
+LoCoNet: Long-Short Context Network for Active Speaker Detection,Xizi Wang · Feng Cheng · Gedas Bertasius, ,https://ar5iv.labs.arxiv.org/html/2301.08237,,2301.08237.pdf,LoCoNet: Long-Short Context Network for Active Speaker Detection,"Active Speaker Detection (ASD) aims to identify who is speaking in each frame
+of a video. ASD reasons from audio and visual information from two contexts:
+long-term intra-speaker context and short-term inter-speaker context. Long-term
+intra-speaker context models the temporal dependencies of the same speaker,
+while short-term inter-speaker context models the interactions of speakers in
+the same scene. These two contexts are complementary to each other and can help
+infer the active speaker. Motivated by these observations, we propose LoCoNet,
+a simple yet effective Long-Short Context Network that models the long-term
+intra-speaker context and short-term inter-speaker context. We use
+self-attention to model long-term intra-speaker context due to its
+effectiveness in modeling long-range dependencies, and convolutional blocks
+that capture local patterns to model short-term inter-speaker context.
+Extensive experiments show that LoCoNet achieves state-of-the-art performance
+on multiple datasets, achieving an mAP of 95.2%(+1.1%) on AVA-ActiveSpeaker,
+68.1%(+22%) on Columbia dataset, 97.2%(+2.8%) on Talkies dataset and
+59.7%(+8.0%) on Ego4D dataset. Moreover, in challenging cases where multiple
+speakers are present, or face of active speaker is much smaller than other
+faces in the same scene, LoCoNet outperforms previous state-of-the-art methods
+by 3.4% on the AVA-ActiveSpeaker dataset. The code will be released at
+https://github.com/SJTUwxz/LoCoNet_ASD.",cs.CV,['cs.CV']
+Realigning Confidence with Temporal Saliency Information for Point-Level Weakly-Supervised Temporal Action Localization,Ziying Xia · Jian Cheng · Siyu Liu · Yongxiang Hu · Shiguang Wang · Zhang Yijie · Wanli Dang,https://github.com/zyxia1009/CVPR2024-TSPNet,,https://link.springer.com/article/10.1007/s11063-024-11598-w,,,,,nan
+3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos,Jiakai Sun · Han Jiao · Guangyuan Li · Zhanjie Zhang · Lei Zhao · Wei Xing,https://sjojok.github.io/3dgstream/,https://arxiv.org/abs/2403.01444,,2403.01444.pdf,3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos,"Constructing photo-realistic Free-Viewpoint Videos (FVVs) of dynamic scenes
+from multi-view videos remains a challenging endeavor. Despite the remarkable
+advancements achieved by current neural rendering techniques, these methods
+generally require complete video sequences for offline training and are not
+capable of real-time rendering. To address these constraints, we introduce
+3DGStream, a method designed for efficient FVV streaming of real-world dynamic
+scenes. Our method achieves fast on-the-fly per-frame reconstruction within 12
+seconds and real-time rendering at 200 FPS. Specifically, we utilize 3D
+Gaussians (3DGs) to represent the scene. Instead of the na\""ive approach of
+directly optimizing 3DGs per-frame, we employ a compact Neural Transformation
+Cache (NTC) to model the translations and rotations of 3DGs, markedly reducing
+the training time and storage required for each FVV frame. Furthermore, we
+propose an adaptive 3DG addition strategy to handle emerging objects in dynamic
+scenes. Experiments demonstrate that 3DGStream achieves competitive performance
+in terms of rendering speed, image quality, training time, and model storage
+when compared with state-of-the-art methods.",cs.CV,['cs.CV']
+Aerial Lifting: Neural Urban Semantic and Building Instance Lifting from Aerial Imagery,Yuqi Zhang · Guanying Chen · Jiaxing Chen · Shuguang Cui,https://zyqz97.github.io/Aerial_Lifting/,https://arxiv.org/abs/2403.11812,,2403.11812.pdf,Aerial Lifting: Neural Urban Semantic and Building Instance Lifting from Aerial Imagery,"We present a neural radiance field method for urban-scale semantic and
+building-level instance segmentation from aerial images by lifting noisy 2D
+labels to 3D. This is a challenging problem due to two primary reasons.
+Firstly, objects in urban aerial images exhibit substantial variations in size,
+including buildings, cars, and roads, which pose a significant challenge for
+accurate 2D segmentation. Secondly, the 2D labels generated by existing
+segmentation methods suffer from the multi-view inconsistency problem,
+especially in the case of aerial images, where each image captures only a small
+portion of the entire scene. To overcome these limitations, we first introduce
+a scale-adaptive semantic label fusion strategy that enhances the segmentation
+of objects of varying sizes by combining labels predicted from different
+altitudes, harnessing the novel-view synthesis capabilities of NeRF. We then
+introduce a novel cross-view instance label grouping strategy based on the 3D
+scene representation to mitigate the multi-view inconsistency problem in the 2D
+instance labels. Furthermore, we exploit multi-view reconstructed depth priors
+to improve the geometric quality of the reconstructed radiance field, resulting
+in enhanced segmentation results. Experiments on multiple real-world
+urban-scale datasets demonstrate that our approach outperforms existing
+methods, highlighting its effectiveness.",cs.CV,['cs.CV']
+NetTrack: Tracking Highly Dynamic Objects with a Net,Guangze Zheng · Shijie Lin · Haobo Zuo · Changhong Fu · Jia Pan, ,https://arxiv.org/abs/2403.11186,,2403.11186.pdf,NetTrack: Tracking Highly Dynamic Objects with a Net,"The complex dynamicity of open-world objects presents non-negligible
+challenges for multi-object tracking (MOT), often manifested as severe
+deformations, fast motion, and occlusions. Most methods that solely depend on
+coarse-grained object cues, such as boxes and the overall appearance of the
+object, are susceptible to degradation due to distorted internal relationships
+of dynamic objects. To address this problem, this work proposes NetTrack, an
+efficient, generic, and affordable tracking framework to introduce fine-grained
+learning that is robust to dynamicity. Specifically, NetTrack constructs a
+dynamicity-aware association with a fine-grained Net, leveraging point-level
+visual cues. Correspondingly, a fine-grained sampler and matching method have
+been incorporated. Furthermore, NetTrack learns object-text correspondence for
+fine-grained localization. To evaluate MOT in extremely dynamic open-world
+scenarios, a bird flock tracking (BFT) dataset is constructed, which exhibits
+high dynamicity with diverse species and open-world scenarios. Comprehensive
+evaluation on BFT validates the effectiveness of fine-grained learning on
+object dynamicity, and thorough transfer experiments on challenging open-world
+benchmarks, i.e., TAO, TAO-OW, AnimalTrack, and GMOT-40, validate the strong
+generalization ability of NetTrack even without finetuning. Project page:
+https://george-zhuang.github.io/nettrack/.",cs.CV,['cs.CV']
+"Advancing Saliency Ranking with Human Fixations: Dataset, Models and Benchmarks",Bowen Deng · Siyang Song · Andrew French · Denis Schluppeck · Michael Pound, ,,https://github.com/topics/saliency-ranking-dateset,,,,,nan
+Breathing Life Into Sketches Using Text-to-Video Priors,Rinon Gal · Yael Vinker · Yuval Alaluf · Amit H. Bermano · Daniel Cohen-Or · Ariel Shamir · Gal Chechik, ,https://arxiv.org/abs/2311.13608,,2311.13608.pdf,Breathing Life Into Sketches Using Text-to-Video Priors,"A sketch is one of the most intuitive and versatile tools humans use to
+convey their ideas visually. An animated sketch opens another dimension to the
+expression of ideas and is widely used by designers for a variety of purposes.
+Animating sketches is a laborious process, requiring extensive experience and
+professional design skills. In this work, we present a method that
+automatically adds motion to a single-subject sketch (hence, ""breathing life
+into it""), merely by providing a text prompt indicating the desired motion. The
+output is a short animation provided in vector representation, which can be
+easily edited. Our method does not require extensive training, but instead
+leverages the motion prior of a large pretrained text-to-video diffusion model
+using a score-distillation loss to guide the placement of strokes. To promote
+natural and smooth motion and to better preserve the sketch's appearance, we
+model the learned motion through two components. The first governs small local
+deformations and the second controls global affine transformations.
+Surprisingly, we find that even models that struggle to generate sketch videos
+on their own can still serve as a useful backbone for animating abstract
+representations.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+BrainWash: A Poisoning Attack to Forget in Continual Learning,Ali Abbasi · Parsa Nooralinejad · Hamed Pirsiavash · Soheil Kolouri, ,https://arxiv.org/abs/2311.11995,,2311.11995.pdf,BrainWash: A Poisoning Attack to Forget in Continual Learning,"Continual learning has gained substantial attention within the deep learning
+community, offering promising solutions to the challenging problem of
+sequential learning. Yet, a largely unexplored facet of this paradigm is its
+susceptibility to adversarial attacks, especially with the aim of inducing
+forgetting. In this paper, we introduce ""BrainWash,"" a novel data poisoning
+method tailored to impose forgetting on a continual learner. By adding the
+BrainWash noise to a variety of baselines, we demonstrate how a trained
+continual learner can be induced to forget its previously learned tasks
+catastrophically, even when using these continual learning baselines. An
+important feature of our approach is that the attacker requires no access to
+previous tasks' data and is armed merely with the model's current parameters
+and the data belonging to the most recent task. Our extensive experiments
+highlight the efficacy of BrainWash, showcasing degradation in performance
+across various regularization-based continual learning methods.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CR']"
+ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis,Xiangjun Gao · Xiaoyu Li · Chaopeng Zhang · Qi Zhang · Yan-Pei Cao · Ying Shan · Long Quan, ,https://arxiv.org/abs/2311.17123,,2311.17123.pdf,ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis,"In this work, we propose a method to address the challenge of rendering a 3D
+human from a single image in a free-view manner. Some existing approaches could
+achieve this by using generalizable pixel-aligned implicit fields to
+reconstruct a textured mesh of a human or by employing a 2D diffusion model as
+guidance with the Score Distillation Sampling (SDS) method, to lift the 2D
+image into 3D space. However, a generalizable implicit field often results in
+an over-smooth texture field, while the SDS method tends to lead to a
+texture-inconsistent novel view with the input image. In this paper, we
+introduce a texture-consistent back view synthesis module that could transfer
+the reference image content to the back view through depth and text-guided
+attention injection. Moreover, to alleviate the color distortion that occurs in
+the side region, we propose a visibility-aware patch consistency regularization
+for texture mapping and refinement combined with the synthesized back view
+texture. With the above techniques, we could achieve high-fidelity and
+texture-consistent human rendering from a single image. Experiments conducted
+on both real and synthetic data demonstrate the effectiveness of our method and
+show that our approach outperforms previous baseline methods.",cs.CV,"['cs.CV', 'cs.AI']"
+NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging,Takahiro Shirakawa · Seiichi Uchida, ,https://arxiv.org/abs/2403.03485,,2403.03485.pdf,NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging,"Layout-aware text-to-image generation is a task to generate multi-object
+images that reflect layout conditions in addition to text conditions. The
+current layout-aware text-to-image diffusion models still have several issues,
+including mismatches between the text and layout conditions and quality
+degradation of generated images. This paper proposes a novel layout-aware
+text-to-image diffusion model called NoiseCollage to tackle these issues.
+During the denoising process, NoiseCollage independently estimates noises for
+individual objects and then crops and merges them into a single noise. This
+operation helps avoid condition mismatches; in other words, it can put the
+right objects in the right places. Qualitative and quantitative evaluations
+show that NoiseCollage outperforms several state-of-the-art models. These
+successful results indicate that the crop-and-merge operation of noises is a
+reasonable strategy to control image generation. We also show that NoiseCollage
+can be integrated with ControlNet to use edges, sketches, and pose skeletons as
+additional conditions. Experimental results show that this integration boosts
+the layout accuracy of ControlNet. The code is available at
+https://github.com/univ-esuty/noisecollage.",cs.CV,['cs.CV']
+SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution,Zhixuan Liang · Yao Mu · Hengbo Ma · Masayoshi Tomizuka · Mingyu Ding · Ping Luo,https://skilldiffuser.github.io/,https://arxiv.org/abs/2312.11598,,2312.11598.pdf,SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution,"Diffusion models have demonstrated strong potential for robotic trajectory
+planning. However, generating coherent trajectories from high-level
+instructions remains challenging, especially for long-range composition tasks
+requiring multiple sequential skills. We propose SkillDiffuser, an end-to-end
+hierarchical planning framework integrating interpretable skill learning with
+conditional diffusion planning to address this problem. At the higher level,
+the skill abstraction module learns discrete, human-understandable skill
+representations from visual observations and language instructions. These
+learned skill embeddings are then used to condition the diffusion model to
+generate customized latent trajectories aligned with the skills. This allows
+generating diverse state trajectories that adhere to the learnable skills. By
+integrating skill learning with conditional trajectory generation,
+SkillDiffuser produces coherent behavior following abstract instructions across
+diverse tasks. Experiments on multi-task robotic manipulation benchmarks like
+Meta-World and LOReL demonstrate state-of-the-art performance and
+human-interpretable skill representations from SkillDiffuser. More
+visualization results and information could be found on our website.",cs.RO,"['cs.RO', 'cs.CV', 'cs.LG']"
+CurveCloudNet: Processing Point Clouds with 1D Structure,Colton Stearns · Alex Fu · Jiateng Liu · Jeong Joon Park · Davis Rempe · Despoina Paschalidou · Leonidas Guibas, ,https://arxiv.org/abs/2312.12743,,2312.12743.pdf,PointeNet: A Lightweight Framework for Effective and Efficient Point Cloud Analysis,"Current methodologies in point cloud analysis predominantly explore 3D
+geometries, often achieved through the introduction of intricate learnable
+geometric extractors in the encoder or by deepening networks with repeated
+blocks. However, these approaches inevitably lead to a significant number of
+learnable parameters, resulting in substantial computational costs and imposing
+memory burdens on CPU/GPU. Additionally, the existing strategies are primarily
+tailored for object-level point cloud classification and segmentation tasks,
+with limited extensions to crucial scene-level applications, such as autonomous
+driving. In response to these limitations, we introduce PointeNet, an efficient
+network designed specifically for point cloud analysis. PointeNet distinguishes
+itself with its lightweight architecture, low training cost, and plug-and-play
+capability, effectively capturing representative features. The network consists
+of a Multivariate Geometric Encoding (MGE) module and an optional
+Distance-aware Semantic Enhancement (DSE) module. The MGE module employs
+operations of sampling, grouping, and multivariate geometric aggregation to
+lightweightly capture and adaptively aggregate multivariate geometric features,
+providing a comprehensive depiction of 3D geometries. The DSE module, designed
+for real-world autonomous driving scenarios, enhances the semantic perception
+of point clouds, particularly for distant points. Our method demonstrates
+flexibility by seamlessly integrating with a classification/segmentation head
+or embedding into off-the-shelf 3D object detection networks, achieving notable
+performance improvements at a minimal cost. Extensive experiments on
+object-level datasets, including ModelNet40, ScanObjectNN, ShapeNetPart, and
+the scene-level dataset KITTI, demonstrate the superior performance of
+PointeNet over state-of-the-art methods in point cloud analysis.",cs.CV,['cs.CV']
+LAN: Learning to Adapt Noise for Image Denoising,Changjin Kim · Tae Hyun Kim · Sungyong Baik, ,https://arxiv.org/abs/2403.15132,,2403.15132.pdf,Transfer CLIP for Generalizable Image Denoising,"Image denoising is a fundamental task in computer vision. While prevailing
+deep learning-based supervised and self-supervised methods have excelled in
+eliminating in-distribution noise, their susceptibility to out-of-distribution
+(OOD) noise remains a significant challenge. The recent emergence of
+contrastive language-image pre-training (CLIP) model has showcased exceptional
+capabilities in open-world image recognition and segmentation. Yet, the
+potential for leveraging CLIP to enhance the robustness of low-level tasks
+remains largely unexplored. This paper uncovers that certain dense features
+extracted from the frozen ResNet image encoder of CLIP exhibit
+distortion-invariant and content-related properties, which are highly desirable
+for generalizable denoising. Leveraging these properties, we devise an
+asymmetrical encoder-decoder denoising network, which incorporates dense
+features including the noisy image and its multi-scale features from the frozen
+ResNet encoder of CLIP into a learnable image decoder to achieve generalizable
+denoising. The progressive feature augmentation strategy is further proposed to
+mitigate feature overfitting and improve the robustness of the learnable
+decoder. Extensive experiments and comparisons conducted across diverse OOD
+noises, including synthetic noise, real-world sRGB noise, and low-dose CT image
+noise, demonstrate the superior generalization ability of our method.",cs.CV,"['cs.CV', 'eess.IV']"
+Structured Gradient-based Interpretations via Norm-Regularized Adversarial Training,Shizhan Gong · Qi Dou · Farzan Farnia, ,https://arxiv.org/abs/2404.04647,,2404.04647.pdf,Structured Gradient-based Interpretations via Norm-Regularized Adversarial Training,"Gradient-based saliency maps have been widely used to explain the decisions
+of deep neural network classifiers. However, standard gradient-based
+interpretation maps, including the simple gradient and integrated gradient
+algorithms, often lack desired structures such as sparsity and connectedness in
+their application to real-world computer vision models. A frequently used
+approach to inducing sparsity structures into gradient-based saliency maps is
+to alter the simple gradient scheme using sparsification or norm-based
+regularization. A drawback with such post-processing methods is their
+frequently-observed significant loss in fidelity to the original simple
+gradient map. In this work, we propose to apply adversarial training as an
+in-processing scheme to train neural networks with structured simple gradient
+maps. We show a duality relation between the regularized norms of the
+adversarial perturbations and gradient-based maps, based on which we design
+adversarial training loss functions promoting sparsity and group-sparsity
+properties in simple gradient maps. We present several numerical results to
+show the influence of our proposed norm-based adversarial training methods on
+the standard gradient-based maps of standard neural network architectures on
+benchmark image datasets.",cs.CV,['cs.CV']
+MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using Differentiable Shading,Abdallah Dib · Luiz Gustavo Hafemann · Emeline Got · Trevor Anderson · Amin Fadaeinejad · Rafael M. O. Cruz · Marc-André Carbonneau, ,https://arxiv.org/abs/2312.13091v2,,2312.13091v2.pdf,MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using Differentiable Shading,"Reconstructing an avatar from a portrait image has many applications in
+multimedia, but remains a challenging research problem. Extracting reflectance
+maps and geometry from one image is ill-posed: recovering geometry is a
+one-to-many mapping problem and reflectance and light are difficult to
+disentangle. Accurate geometry and reflectance can be captured under the
+controlled conditions of a light stage, but it is costly to acquire large
+datasets in this fashion. Moreover, training solely with this type of data
+leads to poor generalization with in-the-wild images. This motivates the
+introduction of MoSAR, a method for 3D avatar generation from monocular images.
+We propose a semi-supervised training scheme that improves generalization by
+learning from both light stage and in-the-wild datasets. This is achieved using
+a novel differentiable shading formulation. We show that our approach
+effectively disentangles the intrinsic face parameters, producing relightable
+avatars. As a result, MoSAR estimates a richer set of skin reflectance maps,
+and generates more realistic avatars than existing state-of-the-art methods. We
+also introduce a new dataset, named FFHQ-UV-Intrinsics, the first public
+dataset providing intrinsic face attributes at scale (diffuse, specular,
+ambient occlusion and translucency maps) for a total of 10k subjects. The
+project website and the dataset are available on the following link:
+https://ubisoft-laforge.github.io/character/mosar/",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG', '68T45 (Primary) 68T07, 68T01 (Secondary)', 'I.2.10; I.4; I.3.3; I.5']"
+Cinematic Behavior Transfer via NeRF-based Differentiable Filming,Xuekun Jiang · Anyi Rao · Jingbo Wang · Dahua Lin · Bo Dai, ,https://arxiv.org/abs/2311.17754,,2311.17754.pdf,Cinematic Behavior Transfer via NeRF-based Differentiable Filming,"In the evolving landscape of digital media and video production, the precise
+manipulation and reproduction of visual elements like camera movements and
+character actions are highly desired. Existing SLAM methods face limitations in
+dynamic scenes and human pose estimation often focuses on 2D projections,
+neglecting 3D statuses. To address these issues, we first introduce a reverse
+filming behavior estimation technique. It optimizes camera trajectories by
+leveraging NeRF as a differentiable renderer and refining SMPL tracks. We then
+introduce a cinematic transfer pipeline that is able to transfer various shot
+types to a new 2D video or a 3D virtual environment. The incorporation of 3D
+engine workflow enables superior rendering and control abilities, which also
+achieves a higher rating in the user study.",cs.CV,"['cs.CV', 'cs.GR', 'cs.HC', 'cs.MM']"
+Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation,Xiaohan Lei · Min Wang · Wengang Zhou · Li Li · Houqiang Li,https://xiaohanlei.github.io/projects/IEVE/,https://arxiv.org/abs/2402.17587,,2402.17587.pdf,Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation,"As a new embodied vision task, Instance ImageGoal Navigation (IIN) aims to
+navigate to a specified object depicted by a goal image in an unexplored
+environment.
+  The main challenge of this task lies in identifying the target object from
+different viewpoints while rejecting similar distractors.
+  Existing ImageGoal Navigation methods usually adopt the simple
+Exploration-Exploitation framework and ignore the identification of specific
+instance during navigation.
+  In this work, we propose to imitate the human behaviour of ``getting closer
+to confirm"" when distinguishing objects from a distance.
+  Specifically, we design a new modular navigation framework named
+Instance-aware Exploration-Verification-Exploitation (IEVE) for instance-level
+image goal navigation.
+  Our method allows for active switching among the exploration, verification,
+and exploitation actions, thereby facilitating the agent in making reasonable
+decisions under different situations.
+  On the challenging HabitatMatterport 3D semantic (HM3D-SEM) dataset, our
+method surpasses previous state-of-the-art work, with a classical segmentation
+model (0.684 vs. 0.561 success) or a robust model (0.702 vs. 0.561 success)",cs.CV,"['cs.CV', 'cs.RO']"
+TextNeRF: A Novel Scene-Text Image Synthesis Method based on Neural Radiance Fields,Jialei Cui · Jianwei Du · Wenzhuo Liu · Zhouhui Lian, ,https://arxiv.org/abs/2403.01325,,2403.01325.pdf,NeRF-VPT: Learning Novel View Representations with Neural Radiance Fields via View Prompt Tuning,"Neural Radiance Fields (NeRF) have garnered remarkable success in novel view
+synthesis. Nonetheless, the task of generating high-quality images for novel
+views persists as a critical challenge. While the existing efforts have
+exhibited commendable progress, capturing intricate details, enhancing
+textures, and achieving superior Peak Signal-to-Noise Ratio (PSNR) metrics
+warrant further focused attention and advancement. In this work, we propose
+NeRF-VPT, an innovative method for novel view synthesis to address these
+challenges. Our proposed NeRF-VPT employs a cascading view prompt tuning
+paradigm, wherein RGB information gained from preceding rendering outcomes
+serves as instructive visual prompts for subsequent rendering stages, with the
+aspiration that the prior knowledge embedded in the prompts can facilitate the
+gradual enhancement of rendered image quality. NeRF-VPT only requires sampling
+RGB data from previous stage renderings as priors at each training stage,
+without relying on extra guidance or complex techniques. Thus, our NeRF-VPT is
+plug-and-play and can be readily integrated into existing methods. By
+conducting comparative analyses of our NeRF-VPT against several NeRF-based
+approaches on demanding real-scene benchmarks, such as Realistic Synthetic 360,
+Real Forward-Facing, Replica dataset, and a user-captured dataset, we
+substantiate that our NeRF-VPT significantly elevates baseline performance and
+proficiently generates more high-quality novel view images than all the
+compared state-of-the-art methods. Furthermore, the cascading learning of
+NeRF-VPT introduces adaptability to scenarios with sparse inputs, resulting in
+a significant enhancement of accuracy for sparse-view novel view synthesis. The
+source code and dataset are available at
+\url{https://github.com/Freedomcls/NeRF-VPT}.",cs.CV,['cs.CV']
+Sparse Global Matching for Video Frame Interpolation with Large Motion,Chunxu Liu · Guozhen Zhang · Rui Zhao · Limin Wang, ,https://arxiv.org/abs/2404.06913,,2404.06913.pdf,Sparse Global Matching for Video Frame Interpolation with Large Motion,"Large motion poses a critical challenge in Video Frame Interpolation (VFI)
+task. Existing methods are often constrained by limited receptive fields,
+resulting in sub-optimal performance when handling scenarios with large motion.
+In this paper, we introduce a new pipeline for VFI, which can effectively
+integrate global-level information to alleviate issues associated with large
+motion. Specifically, we first estimate a pair of initial intermediate flows
+using a high-resolution feature map for extracting local details. Then, we
+incorporate a sparse global matching branch to compensate for flow estimation,
+which consists of identifying flaws in initial flows and generating sparse flow
+compensation with a global receptive field. Finally, we adaptively merge the
+initial flow estimation with global flow compensation, yielding a more accurate
+intermediate flow. To evaluate the effectiveness of our method in handling
+large motion, we carefully curate a more challenging subset from commonly used
+benchmarks. Our method demonstrates the state-of-the-art performance on these
+VFI subsets with large motion.",cs.CV,['cs.CV']
+StraightPCF: Straight Point Cloud Filtering,Dasith de Silva Edirimuni · Xuequan Lu · Gang Li · Lei Wei · Antonio Robles-Kelly · Hongdong Li,https://ddsediri.github.io/ projects/StraightPCF/,https://arxiv.org/abs/2405.08322,,2405.08322.pdf,StraightPCF: Straight Point Cloud Filtering,"Point cloud filtering is a fundamental 3D vision task, which aims to remove
+noise while recovering the underlying clean surfaces. State-of-the-art methods
+remove noise by moving noisy points along stochastic trajectories to the clean
+surfaces. These methods often require regularization within the training
+objective and/or during post-processing, to ensure fidelity. In this paper, we
+introduce StraightPCF, a new deep learning based method for point cloud
+filtering. It works by moving noisy points along straight paths, thus reducing
+discretization errors while ensuring faster convergence to the clean surfaces.
+We model noisy patches as intermediate states between high noise patch variants
+and their clean counterparts, and design the VelocityModule to infer a constant
+flow velocity from the former to the latter. This constant flow leads to
+straight filtering trajectories. In addition, we introduce a DistanceModule
+that scales the straight trajectory using an estimated distance scalar to
+attain convergence near the clean surface. Our network is lightweight and only
+has $\sim530K$ parameters, being 17% of IterativePFN (a most recent point cloud
+filtering network). Extensive experiments on both synthetic and real-world data
+show our method achieves state-of-the-art results. Our method also demonstrates
+nice distributions of filtered points without the need for regularization. The
+implementation code can be found at: https://github.com/ddsediri/StraightPCF.",cs.CV,['cs.CV']
+MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection,Jakub Micorek · Horst Possegger · Dominik Narnhofer · Horst Bischof · Mateusz Kozinski,https://github.com/jakubmicorek/MULDE-Multiscale-Log-Density-Estimation-via-Denoising-Score-Matching-for-Video-Anomaly-Detection,https://arxiv.org/abs/2403.14497,,2403.14497.pdf,MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection,"We propose a novel approach to video anomaly detection: we treat feature
+vectors extracted from videos as realizations of a random variable with a fixed
+distribution and model this distribution with a neural network. This lets us
+estimate the likelihood of test videos and detect video anomalies by
+thresholding the likelihood estimates. We train our video anomaly detector
+using a modification of denoising score matching, a method that injects
+training data with noise to facilitate modeling its distribution. To eliminate
+hyperparameter selection, we model the distribution of noisy video features
+across a range of noise levels and introduce a regularizer that tends to align
+the models for different levels of noise. At test time, we combine anomaly
+indications at multiple noise scales with a Gaussian mixture model. Running our
+video anomaly detector induces minimal delays as inference requires merely
+extracting the features and forward-propagating them through a shallow neural
+network and a Gaussian mixture model. Our experiments on five popular video
+anomaly detection benchmarks demonstrate state-of-the-art performance, both in
+the object-centric and in the frame-centric setup.",cs.CV,['cs.CV']
+Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models,Peifei Zhu · Tsubasa Takahashi · Hirokatsu Kataoka, ,https://arxiv.org/abs/2404.09401,,2404.09401.pdf,Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models,"Diffusion Models (DMs) have shown remarkable capabilities in various
+image-generation tasks. However, there are growing concerns that DMs could be
+used to imitate unauthorized creations and thus raise copyright issues. To
+address this issue, we propose a novel framework that embeds personal
+watermarks in the generation of adversarial examples. Such examples can force
+DMs to generate images with visible watermarks and prevent DMs from imitating
+unauthorized images. We construct a generator based on conditional adversarial
+networks and design three losses (adversarial loss, GAN loss, and perturbation
+loss) to generate adversarial examples that have subtle perturbation but can
+effectively attack DMs to prevent copyright violations. Training a generator
+for a personal watermark by our method only requires 5-10 samples within 2-3
+minutes, and once the generator is trained, it can generate adversarial
+examples with that watermark significantly fast (0.2s per image). We conduct
+extensive experiments in various conditional image-generation scenarios.
+Compared to existing methods that generate images with chaotic textures, our
+method adds visible watermarks on the generated images, which is a more
+straightforward way to indicate copyright violations. We also observe that our
+adversarial examples exhibit good transferability across unknown generative
+models. Therefore, this work provides a simple yet powerful way to protect
+copyright from DM-based imitation.",cs.CV,"['cs.CV', 'cs.AI']"
+Dr. Bokeh: DiffeRentiable Occlusion-aware Bokeh Rendering,Yichen Sheng · Zixun Yu · Lu Ling · Zhiwen Cao · Xuaner Zhang · Xin Lu · Ke Xian · Haiting Lin · Bedrich Benes, ,https://arxiv.org/abs/2308.08843,,2308.08843.pdf,Dr.Bokeh: DiffeRentiable Occlusion-aware Bokeh Rendering,"Bokeh is widely used in photography to draw attention to the subject while
+effectively isolating distractions in the background. Computational methods
+simulate bokeh effects without relying on a physical camera lens. However, in
+the realm of digital bokeh synthesis, the two main challenges for bokeh
+synthesis are color bleeding and partial occlusion at object boundaries. Our
+primary goal is to overcome these two major challenges using physics principles
+that define bokeh formation. To achieve this, we propose a novel and accurate
+filtering-based bokeh rendering equation and a physically-based occlusion-aware
+bokeh renderer, dubbed Dr.Bokeh, which addresses the aforementioned challenges
+during the rendering stage without the need of post-processing or data-driven
+approaches. Our rendering algorithm first preprocesses the input RGBD to obtain
+a layered scene representation. Dr.Bokeh then takes the layered representation
+and user-defined lens parameters to render photo-realistic lens blur. By
+softening non-differentiable operations, we make Dr.Bokeh differentiable such
+that it can be plugged into a machine-learning framework. We perform
+quantitative and qualitative evaluations on synthetic and real-world images to
+validate the effectiveness of the rendering quality and the differentiability
+of our method. We show Dr.Bokeh not only outperforms state-of-the-art bokeh
+rendering algorithms in terms of photo-realism but also improves the depth
+quality from depth-from-defocus.",cs.GR,['cs.GR']
+XScale-NVS: Cross-Scale Novel View Synthesis with Hash Featurized Manifold,Guangyu Wang · Jinzhi Zhang · Fan Wang · Ruqi Huang · Lu Fang, ,https://arxiv.org/abs/2403.19517,,2403.19517.pdf,XScale-NVS: Cross-Scale Novel View Synthesis with Hash Featurized Manifold,"We propose XScale-NVS for high-fidelity cross-scale novel view synthesis of
+real-world large-scale scenes. Existing representations based on explicit
+surface suffer from discretization resolution or UV distortion, while implicit
+volumetric representations lack scalability for large scenes due to the
+dispersed weight distribution and surface ambiguity. In light of the above
+challenges, we introduce hash featurized manifold, a novel hash-based
+featurization coupled with a deferred neural rendering framework. This approach
+fully unlocks the expressivity of the representation by explicitly
+concentrating the hash entries on the 2D manifold, thus effectively
+representing highly detailed contents independent of the discretization
+resolution. We also introduce a novel dataset, namely GigaNVS, to benchmark
+cross-scale, high-resolution novel view synthesis of realworld large-scale
+scenes. Our method significantly outperforms competing baselines on various
+real-world scenes, yielding an average LPIPS that is 40% lower than prior
+state-of-the-art on the challenging GigaNVS benchmark. Please see our project
+page at: xscalenvs.github.io.",cs.CV,['cs.CV']
+Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions,Oindrila Saha · Grant Horn · Subhransu Maji,https://github.com/cvl-umass/AdaptCLIPZS/,https://arxiv.org/abs/2401.02460,,2401.02460.pdf,Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions,"The zero-shot performance of existing vision-language models (VLMs) such as
+CLIP is limited by the availability of large-scale, aligned image and text
+datasets in specific domains. In this work, we leverage two complementary
+sources of information -- descriptions of categories generated by large
+language models (LLMs) and abundant, fine-grained image classification datasets
+-- to improve the zero-shot classification performance of VLMs across
+fine-grained domains. On the technical side, we develop methods to train VLMs
+with this ""bag-level"" image-text supervision. We find that simply using these
+attributes at test-time does not improve performance, but our training
+strategy, for example, on the iNaturalist dataset, leads to an average
+improvement of 4-5% in zero-shot classification accuracy for novel categories
+of birds and flowers. Similar improvements are observed in domains where a
+subset of the categories was used to fine-tune the model. By prompting LLMs in
+various ways, we generate descriptions that capture visual appearance, habitat,
+and geographic regions and pair them with existing attributes such as the
+taxonomic structure of the categories. We systematically evaluate their ability
+to improve zero-shot categorization in natural domains. Our findings suggest
+that geographic priors can be just as effective and are complementary to visual
+appearance. Our method also outperforms prior work on prompt-based tuning of
+VLMs. We release the benchmark, consisting of 14 datasets at
+https://github.com/cvl-umass/AdaptCLIPZS , which will contribute to future
+research in zero-shot recognition.",cs.CV,['cs.CV']
+Contrastive Learning for DeepFake Classification and Localization via Multi-Label Ranking,Cheng-Yao Hong · Yen-Chi Hsu · Tyng-Luh Liu, ,https://arxiv.org/abs/2401.01448,,2401.01448.pdf,ProbMCL: Simple Probabilistic Contrastive Learning for Multi-label Visual Classification,"Multi-label image classification presents a challenging task in many domains,
+including computer vision and medical imaging. Recent advancements have
+introduced graph-based and transformer-based methods to improve performance and
+capture label dependencies. However, these methods often include complex
+modules that entail heavy computation and lack interpretability. In this paper,
+we propose Probabilistic Multi-label Contrastive Learning (ProbMCL), a novel
+framework to address these challenges in multi-label image classification
+tasks. Our simple yet effective approach employs supervised contrastive
+learning, in which samples that share enough labels with an anchor image based
+on a decision threshold are introduced as a positive set. This structure
+captures label dependencies by pulling positive pair embeddings together and
+pushing away negative samples that fall below the threshold. We enhance
+representation learning by incorporating a mixture density network into
+contrastive learning and generating Gaussian mixture distributions to explore
+the epistemic uncertainty of the feature encoder. We validate the effectiveness
+of our framework through experimentation with datasets from the computer vision
+and medical imaging domains. Our method outperforms the existing
+state-of-the-art methods while achieving a low computational footprint on both
+datasets. Visualization analyses also demonstrate that ProbMCL-learned
+classifiers maintain a meaningful semantic topology.",cs.CV,"['cs.CV', 'cs.LG']"
+"Towards Automatic Power Battery Detection:  New Challenge, Benchmark Dataset and Baseline",Xiaoqi Zhao · Youwei Pang · Zhenyu Chen · Qian Yu · Lihe Zhang · Hanqi Liu · Jiaming Zuo · Huchuan Lu, ,https://arxiv.org/abs/2312.02528,,2312.02528.pdf,"Towards Automatic Power Battery Detection: New Challenge, Benchmark Dataset and Baseline","We conduct a comprehensive study on a new task named power battery detection
+(PBD), which aims to localize the dense cathode and anode plates endpoints from
+X-ray images to evaluate the quality of power batteries. Existing manufacturers
+usually rely on human eye observation to complete PBD, which makes it difficult
+to balance the accuracy and efficiency of detection. To address this issue and
+drive more attention into this meaningful task, we first elaborately collect a
+dataset, called X-ray PBD, which has $1,500$ diverse X-ray images selected from
+thousands of power batteries of $5$ manufacturers, with $7$ different visual
+interference. Then, we propose a novel segmentation-based solution for PBD,
+termed multi-dimensional collaborative network (MDCNet). With the help of line
+and counting predictors, the representation of the point segmentation branch
+can be improved at both semantic and detail aspects.Besides, we design an
+effective distance-adaptive mask generation strategy, which can alleviate the
+visual challenge caused by the inconsistent distribution density of plates to
+provide MDCNet with stable supervision. Without any bells and whistles, our
+segmentation-based MDCNet consistently outperforms various other corner
+detection, crowd counting and general/tiny object detection-based solutions,
+making it a strong baseline that can help facilitate future research in PBD.
+Finally, we share some potential difficulties and works for future researches.
+The source code and datasets will be publicly available at
+\href{https://github.com/Xiaoqi-Zhao-DLUT/X-ray-PBD}{X-ray PBD}.",cs.CV,['cs.CV']
+SPU-PMD: Self-Supervised Point Cloud Upsampling via Progressive Mesh Deformation,Yanzhe Liu · Rong Chen · Yushi Li · Yixi Li · Xuehou Tan, ,,https://dl.acm.org/doi/10.1109/TPAMI.2023.3287628,,,,,nan
+Bidirectional Autoregessive Diffusion Model for Dance Generation,Canyu Zhang · Youbao Tang · NING Zhang · Ruei-Sung Lin · Mei Han · Jing Xiao · Song Wang, ,https://arxiv.org/abs/2402.04356,,2402.04356.pdf,Bidirectional Autoregressive Diffusion Model for Dance Generation,"Dance serves as a powerful medium for expressing human emotions, but the
+lifelike generation of dance is still a considerable challenge. Recently,
+diffusion models have showcased remarkable generative abilities across various
+domains. They hold promise for human motion generation due to their adaptable
+many-to-many nature. Nonetheless, current diffusion-based motion generation
+models often create entire motion sequences directly and unidirectionally,
+lacking focus on the motion with local and bidirectional enhancement. When
+choreographing high-quality dance movements, people need to take into account
+not only the musical context but also the nearby music-aligned dance motions.
+To authentically capture human behavior, we propose a Bidirectional
+Autoregressive Diffusion Model (BADM) for music-to-dance generation, where a
+bidirectional encoder is built to enforce that the generated dance is
+harmonious in both the forward and backward directions. To make the generated
+dance motion smoother, a local information decoder is built for local motion
+enhancement. The proposed framework is able to generate new motions based on
+the input conditions and nearby motions, which foresees individual motion
+slices iteratively and consolidates all predictions. To further refine the
+synchronicity between the generated dance and the beat, the beat information is
+incorporated as an input to generate better music-aligned dance movements.
+Experimental results demonstrate that the proposed model achieves
+state-of-the-art performance compared to existing unidirectional approaches on
+the prominent benchmark for music-to-dance generation.",cs.SD,"['cs.SD', 'cs.CV', 'eess.AS']"
+Enhancing Intrinsic Features for Debiasing via Investigating Class-Discerning Common Attributes in Bias-Contrastive Pair,Jeonghoon Park · Chaeyeon Chung · Jaegul Choo, ,https://arxiv.org/abs/2404.19250,,2404.19250.pdf,Enhancing Intrinsic Features for Debiasing via Investigating Class-Discerning Common Attributes in Bias-Contrastive Pair,"In the image classification task, deep neural networks frequently rely on
+bias attributes that are spuriously correlated with a target class in the
+presence of dataset bias, resulting in degraded performance when applied to
+data without bias attributes. The task of debiasing aims to compel classifiers
+to learn intrinsic attributes that inherently define a target class rather than
+focusing on bias attributes. While recent approaches mainly focus on
+emphasizing the learning of data samples without bias attributes (i.e.,
+bias-conflicting samples) compared to samples with bias attributes (i.e.,
+bias-aligned samples), they fall short of directly guiding models where to
+focus for learning intrinsic features. To address this limitation, this paper
+proposes a method that provides the model with explicit spatial guidance that
+indicates the region of intrinsic features. We first identify the intrinsic
+features by investigating the class-discerning common features between a
+bias-aligned (BA) sample and a bias-conflicting (BC) sample (i.e.,
+bias-contrastive pair). Next, we enhance the intrinsic features in the BA
+sample that are relatively under-exploited for prediction compared to the BC
+sample. To construct the bias-contrastive pair without using bias information,
+we introduce a bias-negative score that distinguishes BC samples from BA
+samples employing a biased model. The experiments demonstrate that our method
+achieves state-of-the-art performance on synthetic and real-world datasets with
+various levels of bias severity.",cs.CV,['cs.CV']
+ID-Blau: Image Deblurring by Implicit Diffusion-based reBLurring AUgmentation,Jia-Hao Wu · Fu-Jen Tsai · Yan-Tsung Peng · Charles Tsai · Chia-Wen Lin · Yen-Yu Lin,https://github.com/plusgood-steven/ID-Blau,https://arxiv.org/abs/2312.10998v1,,2312.10998v1.pdf,ID-Blau: Image Deblurring by Implicit Diffusion-based reBLurring AUgmentation,"Image deblurring aims to remove undesired blurs from an image captured in a
+dynamic scene. Much research has been dedicated to improving deblurring
+performance through model architectural designs. However, there is little work
+on data augmentation for image deblurring. Since continuous motion causes
+blurred artifacts during image exposure, we aspire to develop a groundbreaking
+blur augmentation method to generate diverse blurred images by simulating
+motion trajectories in a continuous space. This paper proposes Implicit
+Diffusion-based reBLurring AUgmentation (ID-Blau), utilizing a sharp image
+paired with a controllable blur condition map to produce a corresponding
+blurred image. We parameterize the blur patterns of a blurred image with their
+orientations and magnitudes as a pixel-wise blur condition map to simulate
+motion trajectories and implicitly represent them in a continuous space. By
+sampling diverse blur conditions, ID-Blau can generate various blurred images
+unseen in the training set. Experimental results demonstrate that ID-Blau can
+produce realistic blurred images for training and thus significantly improve
+performance for state-of-the-art deblurring models.",cs.CV,['cs.CV']
+SLICE: Stabilized LIME for Consistent Explanations for Image Classification,Revoti Prasad Bora · Kiran Raja · Philipp Terhörst · Raymond Veldhuis · Raghavendra Ramachandra, ,https://arxiv.org/abs/2403.17742,,2403.17742.pdf,Using Stratified Sampling to Improve LIME Image Explanations,"We investigate the use of a stratified sampling approach for LIME Image, a
+popular model-agnostic explainable AI method for computer vision tasks, in
+order to reduce the artifacts generated by typical Monte Carlo sampling. Such
+artifacts are due to the undersampling of the dependent variable in the
+synthetic neighborhood around the image being explained, which may result in
+inadequate explanations due to the impossibility of fitting a linear regressor
+on the sampled data. We then highlight a connection with the Shapley theory,
+where similar arguments about undersampling and sample relevance were suggested
+in the past. We derive all the formulas and adjustment factors required for an
+unbiased stratified sampling estimator. Experiments show the efficacy of the
+proposed approach.",cs.AI,['cs.AI']
+Leak and Learn: An Attacker's Cookbook to Train Using Leaked Data from Federated Learning,Joshua C. Zhao · Ahaan Dabholkar · Atul Sharma · Saurabh Bagchi, ,https://arxiv.org/abs/2403.18144,,2403.18144.pdf,Leak and Learn: An Attacker's Cookbook to Train Using Leaked Data from Federated Learning,"Federated learning is a decentralized learning paradigm introduced to
+preserve privacy of client data. Despite this, prior work has shown that an
+attacker at the server can still reconstruct the private training data using
+only the client updates. These attacks are known as data reconstruction attacks
+and fall into two major categories: gradient inversion (GI) and linear layer
+leakage attacks (LLL). However, despite demonstrating the effectiveness of
+these attacks in breaching privacy, prior work has not investigated the
+usefulness of the reconstructed data for downstream tasks. In this work, we
+explore data reconstruction attacks through the lens of training and improving
+models with leaked data. We demonstrate the effectiveness of both GI and LLL
+attacks in maliciously training models using the leaked data more accurately
+than a benign federated learning strategy. Counter-intuitively, this bump in
+training quality can occur despite limited reconstruction quality or a small
+total number of leaked images. Finally, we show the limitations of these
+attacks for downstream training, individually for GI attacks and for LLL
+attacks.",cs.CR,"['cs.CR', 'cs.CV']"
+Beyond Seen Primitive Concepts and Attribute-Object Compositional Learning,Nirat Saini · Khoi Pham · Abhinav Shrivastava, ,https://arxiv.org/html/2403.05924v1,,2403.05924v1.pdf,CSCNET: Class-Specified Cascaded Network for Compositional Zero-Shot Learning,"Attribute and object (A-O) disentanglement is a fundamental and critical
+problem for Compositional Zero-shot Learning (CZSL), whose aim is to recognize
+novel A-O compositions based on foregone knowledge. Existing methods based on
+disentangled representation learning lose sight of the contextual dependency
+between the A-O primitive pairs. Inspired by this, we propose a novel A-O
+disentangled framework for CZSL, namely Class-specified Cascaded Network
+(CSCNet). The key insight is to firstly classify one primitive and then
+specifies the predicted class as a priori for guiding another primitive
+recognition in a cascaded fashion. To this end, CSCNet constructs
+Attribute-to-Object and Object-to-Attribute cascaded branches, in addition to a
+composition branch modeling the two primitives as a whole. Notably, we devise a
+parametric classifier (ParamCls) to improve the matching between visual and
+semantic embeddings. By improving the A-O disentanglement, our framework
+achieves superior results than previous competitive methods.",cs.CV,['cs.CV']
+Improving Graph Contrastive Learning via Adaptive Positive Sampling,Jiaming Zhuo · Feiyang Qin · Can Cui · Kun Fu · Bingxin Niu · Mengzhu Wang · Yuanfang Guo · Chuan Wang · Zhen Wang · Xiaochun Cao · Liang Yang, ,,https://ieeexplore.ieee.org/document/10181235,,,,,nan
+ViewDiff: 3D-Consistent Image Generation with Text-To-Image Models,Lukas Höllein · Aljaž Božič · Norman Müller · David Novotny · Hung-Yu Tseng · Christian Richardt · Michael Zollhoefer · Matthias Nießner,https://lukashoel.github.io/ViewDiff/,https://arxiv.org/abs/2403.01807,,2403.01807.pdf,ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models,"3D asset generation is getting massive amounts of attention, inspired by the
+recent success of text-guided 2D content creation. Existing text-to-3D methods
+use pretrained text-to-image diffusion models in an optimization problem or
+fine-tune them on synthetic data, which often results in non-photorealistic 3D
+objects without backgrounds. In this paper, we present a method that leverages
+pretrained text-to-image models as a prior, and learn to generate multi-view
+images in a single denoising process from real-world data. Concretely, we
+propose to integrate 3D volume-rendering and cross-frame-attention layers into
+each block of the existing U-Net network of the text-to-image model. Moreover,
+we design an autoregressive generation that renders more 3D-consistent images
+at any viewpoint. We train our model on real-world datasets of objects and
+showcase its capabilities to generate instances with a variety of high-quality
+shapes and textures in authentic surroundings. Compared to the existing
+methods, the results generated by our method are consistent, and have favorable
+visual quality (-30% FID, -37% KID).",cs.CV,['cs.CV']
+FACT: Frame-Action Cross-Attention Temporal Modeling for Efficient Action Segmentation,Zijia Lu · Ehsan Elhamifar, ,https://arxiv.org/abs/2308.14900,,2308.14900.pdf,BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation,"We address the task of supervised action segmentation which aims to partition
+a video into non-overlapping segments, each representing a different action.
+Recent works apply transformers to perform temporal modeling at the
+frame-level, which suffer from high computational cost and cannot well capture
+action dependencies over long temporal horizons. To address these issues, we
+propose an efficient BI-level Temporal modeling (BIT) framework that learns
+explicit action tokens to represent action segments, in parallel performs
+temporal modeling on frame and action levels, while maintaining a low
+computational cost. Our model contains (i) a frame branch that uses convolution
+to learn frame-level relationships, (ii) an action branch that uses transformer
+to learn action-level dependencies with a small set of action tokens and (iii)
+cross-attentions to allow communication between the two branches. We apply and
+extend a set-prediction objective to allow each action token to represent one
+or multiple action segments, thus can avoid learning a large number of tokens
+over long videos with many segments. Thanks to the design of our action branch,
+we can also seamlessly leverage textual transcripts of videos (when available)
+to help action segmentation by using them to initialize the action tokens. We
+evaluate our model on four video datasets (two egocentric and two third-person)
+for action segmentation with and without transcripts, showing that BIT
+significantly improves the state-of-the-art accuracy with much lower
+computational cost (30 times faster) compared to existing transformer-based
+methods.",cs.CV,['cs.CV']
+Misalignment-Robust Frequency Distribution Loss for Image Transformation,Zhangkai Ni · Juncheng Wu · Zian Wang · Wenhan Yang · Hanli Wang · Lin Ma, ,https://arxiv.org/html/2402.18192v1,,2402.18192v1.pdf,Misalignment-Robust Frequency Distribution Loss for Image Transformation,"This paper aims to address a common challenge in deep learning-based image
+transformation methods, such as image enhancement and super-resolution, which
+heavily rely on precisely aligned paired datasets with pixel-level alignments.
+However, creating precisely aligned paired images presents significant
+challenges and hinders the advancement of methods trained on such data. To
+overcome this challenge, this paper introduces a novel and simple Frequency
+Distribution Loss (FDL) for computing distribution distance within the
+frequency domain. Specifically, we transform image features into the frequency
+domain using Discrete Fourier Transformation (DFT). Subsequently, frequency
+components (amplitude and phase) are processed separately to form the FDL loss
+function. Our method is empirically proven effective as a training constraint
+due to the thoughtful utilization of global information in the frequency
+domain. Extensive experimental evaluations, focusing on image enhancement and
+super-resolution tasks, demonstrate that FDL outperforms existing
+misalignment-robust loss functions. Furthermore, we explore the potential of
+our FDL for image style transfer that relies solely on completely misaligned
+data. Our code is available at: https://github.com/eezkni/FDL",cs.CV,"['cs.CV', 'eess.IV']"
+DIEM: Decomposition-Integration Enhancing Multimodal Insights,Xinyi Jiang · Guoming Wang · Junhao Guo · Juncheng Li · Wenqiao Zhang · Rongxing Lu · Siliang Tang, ,,https://ieeexplore.ieee.org/document/10423001,,,,,nan
+Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction,Inhwan Bae · Junoh Lee · Hae-Gon Jeon,https://github.com/InhwanBae/LMTrajectory,https://arxiv.org/abs/2403.18447,,2403.18447.pdf,Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction,"Language models have demonstrated impressive ability in context understanding
+and generative performance. Inspired by the recent success of language
+foundation models, in this paper, we propose LMTraj (Language-based Multimodal
+Trajectory predictor), which recasts the trajectory prediction task into a sort
+of question-answering problem. Departing from traditional numerical regression
+models, which treat the trajectory coordinate sequence as continuous signals,
+we consider them as discrete signals like text prompts. Specially, we first
+transform an input space for the trajectory coordinate into the natural
+language space. Here, the entire time-series trajectories of pedestrians are
+converted into a text prompt, and scene images are described as text
+information through image captioning. The transformed numerical and image data
+are then wrapped into the question-answering template for use in a language
+model. Next, to guide the language model in understanding and reasoning
+high-level knowledge, such as scene context and social relationships between
+pedestrians, we introduce an auxiliary multi-task question and answering. We
+then train a numerical tokenizer with the prompt data. We encourage the
+tokenizer to separate the integer and decimal parts well, and leverage it to
+capture correlations between the consecutive numbers in the language model.
+Lastly, we train the language model using the numerical tokenizer and all of
+the question-answer prompts. Here, we propose a beam-search-based most-likely
+prediction and a temperature-based multimodal prediction to implement both
+deterministic and stochastic inferences. Applying our LMTraj, we show that the
+language-based model can be a powerful pedestrian trajectory predictor, and
+outperforms existing numerical-based predictor methods. Code is publicly
+available at https://github.com/inhwanbae/LMTrajectory .",cs.CL,"['cs.CL', 'cs.CV', 'cs.LG', 'cs.RO']"
+Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for Enhanced Human Pose Estimation with Sparse Inertial Sensors,Yu Zhang · Songpengcheng Xia · Lei Chu · Jiarui Yang · Qi Wu · Ling Pei, ,https://arxiv.org/abs/2312.02196,,2312.02196.pdf,Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for Enhanced Human Pose Estimation with Sparse Inertial Sensors,"This paper introduces a novel human pose estimation approach using sparse
+inertial sensors, addressing the shortcomings of previous methods reliant on
+synthetic data. It leverages a diverse array of real inertial motion capture
+data from different skeleton formats to improve motion diversity and model
+generalization. This method features two innovative components: a
+pseudo-velocity regression model for dynamic motion capture with inertial
+sensors, and a part-based model dividing the body and sensor data into three
+regions, each focusing on their unique characteristics. The approach
+demonstrates superior performance over state-of-the-art models across five
+public datasets, notably reducing pose error by 19\% on the DIP-IMU dataset,
+thus representing a significant improvement in inertial sensor-based human pose
+estimation. Our codes are available at {\url{https://github.com/dx118/dynaip}}.",cs.CV,['cs.CV']
+Learning to Remove Wrinkled Transparent Film with Polarized Prior,Jiaqi Tang · RUIZHENG WU · Xiaogang Xu · Sixing Hu · Ying-Cong Chen,https://jqt.me/_FilmRemoval_/,https://arxiv.org/abs/2403.04368v1,,2403.04368v1.pdf,Learning to Remove Wrinkled Transparent Film with Polarized Prior,"In this paper, we study a new problem, Film Removal (FR), which attempts to
+remove the interference of wrinkled transparent films and reconstruct the
+original information under films for industrial recognition systems. We first
+physically model the imaging of industrial materials covered by the film.
+Considering the specular highlight from the film can be effectively recorded by
+the polarized camera, we build a practical dataset with polarization
+information containing paired data with and without transparent film. We aim to
+remove interference from the film (specular highlights and other degradations)
+with an end-to-end framework. To locate the specular highlight, we use an angle
+estimation network to optimize the polarization angle with the minimized
+specular highlight. The image with minimized specular highlight is set as a
+prior for supporting the reconstruction network. Based on the prior and the
+polarized images, the reconstruction network can decouple all degradations from
+the film. Extensive experiments show that our framework achieves SOTA
+performance in both image reconstruction and industrial downstream tasks. Our
+code will be released at \url{https://github.com/jqtangust/FilmRemoval}.",cs.CV,['cs.CV']
+FCS: Feature Calibration and Separation for Non-Exemplar Class Incremental Learning,Qiwei Li · Yuxin Peng · Jiahuan Zhou, ,https://arxiv.org/abs/2312.12722,,2312.12722.pdf,Fine-Grained Knowledge Selection and Restoration for Non-Exemplar Class Incremental Learning,"Non-exemplar class incremental learning aims to learn both the new and old
+tasks without accessing any training data from the past. This strict
+restriction enlarges the difficulty of alleviating catastrophic forgetting
+since all techniques can only be applied to current task data. Considering this
+challenge, we propose a novel framework of fine-grained knowledge selection and
+restoration. The conventional knowledge distillation-based methods place too
+strict constraints on the network parameters and features to prevent
+forgetting, which limits the training of new tasks. To loose this constraint,
+we proposed a novel fine-grained selective patch-level distillation to
+adaptively balance plasticity and stability. Some task-agnostic patches can be
+used to preserve the decision boundary of the old task. While some patches
+containing the important foreground are favorable for learning the new task.
+  Moreover, we employ a task-agnostic mechanism to generate more realistic
+prototypes of old tasks with the current task sample for reducing classifier
+bias for fine-grained knowledge restoration. Extensive experiments on CIFAR100,
+TinyImageNet and ImageNet-Subset demonstrate the effectiveness of our method.
+Code is available at https://github.com/scok30/vit-cil.",cs.CV,['cs.CV']
+Video-Based Human Pose Regression via Decoupled Space-Time Aggregation,Jijie He · Wenwu Yang,https://github.com/zgspose/DSTA,https://arxiv.org/abs/2403.19926,,2403.19926.pdf,Video-Based Human Pose Regression via Decoupled Space-Time Aggregation,"By leveraging temporal dependency in video sequences, multi-frame human pose
+estimation algorithms have demonstrated remarkable results in complicated
+situations, such as occlusion, motion blur, and video defocus. These algorithms
+are predominantly based on heatmaps, resulting in high computation and storage
+requirements per frame, which limits their flexibility and real-time
+application in video scenarios, particularly on edge devices. In this paper, we
+develop an efficient and effective video-based human pose regression method,
+which bypasses intermediate representations such as heatmaps and instead
+directly maps the input to the output joint coordinates. Despite the inherent
+spatial correlation among adjacent joints of the human pose, the temporal
+trajectory of each individual joint exhibits relative independence. In light of
+this, we propose a novel Decoupled Space-Time Aggregation network (DSTA) to
+separately capture the spatial contexts between adjacent joints and the
+temporal cues of each individual joint, thereby avoiding the conflation of
+spatiotemporal dimensions. Concretely, DSTA learns a dedicated feature token
+for each joint to facilitate the modeling of their spatiotemporal dependencies.
+With the proposed joint-wise local-awareness attention mechanism, our method is
+capable of efficiently and flexibly utilizing the spatial dependency of
+adjacent joints and the temporal dependency of each joint itself. Extensive
+experiments demonstrate the superiority of our method. Compared to previous
+regression-based single-frame human pose estimation methods, DSTA significantly
+enhances performance, achieving an 8.9 mAP improvement on PoseTrack2017.
+Furthermore, our approach either surpasses or is on par with the
+state-of-the-art heatmap-based multi-frame human pose estimation methods.
+Project page: https://github.com/zgspose/DSTA.",cs.CV,"['cs.CV', 'I.4.9']"
+$L_0$-Sampler: An $L_{0}$ Model Guided Volume Sampling for NeRF,Liangchen Li · Juyong Zhang, ,https://arxiv.org/abs/2311.07044,,2311.07044.pdf,$L_0$-Sampler: An $L_{0}$ Model Guided Volume Sampling for NeRF,"Since being proposed, Neural Radiance Fields (NeRF) have achieved great
+success in related tasks, mainly adopting the hierarchical volume sampling
+(HVS) strategy for volume rendering. However, the HVS of NeRF approximates
+distributions using piecewise constant functions, which provides a relatively
+rough estimation. Based on the observation that a well-trained weight function
+$w(t)$ and the $L_0$ distance between points and the surface have very high
+similarity, we propose $L_0$-Sampler by incorporating the $L_0$ model into
+$w(t)$ to guide the sampling process. Specifically, we propose to use piecewise
+exponential functions rather than piecewise constant functions for
+interpolation, which can not only approximate quasi-$L_0$ weight distributions
+along rays quite well but also can be easily implemented with few lines of code
+without additional computational burden. Stable performance improvements can be
+achieved by applying $L_0$-Sampler to NeRF and its related tasks like 3D
+reconstruction. Code is available at https://ustc3dv.github.io/L0-Sampler/ .",cs.CV,"['cs.CV', 'cs.GR']"
+3DInAction: Understanding Human Actions in 3D Point Clouds,Yizhak Ben-Shabat · Oren Shrout · Stephen Gould, ,https://arxiv.org/html/2303.06346v2,,2303.06346v2.pdf,3DInAction: Understanding Human Actions in 3D Point Clouds,"We propose a novel method for 3D point cloud action recognition.
+Understanding human actions in RGB videos has been widely studied in recent
+years, however, its 3D point cloud counterpart remains under-explored. This is
+mostly due to the inherent limitation of the point cloud data modality -- lack
+of structure, permutation invariance, and varying number of points -- which
+makes it difficult to learn a spatio-temporal representation. To address this
+limitation, we propose the 3DinAction pipeline that first estimates patches
+moving in time (t-patches) as a key building block, alongside a hierarchical
+architecture that learns an informative spatio-temporal representation. We show
+that our method achieves improved performance on existing datasets, including
+DFAUST and IKEA ASM. Code is publicly available at
+https://github.com/sitzikbs/3dincaction.",cs.CV,['cs.CV']
+Poly Kernel Inception Network for Remote Sensing Detection,Xinhao Cai · Qiuxia Lai · Yuwei Wang · Wenguan Wang · Zeren Sun · Yazhou Yao, ,https://arxiv.org/abs/2403.06258,,2403.06258.pdf,Poly Kernel Inception Network for Remote Sensing Detection,"Object detection in remote sensing images (RSIs) often suffers from several
+increasing challenges, including the large variation in object scales and the
+diverse-ranging context. Prior methods tried to address these challenges by
+expanding the spatial receptive field of the backbone, either through
+large-kernel convolution or dilated convolution. However, the former typically
+introduces considerable background noise, while the latter risks generating
+overly sparse feature representations. In this paper, we introduce the Poly
+Kernel Inception Network (PKINet) to handle the above challenges. PKINet
+employs multi-scale convolution kernels without dilation to extract object
+features of varying scales and capture local context. In addition, a Context
+Anchor Attention (CAA) module is introduced in parallel to capture long-range
+contextual information. These two components work jointly to advance the
+performance of PKINet on four challenging remote sensing detection benchmarks,
+namely DOTA-v1.0, DOTA-v1.5, HRSC2016, and DIOR-R.",cs.CV,['cs.CV']
+Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning,Zihua Zhao · Mengxi Chen · Tianjie Dai · Jiangchao Yao · Bo Han · Ya Zhang · Yanfeng Wang, ,https://arxiv.org/abs/2405.16996,,2405.16996.pdf,Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning,"Noisy correspondence that refers to mismatches in cross-modal data pairs, is
+prevalent on human-annotated or web-crawled datasets. Prior approaches to
+leverage such data mainly consider the application of uni-modal noisy label
+learning without amending the impact on both cross-modal and intra-modal
+geometrical structures in multimodal learning. Actually, we find that both
+structures are effective to discriminate noisy correspondence through
+structural differences when being well-established. Inspired by this
+observation, we introduce a Geometrical Structure Consistency (GSC) method to
+infer the true correspondence. Specifically, GSC ensures the preservation of
+geometrical structures within and between modalities, allowing for the accurate
+discrimination of noisy samples based on structural differences. Utilizing
+these inferred true correspondence labels, GSC refines the learning of
+geometrical structures by filtering out the noisy samples. Experiments across
+four cross-modal datasets confirm that GSC effectively identifies noisy samples
+and significantly outperforms the current leading methods.",cs.CV,['cs.CV']
+Rethinking Diffusion Model for Multi-Contrast MRI Super-Resolution,Guangyuan Li · Chen Rao · Juncheng Mo · Zhanjie Zhang · Wei Xing · Lei Zhao, ,https://arxiv.org/abs/2404.04785,,2404.04785.pdf,Rethinking Diffusion Model for Multi-Contrast MRI Super-Resolution,"Recently, diffusion models (DM) have been applied in magnetic resonance
+imaging (MRI) super-resolution (SR) reconstruction, exhibiting impressive
+performance, especially with regard to detailed reconstruction. However, the
+current DM-based SR reconstruction methods still face the following issues: (1)
+They require a large number of iterations to reconstruct the final image, which
+is inefficient and consumes a significant amount of computational resources.
+(2) The results reconstructed by these methods are often misaligned with the
+real high-resolution images, leading to remarkable distortion in the
+reconstructed MR images. To address the aforementioned issues, we propose an
+efficient diffusion model for multi-contrast MRI SR, named as DiffMSR.
+Specifically, we apply DM in a highly compact low-dimensional latent space to
+generate prior knowledge with high-frequency detail information. The highly
+compact latent space ensures that DM requires only a few simple iterations to
+produce accurate prior knowledge. In addition, we design the Prior-Guide Large
+Window Transformer (PLWformer) as the decoder for DM, which can extend the
+receptive field while fully utilizing the prior knowledge generated by DM to
+ensure that the reconstructed MR image remains undistorted. Extensive
+experiments on public and clinical datasets demonstrate that our DiffMSR
+outperforms state-of-the-art methods.",cs.CV,['cs.CV']
+HIR-Diff: Unsupervised Hyperspectral Image Restoration Via Improved Diffusion Models,Li Pang · Xiangyu Rui · Long Cui · Hongzhong Wang · Deyu Meng · Xiangyong Cao, ,https://arxiv.org/abs/2402.15865,,2402.15865.pdf,HIR-Diff: Unsupervised Hyperspectral Image Restoration Via Improved Diffusion Models,"Hyperspectral image (HSI) restoration aims at recovering clean images from
+degraded observations and plays a vital role in downstream tasks. Existing
+model-based methods have limitations in accurately modeling the complex image
+characteristics with handcraft priors, and deep learning-based methods suffer
+from poor generalization ability. To alleviate these issues, this paper
+proposes an unsupervised HSI restoration framework with pre-trained diffusion
+model (HIR-Diff), which restores the clean HSIs from the product of two
+low-rank components, i.e., the reduced image and the coefficient matrix.
+Specifically, the reduced image, which has a low spectral dimension, lies in
+the image field and can be inferred from our improved diffusion model where a
+new guidance function with total variation (TV) prior is designed to ensure
+that the reduced image can be well sampled. The coefficient matrix can be
+effectively pre-estimated based on singular value decomposition (SVD) and
+rank-revealing QR (RRQR) factorization. Furthermore, a novel exponential noise
+schedule is proposed to accelerate the restoration process (about 5$\times$
+acceleration for denoising) with little performance decrease. Extensive
+experimental results validate the superiority of our method in both performance
+and speed on a variety of HSI restoration tasks, including HSI denoising, noisy
+HSI super-resolution, and noisy HSI inpainting. The code is available at
+https://github.com/LiPang/HIRDiff.",cs.CV,"['cs.CV', 'eess.IV']"
+Zero-Shot Structure-Preserving Diffusion Model for  High Dynamic Range Tone Mapping,Ruoxi Zhu · Shusong Xu · Peiye Liu · Sicheng Li · Yanheng Lu · Dimin Niu · Zihao Liu · Zihao Meng · Li Zhiyong · Xinhua Chen · Yibo Fan, ,https://arxiv.org/abs/2309.16975,,2309.16975.pdf,Perceptual Tone Mapping Model for High Dynamic Range Imaging,"One of the key challenges in tone mapping is to preserve the perceptual
+quality of high dynamic range (HDR) images when mapping them to standard
+dynamic range (SDR) displays. Traditional tone mapping operators (TMOs)
+compress the luminance of HDR images without considering the surround and
+display conditions emanating into suboptimal results. Current research
+addresses this challenge by incorporating perceptual color appearance
+attributes. In this work, we propose a TMO (TMOz) that leverages CIECAM16
+perceptual attributes, i.e., brightness, colorfulness, and hue. TMOz accounts
+for the effects of both the surround and the display conditions to achieve more
+optimal colorfulness reproduction. The perceptual brightness is compressed, and
+the perceptual color scales, i.e., colorfulness and hue are derived from HDR
+images by employing CIECAM16 color adaptation equations. A psychophysical
+experiment was conducted to automate the brightness compression parameter. The
+model employs fully automatic and adaptive approach, obviating the requirement
+for manual parameter selection. TMOz was evaluated in terms of contrast,
+colorfulness and overall image quality. The objective and subjective evaluation
+methods revealed that the proposed model outperformed the state-of-the-art
+TMOs.",cs.CV,"['cs.CV', 'eess.IV']"
+VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning,Kang Chen · Xiangqian Wu,https://visual-text-qa.github.io/,,https://dl.acm.org/doi/pdf/10.1145/3581783.3612850,,,,,nan
+Composing Object Relations and Attributes for Image-Text Matching,Khoi Pham · Chuong Huynh · Ser-Nam Lim · Abhinav Shrivastava, ,,https://hmchuong.github.io/,,,,,nan
+HashPoint: Accelerated Point Searching and Sampling for Neural Rendering,Jiahao Ma · Miaomiao Liu · David Ahmedt-Aristizabal · Chuong Nguyen,https://jiahao-ma.github.io/hashpoint/,https://arxiv.org/abs/2404.14044,,2404.14044.pdf,HashPoint: Accelerated Point Searching and Sampling for Neural Rendering,"In this paper, we address the problem of efficient point searching and
+sampling for volume neural rendering. Within this realm, two typical approaches
+are employed: rasterization and ray tracing. The rasterization-based methods
+enable real-time rendering at the cost of increased memory and lower fidelity.
+In contrast, the ray-tracing-based methods yield superior quality but demand
+longer rendering time. We solve this problem by our HashPoint method combining
+these two strategies, leveraging rasterization for efficient point searching
+and sampling, and ray marching for rendering. Our method optimizes point
+searching by rasterizing points within the camera's view, organizing them in a
+hash table, and facilitating rapid searches. Notably, we accelerate the
+rendering process by adaptive sampling on the primary surface encountered by
+the ray. Our approach yields substantial speed-up for a range of
+state-of-the-art ray-tracing-based methods, maintaining equivalent or superior
+accuracy across synthetic and real test datasets. The code will be available at
+https://jiahao-ma.github.io/hashpoint/.",cs.CV,['cs.CV']
+Improving Depth Completion via Depth Feature Upsampling,Yufei Wang · Ge Zhang · Shaoqian Wang · Bo Li · Qi Liu · Le Hui · Yuchao Dai, ,https://arxiv.org/abs/2310.08956,,2310.08956.pdf,LRRU: Long-short Range Recurrent Updating Networks for Depth Completion,"Existing deep learning-based depth completion methods generally employ
+massive stacked layers to predict the dense depth map from sparse input data.
+Although such approaches greatly advance this task, their accompanied huge
+computational complexity hinders their practical applications. To accomplish
+depth completion more efficiently, we propose a novel lightweight deep network
+framework, the Long-short Range Recurrent Updating (LRRU) network. Without
+learning complex feature representations, LRRU first roughly fills the sparse
+input to obtain an initial dense depth map, and then iteratively updates it
+through learned spatially-variant kernels. Our iterative update process is
+content-adaptive and highly flexible, where the kernel weights are learned by
+jointly considering the guidance RGB images and the depth map to be updated,
+and large-to-small kernel scopes are dynamically adjusted to capture
+long-to-short range dependencies. Our initial depth map has coarse but complete
+scene depth information, which helps relieve the burden of directly regressing
+the dense depth from sparse ones, while our proposed method can effectively
+refine it to an accurate depth map with less learnable parameters and inference
+time. Experimental results demonstrate that our proposed LRRU variants achieve
+state-of-the-art performance across different parameter regimes. In particular,
+the LRRU-Base model outperforms competing approaches on the NYUv2 dataset, and
+ranks 1st on the KITTI depth completion benchmark at the time of submission.
+Project page: https://npucvr.github.io/LRRU/.",cs.CV,['cs.CV']
+Improving Out-of-Distribution Generalization in Graphs via Hierarchical Semantic Environments,Yinhua Piao · Sangseon Lee · Yijingxiu Lu · Sun Kim,https://github.com/qkrdmsghk/GOODHSE,https://arxiv.org/abs/2403.01773,,2403.01773.pdf,Improving out-of-distribution generalization in graphs via hierarchical semantic environments,"Out-of-distribution (OOD) generalization in the graph domain is challenging
+due to complex distribution shifts and a lack of environmental contexts. Recent
+methods attempt to enhance graph OOD generalization by generating flat
+environments. However, such flat environments come with inherent limitations to
+capture more complex data distributions. Considering the DrugOOD dataset, which
+contains diverse training environments (e.g., scaffold, size, etc.), flat
+contexts cannot sufficiently address its high heterogeneity. Thus, a new
+challenge is posed to generate more semantically enriched environments to
+enhance graph invariant learning for handling distribution shifts. In this
+paper, we propose a novel approach to generate hierarchical semantic
+environments for each graph. Firstly, given an input graph, we explicitly
+extract variant subgraphs from the input graph to generate proxy predictions on
+local environments. Then, stochastic attention mechanisms are employed to
+re-extract the subgraphs for regenerating global environments in a hierarchical
+manner. In addition, we introduce a new learning objective that guides our
+model to learn the diversity of environments within the same hierarchy while
+maintaining consistency across different hierarchies. This approach enables our
+model to consider the relationships between environments and facilitates robust
+graph invariant learning. Extensive experiments on real-world graph data have
+demonstrated the effectiveness of our framework. Particularly, in the
+challenging dataset DrugOOD, our method achieves up to 1.29% and 2.83%
+improvement over the best baselines on IC50 and EC50 prediction tasks,
+respectively.",cs.LG,"['cs.LG', 'cs.AI']"
+Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval,Haochen Han · Qinghua Zheng · Guang Dai · Minnan Luo · Jingdong Wang,https://github.com/hhc1997/L2RM,https://arxiv.org/abs/2403.05105,,2403.05105.pdf,Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval,"Collecting well-matched multimedia datasets is crucial for training
+cross-modal retrieval models. However, in real-world scenarios, massive
+multimodal data are harvested from the Internet, which inevitably contains
+Partially Mismatched Pairs (PMPs). Undoubtedly, such semantical irrelevant data
+will remarkably harm the cross-modal retrieval performance. Previous efforts
+tend to mitigate this problem by estimating a soft correspondence to
+down-weight the contribution of PMPs. In this paper, we aim to address this
+challenge from a new perspective: the potential semantic similarity among
+unpaired samples makes it possible to excavate useful knowledge from mismatched
+pairs. To achieve this, we propose L2RM, a general framework based on Optimal
+Transport (OT) that learns to rematch mismatched pairs. In detail, L2RM aims to
+generate refined alignments by seeking a minimal-cost transport plan across
+different modalities. To formalize the rematching idea in OT, first, we propose
+a self-supervised cost function that automatically learns from explicit
+similarity-cost mapping relation. Second, we present to model a partial OT
+problem while restricting the transport among false positives to further boost
+refined alignments. Extensive experiments on three benchmarks demonstrate our
+L2RM significantly improves the robustness against PMPs for existing models.
+The code is available at https://github.com/hhc1997/L2RM.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM']"
+Spatio-Temporal Turbulence Mitigation: A Translational Perspective,Xingguang Zhang · Nicholas M Chimitt · Yiheng Chi · Zhiyuan Mao · Stanley H. Chan, ,https://arxiv.org/abs/2401.04244,,2401.04244.pdf,Spatio-Temporal Turbulence Mitigation: A Translational Perspective,"Recovering images distorted by atmospheric turbulence is a challenging
+inverse problem due to the stochastic nature of turbulence. Although numerous
+turbulence mitigation (TM) algorithms have been proposed, their efficiency and
+generalization to real-world dynamic scenarios remain severely limited.
+Building upon the intuitions of classical TM algorithms, we present the Deep
+Atmospheric TUrbulence Mitigation network (DATUM). DATUM aims to overcome major
+challenges when transitioning from classical to deep learning approaches. By
+carefully integrating the merits of classical multi-frame TM methods into a
+deep network structure, we demonstrate that DATUM can efficiently perform
+long-range temporal aggregation using a recurrent fashion, while deformable
+attention and temporal-channel attention seamlessly facilitate pixel
+registration and lucky imaging. With additional supervision, tilt and blur
+degradation can be jointly mitigated. These inductive biases empower DATUM to
+significantly outperform existing methods while delivering a tenfold increase
+in processing speed. A large-scale training dataset, ATSyn, is presented as a
+co-invention to enable generalization in real turbulence. Our code and datasets
+are available at https://xg416.github.io/DATUM.",eess.IV,"['eess.IV', 'cs.CV']"
+Seamless Human Motion Composition with Blended Positional Encodings,German Barquero · Sergio Escalera · Cristina Palmero,https://barquerogerman.github.io/FlowMDM/,https://arxiv.org/abs/2402.15509,,2402.15509.pdf,Seamless Human Motion Composition with Blended Positional Encodings,"Conditional human motion generation is an important topic with many
+applications in virtual reality, gaming, and robotics. While prior works have
+focused on generating motion guided by text, music, or scenes, these typically
+result in isolated motions confined to short durations. Instead, we address the
+generation of long, continuous sequences guided by a series of varying textual
+descriptions. In this context, we introduce FlowMDM, the first diffusion-based
+model that generates seamless Human Motion Compositions (HMC) without any
+postprocessing or redundant denoising steps. For this, we introduce the Blended
+Positional Encodings, a technique that leverages both absolute and relative
+positional encodings in the denoising chain. More specifically, global motion
+coherence is recovered at the absolute stage, whereas smooth and realistic
+transitions are built at the relative stage. As a result, we achieve
+state-of-the-art results in terms of accuracy, realism, and smoothness on the
+Babel and HumanML3D datasets. FlowMDM excels when trained with only a single
+description per motion sequence thanks to its Pose-Centric Cross-ATtention,
+which makes it robust against varying text descriptions at inference time.
+Finally, to address the limitations of existing HMC metrics, we propose two new
+metrics: the Peak Jerk and the Area Under the Jerk, to detect abrupt
+transitions.",cs.CV,['cs.CV']
+CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation,Lingjun Zhao · Jingyu Song · Katherine Skinner,https://song-jingyu.github.io/CRKD,https://arxiv.org/abs/2403.19104,,2403.19104.pdf,CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation,"In the field of 3D object detection for autonomous driving, LiDAR-Camera (LC)
+fusion is the top-performing sensor configuration. Still, LiDAR is relatively
+high cost, which hinders adoption of this technology for consumer automobiles.
+Alternatively, camera and radar are commonly deployed on vehicles already on
+the road today, but performance of Camera-Radar (CR) fusion falls behind LC
+fusion. In this work, we propose Camera-Radar Knowledge Distillation (CRKD) to
+bridge the performance gap between LC and CR detectors with a novel
+cross-modality KD framework. We use the Bird's-Eye-View (BEV) representation as
+the shared feature space to enable effective knowledge distillation. To
+accommodate the unique cross-modality KD path, we propose four distillation
+losses to help the student learn crucial features from the teacher model. We
+present extensive evaluations on the nuScenes dataset to demonstrate the
+effectiveness of the proposed CRKD framework. The project page for CRKD is
+https://song-jingyu.github.io/CRKD.",cs.CV,"['cs.CV', 'cs.RO']"
+RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception,Ruiyang Hao · Siqi Fan · Yingru Dai · Zhenlin Zhang · Chenxi Li · YuntianWang · Haibao Yu · Wenxian Yang · Jirui Yuan · Zaiqing Nie,https://github.com/AIR-THU/DAIR-RCooper,https://arxiv.org/abs/2403.10145,,2403.10145.pdf,RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception,"The value of roadside perception, which could extend the boundaries of
+autonomous driving and traffic management, has gradually become more prominent
+and acknowledged in recent years. However, existing roadside perception
+approaches only focus on the single-infrastructure sensor system, which cannot
+realize a comprehensive understanding of a traffic area because of the limited
+sensing range and blind spots. Orienting high-quality roadside perception, we
+need Roadside Cooperative Perception (RCooper) to achieve practical
+area-coverage roadside perception for restricted traffic areas. Rcooper has its
+own domain-specific challenges, but further exploration is hindered due to the
+lack of datasets. We hence release the first real-world, large-scale RCooper
+dataset to bloom the research on practical roadside cooperative perception,
+including detection and tracking. The manually annotated dataset comprises 50k
+images and 30k point clouds, including two representative traffic scenes (i.e.,
+intersection and corridor). The constructed benchmarks prove the effectiveness
+of roadside cooperation perception and demonstrate the direction of further
+research. Codes and dataset can be accessed at:
+https://github.com/AIR-THU/DAIR-RCooper.",cs.CV,"['cs.CV', 'cs.RO', 'I.4.8; I.5.4']"
+Scene Adaptive Sparse Transformer for Event-based Object Detection,Yansong Peng · Li Hebei · Yueyi Zhang · Xiaoyan Sun · Feng Wu, ,https://arxiv.org/abs/2404.01882,,2404.01882.pdf,Scene Adaptive Sparse Transformer for Event-based Object Detection,"While recent Transformer-based approaches have shown impressive performances
+on event-based object detection tasks, their high computational costs still
+diminish the low power consumption advantage of event cameras. Image-based
+works attempt to reduce these costs by introducing sparse Transformers.
+However, they display inadequate sparsity and adaptability when applied to
+event-based object detection, since these approaches cannot balance the fine
+granularity of token-level sparsification and the efficiency of window-based
+Transformers, leading to reduced performance and efficiency. Furthermore, they
+lack scene-specific sparsity optimization, resulting in information loss and a
+lower recall rate. To overcome these limitations, we propose the Scene Adaptive
+Sparse Transformer (SAST). SAST enables window-token co-sparsification,
+significantly enhancing fault tolerance and reducing computational overhead.
+Leveraging the innovative scoring and selection modules, along with the Masked
+Sparse Window Self-Attention, SAST showcases remarkable scene-aware
+adaptability: It focuses only on important objects and dynamically optimizes
+sparsity level according to scene complexity, maintaining a remarkable balance
+between performance and computational cost. The evaluation results show that
+SAST outperforms all other dense and sparse networks in both performance and
+efficiency on two large-scale event-based object detection datasets (1Mpx and
+Gen1). Code: https://github.com/Peterande/SAST",cs.CV,['cs.CV']
+Local-consistent Transformation Learning for Rotation-invariant Point Cloud Analysis,Yiyang Chen · Lunhao Duan · Shanshan Zhao · Changxing Ding · Dacheng Tao,https://github.com/wdttt/LocoTrans,https://arxiv.org/abs/2403.11113,,2403.11113.pdf,Local-consistent Transformation Learning for Rotation-invariant Point Cloud Analysis,"Rotation invariance is an important requirement for point shape analysis. To
+achieve this, current state-of-the-art methods attempt to construct the local
+rotation-invariant representation through learning or defining the local
+reference frame (LRF). Although efficient, these LRF-based methods suffer from
+perturbation of local geometric relations, resulting in suboptimal local
+rotation invariance. To alleviate this issue, we propose a Local-consistent
+Transformation (LocoTrans) learning strategy. Specifically, we first construct
+the local-consistent reference frame (LCRF) by considering the symmetry of the
+two axes in LRF. In comparison with previous LRFs, our LCRF is able to preserve
+local geometric relationships better through performing local-consistent
+transformation. However, as the consistency only exists in local regions, the
+relative pose information is still lost in the intermediate layers of the
+network. We mitigate such a relative pose issue by developing a relative pose
+recovery (RPR) module. RPR aims to restore the relative pose between adjacent
+transformed patches. Equipped with LCRF and RPR, our LocoTrans is capable of
+learning local-consistent transformation and preserving local geometry, which
+benefits rotation invariance learning. Competitive performance under arbitrary
+rotations on both shape classification and part segmentation tasks and
+ablations can demonstrate the effectiveness of our method. Code will be
+available publicly at https://github.com/wdttt/LocoTrans.",cs.CV,['cs.CV']
+Context-based and Diversity-driven Specificity in Compositional Zero-Shot Learning,Yun Li · Zhe Liu · Hang Chen · Lina Yao, ,https://arxiv.org/abs/2402.17251,,2402.17251.pdf,Context-based and Diversity-driven Specificity in Compositional Zero-Shot Learning,"Compositional Zero-Shot Learning (CZSL) aims to recognize unseen
+attribute-object pairs based on a limited set of observed examples. Current
+CZSL methodologies, despite their advancements, tend to neglect the distinct
+specificity levels present in attributes. For instance, given images of sliced
+strawberries, they may fail to prioritize `Sliced-Strawberry' over a generic
+`Red-Strawberry', despite the former being more informative. They also suffer
+from ballooning search space when shifting from Close-World (CW) to Open-World
+(OW) CZSL. To address the issues, we introduce the Context-based and
+Diversity-driven Specificity learning framework for CZSL (CDS-CZSL). Our
+framework evaluates the specificity of attributes by considering the diversity
+of objects they apply to and their related context. This novel approach allows
+for more accurate predictions by emphasizing specific attribute-object pairs
+and improves composition filtering in OW-CZSL. We conduct experiments in both
+CW and OW scenarios, and our model achieves state-of-the-art results across
+three datasets.",cs.CV,['cs.CV']
+MemFlow: Optical Flow Estimation and Prediction with Memory,Qiaole Dong · Yanwei Fu,https://dqiaole.github.io/MemFlow/,https://arxiv.org/abs/2404.04808,,2404.04808.pdf,MemFlow: Optical Flow Estimation and Prediction with Memory,"Optical flow is a classical task that is important to the vision community.
+Classical optical flow estimation uses two frames as input, whilst some recent
+methods consider multiple frames to explicitly model long-range information.
+The former ones limit their ability to fully leverage temporal coherence along
+the video sequence; and the latter ones incur heavy computational overhead,
+typically not possible for real-time flow estimation. Some multi-frame-based
+approaches even necessitate unseen future frames for current estimation,
+compromising real-time applicability in safety-critical scenarios. To this end,
+we present MemFlow, a real-time method for optical flow estimation and
+prediction with memory. Our method enables memory read-out and update modules
+for aggregating historical motion information in real-time. Furthermore, we
+integrate resolution-adaptive re-scaling to accommodate diverse video
+resolutions. Besides, our approach seamlessly extends to the future prediction
+of optical flow based on past observations. Leveraging effective historical
+motion aggregation, our method outperforms VideoFlow with fewer parameters and
+faster inference speed on Sintel and KITTI-15 datasets in terms of
+generalization performance. At the time of submission, MemFlow also leads in
+performance on the 1080p Spring dataset. Codes and models will be available at:
+https://dqiaole.github.io/MemFlow/.",cs.CV,['cs.CV']
+H-ViT: A Hierarchical Vision Transformer for Deformable Image Registration,Morteza Ghahremani · Mohammad Khateri · Bailiang Jian · Benedikt Wiestler · Ehsan Adeli · Christian Wachinger, ,https://arxiv.org/abs/2306.05688,,2306.05688.pdf,ModeT: Learning Deformable Image Registration via Motion Decomposition Transformer,"The Transformer structures have been widely used in computer vision and have
+recently made an impact in the area of medical image registration. However, the
+use of Transformer in most registration networks is straightforward. These
+networks often merely use the attention mechanism to boost the feature learning
+as the segmentation networks do, but do not sufficiently design to be adapted
+for the registration task. In this paper, we propose a novel motion
+decomposition Transformer (ModeT) to explicitly model multiple motion
+modalities by fully exploiting the intrinsic capability of the Transformer
+structure for deformation estimation. The proposed ModeT naturally transforms
+the multi-head neighborhood attention relationship into the multi-coordinate
+relationship to model multiple motion modes. Then the competitive weighting
+module (CWM) fuses multiple deformation sub-fields to generate the resulting
+deformation field. Extensive experiments on two public brain magnetic resonance
+imaging (MRI) datasets show that our method outperforms current
+state-of-the-art registration networks and Transformers, demonstrating the
+potential of our ModeT for the challenging non-rigid deformation estimation
+problem. The benchmarks and our code are publicly available at
+https://github.com/ZAX130/SmileCode.",cs.CV,['cs.CV']
+Entangled View-Epipolar Information Aggregation for Generalizable Neural Radiance Fields,Zhiyuan Min · Yawei Luo · Wei Yang · Yuesong Wang · Yi Yang,https://github.com/tatakai1/EVENeRF,https://arxiv.org/abs/2311.11845,,2311.11845.pdf,Entangled View-Epipolar Information Aggregation for Generalizable Neural Radiance Fields,"Generalizable NeRF can directly synthesize novel views across new scenes,
+eliminating the need for scene-specific retraining in vanilla NeRF. A critical
+enabling factor in these approaches is the extraction of a generalizable 3D
+representation by aggregating source-view features. In this paper, we propose
+an Entangled View-Epipolar Information Aggregation method dubbed EVE-NeRF.
+Different from existing methods that consider cross-view and along-epipolar
+information independently, EVE-NeRF conducts the view-epipolar feature
+aggregation in an entangled manner by injecting the scene-invariant appearance
+continuity and geometry consistency priors to the aggregation process. Our
+approach effectively mitigates the potential lack of inherent geometric and
+appearance constraint resulting from one-dimensional interactions, thus further
+boosting the 3D representation generalizablity. EVE-NeRF attains
+state-of-the-art performance across various evaluation scenarios. Extensive
+experiments demonstate that, compared to prevailing single-dimensional
+aggregation, the entangled network excels in the accuracy of 3D scene geometry
+and appearance reconstruction. Our code is publicly available at
+https://github.com/tatakai1/EVENeRF.",cs.CV,['cs.CV']
+Hybrid Proposal Refiner: Revisiting DETR Series from the Faster R-CNN Perspective,Jinjing Zhao · Fangyun Wei · Chang Xu,https://github.com/ZhaoJingjing713/HPR,,,,,,,nan
+Hyperbolic Anomaly Detection,Huimin Li · Zhentao Chen · Yunhao Xu · Junlin Hu, ,https://arxiv.org/abs/2403.20236,,2403.20236.pdf,Long-Tailed Anomaly Detection with Learnable Class Names,"Anomaly detection (AD) aims to identify defective images and localize their
+defects (if any). Ideally, AD models should be able to detect defects over many
+image classes; without relying on hard-coded class names that can be
+uninformative or inconsistent across datasets; learn without anomaly
+supervision; and be robust to the long-tailed distributions of real-world
+applications. To address these challenges, we formulate the problem of
+long-tailed AD by introducing several datasets with different levels of class
+imbalance and metrics for performance evaluation. We then propose a novel
+method, LTAD, to detect defects from multiple and long-tailed classes, without
+relying on dataset class names. LTAD combines AD by reconstruction and semantic
+AD modules. AD by reconstruction is implemented with a transformer-based
+reconstruction module. Semantic AD is implemented with a binary classifier,
+which relies on learned pseudo class names and a pretrained foundation model.
+These modules are learned over two phases. Phase 1 learns the pseudo-class
+names and a variational autoencoder (VAE) for feature synthesis that augments
+the training data to combat long-tails. Phase 2 then learns the parameters of
+the reconstruction and classification modules of LTAD. Extensive experiments
+using the proposed long-tailed datasets show that LTAD substantially
+outperforms the state-of-the-art methods for most forms of dataset imbalance.
+The long-tailed dataset split is available at
+https://zenodo.org/records/10854201 .",cs.CV,['cs.CV']
+HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models,Tianrui Guan · Fuxiao Liu · Xiyang Wu · Ruiqi Xian · Zongxia Li · Xiaoyu Liu · Xijun Wang · Lichang Chen · Furong Huang · Yaser Yacoob · Dinesh Manocha · Tianyi Zhou,https://github.com/tianyi-lab/HallusionBench,https://arxiv.org/abs/2310.14566,,2310.14566.pdf,HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models,"We introduce HallusionBench, a comprehensive benchmark designed for the
+evaluation of image-context reasoning. This benchmark presents significant
+challenges to advanced large visual-language models (LVLMs), such as
+GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing
+nuanced understanding and interpretation of visual data. The benchmark
+comprises 346 images paired with 1129 questions, all meticulously crafted by
+human experts. We introduce a novel structure for these visual questions
+designed to establish control groups. This structure enables us to conduct a
+quantitative analysis of the models' response tendencies, logical consistency,
+and various failure modes. In our evaluation on HallusionBench, we benchmarked
+15 different models, highlighting a 31.42% question-pair accuracy achieved by
+the state-of-the-art GPT-4V. Notably, all other evaluated models achieve
+accuracy below 16%. Moreover, our analysis not only highlights the observed
+failure modes, including language hallucination and visual illusion, but also
+deepens an understanding of these pitfalls. Our comprehensive case studies
+within HallusionBench shed light on the challenges of hallucination and
+illusion in LVLMs. Based on these insights, we suggest potential pathways for
+their future improvement. The benchmark and codebase can be accessed at
+https://github.com/tianyi-lab/HallusionBench.",cs.CV,"['cs.CV', 'cs.CL']"
+Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization,Ioanna Ntinou · Enrique Sanchez · Georgios Tzimiropoulos,https://github.com/IoannaNti/BMViT,https://arxiv.org/abs/2312.17686,,2312.17686.pdf,Multiscale Vision Transformers meet Bipartite Matching for efficient single-stage Action Localization,"Action Localization is a challenging problem that combines detection and
+recognition tasks, which are often addressed separately. State-of-the-art
+methods rely on off-the-shelf bounding box detections pre-computed at high
+resolution, and propose transformer models that focus on the classification
+task alone. Such two-stage solutions are prohibitive for real-time deployment.
+On the other hand, single-stage methods target both tasks by devoting part of
+the network (generally the backbone) to sharing the majority of the workload,
+compromising performance for speed. These methods build on adding a DETR head
+with learnable queries that after cross- and self-attention can be sent to
+corresponding MLPs for detecting a person's bounding box and action. However,
+DETR-like architectures are challenging to train and can incur in big
+complexity.
+  In this paper, we observe that \textbf{a straight bipartite matching loss can
+be applied to the output tokens of a vision transformer}. This results in a
+backbone + MLP architecture that can do both tasks without the need of an extra
+encoder-decoder head and learnable queries. We show that a single MViTv2-S
+architecture trained with bipartite matching to perform both tasks surpasses
+the same MViTv2-S when trained with RoI align on pre-computed bounding boxes.
+With a careful design of token pooling and the proposed training pipeline, our
+Bipartite-Matching Vision Transformer model, \textbf{BMViT}, achieves +3 mAP on
+AVA2.2. w.r.t. the two-stage MViTv2-S counterpart. Code is available at
+\href{https://github.com/IoannaNti/BMViT}{https://github.com/IoannaNti/BMViT}",cs.CV,['cs.CV']
+UnO: Unsupervised Occupancy Fields for Perception and Forecasting,Ben Agro · Quinlan Sykora · Sergio Casas · Thomas Gilles · Raquel Urtasun, ,https://arxiv.org/abs/2308.01471,,2308.01471.pdf,Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving,"A self-driving vehicle (SDV) must be able to perceive its surroundings and
+predict the future behavior of other traffic participants. Existing works
+either perform object detection followed by trajectory forecasting of the
+detected objects, or predict dense occupancy and flow grids for the whole
+scene. The former poses a safety concern as the number of detections needs to
+be kept low for efficiency reasons, sacrificing object recall. The latter is
+computationally expensive due to the high-dimensionality of the output grid,
+and suffers from the limited receptive field inherent to fully convolutional
+networks. Furthermore, both approaches employ many computational resources
+predicting areas or objects that might never be queried by the motion planner.
+This motivates our unified approach to perception and future prediction that
+implicitly represents occupancy and flow over time with a single neural
+network. Our method avoids unnecessary computation, as it can be directly
+queried by the motion planner at continuous spatio-temporal locations.
+Moreover, we design an architecture that overcomes the limited receptive field
+of previous explicit occupancy prediction methods by adding an efficient yet
+effective global attention mechanism. Through extensive experiments in both
+urban and highway settings, we demonstrate that our implicit model outperforms
+the current state-of-the-art. For more information, visit the project website:
+https://waabi.ai/research/implicito.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.RO']"
+Do Vision and Language Encoders Represent the World Similarly?,Mayug Maniparambil · Raiymbek Akshulakov · YASSER ABDELAZIZ DAHOU DJILALI · Mohamed El Amine Seddik · Sanath Narayan · Karttikeya Mangalam · Noel O&#x27;Connor,https://github.com/mayug/0-shot-llm-vision,https://arxiv.org/abs/2401.05224,,2401.05224.pdf,Do Vision and Language Encoders Represent the World Similarly?,"Aligned text-image encoders such as CLIP have become the de facto model for
+vision-language tasks. Furthermore, modality-specific encoders achieve
+impressive performances in their respective domains. This raises a central
+question: does an alignment exist between uni-modal vision and language
+encoders since they fundamentally represent the same physical world? Analyzing
+the latent spaces structure of vision and language models on image-caption
+benchmarks using the Centered Kernel Alignment (CKA), we find that the
+representation spaces of unaligned and aligned encoders are semantically
+similar. In the absence of statistical similarity in aligned encoders like
+CLIP, we show that a possible matching of unaligned encoders exists without any
+training. We frame this as a seeded graph-matching problem exploiting the
+semantic similarity between graphs and propose two methods - a Fast Quadratic
+Assignment Problem optimization, and a novel localized CKA metric-based
+matching/retrieval. We demonstrate the effectiveness of this on several
+downstream tasks including cross-lingual, cross-domain caption matching and
+image classification. Code available at github.com/mayug/0-shot-llm-vision.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']"
+You Only Need Less Attention Each Stage in Vision Transformers,Shuoxi Zhang · Hanpeng Liu · Stephen Lin · Kun He, ,,,,,,,nan
+DemoCaricature: Democratising Caricature Generation with a Rough Sketch,Dar-Yen Chen · Ayan Kumar Bhunia · Subhadeep Koley · Aneeshan Sain · Pinaki Nath Chowdhury · Yi-Zhe Song,https://democaricature.github.io/,https://arxiv.org/abs/2312.04364v1,,2312.04364v1.pdf,DemoCaricature: Democratising Caricature Generation with a Rough Sketch,"In this paper, we democratise caricature generation, empowering individuals
+to effortlessly craft personalised caricatures with just a photo and a
+conceptual sketch. Our objective is to strike a delicate balance between
+abstraction and identity, while preserving the creativity and subjectivity
+inherent in a sketch. To achieve this, we present Explicit Rank-1 Model Editing
+alongside single-image personalisation, selectively applying nuanced edits to
+cross-attention layers for a seamless merge of identity and style.
+Additionally, we propose Random Mask Reconstruction to enhance robustness,
+directing the model to focus on distinctive identity and style features.
+Crucially, our aim is not to replace artists but to eliminate accessibility
+barriers, allowing enthusiasts to engage in the artistry.",cs.CV,['cs.CV']
+Prompt Highlighter: Interactive Control for Multi-Modal LLMs,Yuechen Zhang · Shengju Qian · Bohao Peng · Shu Liu · Jiaya Jia,https://github.com/dvlab-research/Prompt-Highlighter,https://arxiv.org/abs/2312.04302,,2312.04302.pdf,Prompt Highlighter: Interactive Control for Multi-Modal LLMs,"This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs)
+inference: explicit controllable text generation. Multi-modal LLMs empower
+multi-modality understanding with the capability of semantic generation yet
+bring less explainability and heavier reliance on prompt contents due to their
+autoregressive generative nature. While manipulating prompt formats could
+improve outputs, designing specific and precise prompts per task can be
+challenging and ineffective. To tackle this issue, we introduce a novel
+inference method, Prompt Highlighter, which enables users to highlight specific
+prompt spans to interactively control the focus during generation. Motivated by
+the classifier-free diffusion guidance, we form regular and unconditional
+context pairs based on highlighted tokens, demonstrating that the
+autoregressive generation in models can be guided in a classifier-free way.
+Notably, we find that, during inference, guiding the models with highlighted
+tokens through the attention weights leads to more desired outputs. Our
+approach is compatible with current LLMs and VLMs, achieving impressive
+customized generation results without training. Experiments confirm its
+effectiveness in focusing on input contexts and generating reliable content.
+Without tuning on LLaVA-v1.5, our method secured 70.7 in the MMBench test and
+1552.5 in MME-perception. The code is available at:
+https://github.com/dvlab-research/Prompt-Highlighter/",cs.CV,"['cs.CV', 'cs.CL']"
+Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting,Haiwei Chen · Yajie Zhao, ,https://arxiv.org/abs/2403.18186,,2403.18186.pdf,Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting,"We present a method for large-mask pluralistic image inpainting based on the
+generative framework of discrete latent codes. Our method learns latent priors,
+discretized as tokens, by only performing computations at the visible locations
+of the image. This is realized by a restrictive partial encoder that predicts
+the token label for each visible block, a bidirectional transformer that infers
+the missing labels by only looking at these tokens, and a dedicated synthesis
+network that couples the tokens with the partial image priors to generate
+coherent and pluralistic complete image even under extreme mask settings.
+Experiments on public benchmarks validate our design choices as the proposed
+method outperforms strong baselines in both visual quality and diversity
+metrics.",cs.CV,['cs.CV']
+Bayesian Differentiable Physics for Cloth Digitalization,Deshan Gong · Ningtao Mao · He Wang, ,https://arxiv.org/abs/2402.17664,,2402.17664.pdf,Bayesian Differentiable Physics for Cloth Digitalization,"We propose a new method for cloth digitalization. Deviating from existing
+methods which learn from data captured under relatively casual settings, we
+propose to learn from data captured in strictly tested measuring protocols, and
+find plausible physical parameters of the cloths. However, such data is
+currently absent, so we first propose a new dataset with accurate cloth
+measurements. Further, the data size is considerably smaller than the ones in
+current deep learning, due to the nature of the data capture process. To learn
+from small data, we propose a new Bayesian differentiable cloth model to
+estimate the complex material heterogeneity of real cloths. It can provide
+highly accurate digitalization from very limited data samples. Through
+exhaustive evaluation and comparison, we show our method is accurate in cloth
+digitalization, efficient in learning from limited data samples, and general in
+capturing material variations. Code and data are available
+https://github.com/realcrane/Bayesian-Differentiable-Physics-for-Cloth-Digitalization",cs.CV,"['cs.CV', 'F.4.8; I.6.8']"
+Few-Shot Object Detection with Foundation Models,Guangxing Han · Ser-Nam Lim, ,https://arxiv.org/abs/2312.14494,,2312.14494.pdf,Revisiting Few-Shot Object Detection with Vision-Language Models,"Few-shot object detection (FSOD) benchmarks have advanced techniques for
+detecting new categories with limited annotations. Existing benchmarks
+repurpose well-established datasets like COCO by partitioning categories into
+base and novel classes for pre-training and fine-tuning respectively. However,
+these benchmarks do not reflect how FSOD is deployed in practice. Rather than
+only pre-training on a small number of base categories, we argue that it is
+more practical to fine-tune a foundation model (e.g., a vision-language model
+(VLM) pre-trained on web-scale data) for a target domain. Surprisingly, we find
+that zero-shot inference from VLMs like GroundingDINO significantly outperforms
+the state-of-the-art (48.3 vs. 33.1 AP) on COCO. However, such zero-shot models
+can still be misaligned to target concepts of interest. For example, trailers
+on the web may be different from trailers in the context of autonomous
+vehicles. In this work, we propose Foundational FSOD, a new benchmark protocol
+that evaluates detectors pre-trained on any external datasets and fine-tuned on
+K-shots per target class. Further, we note that current FSOD benchmarks are
+actually federated datasets containing exhaustive annotations for each category
+on a subset of the data. We leverage this insight to propose simple strategies
+for fine-tuning VLMs with federated losses. We demonstrate the effectiveness of
+our approach on LVIS and nuImages, improving over prior work by 5.9 AP. Our
+code is available at https://github.com/anishmadan23/foundational_fsod",cs.CV,['cs.CV']
+MonoHair: High-Fidelity Hair Modeling from a Monocular Video,Keyu Wu · LINGCHEN YANG · Zhiyi Kuang · Yao Feng · Xutao Han · Yuefan Shen · Hongbo Fu · Kun Zhou · Youyi Zheng,https://keyuwu-cs.github.io/MonoHair/,https://arxiv.org/abs/2403.18356,,2403.18356.pdf,MonoHair: High-Fidelity Hair Modeling from a Monocular Video,"Undoubtedly, high-fidelity 3D hair is crucial for achieving realism, artistic
+expression, and immersion in computer graphics. While existing 3D hair modeling
+methods have achieved impressive performance, the challenge of achieving
+high-quality hair reconstruction persists: they either require strict capture
+conditions, making practical applications difficult, or heavily rely on learned
+prior data, obscuring fine-grained details in images. To address these
+challenges, we propose MonoHair,a generic framework to achieve high-fidelity
+hair reconstruction from a monocular video, without specific requirements for
+environments. Our approach bifurcates the hair modeling process into two main
+stages: precise exterior reconstruction and interior structure inference. The
+exterior is meticulously crafted using our Patch-based Multi-View Optimization
+(PMVO). This method strategically collects and integrates hair information from
+multiple views, independent of prior data, to produce a high-fidelity exterior
+3D line map. This map not only captures intricate details but also facilitates
+the inference of the hair's inner structure. For the interior, we employ a
+data-driven, multi-view 3D hair reconstruction method. This method utilizes 2D
+structural renderings derived from the reconstructed exterior, mirroring the
+synthetic 2D inputs used during training. This alignment effectively bridges
+the domain gap between our training data and real-world data, thereby enhancing
+the accuracy and reliability of our interior structure inference. Lastly, we
+generate a strand model and resolve the directional ambiguity by our hair
+growth algorithm. Our experiments demonstrate that our method exhibits
+robustness across diverse hairstyles and achieves state-of-the-art performance.
+For more results, please refer to our project page
+https://keyuwu-cs.github.io/MonoHair/.",cs.CV,['cs.CV']
+Solving Masked Jigsaw Puzzles with Diffusion Transformers,Jinyang Liu · Wondmgezahu Teshome · Sandesh Ghimire · Mario Sznaier · Octavia Camps, ,https://arxiv.org/abs/2404.07292,,2404.07292.pdf,Solving Masked Jigsaw Puzzles with Diffusion Vision Transformers,"Solving image and video jigsaw puzzles poses the challenging task of
+rearranging image fragments or video frames from unordered sequences to restore
+meaningful images and video sequences. Existing approaches often hinge on
+discriminative models tasked with predicting either the absolute positions of
+puzzle elements or the permutation actions applied to the original data.
+Unfortunately, these methods face limitations in effectively solving puzzles
+with a large number of elements. In this paper, we propose JPDVT, an innovative
+approach that harnesses diffusion transformers to address this challenge.
+Specifically, we generate positional information for image patches or video
+frames, conditioned on their underlying visual content. This information is
+then employed to accurately assemble the puzzle pieces in their correct
+positions, even in scenarios involving missing pieces. Our method achieves
+state-of-the-art performance on several datasets.",cs.CV,['cs.CV']
+Shadow-Enlightened Image Outpainting,Hang Yu · Ruilin Li · Shaorong Xie · Jiayan Qiu, ,https://arxiv.org/html/2204.08563v2,,2204.08563v2.pdf,Cylin-Painting: Seamless {360\textdegree} Panoramic Image Outpainting and Beyond,"Image outpainting gains increasing attention since it can generate the
+complete scene from a partial view, providing a valuable solution to construct
+{360\textdegree} panoramic images. As image outpainting suffers from the
+intrinsic issue of unidirectional completion flow, previous methods convert the
+original problem into inpainting, which allows a bidirectional flow. However,
+we find that inpainting has its own limitations and is inferior to outpainting
+in certain situations. The question of how they may be combined for the best of
+both has as yet remained under-explored. In this paper, we provide a deep
+analysis of the differences between inpainting and outpainting, which
+essentially depends on how the source pixels contribute to the unknown regions
+under different spatial arrangements. Motivated by this analysis, we present a
+Cylin-Painting framework that involves meaningful collaborations between
+inpainting and outpainting and efficiently fuses the different arrangements,
+with a view to leveraging their complementary benefits on a seamless cylinder.
+Nevertheless, straightforwardly applying the cylinder-style convolution often
+generates visually unpleasing results as it discards important positional
+information. To address this issue, we further present a learnable positional
+embedding strategy to incorporate the missing component of positional encoding
+into the cylinder convolution, which significantly improves the panoramic
+results. It is noted that while developed for image outpainting, the proposed
+algorithm can be effectively extended to other panoramic vision tasks, such as
+object detection, depth estimation, and image super-resolution. Code will be
+made available at \url{https://github.com/KangLiao929/Cylin-Painting}.",cs.CV,['cs.CV']
+Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking,Wei Cao · Chang Luo · Biao Zhang · Matthias Nießner · Jiapeng Tang, ,https://arxiv.org/abs/2401.06614,,2401.06614.pdf,Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking,"We introduce Motion2VecSets, a 4D diffusion model for dynamic surface
+reconstruction from point cloud sequences. While existing state-of-the-art
+methods have demonstrated success in reconstructing non-rigid objects using
+neural field representations, conventional feed-forward networks encounter
+challenges with ambiguous observations from noisy, partial, or sparse point
+clouds. To address these challenges, we introduce a diffusion model that
+explicitly learns the shape and motion distribution of non-rigid objects
+through an iterative denoising process of compressed latent representations.
+The diffusion-based priors enable more plausible and probabilistic
+reconstructions when handling ambiguous inputs. We parameterize 4D dynamics
+with latent sets instead of using global latent codes. This novel 4D
+representation allows us to learn local shape and deformation patterns, leading
+to more accurate non-linear motion capture and significantly improving
+generalizability to unseen motions and identities. For more temporally-coherent
+object tracking, we synchronously denoise deformation latent sets and exchange
+information across multiple frames. To avoid computational overhead, we
+designed an interleaved space and time attention block to alternately aggregate
+deformation latents along spatial and temporal domains. Extensive comparisons
+against state-of-the-art methods demonstrate the superiority of our
+Motion2VecSets in 4D reconstruction from various imperfect observations. More
+detailed information can be found at
+https://vveicao.github.io/projects/Motion2VecSets/.",cs.CV,['cs.CV']
+Test-Time Linear Out-of-Distribution Detection,Ke Fan · Tong Liu · Xingyu Qiu · Yikai Wang · Lian Huai · Zeyu Shangguan · Shuang Gou · FENGJIAN LIU · Yuqian Fu · Yanwei Fu · Xingqun Jiang, ,https://arxiv.org/abs/2311.16420,,2311.16420.pdf,Model-free Test Time Adaptation for Out-Of-Distribution Detection,"Out-of-distribution (OOD) detection is essential for the reliability of ML
+models. Most existing methods for OOD detection learn a fixed decision
+criterion from a given in-distribution dataset and apply it universally to
+decide if a data point is OOD. Recent work~\cite{fang2022is} shows that given
+only in-distribution data, it is impossible to reliably detect OOD data without
+extra assumptions. Motivated by the theoretical result and recent exploration
+of test-time adaptation methods, we propose a Non-Parametric Test Time
+\textbf{Ada}ptation framework for \textbf{O}ut-Of-\textbf{D}istribution
+\textbf{D}etection (\abbr). Unlike conventional methods, \abbr utilizes online
+test samples for model adaptation during testing, enhancing adaptability to
+changing data distributions. The framework incorporates detected OOD instances
+into decision-making, reducing false positive rates, particularly when ID and
+OOD distributions overlap significantly. We demonstrate the effectiveness of
+\abbr through comprehensive experiments on multiple OOD detection benchmarks,
+extensive empirical studies show that \abbr significantly improves the
+performance of OOD detection over state-of-the-art methods. Specifically, \abbr
+reduces the false positive rate (FPR95) by $23.23\%$ on the CIFAR-10 benchmarks
+and $38\%$ on the ImageNet-1k benchmarks compared to the advanced methods.
+Lastly, we theoretically verify the effectiveness of \abbr.",cs.LG,"['cs.LG', 'cs.CV']"
+Spatial-Aware Regression for Keypoint Localization,Dongkai Wang · Shiliang Zhang, ,,https://dl.acm.org/doi/10.1145/3581783.3611989,,,,,nan
+Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving,JunDa Cheng · Wei Yin · Kaixuan Wang · Xiaozhi Chen · Shijie Wang · Xin Yang, ,https://arxiv.org/abs/2403.07535,,2403.07535.pdf,Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving,"Multi-view depth estimation has achieved impressive performance over various
+benchmarks. However, almost all current multi-view systems rely on given ideal
+camera poses, which are unavailable in many real-world scenarios, such as
+autonomous driving. In this work, we propose a new robustness benchmark to
+evaluate the depth estimation system under various noisy pose settings.
+Surprisingly, we find current multi-view depth estimation methods or
+single-view and multi-view fusion methods will fail when given noisy pose
+settings. To address this challenge, we propose a single-view and multi-view
+fused depth estimation system, which adaptively integrates high-confident
+multi-view and single-view results for both robust and accurate depth
+estimations. The adaptive fusion module performs fusion by dynamically
+selecting high-confidence regions between two branches based on a wrapping
+confidence map. Thus, the system tends to choose the more reliable branch when
+facing textureless scenes, inaccurate calibration, dynamic objects, and other
+degradation or challenging conditions. Our method outperforms state-of-the-art
+multi-view and fusion methods under robustness testing. Furthermore, we achieve
+state-of-the-art performance on challenging benchmarks (KITTI and DDAD) when
+given accurate pose estimations. Project website:
+https://github.com/Junda24/AFNet/.",cs.CV,['cs.CV']
+ID-like Prompt Learning for Few-Shot Out-of-Distribution Detection,Yichen Bai · Zongbo Han · Bing Cao · Xiaoheng Jiang · Qinghua Hu · Changqing Zhang, ,https://arxiv.org/abs/2311.15243,,2311.15243.pdf,ID-like Prompt Learning for Few-Shot Out-of-Distribution Detection,"Out-of-distribution (OOD) detection methods often exploit auxiliary outliers
+to train model identifying OOD samples, especially discovering challenging
+outliers from auxiliary outliers dataset to improve OOD detection. However,
+they may still face limitations in effectively distinguishing between the most
+challenging OOD samples that are much like in-distribution (ID) data, i.e.,
+\idlike samples. To this end, we propose a novel OOD detection framework that
+discovers \idlike outliers using CLIP \cite{DBLP:conf/icml/RadfordKHRGASAM21}
+from the vicinity space of the ID samples, thus helping to identify these most
+challenging OOD samples. Then a prompt learning framework is proposed that
+utilizes the identified \idlike outliers to further leverage the capabilities
+of CLIP for OOD detection. Benefiting from the powerful CLIP, we only need a
+small number of ID samples to learn the prompts of the model without exposing
+other auxiliary outlier datasets. By focusing on the most challenging \idlike
+OOD samples and elegantly exploiting the capabilities of CLIP, our method
+achieves superior few-shot learning performance on various real-world image
+datasets (e.g., in 4-shot OOD detection on the ImageNet-1k dataset, our method
+reduces the average FPR95 by 12.16\% and improves the average AUROC by 2.76\%,
+compared to state-of-the-art methods). Code is available at
+https://github.com/ycfate/ID-like.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis,Feng Liang · Bichen Wu · Jialiang Wang · Licheng Yu · Kunpeng Li · Yinan Zhao · Ishan Misra · Jia-Bin Huang · Peizhao Zhang · Peter Vajda · Diana Marculescu, ,https://arxiv.org/abs/2312.17681,,2312.17681.pdf,FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis,"Diffusion models have transformed the image-to-image (I2I) synthesis and are
+now permeating into videos. However, the advancement of video-to-video (V2V)
+synthesis has been hampered by the challenge of maintaining temporal
+consistency across video frames. This paper proposes a consistent V2V synthesis
+framework by jointly leveraging spatial conditions and temporal optical flow
+clues within the source video. Contrary to prior methods that strictly adhere
+to optical flow, our approach harnesses its benefits while handling the
+imperfection in flow estimation. We encode the optical flow via warping from
+the first frame and serve it as a supplementary reference in the diffusion
+model. This enables our model for video synthesis by editing the first frame
+with any prevalent I2I models and then propagating edits to successive frames.
+Our V2V model, FlowVid, demonstrates remarkable properties: (1) Flexibility:
+FlowVid works seamlessly with existing I2I models, facilitating various
+modifications, including stylization, object swaps, and local edits. (2)
+Efficiency: Generation of a 4-second video with 30 FPS and 512x512 resolution
+takes only 1.5 minutes, which is 3.1x, 7.2x, and 10.5x faster than CoDeF,
+Rerender, and TokenFlow, respectively. (3) High-quality: In user studies, our
+FlowVid is preferred 45.7% of the time, outperforming CoDeF (3.5%), Rerender
+(10.2%), and TokenFlow (40.4%).",cs.CV,"['cs.CV', 'cs.MM']"
+Prompt-enhanced Multiple Instance Learning for Weakly Supervised Anomaly Detection,Junxi Chen · Liang Li · Li Su · Zheng-Jun Zha · Qingming Huang, ,https://arxiv.org/abs/2306.14451,,2306.14451.pdf,Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection,"Video anomaly detection under weak supervision presents significant
+challenges, particularly due to the lack of frame-level annotations during
+training. While prior research has utilized graph convolution networks and
+self-attention mechanisms alongside multiple instance learning (MIL)-based
+classification loss to model temporal relations and learn discriminative
+features, these methods often employ multi-branch architectures to capture
+local and global dependencies separately, resulting in increased parameters and
+computational costs. Moreover, the coarse-grained interclass separability
+provided by the binary constraint of MIL-based loss neglects the fine-grained
+discriminability within anomalous classes. In response, this paper introduces a
+weakly supervised anomaly detection framework that focuses on efficient context
+modeling and enhanced semantic discriminability. We present a Temporal Context
+Aggregation (TCA) module that captures comprehensive contextual information by
+reusing the similarity matrix and implementing adaptive fusion. Additionally,
+we propose a Prompt-Enhanced Learning (PEL) module that integrates semantic
+priors using knowledge-based prompts to boost the discriminative capacity of
+context features while ensuring separability between anomaly sub-classes.
+Extensive experiments validate the effectiveness of our method's components,
+demonstrating competitive performance with reduced parameters and computational
+effort on three challenging benchmarks: UCF-Crime, XD-Violence, and
+ShanghaiTech datasets. Notably, our approach significantly improves the
+detection accuracy of certain anomaly sub-classes, underscoring its practical
+value and efficacy. Our code is available at:
+https://github.com/yujiangpu20/PEL4VAD.",cs.CV,['cs.CV']
+Efficient Dataset Distillation via Minimax Diffusion,Jianyang Gu · Saeed Vahidian · Vyacheslav Kungurtsev · Haonan Wang · Wei Jiang · Yang You · Yiran Chen,https://github.com/vimar-gu/MinimaxDiffusion,https://arxiv.org/abs/2311.15529v1,,2311.15529v1.pdf,Efficient Dataset Distillation via Minimax Diffusion,"Dataset distillation reduces the storage and computational consumption of
+training a network by generating a small surrogate dataset that encapsulates
+rich information of the original large-scale one. However, previous
+distillation methods heavily rely on the sample-wise iterative optimization
+scheme. As the images-per-class (IPC) setting or image resolution grows larger,
+the necessary computation will demand overwhelming time and resources. In this
+work, we intend to incorporate generative diffusion techniques for computing
+the surrogate dataset. Observing that key factors for constructing an effective
+surrogate dataset are representativeness and diversity, we design additional
+minimax criteria in the generative training to enhance these facets for the
+generated images of diffusion models. We present a theoretical model of the
+process as hierarchical diffusion control demonstrating the flexibility of the
+diffusion process to target these criteria without jeopardizing the
+faithfulness of the sample to the desired distribution. The proposed method
+achieves state-of-the-art validation performance while demanding much less
+computational resources. Under the 100-IPC setting on ImageWoof, our method
+requires less than one-twentieth the distillation time of previous methods, yet
+yields even better performance. Source code available in
+https://github.com/vimar-gu/MinimaxDiffusion.",cs.CV,['cs.CV']
+State Space Models for Event Cameras,Nikola Zubic · Mathias Gehrig · Davide Scaramuzza,https://github.com/uzh-rpg/ssms_event_cameras,https://arxiv.org/abs/2402.15584,,2402.15584.pdf,State Space Models for Event Cameras,"Today, state-of-the-art deep neural networks that process event-camera data
+first convert a temporal window of events into dense, grid-like input
+representations. As such, they exhibit poor generalizability when deployed at
+higher inference frequencies (i.e., smaller temporal windows) than the ones
+they were trained on. We address this challenge by introducing state-space
+models (SSMs) with learnable timescale parameters to event-based vision. This
+design adapts to varying frequencies without the need to retrain the network at
+different frequencies. Additionally, we investigate two strategies to
+counteract aliasing effects when deploying the model at higher frequencies. We
+comprehensively evaluate our approach against existing methods based on RNN and
+Transformer architectures across various benchmarks, including Gen1 and 1 Mpx
+event camera datasets. Our results demonstrate that SSM-based models train 33%
+faster and also exhibit minimal performance degradation when tested at higher
+frequencies than the training input. Traditional RNN and Transformer models
+exhibit performance drops of more than 20 mAP, with SSMs having a drop of 3.76
+mAP, highlighting the effectiveness of SSMs in event-based vision tasks.",cs.CV,"['cs.CV', 'cs.LG']"
+ReCoRe: Regularized Contrastive Representation Learning of World Model,"Rudra P,K. Poudel · Harit Pandya · Stephan Liwicki · Roberto Cipolla",https://www.toshiba.eu/pages/eu/Cambridge-Research-Laboratory/world_models,https://arxiv.org/abs/2312.09056v1,,2312.09056v1.pdf,ReCoRe: Regularized Contrastive Representation Learning of World Model,"While recent model-free Reinforcement Learning (RL) methods have demonstrated
+human-level effectiveness in gaming environments, their success in everyday
+tasks like visual navigation has been limited, particularly under significant
+appearance variations. This limitation arises from (i) poor sample efficiency
+and (ii) over-fitting to training scenarios. To address these challenges, we
+present a world model that learns invariant features using (i) contrastive
+unsupervised learning and (ii) an intervention-invariant regularizer. Learning
+an explicit representation of the world dynamics i.e. a world model, improves
+sample efficiency while contrastive learning implicitly enforces learning of
+invariant features, which improves generalization. However, the naive
+integration of contrastive loss to world models fails due to a lack of
+supervisory signals to the visual encoder, as world-model-based RL methods
+independently optimize representation learning and agent policy. To overcome
+this issue, we propose an intervention-invariant regularizer in the form of an
+auxiliary task such as depth prediction, image denoising, etc., that explicitly
+enforces invariance to style-interventions. Our method outperforms current
+state-of-the-art model-based and model-free RL methods and significantly on
+out-of-distribution point navigation task evaluated on the iGibson benchmark.
+We further demonstrate that our approach, with only visual observations,
+outperforms recent language-guided foundation models for point navigation,
+which is essential for deployment on robots with limited computation
+capabilities. Finally, we demonstrate that our proposed model excels at the
+sim-to-real transfer of its perception module on Gibson benchmark.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV', 'cs.RO', 'stat.ML']"
+Semantic Human Mesh Reconstruction with Textures,xiaoyu zhan · Jianxin Yang · Yuanqi Li · Jie Guo · Yanwen Guo · Wenping Wang,https://zhanxy.xyz/projects/shert,https://arxiv.org/abs/2403.02561,,2403.02561.pdf,Semantic Human Mesh Reconstruction with Textures,"The field of 3D detailed human mesh reconstruction has made significant
+progress in recent years. However, current methods still face challenges when
+used in industrial applications due to unstable results, low-quality meshes,
+and a lack of UV unwrapping and skinning weights. In this paper, we present
+SHERT, a novel pipeline that can reconstruct semantic human meshes with
+textures and high-precision details. SHERT applies semantic- and normal-based
+sampling between the detailed surface (e.g. mesh and SDF) and the corresponding
+SMPL-X model to obtain a partially sampled semantic mesh and then generates the
+complete semantic mesh by our specifically designed self-supervised completion
+and refinement networks. Using the complete semantic mesh as a basis, we employ
+a texture diffusion model to create human textures that are driven by both
+images and texts. Our reconstructed meshes have stable UV unwrapping,
+high-quality triangle meshes, and consistent semantic information. The given
+SMPL-X model provides semantic information and shape priors, allowing SHERT to
+perform well even with incorrect and incomplete inputs. The semantic
+information also makes it easy to substitute and animate different body parts
+such as the face, body, and hands. Quantitative and qualitative experiments
+demonstrate that SHERT is capable of producing high-fidelity and robust
+semantic meshes that outperform state-of-the-art methods.",cs.CV,['cs.CV']
+Infrared Small Target Detection with Scale and Location Sensitivity,Qiankun Liu · Rui Liu · Bolun Zheng · Hongkui Wang · Ying Fu, ,https://arxiv.org/abs/2403.19366,,2403.19366.pdf,Infrared Small Target Detection with Scale and Location Sensitivity,"Recently, infrared small target detection (IRSTD) has been dominated by
+deep-learning-based methods. However, these methods mainly focus on the design
+of complex model structures to extract discriminative features, leaving the
+loss functions for IRSTD under-explored. For example, the widely used
+Intersection over Union (IoU) and Dice losses lack sensitivity to the scales
+and locations of targets, limiting the detection performance of detectors. In
+this paper, we focus on boosting detection performance with a more effective
+loss but a simpler model structure. Specifically, we first propose a novel
+Scale and Location Sensitive (SLS) loss to handle the limitations of existing
+losses: 1) for scale sensitivity, we compute a weight for the IoU loss based on
+target scales to help the detector distinguish targets with different scales:
+2) for location sensitivity, we introduce a penalty term based on the center
+points of targets to help the detector localize targets more precisely. Then,
+we design a simple Multi-Scale Head to the plain U-Net (MSHNet). By applying
+SLS loss to each scale of the predictions, our MSHNet outperforms existing
+state-of-the-art methods by a large margin. In addition, the detection
+performance of existing detectors can be further improved when trained with our
+SLS loss, demonstrating the effectiveness and generalization of our SLS loss.
+The code is available at https://github.com/ying-fu/MSHNet.",cs.CV,['cs.CV']
+Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships,Sebastian Koch · Narunas Vaskevicius · Mirco Colosi · Pedro Hermosilla · Timo Ropinski,https://kochsebastian.com/open3dsg,https://arxiv.org/abs/2402.12259,,2402.12259.pdf,Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships,"Current approaches for 3D scene graph prediction rely on labeled datasets to
+train models for a fixed set of known object classes and relationship
+categories. We present Open3DSG, an alternative approach to learn 3D scene
+graph prediction in an open world without requiring labeled scene graph data.
+We co-embed the features from a 3D scene graph prediction backbone with the
+feature space of powerful open world 2D vision language foundation models. This
+enables us to predict 3D scene graphs from 3D point clouds in a zero-shot
+manner by querying object classes from an open vocabulary and predicting the
+inter-object relationships from a grounded LLM with scene graph features and
+queried object classes as context. Open3DSG is the first 3D point cloud method
+to predict not only explicit open-vocabulary object classes, but also open-set
+relationships that are not limited to a predefined label set, making it
+possible to express rare as well as specific objects and relationships in the
+predicted 3D scene graph. Our experiments show that Open3DSG is effective at
+predicting arbitrary object classes as well as their complex inter-object
+relationships describing spatial, supportive, semantic and comparative
+relationships.",cs.CV,['cs.CV']
+SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation,Bin Xie · Jiale Cao · Jin Xie · Fahad Shahbaz Khan · Yanwei Pang,https://github.com/xb534/SED,https://arxiv.org/abs/2311.15537,,2311.15537.pdf,SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation,"Open-vocabulary semantic segmentation strives to distinguish pixels into
+different semantic groups from an open set of categories. Most existing methods
+explore utilizing pre-trained vision-language models, in which the key is to
+adopt the image-level model for pixel-level segmentation task. In this paper,
+we propose a simple encoder-decoder, named SED, for open-vocabulary semantic
+segmentation, which comprises a hierarchical encoder-based cost map generation
+and a gradual fusion decoder with category early rejection. The hierarchical
+encoder-based cost map generation employs hierarchical backbone, instead of
+plain transformer, to predict pixel-level image-text cost map. Compared to
+plain transformer, hierarchical backbone better captures local spatial
+information and has linear computational complexity with respect to input size.
+Our gradual fusion decoder employs a top-down structure to combine cost map and
+the feature maps of different backbone levels for segmentation. To accelerate
+inference speed, we introduce a category early rejection scheme in the decoder
+that rejects many no-existing categories at the early layer of decoder,
+resulting in at most 4.7 times acceleration without accuracy degradation.
+Experiments are performed on multiple open-vocabulary semantic segmentation
+datasets, which demonstrates the efficacy of our SED method. When using
+ConvNeXt-B, our SED method achieves mIoU score of 31.6\% on ADE20K with 150
+categories at 82 millisecond ($ms$) per image on a single A6000. We will
+release it at \url{https://github.com/xb534/SED.git}.",cs.CV,['cs.CV']
+"Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing",Boqiang Zhang · Hongtao Xie · Zuan Gao · Yuxin Wang, ,https://arxiv.org/abs/2405.12724,,2405.12724.pdf,RemoCap: Disentangled Representation Learning for Motion Capture,"Reconstructing 3D human bodies from realistic motion sequences remains a
+challenge due to pervasive and complex occlusions. Current methods struggle to
+capture the dynamics of occluded body parts, leading to model penetration and
+distorted motion. RemoCap leverages Spatial Disentanglement (SD) and Motion
+Disentanglement (MD) to overcome these limitations. SD addresses occlusion
+interference between the target human body and surrounding objects. It achieves
+this by disentangling target features along the dimension axis. By aligning
+features based on their spatial positions in each dimension, SD isolates the
+target object's response within a global window, enabling accurate capture
+despite occlusions. The MD module employs a channel-wise temporal shuffling
+strategy to simulate diverse scene dynamics. This process effectively
+disentangles motion features, allowing RemoCap to reconstruct occluded parts
+with greater fidelity. Furthermore, this paper introduces a sequence velocity
+loss that promotes temporal coherence. This loss constrains inter-frame
+velocity errors, ensuring the predicted motion exhibits realistic consistency.
+Extensive comparisons with state-of-the-art (SOTA) methods on benchmark
+datasets demonstrate RemoCap's superior performance in 3D human body
+reconstruction. On the 3DPW dataset, RemoCap surpasses all competitors,
+achieving the best results in MPVPE (81.9), MPJPE (72.7), and PA-MPJPE (44.1)
+metrics. Codes are available at https://wanghongsheng01.github.io/RemoCap/.",cs.CV,['cs.CV']
+SpecNeRF: Gaussian Directional Encoding for Specular Reflections,Li Ma · Vasu Agrawal · Haithem Turki · Changil Kim · Chen Gao · Pedro V. Sander · Michael Zollhoefer · Christian Richardt, ,https://arxiv.org/abs/2312.13102,,2312.13102.pdf,SpecNeRF: Gaussian Directional Encoding for Specular Reflections,"Neural radiance fields have achieved remarkable performance in modeling the
+appearance of 3D scenes. However, existing approaches still struggle with the
+view-dependent appearance of glossy surfaces, especially under complex lighting
+of indoor environments. Unlike existing methods, which typically assume distant
+lighting like an environment map, we propose a learnable Gaussian directional
+encoding to better model the view-dependent effects under near-field lighting
+conditions. Importantly, our new directional encoding captures the
+spatially-varying nature of near-field lighting and emulates the behavior of
+prefiltered environment maps. As a result, it enables the efficient evaluation
+of preconvolved specular color at any 3D location with varying roughness
+coefficients. We further introduce a data-driven geometry prior that helps
+alleviate the shape radiance ambiguity in reflection modeling. We show that our
+Gaussian directional encoding and geometry prior significantly improve the
+modeling of challenging specular reflections in neural radiance fields, which
+helps decompose appearance into more physically meaningful components.",cs.CV,['cs.CV']
+Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation,Jiaming Liu · Ran Xu · Senqiao Yang · Renrui Zhang · Qizhe Zhang · Zehui Chen · Yandong Guo · Shanghang Zhang,https://sites.google.com/view/continual-mae/home,https://arxiv.org/abs/2312.12480,,2312.12480.pdf,Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation,"Continual Test-Time Adaptation (CTTA) is proposed to migrate a source
+pre-trained model to continually changing target distributions, addressing
+real-world dynamism. Existing CTTA methods mainly rely on entropy minimization
+or teacher-student pseudo-labeling schemes for knowledge extraction in
+unlabeled target domains. However, dynamic data distributions cause
+miscalibrated predictions and noisy pseudo-labels in existing self-supervised
+learning methods, hindering the effective mitigation of error accumulation and
+catastrophic forgetting problems during the continual adaptation process. To
+tackle these issues, we propose a continual self-supervised method, Adaptive
+Distribution Masked Autoencoders (ADMA), which enhances the extraction of
+target domain knowledge while mitigating the accumulation of distribution
+shifts. Specifically, we propose a Distribution-aware Masking (DaM) mechanism
+to adaptively sample masked positions, followed by establishing consistency
+constraints between the masked target samples and the original target samples.
+Additionally, for masked tokens, we utilize an efficient decoder to reconstruct
+a hand-crafted feature descriptor (e.g., Histograms of Oriented Gradients),
+leveraging its invariant properties to boost task-relevant representations.
+Through conducting extensive experiments on four widely recognized benchmarks,
+our proposed method attains state-of-the-art performance in both classification
+and segmentation CTTA tasks. Our project page:
+https://sites.google.com/view/continual-mae/home.",cs.CV,['cs.CV']
+3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow,Felix Taubner · Prashant Raina · Mathieu Tuli · Eu Wern Teh · Chul Lee · Jinmiao Huang,https://felixtaubner.github.io/flowface,https://arxiv.org/abs/2404.09819,,2404.09819.pdf,3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow,"When working with 3D facial data, improving fidelity and avoiding the uncanny
+valley effect is critically dependent on accurate 3D facial performance
+capture. Because such methods are expensive and due to the widespread
+availability of 2D videos, recent methods have focused on how to perform
+monocular 3D face tracking. However, these methods often fall short in
+capturing precise facial movements due to limitations in their network
+architecture, training, and evaluation processes. Addressing these challenges,
+we propose a novel face tracker, FlowFace, that introduces an innovative 2D
+alignment network for dense per-vertex alignment. Unlike prior work, FlowFace
+is trained on high-quality 3D scan annotations rather than weak supervision or
+synthetic data. Our 3D model fitting module jointly fits a 3D face model from
+one or many observations, integrating existing neutral shape priors for
+enhanced identity and expression disentanglement and per-vertex deformations
+for detailed facial feature reconstruction. Additionally, we propose a novel
+metric and benchmark for assessing tracking accuracy. Our method exhibits
+superior performance on both custom and publicly available benchmarks. We
+further validate the effectiveness of our tracker by generating high-quality 3D
+data from 2D videos, which leads to performance gains on downstream tasks.",cs.CV,['cs.CV']
+Deep Single Image Camera Calibration by Heatmap Regression to Recover Fisheye Images Under Manhattan World Assumption,Nobuhiko Wakai · Satoshi Sato · Yasunori Ishii · Takayoshi Yamashita, ,,https://paperswithcode.com/search?q=author:Yasunori+Ishii,,,,,nan
+Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-Distillation,Hongwei Yan · Liyuan Wang · Kaisheng Ma · Yi Zhong, ,https://arxiv.org/abs/2404.00417,,2404.00417.pdf,Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-Distillation,"To accommodate real-world dynamics, artificial intelligence systems need to
+cope with sequentially arriving content in an online manner. Beyond regular
+Continual Learning (CL) attempting to address catastrophic forgetting with
+offline training of each task, Online Continual Learning (OCL) is a more
+challenging yet realistic setting that performs CL in a one-pass data stream.
+Current OCL methods primarily rely on memory replay of old training samples.
+However, a notable gap from CL to OCL stems from the additional
+overfitting-underfitting dilemma associated with the use of rehearsal buffers:
+the inadequate learning of new training samples (underfitting) and the repeated
+learning of a few old training samples (overfitting). To this end, we introduce
+a novel approach, Multi-level Online Sequential Experts (MOSE), which
+cultivates the model as stacked sub-experts, integrating multi-level
+supervision and reverse self-distillation. Supervision signals across multiple
+stages facilitate appropriate convergence of the new task while gathering
+various strengths from experts by knowledge distillation mitigates the
+performance decline of old tasks. MOSE demonstrates remarkable efficacy in
+learning new samples and preserving past knowledge through multi-level experts,
+thereby significantly advancing OCL performance over state-of-the-art baselines
+(e.g., up to 7.3% on Split CIFAR-100 and 6.1% on Split Tiny-ImageNet).",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+An Empirical Study of the Generalization Ability of Lidar 3D Object Detectors to Unseen Domains,George Eskandar, ,https://arxiv.org/abs/2402.17562v1,,2402.17562v1.pdf,An Empirical Study of the Generalization Ability of Lidar 3D Object Detectors to Unseen Domains,"3D Object Detectors (3D-OD) are crucial for understanding the environment in
+many robotic tasks, especially autonomous driving. Including 3D information via
+Lidar sensors improves accuracy greatly. However, such detectors perform poorly
+on domains they were not trained on, i.e. different locations, sensors,
+weather, etc., limiting their reliability in safety-critical applications.
+There exist methods to adapt 3D-ODs to these domains; however, these methods
+treat 3D-ODs as a black box, neglecting underlying architectural decisions and
+source-domain training strategies. Instead, we dive deep into the details of
+3D-ODs, focusing our efforts on fundamental factors that influence robustness
+prior to domain adaptation.
+  We systematically investigate four design choices (and the interplay between
+them) often overlooked in 3D-OD robustness and domain adaptation: architecture,
+voxel encoding, data augmentations, and anchor strategies. We assess their
+impact on the robustness of nine state-of-the-art 3D-ODs across six benchmarks
+encompassing three types of domain gaps - sensor type, weather, and location.
+  Our main findings are: (1) transformer backbones with local point features
+are more robust than 3D CNNs, (2) test-time anchor size adjustment is crucial
+for adaptation across geographical locations, significantly boosting scores
+without retraining, (3) source-domain augmentations allow the model to
+generalize to low-resolution sensors, and (4) surprisingly, robustness to bad
+weather is improved when training directly on more clean weather data than on
+training with bad weather data. We outline our main conclusions and findings to
+provide practical guidance on developing more robust 3D-ODs.",cs.CV,['cs.CV']
+FreeKD: Knowledge Distillation via Semantic Frequency Prompt,Yuan Zhang · Tao Huang · Jiaming Liu · Tao Jiang · Kuan Cheng · Shanghang Zhang, ,https://arxiv.org/abs/2311.12079,,2311.12079.pdf,FreeKD: Knowledge Distillation via Semantic Frequency Prompt,"Knowledge distillation (KD) has been applied to various tasks successfully,
+and mainstream methods typically boost the student model via spatial imitation
+losses. However, the consecutive downsamplings induced in the spatial domain of
+teacher model is a type of corruption, hindering the student from analyzing
+what specific information needs to be imitated, which results in accuracy
+degradation. To better understand the underlying pattern of corrupted feature
+maps, we shift our attention to the frequency domain. During frequency
+distillation, we encounter a new challenge: the low-frequency bands convey
+general but minimal context, while the high are more informative but also
+introduce noise. Not each pixel within the frequency bands contributes equally
+to the performance. To address the above problem: (1) We propose the Frequency
+Prompt plugged into the teacher model, absorbing the semantic frequency context
+during finetuning. (2) During the distillation period, a pixel-wise frequency
+mask is generated via Frequency Prompt, to localize those pixel of interests
+(PoIs) in various frequency bands. Additionally, we employ a position-aware
+relational frequency loss for dense prediction tasks, delivering a high-order
+spatial enhancement to the student model. We dub our Frequency Knowledge
+Distillation method as FreeKD, which determines the optimal localization and
+extent for the frequency distillation. Extensive experiments demonstrate that
+FreeKD not only outperforms spatial-based distillation methods consistently on
+dense prediction tasks (e.g., FreeKD brings 3.8 AP gains for RepPoints-R50 on
+COCO2017 and 4.55 mIoU gains for PSPNet-R18 on Cityscapes), but also conveys
+more robustness to the student. Notably, we also validate the generalization of
+our approach on large-scale vision models (e.g., DINO and SAM).",cs.CV,['cs.CV']
+Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation,Jihyun Kim · Changjae Oh · Hoseok Do · Soohyun Kim · Kwanghoon Sohn, ,https://arxiv.org/abs/2405.04356,,2405.04356.pdf,Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation,"We present a new multi-modal face image generation method that converts a
+text prompt and a visual input, such as a semantic mask or scribble map, into a
+photo-realistic face image. To do this, we combine the strengths of Generative
+Adversarial networks (GANs) and diffusion models (DMs) by employing the
+multi-modal features in the DM into the latent space of the pre-trained GANs.
+We present a simple mapping and a style modulation network to link two models
+and convert meaningful representations in feature maps and attention maps into
+latent codes. With GAN inversion, the estimated latent codes can be used to
+generate 2D or 3D-aware facial images. We further present a multi-step training
+strategy that reflects textual and structural representations into the
+generated image. Our proposed network produces realistic 2D, multi-view, and
+stylized face images, which align well with inputs. We validate our method by
+using pre-trained 2D and 3D GANs, and our results outperform existing methods.
+Our project page is available at
+https://github.com/1211sh/Diffusion-driven_GAN-Inversion/.",cs.CV,['cs.CV']
+Probing Synergistic High-Order Interaction in Infrared and Visible Image Fusion,Naishan Zheng · Man Zhou · Jie Huang · Junming Hou · Haoying Li · Yuan Xu · Feng Zhao, ,,https://ieeexplore.ieee.org/document/10539339,,,,,nan
+Bridging the Gap Between End-to-End and Two-Step Text Spotting,Mingxin Huang · Hongliang Li · Yuliang Liu · Xiang Bai · Lianwen Jin, ,https://arxiv.org/abs/2404.04624,,2404.04624.pdf,Bridging the Gap Between End-to-End and Two-Step Text Spotting,"Modularity plays a crucial role in the development and maintenance of complex
+systems. While end-to-end text spotting efficiently mitigates the issues of
+error accumulation and sub-optimal performance seen in traditional two-step
+methodologies, the two-step methods continue to be favored in many competitions
+and practical settings due to their superior modularity. In this paper, we
+introduce Bridging Text Spotting, a novel approach that resolves the error
+accumulation and suboptimal performance issues in two-step methods while
+retaining modularity. To achieve this, we adopt a well-trained detector and
+recognizer that are developed and trained independently and then lock their
+parameters to preserve their already acquired capabilities. Subsequently, we
+introduce a Bridge that connects the locked detector and recognizer through a
+zero-initialized neural network. This zero-initialized neural network,
+initialized with weights set to zeros, ensures seamless integration of the
+large receptive field features in detection into the locked recognizer.
+Furthermore, since the fixed detector and recognizer cannot naturally acquire
+end-to-end optimization features, we adopt the Adapter to facilitate their
+efficient learning of these features. We demonstrate the effectiveness of the
+proposed method through extensive experiments: Connecting the latest detector
+and recognizer through Bridging Text Spotting, we achieved an accuracy of 83.3%
+on Total-Text, 69.8% on CTW1500, and 89.5% on ICDAR 2015. The code is available
+at https://github.com/mxin262/Bridging-Text-Spotting.",cs.CV,['cs.CV']
+OED: Towards One-stage End-to-End Dynamic Scene Graph Generation,Guan Wang · Zhimin Li · Qingchao Chen · Yang Liu, ,https://arxiv.org/abs/2405.16925,,2405.16925.pdf,OED: Towards One-stage End-to-End Dynamic Scene Graph Generation,"Dynamic Scene Graph Generation (DSGG) focuses on identifying visual
+relationships within the spatial-temporal domain of videos. Conventional
+approaches often employ multi-stage pipelines, which typically consist of
+object detection, temporal association, and multi-relation classification.
+However, these methods exhibit inherent limitations due to the separation of
+multiple stages, and independent optimization of these sub-problems may yield
+sub-optimal solutions. To remedy these limitations, we propose a one-stage
+end-to-end framework, termed OED, which streamlines the DSGG pipeline. This
+framework reformulates the task as a set prediction problem and leverages
+pair-wise features to represent each subject-object pair within the scene
+graph. Moreover, another challenge of DSGG is capturing temporal dependencies,
+we introduce a Progressively Refined Module (PRM) for aggregating temporal
+context without the constraints of additional trackers or handcrafted
+trajectories, enabling end-to-end optimization of the network. Extensive
+experiments conducted on the Action Genome benchmark demonstrate the
+effectiveness of our design. The code and models are available at
+\url{https://github.com/guanw-pku/OED}.",cs.CV,['cs.CV']
+Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding,Zhiheng Cheng · Qingyue Wei · Hongru Zhu · Yan Wang · Liangqiong Qu · Wei Shao · Yuyin Zhou, ,https://arxiv.org/abs/2403.18271,,2403.18271.pdf,Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding,"The Segment Anything Model (SAM) has garnered significant attention for its
+versatile segmentation abilities and intuitive prompt-based interface. However,
+its application in medical imaging presents challenges, requiring either
+substantial training costs and extensive medical datasets for full model
+fine-tuning or high-quality prompts for optimal performance. This paper
+introduces H-SAM: a prompt-free adaptation of SAM tailored for efficient
+fine-tuning of medical images via a two-stage hierarchical decoding procedure.
+In the initial stage, H-SAM employs SAM's original decoder to generate a prior
+probabilistic mask, guiding a more intricate decoding process in the second
+stage. Specifically, we propose two key designs: 1) A class-balanced,
+mask-guided self-attention mechanism addressing the unbalanced label
+distribution, enhancing image embedding; 2) A learnable mask cross-attention
+mechanism spatially modulating the interplay among different image regions
+based on the prior mask. Moreover, the inclusion of a hierarchical pixel
+decoder in H-SAM enhances its proficiency in capturing fine-grained and
+localized details. This approach enables SAM to effectively integrate learned
+medical priors, facilitating enhanced adaptation for medical image segmentation
+with limited samples. Our H-SAM demonstrates a 4.78% improvement in average
+Dice compared to existing prompt-free SAM variants for multi-organ segmentation
+using only 10% of 2D slices. Notably, without using any unlabeled data, H-SAM
+even outperforms state-of-the-art semi-supervised models relying on extensive
+unlabeled training data across various medical datasets. Our code is available
+at https://github.com/Cccccczh404/H-SAM.",cs.CV,['cs.CV']
+Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers,Jinxia Xie · Bineng Zhong · Zhiyi Mo · Shengping Zhang · Liangtao Shi · Shuxiang Song · Rongrong Ji, ,https://arxiv.org/abs/2403.10574,,2403.10574.pdf,Autoregressive Queries for Adaptive Tracking with Spatio-TemporalTransformers,"The rich spatio-temporal information is crucial to capture the complicated
+target appearance variations in visual tracking. However, most top-performing
+tracking algorithms rely on many hand-crafted components for spatio-temporal
+information aggregation. Consequently, the spatio-temporal information is far
+away from being fully explored. To alleviate this issue, we propose an adaptive
+tracker with spatio-temporal transformers (named AQATrack), which adopts simple
+autoregressive queries to effectively learn spatio-temporal information without
+many hand-designed components. Firstly, we introduce a set of learnable and
+autoregressive queries to capture the instantaneous target appearance changes
+in a sliding window fashion. Then, we design a novel attention mechanism for
+the interaction of existing queries to generate a new query in current frame.
+Finally, based on the initial target template and learnt autoregressive
+queries, a spatio-temporal information fusion module (STM) is designed for
+spatiotemporal formation aggregation to locate a target object. Benefiting from
+the STM, we can effectively combine the static appearance and instantaneous
+changes to guide robust tracking. Extensive experiments show that our method
+significantly improves the tracker's performance on six popular tracking
+benchmarks: LaSOT, LaSOText, TrackingNet, GOT-10k, TNL2K, and UAV123.",cs.CV,['cs.CV']
+GRAM: Global Reasoning for Multi-Page VQA,Itshak Blau · Sharon Fogel · Roi Ronen · Alona Golts · Shahar Tsiper · Elad Ben Avraham · Aviad Aberdam · Roy Ganz · Ron Litman, ,https://arxiv.org/abs/2401.03411,,2401.03411.pdf,GRAM: Global Reasoning for Multi-Page VQA,"The increasing use of transformer-based large language models brings forward
+the challenge of processing long sequences. In document visual question
+answering (DocVQA), leading methods focus on the single-page setting, while
+documents can span hundreds of pages. We present GRAM, a method that seamlessly
+extends pre-trained single-page models to the multi-page setting, without
+requiring computationally-heavy pretraining. To do so, we leverage a
+single-page encoder for local page-level understanding, and enhance it with
+document-level designated layers and learnable tokens, facilitating the flow of
+information across pages for global reasoning. To enforce our model to utilize
+the newly introduced document tokens, we propose a tailored bias adaptation
+method. For additional computational savings during decoding, we introduce an
+optional compression stage using our compression-transformer
+(C-Former),reducing the encoded sequence length, thereby allowing a tradeoff
+between quality and latency. Extensive experiments showcase GRAM's
+state-of-the-art performance on the benchmarks for multi-page DocVQA,
+demonstrating the effectiveness of our approach.",cs.CL,"['cs.CL', 'cs.CV']"
+Attribute-Guided Pedestrian Retrieval: Bridging Person Re-ID with Internal Attribute Variability,Yan Huang · Zhang Zhang · Qiang Wu · yi zhong · Liang Wang, ,,https://www.youtube.com/watch?v=5xrCUp_gdwg,,,,,nan
+AnyScene: Customized Image Synthesis with Composited Foreground,Ruidong Chen · Lanjun Wang · Weizhi Nie · Yongdong Zhang · An-An Liu, ,https://ar5iv.labs.arxiv.org/html/2302.09778,,2302.09778.pdf,Composer: Creative and Controllable Image Synthesis with Composable Conditions,"Recent large-scale generative models learned on big data are capable of
+synthesizing incredible images yet suffer from limited controllability. This
+work offers a new generation paradigm that allows flexible control of the
+output image, such as spatial layout and palette, while maintaining the
+synthesis quality and model creativity. With compositionality as the core idea,
+we first decompose an image into representative factors, and then train a
+diffusion model with all these factors as the conditions to recompose the
+input. At the inference stage, the rich intermediate representations work as
+composable elements, leading to a huge design space (i.e., exponentially
+proportional to the number of decomposed factors) for customizable content
+creation. It is noteworthy that our approach, which we call Composer, supports
+various levels of conditions, such as text description as the global
+information, depth map and sketch as the local guidance, color histogram for
+low-level details, etc. Besides improving controllability, we confirm that
+Composer serves as a general framework and facilitates a wide range of
+classical generative tasks without retraining. Code and models will be made
+available.",cs.CV,"['cs.CV', 'cs.GR']"
+Multiway Point Cloud Mosaicking with Diffusion and Global Optimization,Shengze Jin · Iro Armeni · Marc Pollefeys · Daniel Barath, ,https://arxiv.org/abs/2404.00429,,2404.00429.pdf,Multiway Point Cloud Mosaicking with Diffusion and Global Optimization,"We introduce a novel framework for multiway point cloud mosaicking (named
+Wednesday), designed to co-align sets of partially overlapping point clouds --
+typically obtained from 3D scanners or moving RGB-D cameras -- into a unified
+coordinate system. At the core of our approach is ODIN, a learned pairwise
+registration algorithm that iteratively identifies overlaps and refines
+attention scores, employing a diffusion-based process for denoising pairwise
+correlation matrices to enhance matching accuracy. Further steps include
+constructing a pose graph from all point clouds, performing rotation averaging,
+a novel robust algorithm for re-estimating translations optimally in terms of
+consensus maximization and translation optimization. Finally, the point cloud
+rotations and positions are optimized jointly by a diffusion-based approach.
+Tested on four diverse, large-scale datasets, our method achieves
+state-of-the-art pairwise and multiway registration results by a large margin
+on all benchmarks. Our code and models are available at
+https://github.com/jinsz/Multiway-Point-Cloud-Mosaicking-with-Diffusion-and-Global-Optimization.",cs.CV,['cs.CV']
+Dexterous Grasp Transformer,Guo-Hao Xu · Yi-Lin Wei · Dian Zheng · Xiao-Ming Wu · Wei-Shi Zheng, ,https://arxiv.org/abs/2404.18135,,2404.18135.pdf,Dexterous Grasp Transformer,"In this work, we propose a novel discriminative framework for dexterous grasp
+generation, named Dexterous Grasp TRansformer (DGTR), capable of predicting a
+diverse set of feasible grasp poses by processing the object point cloud with
+only one forward pass. We formulate dexterous grasp generation as a set
+prediction task and design a transformer-based grasping model for it. However,
+we identify that this set prediction paradigm encounters several optimization
+challenges in the field of dexterous grasping and results in restricted
+performance. To address these issues, we propose progressive strategies for
+both the training and testing phases. First, the dynamic-static matching
+training (DSMT) strategy is presented to enhance the optimization stability
+during the training phase. Second, we introduce the adversarial-balanced
+test-time adaptation (AB-TTA) with a pair of adversarial losses to improve
+grasping quality during the testing phase. Experimental results on the
+DexGraspNet dataset demonstrate the capability of DGTR to predict dexterous
+grasp poses with both high quality and diversity. Notably, while keeping high
+quality, the diversity of grasp poses predicted by DGTR significantly
+outperforms previous works in multiple metrics without any data pre-processing.
+Codes are available at https://github.com/iSEE-Laboratory/DGTR .",cs.RO,['cs.RO']
+MoCha-Stereo: Motif Channel Attention Network for Stereo Matching,Ziyang Chen · Wei Long · He Yao · Yongjun Zhang · Bingshu Wang · Yongbin Qin · Jia Wu,https://github.com/ZYangChen/MoCha-Stereo,https://arxiv.org/abs/2404.06842,,2404.06842.pdf,MoCha-Stereo: Motif Channel Attention Network for Stereo Matching,"Learning-based stereo matching techniques have made significant progress.
+However, existing methods inevitably lose geometrical structure information
+during the feature channel generation process, resulting in edge detail
+mismatches. In this paper, the Motif Cha}nnel Attention Stereo Matching Network
+(MoCha-Stereo) is designed to address this problem. We provide the Motif
+Channel Correlation Volume (MCCV) to determine more accurate edge matching
+costs. MCCV is achieved by projecting motif channels, which capture common
+geometric structures in feature channels, onto feature maps and cost volumes.
+In addition, edge variations in %potential feature channels of the
+reconstruction error map also affect details matching, we propose the
+Reconstruction Error Motif Penalty (REMP) module to further refine the
+full-resolution disparity estimation. REMP integrates the frequency information
+of typical channel features from the reconstruction error. MoCha-Stereo ranks
+1st on the KITTI-2015 and KITTI-2012 Reflective leaderboards. Our structure
+also shows excellent performance in Multi-View Stereo. Code is avaliable at
+https://github.com/ZYangChen/MoCha-Stereo.",cs.CV,['cs.CV']
+Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis,Zhan Li · Zhang Chen · Zhong Li · Yi Xu,https://oppo-us-research.github.io/SpacetimeGaussians-website/,https://arxiv.org/abs/2312.16812,,2312.16812.pdf,Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis,"Novel view synthesis of dynamic scenes has been an intriguing yet challenging
+problem. Despite recent advancements, simultaneously achieving high-resolution
+photorealistic results, real-time rendering, and compact storage remains a
+formidable task. To address these challenges, we propose Spacetime Gaussian
+Feature Splatting as a novel dynamic scene representation, composed of three
+pivotal components. First, we formulate expressive Spacetime Gaussians by
+enhancing 3D Gaussians with temporal opacity and parametric motion/rotation.
+This enables Spacetime Gaussians to capture static, dynamic, as well as
+transient content within a scene. Second, we introduce splatted feature
+rendering, which replaces spherical harmonics with neural features. These
+features facilitate the modeling of view- and time-dependent appearance while
+maintaining small size. Third, we leverage the guidance of training error and
+coarse depth to sample new Gaussians in areas that are challenging to converge
+with existing pipelines. Experiments on several established real-world datasets
+demonstrate that our method achieves state-of-the-art rendering quality and
+speed, while retaining compact storage. At 8K resolution, our lite-version
+model can render at 60 FPS on an Nvidia RTX 4090 GPU. Our code is available at
+https://github.com/oppo-us-research/SpacetimeGaussians.",cs.CV,"['cs.CV', 'cs.GR']"
+MoReVQA: Exploring Modular Reasoning Models for Video Question Answering,Juhong Min · Shyamal Buch · Arsha Nagrani · Minsu Cho · Cordelia Schmid, ,https://arxiv.org/abs/2404.06511,,2404.06511.pdf,MoReVQA: Exploring Modular Reasoning Models for Video Question Answering,"This paper addresses the task of video question answering (videoQA) via a
+decomposed multi-stage, modular reasoning framework. Previous modular methods
+have shown promise with a single planning stage ungrounded in visual content.
+However, through a simple and effective baseline, we find that such systems can
+lead to brittle behavior in practice for challenging videoQA settings. Thus,
+unlike traditional single-stage planning methods, we propose a multi-stage
+system consisting of an event parser, a grounding stage, and a final reasoning
+stage in conjunction with an external memory. All stages are training-free, and
+performed using few-shot prompting of large models, creating interpretable
+intermediate outputs at each stage. By decomposing the underlying planning and
+task complexity, our method, MoReVQA, improves over prior work on standard
+videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with
+state-of-the-art results, and extensions to related tasks (grounded videoQA,
+paragraph captioning).",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation,Yifei Li · Hsiaoyu Chen · Egor Larionov · Nikolaos Sarafianos · Wojciech Matusik · Tuur Stuyck, ,https://arxiv.org/abs/2311.12194,,2311.12194.pdf,DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation,"The realism of digital avatars is crucial in enabling telepresence
+applications with self-expression and customization. While physical simulations
+can produce realistic motions for clothed humans, they require high-quality
+garment assets with associated physical parameters for cloth simulations.
+However, manually creating these assets and calibrating their parameters is
+labor-intensive and requires specialized expertise. Current methods focus on
+reconstructing geometry, but don't generate complete assets for physics-based
+applications. To address this gap, we propose \papername,~a novel approach that
+performs body and garment co-optimization using differentiable simulation. By
+integrating physical simulation into the optimization loop and accounting for
+the complex nonlinear behavior of cloth and its intricate interaction with the
+body, our framework recovers body and garment geometry and extracts important
+material parameters in a physically plausible way. Our experiments demonstrate
+that our approach generates realistic clothing and body shape suitable for
+downstream applications. We provide additional insights and results on our
+webpage: https://people.csail.mit.edu/liyifei/publication/diffavatar/",cs.CV,['cs.CV']
+SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers,Ioannis Kakogeorgiou · Spyros Gidaris · Konstantinos Karantzalos · Nikos Komodakis, ,https://arxiv.org/abs/2312.00648,,2312.00648.pdf,SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers,"Unsupervised object-centric learning aims to decompose scenes into
+interpretable object entities, termed slots. Slot-based auto-encoders stand out
+as a prominent method for this task. Within them, crucial aspects include
+guiding the encoder to generate object-specific slots and ensuring the decoder
+utilizes them during reconstruction. This work introduces two novel techniques,
+(i) an attention-based self-training approach, which distills superior
+slot-based attention masks from the decoder to the encoder, enhancing object
+segmentation, and (ii) an innovative patch-order permutation strategy for
+autoregressive transformers that strengthens the role of slot vectors in
+reconstruction. The effectiveness of these strategies is showcased
+experimentally. The combined approach significantly surpasses prior slot-based
+autoencoder methods in unsupervised object segmentation, especially with
+complex real-world images. We provide the implementation code at
+https://github.com/gkakogeorgiou/spot .",cs.CV,['cs.CV']
+Color Shift Estimation-and-Correction for Image Enhancement,Yiyu Li · Ke Xu · Gerhard Hancke · Rynson W.H. Lau, ,https://arxiv.org/abs/2405.17725,,2405.17725.pdf,Color Shift Estimation-and-Correction for Image Enhancement,"Images captured under sub-optimal illumination conditions may contain both
+over- and under-exposures. Current approaches mainly focus on adjusting image
+brightness, which may exacerbate the color tone distortion in under-exposed
+areas and fail to restore accurate colors in over-exposed regions. We observe
+that over- and under-exposed regions display opposite color tone distribution
+shifts with respect to each other, which may not be easily normalized in joint
+modeling as they usually do not have ``normal-exposed'' regions/pixels as
+reference. In this paper, we propose a novel method to enhance images with both
+over- and under-exposures by learning to estimate and correct such color
+shifts. Specifically, we first derive the color feature maps of the brightened
+and darkened versions of the input image via a UNet-based network, followed by
+a pseudo-normal feature generator to produce pseudo-normal color feature maps.
+We then propose a novel COlor Shift Estimation (COSE) module to estimate the
+color shifts between the derived brightened (or darkened) color feature maps
+and the pseudo-normal color feature maps. The COSE module corrects the
+estimated color shifts of the over- and under-exposed regions separately. We
+further propose a novel COlor MOdulation (COMO) module to modulate the
+separately corrected colors in the over- and under-exposed regions to produce
+the enhanced image. Comprehensive experiments show that our method outperforms
+existing approaches. Project webpage: https://github.com/yiyulics/CSEC.",cs.CV,['cs.CV']
+Human Gaussian Splatting : Real-time Rendering of Animatable Avatars,Arthur Moreau · Jifei Song · Helisa Dhamo · Richard Shaw · Yiren Zhou · Eduardo Pérez-Pellitero,https://perezpellitero.github.io/projects/hugs/index.html,https://arxiv.org/abs/2311.17113,,2311.17113.pdf,Human Gaussian Splatting: Real-time Rendering of Animatable Avatars,"This work addresses the problem of real-time rendering of photorealistic
+human body avatars learned from multi-view videos. While the classical
+approaches to model and render virtual humans generally use a textured mesh,
+recent research has developed neural body representations that achieve
+impressive visual quality. However, these models are difficult to render in
+real-time and their quality degrades when the character is animated with body
+poses different than the training observations. We propose an animatable human
+model based on 3D Gaussian Splatting, that has recently emerged as a very
+efficient alternative to neural radiance fields. The body is represented by a
+set of gaussian primitives in a canonical space which is deformed with a coarse
+to fine approach that combines forward skinning and local non-rigid refinement.
+We describe how to learn our Human Gaussian Splatting (HuGS) model in an
+end-to-end fashion from multi-view observations, and evaluate it against the
+state-of-the-art approaches for novel pose synthesis of clothed body. Our
+method achieves 1.5 dB PSNR improvement over the state-of-the-art on THuman4
+dataset while being able to render in real-time (80 fps for 512x512
+resolution).",cs.CV,"['cs.CV', 'cs.GR']"
+Boosting Spike Camera Image Reconstruction from a Perspective of Dealing with Spike Fluctuations,Rui Zhao · Ruiqin Xiong · Jing Zhao · Jian Zhang · Xiaopeng Fan · Zhaofei Yu · Tiejun Huang, ,https://ar5iv.labs.arxiv.org/html/2303.11684,,2303.11684.pdf,SpikeCV: Open a Continuous Computer Vision Era,"SpikeCV is a new open-source computer vision platform for the spike camera,
+which is a neuromorphic visual sensor that has developed rapidly in recent
+years. In the spike camera, each pixel position directly accumulates the light
+intensity and asynchronously fires spikes. The output binary spikes can reach a
+frequency of 40,000 Hz. As a new type of visual expression, spike sequence has
+high spatiotemporal completeness and preserves the continuous visual
+information of the external world. Taking advantage of the low latency and high
+dynamic range of the spike camera, many spike-based algorithms have made
+significant progress, such as high-quality imaging and ultra-high-speed target
+detection.
+  To build up a community ecology for the spike vision to facilitate more users
+to take advantage of the spike camera, SpikeCV provides a variety of
+ultra-high-speed scene datasets, hardware interfaces, and an easy-to-use
+modules library. SpikeCV focuses on encapsulation for spike data,
+standardization for dataset interfaces, modularization for vision tasks, and
+real-time applications for challenging scenes. With the advent of the
+open-source Python ecosystem, modules of SpikeCV can be used as a Python
+library to fulfilled most of the numerical analysis needs of researchers. We
+demonstrate the efficiency of the SpikeCV on offline inference and real-time
+applications. The project repository address are
+\url{https://openi.pcl.ac.cn/Cordium/SpikeCV} and
+\url{https://github.com/Zyj061/SpikeCV",cs.CV,['cs.CV']
+DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans,Akash Sengupta · Thiemo Alldieck · NIKOS KOLOTOUROS · Enric Corona · Andrei Zanfir · Cristian Sminchisescu,https://akashsengupta1997.github.io/diffhuman/,https://arxiv.org/abs/2404.00485,,2404.00485.pdf,DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans,"We present DiffHuman, a probabilistic method for photorealistic 3D human
+reconstruction from a single RGB image. Despite the ill-posed nature of this
+problem, most methods are deterministic and output a single solution, often
+resulting in a lack of geometric detail and blurriness in unseen or uncertain
+regions. In contrast, DiffHuman predicts a probability distribution over 3D
+reconstructions conditioned on an input 2D image, which allows us to sample
+multiple detailed 3D avatars that are consistent with the image. DiffHuman is
+implemented as a conditional diffusion model that denoises pixel-aligned 2D
+observations of an underlying 3D shape representation. During inference, we may
+sample 3D avatars by iteratively denoising 2D renders of the predicted 3D
+representation. Furthermore, we introduce a generator neural network that
+approximates rendering with considerably reduced runtime (55x speed up),
+resulting in a novel dual-branch diffusion framework. Our experiments show that
+DiffHuman can produce diverse and detailed reconstructions for the parts of the
+person that are unseen or uncertain in the input image, while remaining
+competitive with the state-of-the-art when reconstructing visible surfaces.",cs.CV,['cs.CV']
+LASIL: Learner-Aware Supervised Imitation Learning For Long-term Microscopic Traffic Simulation,Ke Guo · Zhenwei Miao · Wei Jing · Weiwei Liu · Weizi Li · Dayang Hao · Jia Pan,https://sites.google.com/view/lasil,https://arxiv.org/abs/2403.17601,,2403.17601.pdf,LASIL: Learner-Aware Supervised Imitation Learning For Long-term Microscopic Traffic Simulation,"Microscopic traffic simulation plays a crucial role in transportation
+engineering by providing insights into individual vehicle behavior and overall
+traffic flow. However, creating a realistic simulator that accurately
+replicates human driving behaviors in various traffic conditions presents
+significant challenges. Traditional simulators relying on heuristic models
+often fail to deliver accurate simulations due to the complexity of real-world
+traffic environments. Due to the covariate shift issue, existing imitation
+learning-based simulators often fail to generate stable long-term simulations.
+In this paper, we propose a novel approach called learner-aware supervised
+imitation learning to address the covariate shift problem in multi-agent
+imitation learning. By leveraging a variational autoencoder simultaneously
+modeling the expert and learner state distribution, our approach augments
+expert states such that the augmented state is aware of learner state
+distribution. Our method, applied to urban traffic simulation, demonstrates
+significant improvements over existing state-of-the-art baselines in both
+short-term microscopic and long-term macroscopic realism when evaluated on the
+real-world dataset pNEUMA.",cs.AI,"['cs.AI', 'cs.LG']"
+Sparse Semi-Detr: Sparse Learnable Queries for Semi-Supervised Object Detection,Tahira Shehzadi · Khurram Azeem Hashmi · Didier Stricker · Muhammad Zeshan Afzal, ,https://arxiv.org/abs/2404.01819,,2404.01819.pdf,Sparse Semi-DETR: Sparse Learnable Queries for Semi-Supervised Object Detection,"In this paper, we address the limitations of the DETR-based semi-supervised
+object detection (SSOD) framework, particularly focusing on the challenges
+posed by the quality of object queries. In DETR-based SSOD, the one-to-one
+assignment strategy provides inaccurate pseudo-labels, while the one-to-many
+assignments strategy leads to overlapping predictions. These issues compromise
+training efficiency and degrade model performance, especially in detecting
+small or occluded objects. We introduce Sparse Semi-DETR, a novel
+transformer-based, end-to-end semi-supervised object detection solution to
+overcome these challenges. Sparse Semi-DETR incorporates a Query Refinement
+Module to enhance the quality of object queries, significantly improving
+detection capabilities for small and partially obscured objects. Additionally,
+we integrate a Reliable Pseudo-Label Filtering Module that selectively filters
+high-quality pseudo-labels, thereby enhancing detection accuracy and
+consistency. On the MS-COCO and Pascal VOC object detection benchmarks, Sparse
+Semi-DETR achieves a significant improvement over current state-of-the-art
+methods that highlight Sparse Semi-DETR's effectiveness in semi-supervised
+object detection, particularly in challenging scenarios involving small or
+partially obscured objects.",cs.CV,['cs.CV']
+CityDreamer: Compositional Generative Model of Unbounded 3D Cities,Haozhe Xie · Zhaoxi Chen · Fangzhou Hong · Ziwei Liu,https://www.infinitescript.com/project/city-dreamer,https://arxiv.org/abs/2309.00610,,2309.00610.pdf,CityDreamer: Compositional Generative Model of Unbounded 3D Cities,"3D city generation is a desirable yet challenging task, since humans are more
+sensitive to structural distortions in urban environments. Additionally,
+generating 3D cities is more complex than 3D natural scenes since buildings, as
+objects of the same class, exhibit a wider range of appearances compared to the
+relatively consistent appearance of objects like trees in natural scenes. To
+address these challenges, we propose \textbf{CityDreamer}, a compositional
+generative model designed specifically for unbounded 3D cities. Our key insight
+is that 3D city generation should be a composition of different types of neural
+fields: 1) various building instances, and 2) background stuff, such as roads
+and green lands. Specifically, we adopt the bird's eye view scene
+representation and employ a volumetric render for both instance-oriented and
+stuff-oriented neural fields. The generative hash grid and periodic positional
+embedding are tailored as scene parameterization to suit the distinct
+characteristics of building instances and background stuff. Furthermore, we
+contribute a suite of CityGen Datasets, including OSM and GoogleEarth, which
+comprises a vast amount of real-world city imagery to enhance the realism of
+the generated 3D cities both in their layouts and appearances. CityDreamer
+achieves state-of-the-art performance not only in generating realistic 3D
+cities but also in localized editing within the generated cities.",cs.CV,['cs.CV']
+"One-dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications",Mengyao Lyu · Yuhong Yang · Haiwen Hong · Hui Chen · Xuan Jin · Yuan He · Hui Xue · Jungong Han · Guiguang Ding,https://lyumengyao.github.io/projects/spm,https://arxiv.org/abs/2312.16145,,2312.16145.pdf,"One-Dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications","The prevalent use of commercial and open-source diffusion models (DMs) for
+text-to-image generation prompts risk mitigation to prevent undesired
+behaviors. Existing concept erasing methods in academia are all based on full
+parameter or specification-based fine-tuning, from which we observe the
+following issues: 1) Generation alternation towards erosion: Parameter drift
+during target elimination causes alternations and potential deformations across
+all generations, even eroding other concepts at varying degrees, which is more
+evident with multi-concept erased; 2) Transfer inability & deployment
+inefficiency: Previous model-specific erasure impedes the flexible combination
+of concepts and the training-free transfer towards other models, resulting in
+linear cost growth as the deployment scenarios increase. To achieve
+non-invasive, precise, customizable, and transferable elimination, we ground
+our erasing framework on one-dimensional adapters to erase multiple concepts
+from most DMs at once across versatile erasing applications. The
+concept-SemiPermeable structure is injected as a Membrane (SPM) into any DM to
+learn targeted erasing, and meantime the alteration and erosion phenomenon is
+effectively mitigated via a novel Latent Anchoring fine-tuning strategy. Once
+obtained, SPMs can be flexibly combined and plug-and-play for other DMs without
+specific re-tuning, enabling timely and efficient adaptation to diverse
+scenarios. During generation, our Facilitated Transport mechanism dynamically
+regulates the permeability of each SPM to respond to different input prompts,
+further minimizing the impact on other concepts. Quantitative and qualitative
+results across ~40 concepts, 7 DMs and 4 erasing applications have demonstrated
+the superior erasing of SPM. Our code and pre-tuned SPMs are available on the
+project page https://lyumengyao.github.io/projects/spm.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention,Ju-Hyeon Nam · Nur Suriza Syazwany · Su Jung Kim · Sang-Chul Lee,https://skawngus1111.github.io/MADGNet_project/,https://arxiv.org/abs/2405.06284,,2405.06284.pdf,Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention,"Generalizability in deep neural networks plays a pivotal role in medical
+image segmentation. However, deep learning-based medical image analyses tend to
+overlook the importance of frequency variance, which is critical element for
+achieving a model that is both modality-agnostic and domain-generalizable.
+Additionally, various models fail to account for the potential information loss
+that can arise from multi-task learning under deep supervision, a factor that
+can impair the model representation ability. To address these challenges, we
+propose a Modality-agnostic Domain Generalizable Network (MADGNet) for medical
+image segmentation, which comprises two key components: a Multi-Frequency in
+Multi-Scale Attention (MFMSA) block and Ensemble Sub-Decoding Module (E-SDM).
+The MFMSA block refines the process of spatial feature extraction, particularly
+in capturing boundary features, by incorporating multi-frequency and
+multi-scale features, thereby offering informative cues for tissue outline and
+anatomical structures. Moreover, we propose E-SDM to mitigate information loss
+in multi-task learning with deep supervision, especially during substantial
+upsampling from low resolution. We evaluate the segmentation performance of
+MADGNet across six modalities and fifteen datasets. Through extensive
+experiments, we demonstrate that MADGNet consistently outperforms
+state-of-the-art models across various modalities, showcasing superior
+segmentation performance. This affirms MADGNet as a robust solution for medical
+image segmentation that excels in diverse imaging scenarios. Our MADGNet code
+is available in GitHub Link.",eess.IV,"['eess.IV', 'cs.CV', 'cs.LG']"
+CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering,Shaowei Wang · Lingling Zhang · Longji Zhu · Tao Qin · Kim-Hui Yap · Xinyu Zhang · Jun Liu, ,https://arxiv.org/abs/2312.17269,,2312.17269.pdf,Conversational Question Answering with Reformulations over Knowledge Graph,"Conversational question answering (convQA) over knowledge graphs (KGs)
+involves answering multi-turn natural language questions about information
+contained in a KG. State-of-the-art methods of ConvQA often struggle with
+inexplicit question-answer pairs. These inputs are easy for human beings to
+understand given a conversation history, but hard for a machine to interpret,
+which can degrade ConvQA performance. To address this problem, we propose a
+reinforcement learning (RL) based model, CornNet, which utilizes question
+reformulations generated by large language models (LLMs) to improve ConvQA
+performance. CornNet adopts a teacher-student architecture where a teacher
+model learns question representations using human writing reformulations, and a
+student model to mimic the teacher model's output via reformulations generated
+by LLMs. The learned question representation is then used by an RL model to
+locate the correct answer in a KG. Extensive experimental results show that
+CornNet outperforms state-of-the-art convQA models.",cs.CL,"['cs.CL', 'cs.AI']"
+Lane2Seq: Towards Unified Lane Detection via Sequence Generation,Kunyang Zhou,https://zkyseu.github.io/lane2seq.github.io/,https://arxiv.org/abs/2402.17172,,2402.17172.pdf,Lane2Seq: Towards Unified Lane Detection via Sequence Generation,"In this paper, we present a novel sequence generation-based framework for
+lane detection, called Lane2Seq. It unifies various lane detection formats by
+casting lane detection as a sequence generation task. This is different from
+previous lane detection methods, which depend on well-designed task-specific
+head networks and corresponding loss functions. Lane2Seq only adopts a plain
+transformer-based encoder-decoder architecture with a simple cross-entropy
+loss. Additionally, we propose a new multi-format model tuning based on
+reinforcement learning to incorporate the task-specific knowledge into
+Lane2Seq. Experimental results demonstrate that such a simple sequence
+generation paradigm not only unifies lane detection but also achieves
+competitive performance on benchmarks. For example, Lane2Seq gets 97.95\% and
+97.42\% F1 score on Tusimple and LLAMAS datasets, establishing a new
+state-of-the-art result for two benchmarks.",cs.CV,['cs.CV']
+Scaling Up Dynamic 3D Human-Scene Interaction Modelling,Nan Jiang · Zhiyuan Zhang · Hongjie Li · Xiaoxuan Ma · Zan Wang · Yixin Chen · Tengyu Liu · Yixin Zhu · Siyuan Huang, ,https://arxiv.org/abs/2403.08629,,2403.08629.pdf,Scaling Up Dynamic Human-Scene Interaction Modeling,"Confronting the challenges of data scarcity and advanced motion synthesis in
+human-scene interaction modeling, we introduce the TRUMANS dataset alongside a
+novel HSI motion synthesis method. TRUMANS stands as the most comprehensive
+motion-captured HSI dataset currently available, encompassing over 15 hours of
+human interactions across 100 indoor scenes. It intricately captures whole-body
+human motions and part-level object dynamics, focusing on the realism of
+contact. This dataset is further scaled up by transforming physical
+environments into exact virtual models and applying extensive augmentations to
+appearance and motion for both humans and objects while maintaining interaction
+fidelity. Utilizing TRUMANS, we devise a diffusion-based autoregressive model
+that efficiently generates HSI sequences of any length, taking into account
+both scene context and intended actions. In experiments, our approach shows
+remarkable zero-shot generalizability on a range of 3D scene datasets (e.g.,
+PROX, Replica, ScanNet, ScanNet++), producing motions that closely mimic
+original motion-captured sequences, as confirmed by quantitative experiments
+and human studies.",cs.CV,['cs.CV']
+QUADify: Extracting Meshes with Pixel-level Details and Materials from Images,Maximilian Frühauf · Hayko Riemenschneider · Markus Gross · Christopher Schroers,https://maxfruehauf.com/publications/fruehauf2024quadify/drs_project_page/,,https://www.youtube.com/watch?v=n8M9c9yKGMk,,,,,nan
+BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning,Ruyang Liu · Chen Li · Yixiao Ge · Thomas H. Li · Ying Shan · Ge Li, ,http://export.arxiv.org/abs/2309.15785,,2309.15785.pdf,One For All: Video Conversation is Feasible Without Video Instruction Tuning,"The recent progress in Large Language Models (LLM) has spurred various
+advancements in image-language conversation agents, while how to build a
+proficient video-based dialogue system is still under exploration. Considering
+the extensive scale of LLM and visual backbone, minimal GPU memory is left for
+facilitating effective temporal modeling, which is crucial for comprehending
+and providing feedback on videos. To this end, we propose Branching Temporal
+Adapter (BT-Adapter), a novel method for extending image-language pretrained
+models into the video domain. Specifically, BT-Adapter serves as a plug-and-use
+temporal modeling branch alongside the pretrained visual encoder, which is
+tuned while keeping the backbone frozen. Just pretrained once, BT-Adapter can
+be seamlessly integrated into all image conversation models using this version
+of CLIP, enabling video conversations without the need for video instructions.
+Besides, we develop a unique asymmetric token masking strategy inside the
+branch with tailor-made training tasks for BT-Adapter, facilitating faster
+convergence and better results. Thanks to BT-Adapter, we are able to empower
+existing multimodal dialogue models with strong video understanding
+capabilities without incurring excessive GPU costs. Without bells and whistles,
+BT-Adapter achieves (1) state-of-the-art zero-shot results on various video
+tasks using thousands of fewer GPU hours. (2) better performance than current
+video chatbots without any video instruction tuning. (3) state-of-the-art
+results of video chatting using video instruction tuning, outperforming
+previous SOTAs by a large margin.",cs.CV,['cs.CV']
+Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection,Ting Lei · Shaofeng Yin · Yang Liu, ,https://arxiv.org/abs/2404.06194,,2404.06194.pdf,Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection,"Open-vocabulary human-object interaction (HOI) detection, which is concerned
+with the problem of detecting novel HOIs guided by natural language, is crucial
+for understanding human-centric scenes. However, prior zero-shot HOI detectors
+often employ the same levels of feature maps to model HOIs with varying
+distances, leading to suboptimal performance in scenes containing human-object
+pairs with a wide range of distances. In addition, these detectors primarily
+rely on category names and overlook the rich contextual information that
+language can provide, which is essential for capturing open vocabulary concepts
+that are typically rare and not well-represented by category names alone. In
+this paper, we introduce a novel end-to-end open vocabulary HOI detection
+framework with conditional multi-level decoding and fine-grained semantic
+enhancement (CMD-SE), harnessing the potential of Visual-Language Models
+(VLMs). Specifically, we propose to model human-object pairs with different
+distances with different levels of feature maps by incorporating a soft
+constraint during the bipartite matching process. Furthermore, by leveraging
+large language models (LLMs) such as GPT models, we exploit their extensive
+world knowledge to generate descriptions of human body part states for various
+interactions. Then we integrate the generalizable and fine-grained semantics of
+human body parts to improve interaction recognition. Experimental results on
+two datasets, SWIG-HOI and HICO-DET, demonstrate that our proposed method
+achieves state-of-the-art results in open vocabulary HOI detection. The code
+and models are available at https://github.com/ltttpku/CMD-SE-release.",cs.CV,['cs.CV']
+SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction,Zechuan Zhang · Zongxin Yang · Yi Yang,https://river-zhang.github.io/SIFU-projectpage/,https://arxiv.org/abs/2312.06704,,2312.06704.pdf,SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction,"Creating high-quality 3D models of clothed humans from single images for
+real-world applications is crucial. Despite recent advancements, accurately
+reconstructing humans in complex poses or with loose clothing from in-the-wild
+images, along with predicting textures for unseen areas, remains a significant
+challenge. A key limitation of previous methods is their insufficient prior
+guidance in transitioning from 2D to 3D and in texture prediction. In response,
+we introduce SIFU (Side-view Conditioned Implicit Function for Real-world
+Usable Clothed Human Reconstruction), a novel approach combining a Side-view
+Decoupling Transformer with a 3D Consistent Texture Refinement pipeline.SIFU
+employs a cross-attention mechanism within the transformer, using SMPL-X
+normals as queries to effectively decouple side-view features in the process of
+mapping 2D features to 3D. This method not only improves the precision of the
+3D models but also their robustness, especially when SMPL-X estimates are not
+perfect. Our texture refinement process leverages text-to-image diffusion-based
+prior to generate realistic and consistent textures for invisible views.
+Through extensive experiments, SIFU surpasses SOTA methods in both geometry and
+texture reconstruction, showcasing enhanced robustness in complex scenarios and
+achieving an unprecedented Chamfer and P2S measurement. Our approach extends to
+practical applications such as 3D printing and scene building, demonstrating
+its broad utility in real-world scenarios. Project page
+https://river-zhang.github.io/SIFU-projectpage/ .",cs.CV,['cs.CV']
+Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images,Chaoqin Huang · Aofan Jiang · Jinghao Feng · Ya Zhang · Xinchao Wang · Yanfeng Wang, ,https://arxiv.org/abs/2403.12570,,2403.12570.pdf,Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images,"Recent advancements in large-scale visual-language pre-trained models have
+led to significant progress in zero-/few-shot anomaly detection within natural
+image domains. However, the substantial domain divergence between natural and
+medical images limits the effectiveness of these methodologies in medical
+anomaly detection. This paper introduces a novel lightweight multi-level
+adaptation and comparison framework to repurpose the CLIP model for medical
+anomaly detection. Our approach integrates multiple residual adapters into the
+pre-trained visual encoder, enabling a stepwise enhancement of visual features
+across different levels. This multi-level adaptation is guided by multi-level,
+pixel-wise visual-language feature alignment loss functions, which recalibrate
+the model's focus from object semantics in natural imagery to anomaly
+identification in medical images. The adapted features exhibit improved
+generalization across various medical data types, even in zero-shot scenarios
+where the model encounters unseen medical modalities and anatomical regions
+during training. Our experiments on medical anomaly detection benchmarks
+demonstrate that our method significantly surpasses current state-of-the-art
+models, with an average AUC improvement of 6.24% and 7.33% for anomaly
+classification, 2.03% and 2.37% for anomaly segmentation, under the zero-shot
+and few-shot settings, respectively. Source code is available at:
+https://github.com/MediaBrain-SJTU/MVFA-AD",cs.CV,['cs.CV']
+Panacea: Panoramic and Controllable Video Generation for Autonomous Driving,Yuqing Wen · Yucheng Zhao · Yingfei Liu · Fan Jia · Yanhui Wang · Chong Luo · Chi Zhang · Tiancai Wang · Xiaoyan Sun · Xiangyu Zhang, ,https://arxiv.org/abs/2311.16813,,2311.16813.pdf,Panacea: Panoramic and Controllable Video Generation for Autonomous Driving,"The field of autonomous driving increasingly demands high-quality annotated
+training data. In this paper, we propose Panacea, an innovative approach to
+generate panoramic and controllable videos in driving scenarios, capable of
+yielding an unlimited numbers of diverse, annotated samples pivotal for
+autonomous driving advancements. Panacea addresses two critical challenges:
+'Consistency' and 'Controllability.' Consistency ensures temporal and
+cross-view coherence, while Controllability ensures the alignment of generated
+content with corresponding annotations. Our approach integrates a novel 4D
+attention and a two-stage generation pipeline to maintain coherence,
+supplemented by the ControlNet framework for meticulous control by the
+Bird's-Eye-View (BEV) layouts. Extensive qualitative and quantitative
+evaluations of Panacea on the nuScenes dataset prove its effectiveness in
+generating high-quality multi-view driving-scene videos. This work notably
+propels the field of autonomous driving by effectively augmenting the training
+dataset used for advanced BEV perception techniques.",cs.CV,['cs.CV']
+No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation,Xiangyang Zhu · Renrui Zhang · Bowei He · Ziyu Guo · Jiaming Liu · Han Xiao · Chaoyou Fu · Hao Dong · Peng Gao, ,https://arxiv.org/abs/2404.04050,,2404.04050.pdf,No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation,"To reduce the reliance on large-scale datasets, recent works in 3D
+segmentation resort to few-shot learning. Current 3D few-shot segmentation
+methods first pre-train models on 'seen' classes, and then evaluate their
+generalization performance on 'unseen' classes. However, the prior pre-training
+stage not only introduces excessive time overhead but also incurs a significant
+domain gap on 'unseen' classes. To tackle these issues, we propose a
+Non-parametric Network for few-shot 3D Segmentation, Seg-NN, and its Parametric
+variant, Seg-PN. Without training, Seg-NN extracts dense representations by
+hand-crafted filters and achieves comparable performance to existing parametric
+models. Due to the elimination of pre-training, Seg-NN can alleviate the domain
+gap issue and save a substantial amount of time. Based on Seg-NN, Seg-PN only
+requires training a lightweight QUEry-Support Transferring (QUEST) module,
+which enhances the interaction between the support set and query set.
+Experiments suggest that Seg-PN outperforms previous state-of-the-art method by
++4.19% and +7.71% mIoU on S3DIS and ScanNet datasets respectively, while
+reducing training time by -90%, indicating its effectiveness and efficiency.",cs.CV,['cs.CV']
+MoST: Motion Style Transformer between Diverse Action Contents,Boeun Kim · Jungho Kim · Hyung Jin Chang · Jin Young Choi,https://boeun-kim.github.io/page-MoST/,https://arxiv.org/abs/2403.06225,,2403.06225.pdf,MoST: Motion Style Transformer between Diverse Action Contents,"While existing motion style transfer methods are effective between two
+motions with identical content, their performance significantly diminishes when
+transferring style between motions with different contents. This challenge lies
+in the lack of clear separation between content and style of a motion. To
+tackle this challenge, we propose a novel motion style transformer that
+effectively disentangles style from content and generates a plausible motion
+with transferred style from a source motion. Our distinctive approach to
+achieving the goal of disentanglement is twofold: (1) a new architecture for
+motion style transformer with `part-attentive style modulator across body
+parts' and `Siamese encoders that encode style and content features
+separately'; (2) style disentanglement loss. Our method outperforms existing
+methods and demonstrates exceptionally high quality, particularly in motion
+pairs with different contents, without the need for heuristic post-processing.
+Codes are available at https://github.com/Boeun-Kim/MoST.",cs.CV,"['cs.CV', 'cs.AI']"
+Learned Lossless Image Compression based on Bit Plane Slicing,Zhe Zhang · Huairui Wang · Zhenzhong Chen · Shan Liu, ,https://arxiv.org/abs/2308.13287,,2308.13287.pdf,Efficient Learned Lossless JPEG Recompression,"JPEG is one of the most popular image compression methods. It is beneficial
+to compress those existing JPEG files without introducing additional
+distortion. In this paper, we propose a deep learning based method to further
+compress JPEG images losslessly. Specifically, we propose a Multi-Level
+Parallel Conditional Modeling (ML-PCM) architecture, which enables parallel
+decoding in different granularities. First, luma and chroma are processed
+independently to allow parallel coding. Second, we propose pipeline parallel
+context model (PPCM) and compressed checkerboard context model (CCCM) for the
+effective conditional modeling and efficient decoding within luma and chroma
+components. Our method has much lower latency while achieves better compression
+ratio compared with previous SOTA. After proper software optimization, we can
+obtain a good throughput of 57 FPS for 1080P images on NVIDIA T4 GPU.
+Furthermore, combined with quantization, our approach can also act as a lossy
+JPEG codec which has obvious advantage over SOTA lossy compression methods in
+high bit rate (bpp$>0.9$).",eess.IV,['eess.IV']
+Structure-Aware Sparse-View X-ray 3D Reconstruction,Yuanhao Cai · Jiahao Wang · Alan L. Yuille · Zongwei Zhou · Angtian Wang,https://github.com/caiyuanhao1998/SAX-NeRF,https://arxiv.org/abs/2311.10959,,2311.10959.pdf,Structure-Aware Sparse-View X-ray 3D Reconstruction,"X-ray, known for its ability to reveal internal structures of objects, is
+expected to provide richer information for 3D reconstruction than visible
+light. Yet, existing neural radiance fields (NeRF) algorithms overlook this
+important nature of X-ray, leading to their limitations in capturing structural
+contents of imaged objects. In this paper, we propose a framework,
+Structure-Aware X-ray Neural Radiodensity Fields (SAX-NeRF), for sparse-view
+X-ray 3D reconstruction. Firstly, we design a Line Segment-based Transformer
+(Lineformer) as the backbone of SAX-NeRF. Linefomer captures internal
+structures of objects in 3D space by modeling the dependencies within each line
+segment of an X-ray. Secondly, we present a Masked Local-Global (MLG) ray
+sampling strategy to extract contextual and geometric information in 2D
+projection. Plus, we collect a larger-scale dataset X3D covering wider X-ray
+applications. Experiments on X3D show that SAX-NeRF surpasses previous
+NeRF-based methods by 12.56 and 2.49 dB on novel view synthesis and CT
+reconstruction. Code, models, and data are released at
+https://github.com/caiyuanhao1998/SAX-NeRF",eess.IV,"['eess.IV', 'cs.CV']"
+Point Cloud Pre-training with Diffusion Models,xiao zheng · Xiaoshui Huang · Guofeng Mei · Zhaoyang Lyu · Yuenan Hou · Wanli Ouyang · Bo Dai · Yongshun Gong, ,https://arxiv.org/abs/2311.14960,,2311.14960.pdf,Point Cloud Pre-training with Diffusion Models,"Pre-training a model and then fine-tuning it on downstream tasks has
+demonstrated significant success in the 2D image and NLP domains. However, due
+to the unordered and non-uniform density characteristics of point clouds, it is
+non-trivial to explore the prior knowledge of point clouds and pre-train a
+point cloud backbone. In this paper, we propose a novel pre-training method
+called Point cloud Diffusion pre-training (PointDif). We consider the point
+cloud pre-training task as a conditional point-to-point generation problem and
+introduce a conditional point generator. This generator aggregates the features
+extracted by the backbone and employs them as the condition to guide the
+point-to-point recovery from the noisy point cloud, thereby assisting the
+backbone in capturing both local and global geometric priors as well as the
+global point density distribution of the object. We also present a recurrent
+uniform sampling optimization strategy, which enables the model to uniformly
+recover from various noise levels and learn from balanced supervision. Our
+PointDif achieves substantial improvement across various real-world datasets
+for diverse downstream tasks such as classification, segmentation and
+detection. Specifically, PointDif attains 70.0% mIoU on S3DIS Area 5 for the
+segmentation task and achieves an average improvement of 2.4% on ScanObjectNN
+for the classification task compared to TAP. Furthermore, our pre-training
+framework can be flexibly applied to diverse point cloud backbones and bring
+considerable gains.",cs.CV,['cs.CV']
+DiffLoc: Diffusion Model for Outdoor LiDAR Localization,Wen Li · Yuyang Yang · Shangshu Yu · Guosheng Hu · Chenglu Wen · Ming Cheng · Cheng Wang, ,,https://www.youtube.com/watch?v=sSW9nHQR0nc,,,,,nan
+Instance-level Expert Knowledge and Aggregate Discriminative Attention for Radiology Report Generation,Shenshen Bu · Taiji Li · Zhiming Dai · Yuedong Yang,https://github.com/hnjzbss/EKAGen,https://arxiv.org/abs/2311.00399,,2311.00399.pdf,Enhanced Knowledge Injection for Radiology Report Generation,"Automatic generation of radiology reports holds crucial clinical value, as it
+can alleviate substantial workload on radiologists and remind less experienced
+ones of potential anomalies. Despite the remarkable performance of various
+image captioning methods in the natural image field, generating accurate
+reports for medical images still faces challenges, i.e., disparities in visual
+and textual data, and lack of accurate domain knowledge. To address these
+issues, we propose an enhanced knowledge injection framework, which utilizes
+two branches to extract different types of knowledge. The Weighted Concept
+Knowledge (WCK) branch is responsible for introducing clinical medical concepts
+weighted by TF-IDF scores. The Multimodal Retrieval Knowledge (MRK) branch
+extracts triplets from similar reports, emphasizing crucial clinical
+information related to entity positions and existence. By integrating this
+finer-grained and well-structured knowledge with the current image, we are able
+to leverage the multi-source knowledge gain to ultimately facilitate more
+accurate report generation. Extensive experiments have been conducted on two
+public benchmarks, demonstrating that our method achieves superior performance
+over other state-of-the-art methods. Ablation studies further validate the
+effectiveness of two extracted knowledge sources.",cs.CV,"['cs.CV', 'cs.CL']"
+Enhancing Multimodal Cooperation via Sample-level Modality Valuation,Yake Wei · Ruoxuan Feng · Zihe Wang · Di Hu, ,https://arxiv.org/html/2309.06255v3,,2309.06255v3.pdf,Enhancing Multimodal Cooperation via Fine-grained Modality Valuation,"One primary topic of multimodal learning is to jointly incorporate
+heterogeneous information from different modalities. However, most models often
+suffer from unsatisfactory multimodal cooperation, which cannot jointly utilize
+all modalities well. Some methods are proposed to identify and enhance the
+worse learnt modality, but they are often hard to provide the fine-grained
+observation of multimodal cooperation at sample-level with theoretical support.
+Hence, it is essential to reasonably observe and improve the fine-grained
+cooperation between modalities, especially when facing realistic scenarios
+where the modality discrepancy could vary across different samples. To this
+end, we introduce a sample-level modality valuation metric to evaluate the
+contribution of each modality for each sample. Via modality valuation, we
+observe that modality discrepancy indeed could be different at sample-level,
+beyond the global contribution discrepancy at dataset-level. We further analyze
+this issue and improve cooperation between modalities at sample-level by
+enhancing the discriminative ability of low-contributing modalities in a
+targeted manner. Overall, our methods reasonably observe the fine-grained
+uni-modal contribution and achieve considerable improvement. The source code
+and dataset are available at
+\url{https://github.com/GeWu-Lab/Valuate-and-Enhance-Multimodal-Cooperation}.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.MM']"
+A Pedestrian is Worth One Prompt: Towards Language Guidance Person Re-Identification,Zexian Yang · Dayan Wu · Chenming Wu · Zheng Lin · JingziGU · Weiping Wang, ,,https://ieeexplore.ieee.org/document/10301577,,,,,nan
+Data-Efficient Multimodal Fusion on a Single GPU,Noël Vouitsis · Zhaoyan Liu · Satya Krishna Gorti · Valentin Villecroze · Jesse C. Cresswell · Guangwei Yu · Gabriel Loaiza-Ganem · Maksims Volkovs, ,https://arxiv.org/abs/2312.10144,,2312.10144.pdf,Data-Efficient Multimodal Fusion on a Single GPU,"The goal of multimodal alignment is to learn a single latent space that is
+shared between multimodal inputs. The most powerful models in this space have
+been trained using massive datasets of paired inputs and large-scale
+computational resources, making them prohibitively expensive to train in many
+practical scenarios. We surmise that existing unimodal encoders pre-trained on
+large amounts of unimodal data should provide an effective bootstrap to create
+multimodal models from unimodal ones at much lower costs. We therefore propose
+FuseMix, a multimodal augmentation scheme that operates on the latent spaces of
+arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal
+alignment, we achieve competitive performance -- and in certain cases
+outperform state-of-the art methods -- in both image-text and audio-text
+retrieval, with orders of magnitude less compute and data: for example, we
+outperform CLIP on the Flickr30K text-to-image retrieval task with $\sim \!
+600\times$ fewer GPU days and $\sim \! 80\times$ fewer image-text pairs.
+Additionally, we show how our method can be applied to convert pre-trained
+text-to-image generative models into audio-to-image ones. Code is available at:
+https://github.com/layer6ai-labs/fusemix.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+SimDA: Simple Diffusion Adapter for Efficient Video Generation,Zhen Xing · Qi Dai · Han Hu · Zuxuan Wu · Yu-Gang Jiang, ,https://arxiv.org/abs/2308.09710,,2308.09710.pdf,SimDA: Simple Diffusion Adapter for Efficient Video Generation,"The recent wave of AI-generated content has witnessed the great development
+and success of Text-to-Image (T2I) technologies. By contrast, Text-to-Video
+(T2V) still falls short of expectations though attracting increasing interests.
+Existing works either train from scratch or adapt large T2I model to videos,
+both of which are computation and resource expensive. In this work, we propose
+a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B
+parameters of a strong T2I model, adapting it to video generation in a
+parameter-efficient way. In particular, we turn the T2I model for T2V by
+designing light-weight spatial and temporal adapters for transfer learning.
+Besides, we change the original spatial attention to the proposed Latent-Shift
+Attention (LSA) for temporal consistency. With similar model architecture, we
+further train a video super-resolution model to generate high-definition
+(1024x1024) videos. In addition to T2V generation in the wild, SimDA could also
+be utilized in one-shot video editing with only 2 minutes tuning. Doing so, our
+method could minimize the training effort with extremely few tunable parameters
+for model adaptation.",cs.CV,"['cs.CV', 'cs.AI']"
+Learning Adaptive Spatial Coherent Correlations for Speech-Preserving Facial Expression Manipulation,Tianshui Chen · Jianman Lin · Zhijing Yang · Chunmei Qing · Liang Lin, ,https://arxiv.org/abs/2401.11085,,2401.11085.pdf,Adaptive Global-Local Representation Learning and Selection for Cross-Domain Facial Expression Recognition,"Domain shift poses a significant challenge in Cross-Domain Facial Expression
+Recognition (CD-FER) due to the distribution variation across different
+domains. Current works mainly focus on learning domain-invariant features
+through global feature adaptation, while neglecting the transferability of
+local features. Additionally, these methods lack discriminative supervision
+during training on target datasets, resulting in deteriorated feature
+representation in target domain. To address these limitations, we propose an
+Adaptive Global-Local Representation Learning and Selection (AGLRLS) framework.
+The framework incorporates global-local adversarial adaptation and
+semantic-aware pseudo label generation to enhance the learning of
+domain-invariant and discriminative feature during training. Meanwhile, a
+global-local prediction consistency learning is introduced to improve
+classification results during inference. Specifically, the framework consists
+of separate global-local adversarial learning modules that learn
+domain-invariant global and local features independently. We also design a
+semantic-aware pseudo label generation module, which computes semantic labels
+based on global and local features. Moreover, a novel dynamic threshold
+strategy is employed to learn the optimal thresholds by leveraging independent
+prediction of global and local features, ensuring filtering out the unreliable
+pseudo labels while retaining reliable ones. These labels are utilized for
+model optimization through the adversarial learning process in an end-to-end
+manner. During inference, a global-local prediction consistency module is
+developed to automatically learn an optimal result from multiple predictions.
+We conduct comprehensive experiments and analysis based on a fair evaluation
+benchmark. The results demonstrate that the proposed framework outperforms the
+current competing methods by a substantial margin.",cs.CV,"['cs.CV', 'cs.AI']"
+Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow,Hanyu Zhou · Yi Chang · Zhiwei Shi,https://hyzhouboy.github.io/,https://arxiv.org/abs/2403.07432,,2403.07432.pdf,Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow,"Single RGB or LiDAR is the mainstream sensor for the challenging scene flow,
+which relies heavily on visual features to match motion features. Compared with
+single modality, existing methods adopt a fusion strategy to directly fuse the
+cross-modal complementary knowledge in motion space. However, these direct
+fusion methods may suffer the modality gap due to the visual intrinsic
+heterogeneous nature between RGB and LiDAR, thus deteriorating motion features.
+We discover that event has the homogeneous nature with RGB and LiDAR in both
+visual and motion spaces. In this work, we bring the event as a bridge between
+RGB and LiDAR, and propose a novel hierarchical visual-motion fusion framework
+for scene flow, which explores a homogeneous space to fuse the cross-modal
+complementary knowledge for physical interpretation. In visual fusion, we
+discover that event has a complementarity (relative v.s. absolute) in luminance
+space with RGB for high dynamic imaging, and has a complementarity (local
+boundary v.s. global shape) in scene structure space with LiDAR for structure
+integrity. In motion fusion, we figure out that RGB, event and LiDAR are
+complementary (spatial-dense, temporal-dense v.s. spatiotemporal-sparse) to
+each other in correlation space, which motivates us to fuse their motion
+correlations for motion continuity. The proposed hierarchical fusion can
+explicitly fuse the multimodal knowledge to progressively improve scene flow
+from visual space to motion space. Extensive experiments have been performed to
+verify the superiority of the proposed method.",cs.CV,['cs.CV']
+Efficient Vision-Language Pre-training by Cluster Masking,Zihao Wei · Zixuan Pan · Andrew Owens, ,https://arxiv.org/abs/2405.08815,,2405.08815.pdf,Efficient Vision-Language Pre-training by Cluster Masking,"We propose a simple strategy for masking image patches during visual-language
+contrastive learning that improves the quality of the learned representations
+and the training speed. During each iteration of training, we randomly mask
+clusters of visually similar image patches, as measured by their raw pixel
+intensities. This provides an extra learning signal, beyond the contrastive
+training itself, since it forces a model to predict words for masked visual
+structures solely from context. It also speeds up training by reducing the
+amount of data used in each image. We evaluate the effectiveness of our model
+by pre-training on a number of benchmarks, finding that it outperforms other
+masking strategies, such as FLIP, on the quality of the learned representation.",cs.CV,['cs.CV']
+Seg2Reg: Differentiable 2D Segmentation to 1D Regression Rendering for 360 Room Layout Reconstruction,Cheng Sun · Wei-En Tai · Yu-Lin Shih · Kuan-Wei Chen · Yong-Jing Syu · Kent Selwyn The · Yu-Chiang Frank Wang · Hwann-Tzong Chen, ,https://arxiv.org/abs/2311.18695v1,,2311.18695v1.pdf,Seg2Reg: Differentiable 2D Segmentation to 1D Regression Rendering for 360 Room Layout Reconstruction,"State-of-the-art single-view 360-degree room layout reconstruction methods
+formulate the problem as a high-level 1D (per-column) regression task. On the
+other hand, traditional low-level 2D layout segmentation is simpler to learn
+and can represent occluded regions, but it requires complex post-processing for
+the targeting layout polygon and sacrifices accuracy. We present Seg2Reg to
+render 1D layout depth regression from the 2D segmentation map in a
+differentiable and occlusion-aware way, marrying the merits of both sides.
+Specifically, our model predicts floor-plan density for the input
+equirectangular 360-degree image. Formulating the 2D layout representation as a
+density field enables us to employ `flattened' volume rendering to form 1D
+layout depth regression. In addition, we propose a novel 3D warping
+augmentation on layout to improve generalization. Finally, we re-implement
+recent room layout reconstruction methods into our codebase for benchmarking
+and explore modern backbones and training techniques to serve as the strong
+baseline. Our model significantly outperforms previous arts. The code will be
+made available upon publication.",cs.CV,"['cs.CV', 'cs.LG']"
+Patch2Self2: Self-supervised Denoising on Coresets via Matrix Sketching,Shreyas Fadnavis · Agniva Chowdhury · Joshua Batson · Petros Drineas · Eleftherios Garyfallidis, ,,,,,,,nan
+X-3D: Explicit 3D Structure Modeling for Point Cloud Recognition,Shuofeng Sun · Yongming Rao · Jiwen Lu · Haibin Yan,https://github.com/sunshuofeng/X-3D,https://arxiv.org/abs/2404.15010,,2404.15010.pdf,X-3D: Explicit 3D Structure Modeling for Point Cloud Recognition,"Numerous prior studies predominantly emphasize constructing relation vectors
+for individual neighborhood points and generating dynamic kernels for each
+vector and embedding these into high-dimensional spaces to capture implicit
+local structures. However, we contend that such implicit high-dimensional
+structure modeling approch inadequately represents the local geometric
+structure of point clouds due to the absence of explicit structural
+information. Hence, we introduce X-3D, an explicit 3D structure modeling
+approach. X-3D functions by capturing the explicit local structural information
+within the input 3D space and employing it to produce dynamic kernels with
+shared weights for all neighborhood points within the current local region.
+This modeling approach introduces effective geometric prior and significantly
+diminishes the disparity between the local structure of the embedding space and
+the original input point cloud, thereby improving the extraction of local
+features. Experiments show that our method can be used on a variety of methods
+and achieves state-of-the-art performance on segmentation, classification,
+detection tasks with lower extra computational cost, such as \textbf{90.7\%} on
+ScanObjectNN for classification, \textbf{79.2\%} on S3DIS 6 fold and
+\textbf{74.3\%} on S3DIS Area 5 for segmentation, \textbf{76.3\%} on ScanNetV2
+for segmentation and \textbf{64.5\%} mAP , \textbf{46.9\%} mAP on SUN RGB-D and
+\textbf{69.0\%} mAP , \textbf{51.1\%} mAP on ScanNetV2 . Our code is available
+at
+\href{https://github.com/sunshuofeng/X-3D}{https://github.com/sunshuofeng/X-3D}.",cs.CV,['cs.CV']
+NeRFCodec: Neural Feature Compression Meets Neural Radiance Fields for Memory-Efficient Scene Representation,Sicheng Li · Hao Li · Yiyi Liao · Lu Yu,https://jasonlsc.github.io/nerfcodec_homepage/,https://arxiv.org/abs/2404.02185,,2404.02185.pdf,NeRFCodec: Neural Feature Compression Meets Neural Radiance Fields for Memory-Efficient Scene Representation,"The emergence of Neural Radiance Fields (NeRF) has greatly impacted 3D scene
+modeling and novel-view synthesis. As a kind of visual media for 3D scene
+representation, compression with high rate-distortion performance is an eternal
+target. Motivated by advances in neural compression and neural field
+representation, we propose NeRFCodec, an end-to-end NeRF compression framework
+that integrates non-linear transform, quantization, and entropy coding for
+memory-efficient scene representation. Since training a non-linear transform
+directly on a large scale of NeRF feature planes is impractical, we discover
+that pre-trained neural 2D image codec can be utilized for compressing the
+features when adding content-specific parameters. Specifically, we reuse neural
+2D image codec but modify its encoder and decoder heads, while keeping the
+other parts of the pre-trained decoder frozen. This allows us to train the full
+pipeline via supervision of rendering loss and entropy loss, yielding the
+rate-distortion balance by updating the content-specific parameters. At test
+time, the bitstreams containing latent code, feature decoder head, and other
+side information are transmitted for communication. Experimental results
+demonstrate our method outperforms existing NeRF compression methods, enabling
+high-quality novel view synthesis with a memory budget of 0.5 MB.",cs.CV,"['cs.CV', 'cs.GR', 'eess.IV']"
+Desigen: A Pipeline for Controllable Design Template Generation,Haohan Weng · Danqing Huang · YU QIAO · Hu Zheng · Chin-Yew Lin · Tong Zhang · C. L. Philip Chen, ,https://arxiv.org/html/2403.09093v1,,2403.09093v1.pdf,Desigen: A Pipeline for Controllable Design Template Generation,"Templates serve as a good starting point to implement a design (e.g., banner,
+slide) but it takes great effort from designers to manually create. In this
+paper, we present Desigen, an automatic template creation pipeline which
+generates background images as well as harmonious layout elements over the
+background. Different from natural images, a background image should preserve
+enough non-salient space for the overlaying layout elements. To equip existing
+advanced diffusion-based models with stronger spatial control, we propose two
+simple but effective techniques to constrain the saliency distribution and
+reduce the attention weight in desired regions during the background generation
+process. Then conditioned on the background, we synthesize the layout with a
+Transformer-based autoregressive generator. To achieve a more harmonious
+composition, we propose an iterative inference strategy to adjust the
+synthesized background and layout in multiple rounds. We constructed a design
+dataset with more than 40k advertisement banners to verify our approach.
+Extensive experiments demonstrate that the proposed pipeline generates
+high-quality templates comparable to human designers. More than a single-page
+design, we further show an application of presentation generation that outputs
+a set of theme-consistent slides. The data and code are available at
+https://whaohan.github.io/desigen.",cs.CV,['cs.CV']
+"Sparse views, Near light: A practical paradigm for uncalibrated point-light photometric stereo",Mohammed Brahimi · Bjoern Haefner · Zhenzhang Ye · Bastian Goldluecke · Daniel Cremers, ,https://arxiv.org/abs/2404.00098,,2404.00098.pdf,"Sparse Views, Near Light: A Practical Paradigm for Uncalibrated Point-light Photometric Stereo","Neural approaches have shown a significant progress on camera-based
+reconstruction. But they require either a fairly dense sampling of the viewing
+sphere, or pre-training on an existing dataset, thereby limiting their
+generalizability. In contrast, photometric stereo (PS) approaches have shown
+great potential for achieving high-quality reconstruction under sparse
+viewpoints. Yet, they are impractical because they typically require tedious
+laboratory conditions, are restricted to dark rooms, and often multi-staged,
+making them subject to accumulated errors. To address these shortcomings, we
+propose an end-to-end uncalibrated multi-view PS framework for reconstructing
+high-resolution shapes acquired from sparse viewpoints in a real-world
+environment. We relax the dark room assumption, and allow a combination of
+static ambient lighting and dynamic near LED lighting, thereby enabling easy
+data capture outside the lab. Experimental validation confirms that it
+outperforms existing baseline approaches in the regime of sparse viewpoints by
+a large margin. This allows to bring high-accuracy 3D reconstruction from the
+dark room to the real world, while maintaining a reasonable data capture
+complexity.",cs.CV,['cs.CV']
+LeGO: Leveraging a Surface Deformation Network for Animatable Stylized Face Generation with One Example,Soyeon Yoon · Kwan Yun · Kwanggyoon Seo · Sihun Cha · Jung Eun Yoo · Junyong Noh,https://kwanyun.github.io/lego/,https://arxiv.org/abs/2403.15227,,2403.15227.pdf,LeGO: Leveraging a Surface Deformation Network for Animatable Stylized Face Generation with One Example,"Recent advances in 3D face stylization have made significant strides in few
+to zero-shot settings. However, the degree of stylization achieved by existing
+methods is often not sufficient for practical applications because they are
+mostly based on statistical 3D Morphable Models (3DMM) with limited variations.
+To this end, we propose a method that can produce a highly stylized 3D face
+model with desired topology. Our methods train a surface deformation network
+with 3DMM and translate its domain to the target style using a paired exemplar.
+The network achieves stylization of the 3D face mesh by mimicking the style of
+the target using a differentiable renderer and directional CLIP losses.
+Additionally, during the inference process, we utilize a Mesh Agnostic Encoder
+(MAGE) that takes deformation target, a mesh of diverse topologies as input to
+the stylization process and encodes its shape into our latent space. The
+resulting stylized face model can be animated by commonly used 3DMM blend
+shapes. A set of quantitative and qualitative evaluations demonstrate that our
+method can produce highly stylized face meshes according to a given style and
+output them in a desired topology. We also demonstrate example applications of
+our method including image-based stylized avatar generation, linear
+interpolation of geometric styles, and facial animation of stylized avatars.",cs.CV,"['cs.CV', 'cs.GR', '68T45', 'I.4.9']"
+Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models,Chang Liu · Haoning Wu · Yujie Zhong · Xiaoyun Zhang · Yanfeng Wang · Weidi Xie,https://haoningwu3639.github.io/StoryGen_Webpage/,https://ar5iv.labs.arxiv.org/html/2312.03884,,2312.03884.pdf,WonderJourney: Going from Anywhere to Everywhere,"We introduce WonderJourney, a modularized framework for perpetual 3D scene
+generation. Unlike prior work on view generation that focuses on a single type
+of scenes, we start at any user-provided location (by a text description or an
+image) and generate a journey through a long sequence of diverse yet coherently
+connected 3D scenes. We leverage an LLM to generate textual descriptions of the
+scenes in this journey, a text-driven point cloud generation pipeline to make a
+compelling and coherent sequence of 3D scenes, and a large VLM to verify the
+generated scenes. We show compelling, diverse visual results across various
+scene types and styles, forming imaginary ""wonderjourneys"". Project website:
+https://kovenyu.com/WonderJourney/",cs.CV,"['cs.CV', 'cs.GR']"
+CSTA: CNN-based Spatiotemporal Attention for Video Summarization,Jaewon Son · Jaehun Park · Kwangsu Kim,https://github.com/thswodnjs3/CSTA,https://arxiv.org/abs/2405.11905,,2405.11905.pdf,CSTA: CNN-based Spatiotemporal Attention for Video Summarization,"Video summarization aims to generate a concise representation of a video,
+capturing its essential content and key moments while reducing its overall
+length. Although several methods employ attention mechanisms to handle
+long-term dependencies, they often fail to capture the visual significance
+inherent in frames. To address this limitation, we propose a CNN-based
+SpatioTemporal Attention (CSTA) method that stacks each feature of frames from
+a single video to form image-like frame representations and applies 2D CNN to
+these frame features. Our methodology relies on CNN to comprehend the inter and
+intra-frame relations and to find crucial attributes in videos by exploiting
+its ability to learn absolute positions within images. In contrast to previous
+work compromising efficiency by designing additional modules to focus on
+spatial importance, CSTA requires minimal computational overhead as it uses CNN
+as a sliding window. Extensive experiments on two benchmark datasets (SumMe and
+TVSum) demonstrate that our proposed approach achieves state-of-the-art
+performance with fewer MACs compared to previous methods. Codes are available
+at https://github.com/thswodnjs3/CSTA.",cs.CV,['cs.CV']
+LP++: A Surprisingly Strong Linear Probe for Few-Shot CLIP,Yunshi HUANG · Fereshteh Shakeri · Jose Dolz · Malik Boudiaf · Houda Bahig · Ismail Ben Ayed, ,https://arxiv.org/abs/2404.02285,,2404.02285.pdf,LP++: A Surprisingly Strong Linear Probe for Few-Shot CLIP,"In a recent, strongly emergent literature on few-shot CLIP adaptation, Linear
+Probe (LP) has been often reported as a weak baseline. This has motivated
+intensive research building convoluted prompt learning or feature adaptation
+strategies. In this work, we propose and examine from convex-optimization
+perspectives a generalization of the standard LP baseline, in which the linear
+classifier weights are learnable functions of the text embedding, with
+class-wise multipliers blending image and text knowledge. As our objective
+function depends on two types of variables, i.e., the class visual prototypes
+and the learnable blending parameters, we propose a computationally efficient
+block coordinate Majorize-Minimize (MM) descent algorithm. In our full-batch MM
+optimizer, which we coin LP++, step sizes are implicit, unlike standard
+gradient descent practices where learning rates are intensively searched over
+validation sets. By examining the mathematical properties of our loss (e.g.,
+Lipschitz gradient continuity), we build majorizing functions yielding
+data-driven learning rates and derive approximations of the loss's minima,
+which provide data-informed initialization of the variables. Our image-language
+objective function, along with these non-trivial optimization insights and
+ingredients, yields, surprisingly, highly competitive few-shot CLIP
+performances. Furthermore, LP++ operates in black-box, relaxes intensive
+validation searches for the optimization hyper-parameters, and runs
+orders-of-magnitudes faster than state-of-the-art few-shot CLIP adaptation
+methods. Our code is available at:
+\url{https://github.com/FereshteShakeri/FewShot-CLIP-Strong-Baseline.git}.",cs.CV,['cs.CV']
+EarthLoc: Astronaut Photography Localization by Indexing Earth from Space,Gabriele Berton · Alex Stoken · Barbara Caputo · Carlo Masone,https://github.com/gmberton/EarthLoc,https://arxiv.org/abs/2403.06758,,2403.06758.pdf,EarthLoc: Astronaut Photography Localization by Indexing Earth from Space,"Astronaut photography, spanning six decades of human spaceflight, presents a
+unique Earth observations dataset with immense value for both scientific
+research and disaster response. Despite its significance, accurately localizing
+the geographical extent of these images, crucial for effective utilization,
+poses substantial challenges. Current manual localization efforts are
+time-consuming, motivating the need for automated solutions. We propose a novel
+approach - leveraging image retrieval - to address this challenge efficiently.
+We introduce innovative training techniques, including Year-Wise Data
+Augmentation and a Neutral-Aware Multi-Similarity Loss, which contribute to the
+development of a high-performance model, EarthLoc. We develop six evaluation
+datasets and perform a comprehensive benchmark comparing EarthLoc to existing
+methods, showcasing its superior efficiency and accuracy. Our approach marks a
+significant advancement in automating the localization of astronaut
+photography, which will help bridge a critical gap in Earth observations data.
+Code and datasets are available at https://github.com/gmberton/EarthLoc",cs.CV,['cs.CV']
+Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis,Jiawen Li · Yuxuan Chen · Hongbo Chu · Sun Qiehe · Tian Guan · Anjia Han · Yonghong He, ,https://arxiv.org/abs/2403.07719,,2403.07719.pdf,Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis,"Histopathological whole slide images (WSIs) classification has become a
+foundation task in medical microscopic imaging processing. Prevailing
+approaches involve learning WSIs as instance-bag representations, emphasizing
+significant instances but struggling to capture the interactions between
+instances. Additionally, conventional graph representation methods utilize
+explicit spatial positions to construct topological structures but restrict the
+flexible interaction capabilities between instances at arbitrary locations,
+particularly when spatially distant. In response, we propose a novel dynamic
+graph representation algorithm that conceptualizes WSIs as a form of the
+knowledge graph structure. Specifically, we dynamically construct neighbors and
+directed edge embeddings based on the head and tail relationships between
+instances. Then, we devise a knowledge-aware attention mechanism that can
+update the head node features by learning the joint attention score of each
+neighbor and edge. Finally, we obtain a graph-level embedding through the
+global pooling process of the updated head, serving as an implicit
+representation for the WSI classification. Our end-to-end graph representation
+learning approach has outperformed the state-of-the-art WSI analysis methods on
+three TCGA benchmark datasets and in-house test sets. Our code is available at
+https://github.com/WonderLandxD/WiKG.",cs.CV,['cs.CV']
+Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary,Leheng Zhang · Yawei Li · Xingyu Zhou · Xiaorui Zhao · Shuhang Gu, ,https://arxiv.org/abs/2401.08209,,2401.08209.pdf,Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary,"Single Image Super-Resolution is a classic computer vision problem that
+involves estimating high-resolution (HR) images from low-resolution (LR) ones.
+Although deep neural networks (DNNs), especially Transformers for
+super-resolution, have seen significant advancements in recent years,
+challenges still remain, particularly in limited receptive field caused by
+window-based self-attention. To address these issues, we introduce a group of
+auxiliary Adaptive Token Dictionary to SR Transformer and establish an ATD-SR
+method. The introduced token dictionary could learn prior information from
+training data and adapt the learned prior to specific testing image through an
+adaptive refinement step. The refinement strategy could not only provide global
+information to all input tokens but also group image tokens into categories.
+Based on category partitions, we further propose a category-based
+self-attention mechanism designed to leverage distant but similar tokens for
+enhancing input features. The experimental results show that our method
+achieves the best performance on various single image super-resolution
+benchmarks.",cs.CV,['cs.CV']
+Learning to Select Views for Efficient Multi-View Understanding,Yunzhong Hou · Stephen Gould · Liang Zheng, ,,https://openreview.net/forum?id=mzWQ2hOKNX,,,,,nan
+MedBN: Robust Test-Time Adaptation against Malicious Test Samples,Hyejin Park · Jeongyeon Hwang · Sunung Mun · Sangdon Park · Jungseul Ok,http://hyejin-s.github.io/medbn,https://arxiv.org/abs/2403.19326,,2403.19326.pdf,MedBN: Robust Test-Time Adaptation against Malicious Test Samples,"Test-time adaptation (TTA) has emerged as a promising solution to address
+performance decay due to unforeseen distribution shifts between training and
+test data. While recent TTA methods excel in adapting to test data variations,
+such adaptability exposes a model to vulnerability against malicious examples,
+an aspect that has received limited attention. Previous studies have uncovered
+security vulnerabilities within TTA even when a small proportion of the test
+batch is maliciously manipulated. In response to the emerging threat, we
+propose median batch normalization (MedBN), leveraging the robustness of the
+median for statistics estimation within the batch normalization layer during
+test-time inference. Our method is algorithm-agnostic, thus allowing seamless
+integration with existing TTA frameworks. Our experimental results on benchmark
+datasets, including CIFAR10-C, CIFAR100-C and ImageNet-C, consistently
+demonstrate that MedBN outperforms existing approaches in maintaining robust
+performance across different attack scenarios, encompassing both instant and
+cumulative attacks. Through extensive experiments, we show that our approach
+sustains the performance even in the absence of attacks, achieving a practical
+balance between robustness and performance.",cs.LG,"['cs.LG', 'cs.CR', 'cs.CV']"
+Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis,Tianci Bi · Xiaoyi Zhang · Zhizheng Zhang · Wenxuan Xie · Cuiling Lan · Yan Lu · Nanning Zheng, ,https://arxiv.org/abs/2405.07481,,2405.07481.pdf,Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis,"Significant progress has been made in scene text detection models since the
+rise of deep learning, but scene text layout analysis, which aims to group
+detected text instances as paragraphs, has not kept pace. Previous works either
+treated text detection and grouping using separate models, or train a model
+from scratch while using a unified one. All of them have not yet made full use
+of the already well-trained text detectors and easily obtainable detection
+datasets. In this paper, we present Text Grouping Adapter (TGA), a module that
+can enable the utilization of various pre-trained text detectors to learn
+layout analysis, allowing us to adopt a well-trained text detector right off
+the shelf or just fine-tune it efficiently. Designed to be compatible with
+various text detector architectures, TGA takes detected text regions and image
+features as universal inputs to assemble text instance features. To capture
+broader contextual information for layout analysis, we propose to predict text
+group masks from text instance features by one-to-many assignment. Our
+comprehensive experiments demonstrate that, even with frozen pre-trained
+models, incorporating our TGA into various pre-trained text detectors and text
+spotters can achieve superior layout analysis performance, simultaneously
+inheriting generalized text detection ability from pre-training. In the case of
+full parameter fine-tuning, we can further improve layout analysis performance.",cs.CV,['cs.CV']
+Masked Spatial Propagation Network for Sparsity-Adaptive Depth Refinement,Jinyoung Jun · Jae-Han Lee · Chang-Su Kim, ,https://arxiv.org/abs/2404.19294,,2404.19294.pdf,Masked Spatial Propagation Network for Sparsity-Adaptive Depth Refinement,"The main function of depth completion is to compensate for an insufficient
+and unpredictable number of sparse depth measurements of hardware sensors.
+However, existing research on depth completion assumes that the sparsity -- the
+number of points or LiDAR lines -- is fixed for training and testing. Hence,
+the completion performance drops severely when the number of sparse depths
+changes significantly. To address this issue, we propose the sparsity-adaptive
+depth refinement (SDR) framework, which refines monocular depth estimates using
+sparse depth points. For SDR, we propose the masked spatial propagation network
+(MSPN) to perform SDR with a varying number of sparse depths effectively by
+gradually propagating sparse depth information throughout the entire depth map.
+Experimental results demonstrate that MPSN achieves state-of-the-art
+performance on both SDR and conventional depth completion scenarios.",cs.CV,['cs.CV']
+Depth-aware Test-Time Training for Zero-shot Video Object Segmentation,Weihuang Liu · Xi Shen · Haolun Li · Xiuli Bi · Bo Liu · Chi-Man Pun · Xiaodong Cun,https://nifangbaage.github.io/DATTT/,https://arxiv.org/abs/2403.04258,,2403.04258.pdf,Depth-aware Test-Time Training for Zero-shot Video Object Segmentation,"Zero-shot Video Object Segmentation (ZSVOS) aims at segmenting the primary
+moving object without any human annotations. Mainstream solutions mainly focus
+on learning a single model on large-scale video datasets, which struggle to
+generalize to unseen videos. In this work, we introduce a test-time training
+(TTT) strategy to address the problem. Our key insight is to enforce the model
+to predict consistent depth during the TTT process. In detail, we first train a
+single network to perform both segmentation and depth prediction tasks. This
+can be effectively learned with our specifically designed depth modulation
+layer. Then, for the TTT process, the model is updated by predicting consistent
+depth maps for the same frame under different data augmentations. In addition,
+we explore different TTT weight updating strategies. Our empirical results
+suggest that the momentum-based weight initialization and looping-based
+training scheme lead to more stable improvements. Experiments show that the
+proposed method achieves clear improvements on ZSVOS. Our proposed video TTT
+strategy provides significant superiority over state-of-the-art TTT methods.
+Our code is available at: https://nifangbaage.github.io/DATTT.",cs.CV,['cs.CV']
+Viewpoint-Aware Visual Grounding in 3D Scenes,Xiangxi Shi · Zhonghua Wu · Stefan Lee, ,https://arxiv.org/abs/2403.03077,,2403.03077.pdf,MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding,"3D visual grounding involves matching natural language descriptions with
+their corresponding objects in 3D spaces. Existing methods often face
+challenges with accuracy in object recognition and struggle in interpreting
+complex linguistic queries, particularly with descriptions that involve
+multiple anchors or are view-dependent. In response, we present the MiKASA
+(Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model
+integrates a self-attention-based scene-aware object encoder and an original
+multi-key-anchor technique, enhancing object recognition accuracy and the
+understanding of spatial relationships. Furthermore, MiKASA improves the
+explainability of decision-making, facilitating error diagnosis. Our model
+achieves the highest overall accuracy in the Referit3D challenge for both the
+Sr3D and Nr3D datasets, particularly excelling by a large margin in categories
+that require viewpoint-dependent descriptions.",cs.CV,['cs.CV']
+Training Vision Transformers for Semi-Supervised Semantic Segmentation,Xinting Hu · Li Jiang · Bernt Schiele, ,,https://github.com/JoyHuYY1412/S4Former,,,,,nan
+DeconfuseTrack: Dealing with Confusion for Multi-Object Tracking,Cheng Huang · Shoudong Han · Mengyu He · Wenbo Zheng · Yuhao Wei, ,https://arxiv.org/abs/2403.02767,,2403.02767.pdf,DeconfuseTrack:Dealing with Confusion for Multi-Object Tracking,"Accurate data association is crucial in reducing confusion, such as ID
+switches and assignment errors, in multi-object tracking (MOT). However,
+existing advanced methods often overlook the diversity among trajectories and
+the ambiguity and conflicts present in motion and appearance cues, leading to
+confusion among detections, trajectories, and associations when performing
+simple global data association. To address this issue, we propose a simple,
+versatile, and highly interpretable data association approach called Decomposed
+Data Association (DDA). DDA decomposes the traditional association problem into
+multiple sub-problems using a series of non-learning-based modules and
+selectively addresses the confusion in each sub-problem by incorporating
+targeted exploitation of new cues. Additionally, we introduce Occlusion-aware
+Non-Maximum Suppression (ONMS) to retain more occluded detections, thereby
+increasing opportunities for association with trajectories and indirectly
+reducing the confusion caused by missed detections. Finally, based on DDA and
+ONMS, we design a powerful multi-object tracker named DeconfuseTrack,
+specifically focused on resolving confusion in MOT. Extensive experiments
+conducted on the MOT17 and MOT20 datasets demonstrate that our proposed DDA and
+ONMS significantly enhance the performance of several popular trackers.
+Moreover, DeconfuseTrack achieves state-of-the-art performance on the MOT17 and
+MOT20 test sets, significantly outperforms the baseline tracker ByteTrack in
+metrics such as HOTA, IDF1, AssA. This validates that our tracking design
+effectively reduces confusion caused by simple global association.",cs.CV,['cs.CV']
+SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis,Ziqiao Peng · Wentao Hu · Yue Shi · Xiangyu Zhu · Xiaomei Zhang · Hao Zhao · Jun He · Hongyan Liu · Zhaoxin Fan,https://ziqiaopeng.github.io/synctalk/,https://arxiv.org/html/2311.17590v2,,2311.17590v2.pdf,SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis,"Achieving high synchronization in the synthesis of realistic, speech-driven
+talking head videos presents a significant challenge. Traditional Generative
+Adversarial Networks (GAN) struggle to maintain consistent facial identity,
+while Neural Radiance Fields (NeRF) methods, although they can address this
+issue, often produce mismatched lip movements, inadequate facial expressions,
+and unstable head poses. A lifelike talking head requires synchronized
+coordination of subject identity, lip movements, facial expressions, and head
+poses. The absence of these synchronizations is a fundamental flaw, leading to
+unrealistic and artificial outcomes. To address the critical issue of
+synchronization, identified as the ""devil"" in creating realistic talking heads,
+we introduce SyncTalk. This NeRF-based method effectively maintains subject
+identity, enhancing synchronization and realism in talking head synthesis.
+SyncTalk employs a Face-Sync Controller to align lip movements with speech and
+innovatively uses a 3D facial blendshape model to capture accurate facial
+expressions. Our Head-Sync Stabilizer optimizes head poses, achieving more
+natural head movements. The Portrait-Sync Generator restores hair details and
+blends the generated head with the torso for a seamless visual experience.
+Extensive experiments and user studies demonstrate that SyncTalk outperforms
+state-of-the-art methods in synchronization and realism. We recommend watching
+the supplementary video: https://ziqiaopeng.github.io/synctalk",cs.CV,['cs.CV']
+Efficient Test-Time Adaptation of Vision-Language Models,Adilbek Karmanov · Dayan Guan · Shijian Lu · Abdulmotaleb El Saddik · Eric P. Xing,https://kdiaaa.github.io/tda/,https://arxiv.org/abs/2403.18293,,2403.18293.pdf,Efficient Test-Time Adaptation of Vision-Language Models,"Test-time adaptation with pre-trained vision-language models has attracted
+increasing attention for tackling distribution shifts during the test time.
+Though prior studies have achieved very promising performance, they involve
+intensive computation which is severely unaligned with test-time adaptation. We
+design TDA, a training-free dynamic adapter that enables effective and
+efficient test-time adaptation with vision-language models. TDA works with a
+lightweight key-value cache that maintains a dynamic queue with few-shot pseudo
+labels as values and the corresponding test-sample features as keys. Leveraging
+the key-value cache, TDA allows adapting to test data gradually via progressive
+pseudo label refinement which is super-efficient without incurring any
+backpropagation. In addition, we introduce negative pseudo labeling that
+alleviates the adverse impact of pseudo label noises by assigning pseudo labels
+to certain negative classes when the model is uncertain about its pseudo label
+predictions. Extensive experiments over two benchmarks demonstrate TDA's
+superior effectiveness and efficiency as compared with the state-of-the-art.
+The code has been released in \url{https://kdiaaa.github.io/tda/}.",cs.CV,['cs.CV']
+Systematic comparison of semi-supervised and self-supervised learning for medical image classification,Zhe Huang · Ruijie Jiang · Shuchin Aeron · Michael C. Hughes, ,https://arxiv.org/abs/2307.08919v2,,2307.08919v2.pdf,Systematic comparison of semi-supervised and self-supervised learning for medical image classification,"In many medical image classification problems, labeled data is scarce while
+unlabeled data is more available. Semi-supervised learning and self-supervised
+learning are two different research directions that can improve accuracy by
+learning from extra unlabeled data. Recent methods from both directions have
+reported significant gains on traditional benchmarks. Yet past benchmarks do
+not focus on medical tasks and rarely compare self- and semi- methods together
+on equal footing. Furthermore, past benchmarks often handle hyperparameter
+tuning suboptimally. First, they may not tune hyperparameters at all, leading
+to underfitting. Second, when tuning does occur, it often unrealistically uses
+a labeled validation set much larger than the train set. Both cases make
+previously published rankings of methods difficult to translate to practical
+settings. This study contributes a systematic evaluation of self- and semi-
+methods with a unified experimental protocol intended to guide a practitioner
+with scarce overall labeled data and a limited compute budget. We answer two
+key questions: Can hyperparameter tuning be effective with realistic-sized
+validation sets? If so, when all methods are tuned well, which self- or
+semi-supervised methods reach the best accuracy? Our study compares 13
+representative semi- and self-supervised methods to strong labeled-set-only
+baselines on 4 medical datasets. From 20000+ total GPU hours of computation, we
+provide valuable best practices to resource-constrained, results-focused
+practitioners.",cs.CV,"['cs.CV', 'cs.LG']"
+Grounded Text-to-Image Synthesis with Attention Refocusing,Quynh Phung · Songwei Ge · Jia-Bin Huang, ,https://arxiv.org/abs/2306.05427,,2306.05427.pdf,Grounded Text-to-Image Synthesis with Attention Refocusing,"Driven by the scalable diffusion models trained on large-scale datasets,
+text-to-image synthesis methods have shown compelling results. However, these
+models still fail to precisely follow the text prompt involving multiple
+objects, attributes, or spatial compositions. In this paper, we reveal the
+potential causes in the diffusion model's cross-attention and self-attention
+layers. We propose two novel losses to refocus attention maps according to a
+given spatial layout during sampling. Creating the layouts manually requires
+additional effort and can be tedious. Therefore, we explore using large
+language models (LLM) to produce these layouts for our method. We conduct
+extensive experiments on the DrawBench, HRS, and TIFA benchmarks to evaluate
+our proposed method. We show that our proposed attention refocusing effectively
+improves the controllability of existing approaches.",cs.CV,['cs.CV']
+"Flexible Biometrics Recognition: Bridging the Multimodality Gap through Attention, Alignment and Prompt Tuning",Leslie Ching Ow Tiong · Dick Sigmund · Chen-Hui Chan · Andrew Beng Jin Teoh,https://github.com/MIS-DevWorks/FBR,,https://mdpi-res.com/d_attachment/sensors/sensors-23-06006/article_deploy/sensors-23-06006.pdf?version=1687952937,,,,,nan
+Universal Semi-Supervised Domain Adaptation by Mitigating Common-Class Bias,Wenyu Zhang · Qingmu Liu · Felix Ong · Mohamed Ragab · Chuan-Sheng Foo, ,https://arxiv.org/abs/2403.11234,,2403.11234.pdf,Universal Semi-Supervised Domain Adaptation by Mitigating Common-Class Bias,"Domain adaptation is a critical task in machine learning that aims to improve
+model performance on a target domain by leveraging knowledge from a related
+source domain. In this work, we introduce Universal Semi-Supervised Domain
+Adaptation (UniSSDA), a practical yet challenging setting where the target
+domain is partially labeled, and the source and target label space may not
+strictly match. UniSSDA is at the intersection of Universal Domain Adaptation
+(UniDA) and Semi-Supervised Domain Adaptation (SSDA): the UniDA setting does
+not allow for fine-grained categorization of target private classes not
+represented in the source domain, while SSDA focuses on the restricted
+closed-set setting where source and target label spaces match exactly. Existing
+UniDA and SSDA methods are susceptible to common-class bias in UniSSDA
+settings, where models overfit to data distributions of classes common to both
+domains at the expense of private classes. We propose a new prior-guided
+pseudo-label refinement strategy to reduce the reinforcement of common-class
+bias due to pseudo-labeling, a common label propagation strategy in domain
+adaptation. We demonstrate the effectiveness of the proposed strategy on
+benchmark datasets Office-Home, DomainNet, and VisDA. The proposed strategy
+attains the best performance across UniSSDA adaptation settings and establishes
+a new baseline for UniSSDA.",cs.CV,['cs.CV']
+OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation,Bohao Peng · Xiaoyang Wu · Li Jiang · Yukang Chen · Hengshuang Zhao · Zhuotao Tian · Jiaya Jia, ,https://arxiv.org/abs/2403.14418,,2403.14418.pdf,OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation,"The booming of 3D recognition in the 2020s began with the introduction of
+point cloud transformers. They quickly overwhelmed sparse CNNs and became
+state-of-the-art models, especially in 3D semantic segmentation. However,
+sparse CNNs are still valuable networks, due to their efficiency treasure, and
+ease of application. In this work, we reexamine the design distinctions and
+test the limits of what a sparse CNN can achieve. We discover that the key
+credit to the performance difference is adaptivity. Specifically, we propose
+two key components, i.e., adaptive receptive fields (spatially) and adaptive
+relation, to bridge the gap. This exploration led to the creation of
+Omni-Adaptive 3D CNNs (OA-CNNs), a family of networks that integrates a
+lightweight module to greatly enhance the adaptivity of sparse CNNs at minimal
+computational cost. Without any self-attention modules, OA-CNNs favorably
+surpass point transformers in terms of accuracy in both indoor and outdoor
+scenes, with much less latency and memory cost. Notably, it achieves 76.1%,
+78.9%, and 70.6% mIoU on ScanNet v2, nuScenes, and SemanticKITTI validation
+benchmarks respectively, while maintaining at most 5x better speed than
+transformer counterparts. This revelation highlights the potential of pure
+sparse CNNs to outperform transformer-related networks.",cs.CV,['cs.CV']
+Unified Language-driven Zero-shot Domain Adaptation,Senqiao Yang · Zhuotao Tian · Li Jiang · Jiaya Jia, ,https://arxiv.org/abs/2404.07155,,2404.07155.pdf,Unified Language-driven Zero-shot Domain Adaptation,"This paper introduces Unified Language-driven Zero-shot Domain Adaptation
+(ULDA), a novel task setting that enables a single model to adapt to diverse
+target domains without explicit domain-ID knowledge. We identify the
+constraints in the existing language-driven zero-shot domain adaptation task,
+particularly the requirement for domain IDs and domain-specific models, which
+may restrict flexibility and scalability. To overcome these issues, we propose
+a new framework for ULDA, consisting of Hierarchical Context Alignment (HCA),
+Domain Consistent Representation Learning (DCRL), and Text-Driven Rectifier
+(TDR). These components work synergistically to align simulated features with
+target text across multiple visual levels, retain semantic correlations between
+different regional representations, and rectify biases between simulated and
+real target visual features, respectively. Our extensive empirical evaluations
+demonstrate that this framework achieves competitive performance in both
+settings, surpassing even the model that requires domain-ID, showcasing its
+superiority and generalization ability. The proposed method is not only
+effective but also maintains practicality and efficiency, as it does not
+introduce additional computational costs during inference. Our project page is
+https://senqiaoyang.com/project/ULDA .",cs.CV,['cs.CV']
+Delving into the Trajectory Long-tail Distribution for Muti-object Tracking,Sijia Chen · En Yu · Jinyang Li · Wenbing Tao, ,https://arxiv.org/abs/2403.04700,,2403.04700.pdf,Delving into the Trajectory Long-tail Distribution for Muti-object Tracking,"Multiple Object Tracking (MOT) is a critical area within computer vision,
+with a broad spectrum of practical implementations. Current research has
+primarily focused on the development of tracking algorithms and enhancement of
+post-processing techniques. Yet, there has been a lack of thorough examination
+concerning the nature of tracking data it self. In this study, we pioneer an
+exploration into the distribution patterns of tracking data and identify a
+pronounced long-tail distribution issue within existing MOT datasets. We note a
+significant imbalance in the distribution of trajectory lengths across
+different pedestrians, a phenomenon we refer to as ``pedestrians trajectory
+long-tail distribution''. Addressing this challenge, we introduce a bespoke
+strategy designed to mitigate the effects of this skewed distribution.
+Specifically, we propose two data augmentation strategies, including Stationary
+Camera View Data Augmentation (SVA) and Dynamic Camera View Data Augmentation
+(DVA) , designed for viewpoint states and the Group Softmax (GS) module for
+Re-ID. SVA is to backtrack and predict the pedestrian trajectory of tail
+classes, and DVA is to use diffusion model to change the background of the
+scene. GS divides the pedestrians into unrelated groups and performs softmax
+operation on each group individually. Our proposed strategies can be integrated
+into numerous existing tracking systems, and extensive experimentation
+validates the efficacy of our method in reducing the influence of long-tail
+distribution on multi-object tracking performance. The code is available at
+https://github.com/chen-si-jia/Trajectory-Long-tail-Distribution-for-MOT.",cs.CV,['cs.CV']
+HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes,Yichen Yao · Zimo Jiang · YUJING SUN · Zhencai Zhu · Xinge Zhu · Runnan Chen · Yuexin Ma, ,https://arxiv.org/abs/2403.02769,,2403.02769.pdf,HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes,"Human-centric 3D scene understanding has recently drawn increasing attention,
+driven by its critical impact on robotics. However, human-centric real-life
+scenarios are extremely diverse and complicated, and humans have intricate
+motions and interactions. With limited labeled data, supervised methods are
+difficult to generalize to general scenarios, hindering real-life applications.
+Mimicking human intelligence, we propose an unsupervised 3D detection method
+for human-centric scenarios by transferring the knowledge from synthetic human
+instances to real scenes. To bridge the gap between the distinct data
+representations and feature distributions of synthetic models and real point
+clouds, we introduce novel modules for effective instance-to-scene
+representation transfer and synthetic-to-real feature alignment. Remarkably,
+our method exhibits superior performance compared to current state-of-the-art
+techniques, achieving 87.8% improvement in mAP and closely approaching the
+performance of fully supervised methods (62.15 mAP vs. 69.02 mAP) on HuCenLife
+Dataset.",cs.CV,['cs.CV']
+VAREN: Very Accurate and Realistic Equine Network,Silvia Zuffi · Ylva Mellbin · Ci Li · Markus Höschle · Hedvig Kjellström · Senya Polikovsky · Elin Hernlund · Michael J. Black,https://varen.is.tue.mpg.de/,,https://www.kth.se/is/rpl/rpl-news/accepted-publications-march-1.1339092,,,,,nan
+Mask Grounding for Referring Image Segmentation,Yong Xien Chng · Henry Zheng · Yizeng Han · Xuchong QIU · Gao Huang,https://yxchng.github.io/projects/mask-grounding/,https://arxiv.org/abs/2312.12198,,2312.12198.pdf,Mask Grounding for Referring Image Segmentation,"Referring Image Segmentation (RIS) is a challenging task that requires an
+algorithm to segment objects referred by free-form language expressions.
+Despite significant progress in recent years, most state-of-the-art (SOTA)
+methods still suffer from considerable language-image modality gap at the pixel
+and word level. These methods generally 1) rely on sentence-level language
+features for language-image alignment and 2) lack explicit training supervision
+for fine-grained visual grounding. Consequently, they exhibit weak object-level
+correspondence between visual and language features. Without well-grounded
+features, prior methods struggle to understand complex expressions that require
+strong reasoning over relationships among multiple objects, especially when
+dealing with rarely used or ambiguous clauses. To tackle this challenge, we
+introduce a novel Mask Grounding auxiliary task that significantly improves
+visual grounding within language features, by explicitly teaching the model to
+learn fine-grained correspondence between masked textual tokens and their
+matching visual objects. Mask Grounding can be directly used on prior RIS
+methods and consistently bring improvements. Furthermore, to holistically
+address the modality gap, we also design a cross-modal alignment loss and an
+accompanying alignment module. These additions work synergistically with Mask
+Grounding. With all these techniques, our comprehensive approach culminates in
+MagNet (Mask-grounded Network), an architecture that significantly outperforms
+prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating
+our method's effectiveness in addressing current limitations of RIS algorithms.
+Our code and pre-trained weights will be released.",cs.CV,['cs.CV']
+MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion Model,Kaiyu Song · Hanjiang Lai · Yan Pan · Jian Yin, ,https://arxiv.org/abs/2312.04802,,2312.04802.pdf,MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion Model,"Deep neural networks (DNNs) are vulnerable to adversarial perturbation, where
+an imperceptible perturbation is added to the image that can fool the DNNs.
+Diffusion-based adversarial purification focuses on using the diffusion model
+to generate a clean image against such adversarial attacks. Unfortunately, the
+generative process of the diffusion model is also inevitably affected by
+adversarial perturbation since the diffusion model is also a deep network where
+its input has adversarial perturbation. In this work, we propose
+MimicDiffusion, a new diffusion-based adversarial purification technique, that
+directly approximates the generative process of the diffusion model with the
+clean image as input. Concretely, we analyze the differences between the guided
+terms using the clean image and the adversarial sample. After that, we first
+implement MimicDiffusion based on Manhattan distance. Then, we propose two
+guidance to purify the adversarial perturbation and approximate the clean
+diffusion model. Extensive experiments on three image datasets including
+CIFAR-10, CIFAR-100, and ImageNet with three classifier backbones including
+WideResNet-70-16, WideResNet-28-10, and ResNet50 demonstrate that
+MimicDiffusion significantly performs better than the state-of-the-art
+baselines. On CIFAR-10, CIFAR-100, and ImageNet, it achieves 92.67\%, 61.35\%,
+and 61.53\% average robust accuracy, which are 18.49\%, 13.23\%, and 17.64\%
+higher, respectively. The code is available in the supplementary material.",cs.CV,['cs.CV']
+Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing,Hyelin Nam · Gihyun Kwon · Geon Yeong Park · Jong Chul Ye,https://hyelinnam.github.io/CDS/,https://arxiv.org/abs/2311.18608,,2311.18608.pdf,Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing,"With the remarkable advent of text-to-image diffusion models, image editing
+methods have become more diverse and continue to evolve. A promising recent
+approach in this realm is Delta Denoising Score (DDS) - an image editing
+technique based on Score Distillation Sampling (SDS) framework that leverages
+the rich generative prior of text-to-image diffusion models. However, relying
+solely on the difference between scoring functions is insufficient for
+preserving specific structural elements from the original image, a crucial
+aspect of image editing. To address this, here we present an embarrassingly
+simple yet very powerful modification of DDS, called Contrastive Denoising
+Score (CDS), for latent diffusion models (LDM). Inspired by the similarities
+and differences between DDS and the contrastive learning for unpaired
+image-to-image translation(CUT), we introduce a straightforward approach using
+CUT loss within the DDS framework. Rather than employing auxiliary networks as
+in the original CUT approach, we leverage the intermediate features of LDM,
+specifically those from the self-attention layers, which possesses rich spatial
+information. Our approach enables zero-shot image-to-image translation and
+neural radiance field (NeRF) editing, achieving structural correspondence
+between the input and output while maintaining content controllability.
+Qualitative results and comparisons demonstrates the effectiveness of our
+proposed method. Project page: https://hyelinnam.github.io/CDS/",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Non-rigid Structure-from-Motion: Temporally-smooth Procrustean Alignment and Spatially-variant Deformation Modeling,Jiawei Shi · Hui Deng · Yuchao Dai,https://npucvr.github.io/TSM-NRSfM/,https://arxiv.org/abs/2405.04309,,2405.04309.pdf,Non-rigid Structure-from-Motion: Temporally-smooth Procrustean Alignment and Spatially-variant Deformation Modeling,"Even though Non-rigid Structure-from-Motion (NRSfM) has been extensively
+studied and great progress has been made, there are still key challenges that
+hinder their broad real-world applications: 1) the inherent motion/rotation
+ambiguity requires either explicit camera motion recovery with extra constraint
+or complex Procrustean Alignment; 2) existing low-rank modeling of the global
+shape can over-penalize drastic deformations in the 3D shape sequence. This
+paper proposes to resolve the above issues from a spatial-temporal modeling
+perspective. First, we propose a novel Temporally-smooth Procrustean Alignment
+module that estimates 3D deforming shapes and adjusts the camera motion by
+aligning the 3D shape sequence consecutively. Our new alignment module remedies
+the requirement of complex reference 3D shape during alignment, which is more
+conductive to non-isotropic deformation modeling. Second, we propose a
+spatial-weighted approach to enforce the low-rank constraint adaptively at
+different locations to accommodate drastic spatially-variant deformation
+reconstruction better. Our modeling outperform existing low-rank based methods,
+and extensive experiments across different datasets validate the effectiveness
+of our method.",cs.CV,['cs.CV']
+Defense Against Adversarial Attacks on No-Reference Image Quality Models with Gradient Norm Regularization,Yujia Liu · Chenxi Yang · Dingquan Li · Jianhao Ding · Tingting Jiang, ,https://arxiv.org/abs/2403.11397,,2403.11397.pdf,Defense Against Adversarial Attacks on No-Reference Image Quality Models with Gradient Norm Regularization,"The task of No-Reference Image Quality Assessment (NR-IQA) is to estimate the
+quality score of an input image without additional information. NR-IQA models
+play a crucial role in the media industry, aiding in performance evaluation and
+optimization guidance. However, these models are found to be vulnerable to
+adversarial attacks, which introduce imperceptible perturbations to input
+images, resulting in significant changes in predicted scores. In this paper, we
+propose a defense method to improve the stability in predicted scores when
+attacked by small perturbations, thus enhancing the adversarial robustness of
+NR-IQA models. To be specific, we present theoretical evidence showing that the
+magnitude of score changes is related to the $\ell_1$ norm of the model's
+gradient with respect to the input image. Building upon this theoretical
+foundation, we propose a norm regularization training strategy aimed at
+reducing the $\ell_1$ norm of the gradient, thereby boosting the robustness of
+NR-IQA models. Experiments conducted on four NR-IQA baseline models demonstrate
+the effectiveness of our strategy in reducing score changes in the presence of
+adversarial attacks. To the best of our knowledge, this work marks the first
+attempt to defend against adversarial attacks on NR-IQA models. Our study
+offers valuable insights into the adversarial robustness of NR-IQA models and
+provides a foundation for future research in this area.",cs.CV,"['cs.CV', 'eess.IV']"
+A Unified Framework for Human-centric Point Cloud Video Understanding,Yiteng Xu · Kecheng Ye · xiao han · yiming ren · Xinge Zhu · Yuexin Ma, ,https://arxiv.org/abs/2403.20031,,2403.20031.pdf,A Unified Framework for Human-centric Point Cloud Video Understanding,"Human-centric Point Cloud Video Understanding (PVU) is an emerging field
+focused on extracting and interpreting human-related features from sequences of
+human point clouds, further advancing downstream human-centric tasks and
+applications. Previous works usually focus on tackling one specific task and
+rely on huge labeled data, which has poor generalization capability.
+Considering that human has specific characteristics, including the structural
+semantics of human body and the dynamics of human motions, we propose a unified
+framework to make full use of the prior knowledge and explore the inherent
+features in the data itself for generalized human-centric point cloud video
+understanding. Extensive experiments demonstrate that our method achieves
+state-of-the-art performance on various human-related tasks, including action
+recognition and 3D pose estimation. All datasets and code will be released
+soon.",cs.CV,['cs.CV']
+iKUN: Speak to Trackers without Retraining,Yunhao Du · Cheng Lei · Zhicheng Zhao · Fei Su,https://github.com/dyhBUPT/iKUN,https://arxiv.org/abs/2312.16245,,2312.16245.pdf,iKUN: Speak to Trackers without Retraining,"Referring multi-object tracking (RMOT) aims to track multiple objects based
+on input textual descriptions. Previous works realize it by simply integrating
+an extra textual module into the multi-object tracker. However, they typically
+need to retrain the entire framework and have difficulties in optimization. In
+this work, we propose an insertable Knowledge Unification Network, termed iKUN,
+to enable communication with off-the-shelf trackers in a plug-and-play manner.
+Concretely, a knowledge unification module (KUM) is designed to adaptively
+extract visual features based on textual guidance. Meanwhile, to improve the
+localization accuracy, we present a neural version of Kalman filter (NKF) to
+dynamically adjust process noise and observation noise based on the current
+motion status. Moreover, to address the problem of open-set long-tail
+distribution of textual descriptions, a test-time similarity calibration method
+is proposed to refine the confidence score with pseudo frequency. Extensive
+experiments on Refer-KITTI dataset verify the effectiveness of our framework.
+Finally, to speed up the development of RMOT, we also contribute a more
+challenging dataset, Refer-Dance, by extending public DanceTrack dataset with
+motion and dressing descriptions. The codes and dataset are available at
+https://github.com/dyhBUPT/iKUN.",cs.CV,['cs.CV']
+Accelerating Diffusion Sampling with Optimized Time Steps,Shuchen Xue · Zhaoqiang Liu · Fei Chen · Shifeng Zhang · Tianyang Hu · Enze Xie · Zhenguo Li, ,https://arxiv.org/abs/2402.17376,,2402.17376.pdf,Accelerating Diffusion Sampling with Optimized Time Steps,"Diffusion probabilistic models (DPMs) have shown remarkable performance in
+high-resolution image synthesis, but their sampling efficiency is still to be
+desired due to the typically large number of sampling steps. Recent
+advancements in high-order numerical ODE solvers for DPMs have enabled the
+generation of high-quality images with much fewer sampling steps. While this is
+a significant development, most sampling methods still employ uniform time
+steps, which is not optimal when using a small number of steps. To address this
+issue, we propose a general framework for designing an optimization problem
+that seeks more appropriate time steps for a specific numerical ODE solver for
+DPMs. This optimization problem aims to minimize the distance between the
+ground-truth solution to the ODE and an approximate solution corresponding to
+the numerical solver. It can be efficiently solved using the constrained trust
+region method, taking less than $15$ seconds. Our extensive experiments on both
+unconditional and conditional sampling using pixel- and latent-space DPMs
+demonstrate that, when combined with the state-of-the-art sampling method
+UniPC, our optimized time steps significantly improve image generation
+performance in terms of FID scores for datasets such as CIFAR-10 and ImageNet,
+compared to using uniform time steps.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+DemoFusion: Democratising High-Resolution Image Generation With No $$$,Ruoyi DU · Dongliang Chang · Timothy Hospedales · Yi-Zhe Song · Zhanyu Ma,https://ruoyidu.github.io/demofusion/demofusion.html,https://arxiv.org/abs/2311.16973,,2311.16973.pdf,DemoFusion: Democratising High-Resolution Image Generation With No $$$,"High-resolution image generation with Generative Artificial Intelligence
+(GenAI) has immense potential but, due to the enormous capital investment
+required for training, it is increasingly centralised to a few large
+corporations, and hidden behind paywalls. This paper aims to democratise
+high-resolution GenAI by advancing the frontier of high-resolution generation
+while remaining accessible to a broad audience. We demonstrate that existing
+Latent Diffusion Models (LDMs) possess untapped potential for higher-resolution
+image generation. Our novel DemoFusion framework seamlessly extends open-source
+GenAI models, employing Progressive Upscaling, Skip Residual, and Dilated
+Sampling mechanisms to achieve higher-resolution image generation. The
+progressive nature of DemoFusion requires more passes, but the intermediate
+results can serve as ""previews"", facilitating rapid prompt iteration.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Tumor Micro-environment Interactions Guided Graph Learning for Survival Analysis of Human Cancers from Whole-slide Pathological Images.,WEI SHAO · YangYang Shi · Daoqiang Zhang · Junjie Zhou · Peng Wan, ,,https://www.nature.com/articles/s41467-023-40890-x,,,,,nan
+Adapting Short-Term Transformers for Action Detection in Untrimmed Videos,Min Yang · gaohuan · Ping Guo · Limin Wang, ,https://arxiv.org/abs/2312.01897,,2312.01897.pdf,Adapting Short-Term Transformers for Action Detection in Untrimmed Videos,"Vision Transformer (ViT) has shown high potential in video recognition, owing
+to its flexible design, adaptable self-attention mechanisms, and the efficacy
+of masked pre-training. Yet, it remains unclear how to adapt these pre-trained
+short-term ViTs for temporal action detection (TAD) in untrimmed videos. The
+existing works treat them as off-the-shelf feature extractors for each
+short-trimmed snippet without capturing the fine-grained relation among
+different snippets in a broader temporal context. To mitigate this issue, this
+paper focuses on designing a new mechanism for adapting these pre-trained ViT
+models as a unified long-form video transformer to fully unleash its modeling
+power in capturing inter-snippet relation, while still keeping low computation
+overhead and memory consumption for efficient TAD. To this end, we design
+effective cross-snippet propagation modules to gradually exchange short-term
+video information among different snippets from two levels. For inner-backbone
+information propagation, we introduce a cross-snippet propagation strategy to
+enable multi-snippet temporal feature interaction inside the backbone.For
+post-backbone information propagation, we propose temporal transformer layers
+for further clip-level modeling. With the plain ViT-B pre-trained with
+VideoMAE, our end-to-end temporal action detector (ViT-TAD) yields a very
+competitive performance to previous temporal action detectors, riching up to
+69.5 average mAP on THUMOS14, 37.40 average mAP on ActivityNet-1.3 and 17.20
+average mAP on FineAction.",cs.CV,['cs.CV']
+LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free Environment,yiming ren · xiao han · Chengfeng Zhao · Jingya Wang · Lan Xu · Jingyi Yu · Yuexin Ma, ,https://arxiv.org/abs/2402.17171,,2402.17171.pdf,LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free Environment,"For human-centric large-scale scenes, fine-grained modeling for 3D human
+global pose and shape is significant for scene understanding and can benefit
+many real-world applications. In this paper, we present LiveHPS, a novel
+single-LiDAR-based approach for scene-level human pose and shape estimation
+without any limitation of light conditions and wearable devices. In particular,
+we design a distillation mechanism to mitigate the distribution-varying effect
+of LiDAR point clouds and exploit the temporal-spatial geometric and dynamic
+information existing in consecutive frames to solve the occlusion and noise
+disturbance. LiveHPS, with its efficient configuration and high-quality output,
+is well-suited for real-world applications. Moreover, we propose a huge human
+motion dataset, named FreeMotion, which is collected in various scenarios with
+diverse human poses, shapes and translations. It consists of multi-modal and
+multi-view acquisition data from calibrated and synchronized LiDARs, cameras,
+and IMUs. Extensive experiments on our new dataset and other public datasets
+demonstrate the SOTA performance and robustness of our approach. We will
+release our code and dataset soon.",cs.CV,['cs.CV']
+CoDe: An Explicit Content Decoupling Framework for Image Restoration,Enxuan Gu · Hongwei Ge · Yong Guo, ,https://arxiv.org/abs/2312.05006,,2312.05006.pdf,Decoupling Degradation and Content Processing for Adverse Weather Image Restoration,"Adverse weather image restoration strives to recover clear images from those
+affected by various weather types, such as rain, haze, and snow. Each weather
+type calls for a tailored degradation removal approach due to its unique impact
+on images. Conversely, content reconstruction can employ a uniform approach, as
+the underlying image content remains consistent. Although previous techniques
+can handle multiple weather types within a single network, they neglect the
+crucial distinction between these two processes, limiting the quality of
+restored images. This work introduces a novel adverse weather image restoration
+method, called DDCNet, which decouples the degradation removal and content
+reconstruction process at the feature level based on their channel statistics.
+Specifically, we exploit the unique advantages of the Fourier transform in both
+these two processes: (1) the degradation information is mainly located in the
+amplitude component of the Fourier domain, and (2) the Fourier domain contains
+global information. The former facilitates channel-dependent degradation
+removal operation, allowing the network to tailor responses to various adverse
+weather types; the latter, by integrating Fourier's global properties into
+channel-independent content features, enhances network capacity for consistent
+global content reconstruction. We further augment the degradation removal
+process with a degradation mapping loss function. Extensive experiments
+demonstrate our method achieves state-of-the-art performance in multiple
+adverse weather removal benchmarks.",cs.CV,['cs.CV']
+MemSAM: Taming Segment Anything Model for Echocardiography Video Segmentation,Xiaolong Deng · Huisi Wu · Runhao Zeng · Jing Qin,https://github.com/dengxl0520/MemSAM,https://arxiv.org/abs/2311.10529,,2311.10529.pdf,Enhancing the Reliability of Segment Anything Model for Auto-Prompting Medical Image Segmentation with Uncertainty Rectification,"The Segment Anything Model (SAM) has recently emerged as a groundbreaking
+foundation model for prompt-driven image segmentation tasks. However, both the
+original SAM and its medical variants require slice-by-slice manual prompting
+of target structures, which directly increase the burden for applications.
+Despite attempts of auto-prompting to turn SAM into a fully automatic manner,
+it still exhibits subpar performance and lacks of reliability especially in the
+field of medical imaging. In this paper, we propose UR-SAM, an uncertainty
+rectified SAM framework to enhance the reliability for auto-prompting medical
+image segmentation. Building upon a localization framework for automatic prompt
+generation, our method incorporates a prompt augmentation module to obtain a
+series of input prompts for SAM for uncertainty estimation and an
+uncertainty-based rectification module to further utilize the distribution of
+estimated uncertainty to improve the segmentation performance. Extensive
+experiments on two public 3D medical datasets covering the segmentation of 35
+organs demonstrate that without supplementary training or fine-tuning, our
+method further improves the segmentation performance with up to 10.7 % and 13.8
+% in dice similarity coefficient, demonstrating efficiency and broad
+capabilities for medical image segmentation without manual prompting.",cs.CV,['cs.CV']
+Generative Powers of Ten,Xiaojuan Wang · Janne Kontkanen · Brian Curless · Steve Seitz · Ira Kemelmacher-Shlizerman · Ben Mildenhall · Pratul P. Srinivasan · Dor Verbin · Aleksander Holynski, ,https://arxiv.org/abs/2312.02149,,2312.02149.pdf,Generative Powers of Ten,"We present a method that uses a text-to-image model to generate consistent
+content across multiple image scales, enabling extreme semantic zooms into a
+scene, e.g., ranging from a wide-angle landscape view of a forest to a macro
+shot of an insect sitting on one of the tree branches. We achieve this through
+a joint multi-scale diffusion sampling approach that encourages consistency
+across different scales while preserving the integrity of each individual
+sampling process. Since each generated scale is guided by a different text
+prompt, our method enables deeper levels of zoom than traditional
+super-resolution methods that may struggle to create new contextual structure
+at vastly different scales. We compare our method qualitatively with
+alternative techniques in image super-resolution and outpainting, and show that
+our method is most effective at generating consistent multi-scale content.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.GR']"
+When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach,TAO MA · Bing Bai · Haozhe Lin · Heyuan Wang · Yu Wang · Lin Luo · Lu Fang, ,https://arxiv.org/abs/2307.11558,,2307.11558.pdf,Advancing Visual Grounding with Scene Knowledge: Benchmark and Method,"Visual grounding (VG) aims to establish fine-grained alignment between vision
+and language. Ideally, it can be a testbed for vision-and-language models to
+evaluate their understanding of the images and texts and their reasoning
+abilities over their joint space. However, most existing VG datasets are
+constructed using simple description texts, which do not require sufficient
+reasoning over the images and texts. This has been demonstrated in a recent
+study~\cite{luo2022goes}, where a simple LSTM-based text encoder without
+pretraining can achieve state-of-the-art performance on mainstream VG datasets.
+Therefore, in this paper, we propose a novel benchmark of \underline{S}cene
+\underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG),
+where the image content and referring expressions are not sufficient to ground
+the target objects, forcing the models to have a reasoning ability on the
+long-form scene knowledge. To perform this task, we propose two approaches to
+accept the triple-type input, where the former embeds knowledge into the image
+features before the image-query interaction; the latter leverages linguistic
+structure to assist in computing the image-text matching. We conduct extensive
+experiments to analyze the above methods and show that the proposed approaches
+achieve promising results but still leave room for improvement, including
+performance and interpretability. The dataset and code are available at
+\url{https://github.com/zhjohnchan/SK-VG}.",cs.CV,"['cs.CV', 'cs.CL']"
+Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization,Jimyeong Kim · Jungwon Park · Wonjong Rhee, ,https://arxiv.org/abs/2403.15330,,2403.15330.pdf,Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization,"In text-to-image personalization, a timely and crucial challenge is the
+tendency of generated images overfitting to the biases present in the reference
+images. We initiate our study with a comprehensive categorization of the biases
+into background, nearby-object, tied-object, substance (in style
+re-contextualization), and pose biases. These biases manifest in the generated
+images due to their entanglement into the subject embedding. This undesired
+embedding entanglement not only results in the reflection of biases from the
+reference images into the generated images but also notably diminishes the
+alignment of the generated images with the given generation prompt. To address
+this challenge, we propose SID~(Selectively Informative Description), a text
+description strategy that deviates from the prevalent approach of only
+characterizing the subject's class identification. SID is generated utilizing
+multimodal GPT-4 and can be seamlessly integrated into optimization-based
+models. We present comprehensive experimental results along with analyses of
+cross-attention maps, subject-alignment, non-subject-disentanglement, and
+text-alignment.",cs.CV,['cs.CV']
+SFOD: Spiking Fusion Object Detector,Yimeng Fan · Wei Zhang · Changsong Liu · Mingyang Li · Wenrui Lu,https://github.com/yimeng-fan/SFOD,https://arxiv.org/abs/2403.15192,,2403.15192.pdf,SFOD: Spiking Fusion Object Detector,"Event cameras, characterized by high temporal resolution, high dynamic range,
+low power consumption, and high pixel bandwidth, offer unique capabilities for
+object detection in specialized contexts. Despite these advantages, the
+inherent sparsity and asynchrony of event data pose challenges to existing
+object detection algorithms. Spiking Neural Networks (SNNs), inspired by the
+way the human brain codes and processes information, offer a potential solution
+to these difficulties. However, their performance in object detection using
+event cameras is limited in current implementations. In this paper, we propose
+the Spiking Fusion Object Detector (SFOD), a simple and efficient approach to
+SNN-based object detection. Specifically, we design a Spiking Fusion Module,
+achieving the first-time fusion of feature maps from different scales in SNNs
+applied to event cameras. Additionally, through integrating our analysis and
+experiments conducted during the pretraining of the backbone network on the
+NCAR dataset, we delve deeply into the impact of spiking decoding strategies
+and loss functions on model performance. Thereby, we establish state-of-the-art
+classification results based on SNNs, achieving 93.7\% accuracy on the NCAR
+dataset. Experimental results on the GEN1 detection dataset demonstrate that
+the SFOD achieves a state-of-the-art mAP of 32.1\%, outperforming existing
+SNN-based approaches. Our research not only underscores the potential of SNNs
+in object detection with event cameras but also propels the advancement of
+SNNs. Code is available at https://github.com/yimeng-fan/SFOD.",cs.CV,"['cs.CV', 'cs.AI']"
+A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives,Simone Alberto Peirone · Francesca Pistilli · Antonio Alliegro · Giuseppe Averta,https://sapeirone.github.io/EgoPack/,https://arxiv.org/abs/2403.03037,,2403.03037.pdf,A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives,"Human comprehension of a video stream is naturally broad: in a few instants,
+we are able to understand what is happening, the relevance and relationship of
+objects, and forecast what will follow in the near future, everything all at
+once. We believe that - to effectively transfer such an holistic perception to
+intelligent machines - an important role is played by learning to correlate
+concepts and to abstract knowledge coming from different tasks, to
+synergistically exploit them when learning novel skills. To accomplish this, we
+seek for a unified approach to video understanding which combines shared
+temporal modelling of human actions with minimal overhead, to support multiple
+downstream tasks and enable cooperation when learning novel skills. We then
+propose EgoPack, a solution that creates a collection of task perspectives that
+can be carried across downstream tasks and used as a potential source of
+additional insights, as a backpack of skills that a robot can carry around and
+use when needed. We demonstrate the effectiveness and efficiency of our
+approach on four Ego4D benchmarks, outperforming current state-of-the-art
+methods.",cs.CV,"['cs.CV', 'cs.LG']"
+How Far Can We Compress Instant NGP-Based NeRF?,Yihang Chen · Qianyi Wu · Mehrtash Harandi · Jianfei Cai, ,https://arxiv.org/abs/2310.14695,,2310.14695.pdf,CAwa-NeRF: Instant Learning of Compression-Aware NeRF Features,"Modeling 3D scenes by volumetric feature grids is one of the promising
+directions of neural approximations to improve Neural Radiance Fields (NeRF).
+Instant-NGP (INGP) introduced multi-resolution hash encoding from a lookup
+table of trainable feature grids which enabled learning high-quality neural
+graphics primitives in a matter of seconds. However, this improvement came at
+the cost of higher storage size. In this paper, we address this challenge by
+introducing instant learning of compression-aware NeRF features (CAwa-NeRF),
+that allows exporting the zip compressed feature grids at the end of the model
+training with a negligible extra time overhead without changing neither the
+storage architecture nor the parameters used in the original INGP paper.
+Nonetheless, the proposed method is not limited to INGP but could also be
+adapted to any model. By means of extensive simulations, our proposed instant
+learning pipeline can achieve impressive results on different kinds of static
+scenes such as single object masked background scenes and real-life scenes
+captured in our studio. In particular, for single object masked background
+scenes CAwa-NeRF compresses the feature grids down to 6% (1.2 MB) of the
+original size without any loss in the PSNR (33 dB) or down to 2.4% (0.53 MB)
+with a slight virtual loss (32.31 dB).",cs.CV,['cs.CV']
+Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners,Yazhou Xing · Yingqing He · Zeyue Tian · Xintao Wang · Qifeng Chen, ,https://arxiv.org/abs/2402.17723,,2402.17723.pdf,Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners,"Video and audio content creation serves as the core technique for the movie
+industry and professional users. Recently, existing diffusion-based methods
+tackle video and audio generation separately, which hinders the technique
+transfer from academia to industry. In this work, we aim at filling the gap,
+with a carefully designed optimization-based framework for cross-visual-audio
+and joint-visual-audio generation. We observe the powerful generation ability
+of off-the-shelf video or audio generation models. Thus, instead of training
+the giant models from scratch, we propose to bridge the existing strong models
+with a shared latent representation space. Specifically, we propose a
+multimodality latent aligner with the pre-trained ImageBind model. Our latent
+aligner shares a similar core as the classifier guidance that guides the
+diffusion denoising process during inference time. Through carefully designed
+optimization strategy and loss functions, we show the superior performance of
+our method on joint video-audio generation, visual-steered audio generation,
+and audio-steered visual generation tasks. The project website can be found at
+https://yzxing87.github.io/Seeing-and-Hearing/",cs.CV,"['cs.CV', 'cs.MM', 'cs.SD', 'eess.AS']"
+VidToMe: Video Token Merging for Zero-Shot Video Editing,Xirui Li · Chao Ma · Xiaokang Yang · Ming-Hsuan Yang,https://vidtome-diffusion.github.io/,https://arxiv.org/abs/2312.10656,,2312.10656.pdf,VidToMe: Video Token Merging for Zero-Shot Video Editing,"Diffusion models have made significant advances in generating high-quality
+images, but their application to video generation has remained challenging due
+to the complexity of temporal motion. Zero-shot video editing offers a solution
+by utilizing pre-trained image diffusion models to translate source videos into
+new ones. Nevertheless, existing methods struggle to maintain strict temporal
+consistency and efficient memory consumption. In this work, we propose a novel
+approach to enhance temporal consistency in generated videos by merging
+self-attention tokens across frames. By aligning and compressing temporally
+redundant tokens across frames, our method improves temporal coherence and
+reduces memory consumption in self-attention computations. The merging strategy
+matches and aligns tokens according to the temporal correspondence between
+frames, facilitating natural temporal consistency in generated video frames. To
+manage the complexity of video processing, we divide videos into chunks and
+develop intra-chunk local token merging and inter-chunk global token merging,
+ensuring both short-term video continuity and long-term content consistency.
+Our video editing approach seamlessly extends the advancements in image editing
+to video editing, rendering favorable results in temporal consistency over
+state-of-the-art methods.",cs.CV,['cs.CV']
+Real-Time Exposure Correction via Collaborative Transformations and Adaptive Sampling,Ziwen Li · Feng Zhang · Meng Cao · Jinpu Zhang · Yuanjie Shao · Yuehuan Wang · Nong Sang,https://github.com/HUST-IAL/CoTF,,https://www.semanticscholar.org/paper/An-Efficient-Method-for-Real-Time-Image-Exposure-Yang-Zhang/b40baf5034dcc98f06f53abe907b9ac0395e2bb2,,,,,nan
+Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering,Zhiwen Yan · Weng Fei Low · Yu Chen · Gim Hee Lee, ,https://arxiv.org/abs/2311.17089,,2311.17089.pdf,Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering,"3D Gaussians have recently emerged as a highly efficient representation for
+3D reconstruction and rendering. Despite its high rendering quality and speed
+at high resolutions, they both deteriorate drastically when rendered at lower
+resolutions or from far away camera position. During low resolution or far away
+rendering, the pixel size of the image can fall below the Nyquist frequency
+compared to the screen size of each splatted 3D Gaussian and leads to aliasing
+effect. The rendering is also drastically slowed down by the sequential alpha
+blending of more splatted Gaussians per pixel. To address these issues, we
+propose a multi-scale 3D Gaussian splatting algorithm, which maintains
+Gaussians at different scales to represent the same scene. Higher-resolution
+images are rendered with more small Gaussians, and lower-resolution images are
+rendered with fewer larger Gaussians. With similar training time, our algorithm
+can achieve 13\%-66\% PSNR and 160\%-2400\% rendering speed improvement at
+4$\times$-128$\times$ scale rendering on Mip-NeRF360 dataset compared to the
+single scale 3D Gaussian splitting. Our code and more results are available on
+our project website https://jokeryan.github.io/projects/ms-gs/",cs.CV,['cs.CV']
+Construct to Associate: Cooperative Context Learning for Domain Adaptive Point Cloud Segmentation,Guangrui Li, ,,https://ieeexplore.ieee.org/document/10330760,,,,,nan
+Holistic Features are almost Sufficient for Text-to-Video Retrieval,Kaibin Tian · Ruixiang Zhao · Zijie Xin · Bangxiang Lan · Xirong Li,https://github.com/ruc-aimc-lab/TeachCLIP,,https://lixirong.net/research/cvpr2024-holistic-features-are-almost-sufficient-for-text-to-video-retrieval,,,,,nan
+TE-TAD: Towards Fully End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression,Ho-Joong Kim · Jung-Ho Hong · Heejo Kong · Seong-Whan Lee, ,https://arxiv.org/abs/2404.02405,,2404.02405.pdf,TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression,"In this paper, we investigate that the normalized coordinate expression is a
+key factor as reliance on hand-crafted components in query-based detectors for
+temporal action detection (TAD). Despite significant advancements towards an
+end-to-end framework in object detection, query-based detectors have been
+limited in achieving full end-to-end modeling in TAD. To address this issue, we
+propose \modelname{}, a full end-to-end temporal action detection transformer
+that integrates time-aligned coordinate expression. We reformulate coordinate
+expression utilizing actual timeline values, ensuring length-invariant
+representations from the extremely diverse video duration environment.
+Furthermore, our proposed adaptive query selection dynamically adjusts the
+number of queries based on video length, providing a suitable solution for
+varying video durations compared to a fixed query set. Our approach not only
+simplifies the TAD process by eliminating the need for hand-crafted components
+but also significantly improves the performance of query-based detectors. Our
+TE-TAD outperforms the previous query-based detectors and achieves competitive
+performance compared to state-of-the-art methods on popular benchmark datasets.
+Code is available at: https://github.com/Dotori-HJ/TE-TAD",cs.CV,['cs.CV']
+Dual Prototype Attention for Unsupervised Video Object Segmentation,Suhwan Cho · Minhyeok Lee · Seunghoon Lee · Dogyoon Lee · Heeseung Choi · Ig-Jae Kim · Sangyoun Lee, ,https://arxiv.org/abs/2309.14786,,2309.14786.pdf,Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation,"Unsupervised video object segmentation (VOS) is a task that aims to detect
+the most salient object in a video without external guidance about the object.
+To leverage the property that salient objects usually have distinctive
+movements compared to the background, recent methods collaboratively use motion
+cues extracted from optical flow maps with appearance cues extracted from RGB
+images. However, as optical flow maps are usually very relevant to segmentation
+masks, the network is easy to be learned overly dependent on the motion cues
+during network training. As a result, such two-stream approaches are vulnerable
+to confusing motion cues, making their prediction unstable. To relieve this
+issue, we design a novel motion-as-option network by treating motion cues as
+optional. During network training, RGB images are randomly provided to the
+motion encoder instead of optical flow maps, to implicitly reduce motion
+dependency of the network. As the learned motion encoder can deal with both RGB
+images and optical flow maps, two different predictions can be generated
+depending on which source information is used as motion input. In order to
+fully exploit this property, we also propose an adaptive output selection
+algorithm to adopt optimal prediction result at test time. Our proposed
+approach affords state-of-the-art performance on all public benchmark datasets,
+even maintaining real-time inference speed.",cs.CV,['cs.CV']
+Adaptive Slot Attention: Object Discovery with Dynamic Slot Number,Ke Fan · Zechen Bai · Tianjun Xiao · Tong He · Max Horn · Yanwei Fu · Francesco Locatello · Zheng Zhang, ,https://arxiv.org/abs/2307.09437,,2307.09437.pdf,Grounded Object Centric Learning,"The extraction of modular object-centric representations for downstream tasks
+is an emerging area of research. Learning grounded representations of objects
+that are guaranteed to be stable and invariant promises robust performance
+across different tasks and environments. Slot Attention (SA) learns
+object-centric representations by assigning objects to \textit{slots}, but
+presupposes a \textit{single} distribution from which all slots are randomly
+initialised. This results in an inability to learn \textit{specialized} slots
+which bind to specific object types and remain invariant to identity-preserving
+changes in object appearance. To address this, we present
+\emph{\textsc{Co}nditional \textsc{S}lot \textsc{A}ttention} (\textsc{CoSA})
+using a novel concept of \emph{Grounded Slot Dictionary} (GSD) inspired by
+vector quantization. Our proposed GSD comprises (i) canonical object-level
+property vectors and (ii) parametric Gaussian distributions, which define a
+prior over the slots. We demonstrate the benefits of our method in multiple
+downstream tasks such as scene generation, composition, and task adaptation,
+whilst remaining competitive with SA in popular object discovery benchmarks.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras,Ashwath Shetty · Marc Habermann · Guoxing Sun · Diogo Luvizon · Vladislav Golyanik · Christian Theobalt, ,https://arxiv.org/abs/2312.07423,,2312.07423.pdf,Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras,"We present the first approach to render highly realistic free-viewpoint
+videos of a human actor in general apparel, from sparse multi-view recording to
+display, in real-time at an unprecedented 4K resolution. At inference, our
+method only requires four camera views of the moving actor and the respective
+3D skeletal pose. It handles actors in wide clothing, and reproduces even
+fine-scale dynamic detail, e.g. clothing wrinkles, face expressions, and hand
+gestures. At training time, our learning-based approach expects dense
+multi-view video and a rigged static surface scan of the actor. Our method
+comprises three main stages. Stage 1 is a skeleton-driven neural approach for
+high-quality capture of the detailed dynamic mesh geometry. Stage 2 is a novel
+solution to create a view-dependent texture using four test-time camera views
+as input. Finally, stage 3 comprises a new image-based refinement network
+rendering the final 4K image given the output from the previous stages. Our
+approach establishes a new benchmark for real-time rendering resolution and
+quality using sparse input camera views, unlocking possibilities for immersive
+telepresence.",cs.CV,['cs.CV']
+Seeing the World through Your Eyes,Hadi Alzayer · Kevin Zhang · Brandon Y. Feng · Christopher Metzler · Jia-Bin Huang, ,https://arxiv.org/abs/2306.09348,,2306.09348.pdf,Seeing the World through Your Eyes,"The reflective nature of the human eye is an underappreciated source of
+information about what the world around us looks like. By imaging the eyes of a
+moving person, we can collect multiple views of a scene outside the camera's
+direct line of sight through the reflections in the eyes. In this paper, we
+reconstruct a 3D scene beyond the camera's line of sight using portrait images
+containing eye reflections. This task is challenging due to 1) the difficulty
+of accurately estimating eye poses and 2) the entangled appearance of the eye
+iris and the scene reflections. Our method jointly refines the cornea poses,
+the radiance field depicting the scene, and the observer's eye iris texture. We
+further propose a simple regularization prior on the iris texture pattern to
+improve reconstruction quality. Through various experiments on synthetic and
+real-world captures featuring people with varied eye colors, we demonstrate the
+feasibility of our approach to recover 3D scenes using eye reflections.",cs.CV,['cs.CV']
+NeRF Analogies - Example-Based Visual Attribute Transfer for NeRFs,Michael Fischer · Zhengqin Li · Thu Nguyen-Phuoc · Aljaž Božič · Zhao Dong · Carl Marshall · Tobias Ritschel, ,https://arxiv.org/abs/2402.08622,,2402.08622.pdf,NeRF Analogies: Example-Based Visual Attribute Transfer for NeRFs,"A Neural Radiance Field (NeRF) encodes the specific relation of 3D geometry
+and appearance of a scene. We here ask the question whether we can transfer the
+appearance from a source NeRF onto a target 3D geometry in a semantically
+meaningful way, such that the resulting new NeRF retains the target geometry
+but has an appearance that is an analogy to the source NeRF. To this end, we
+generalize classic image analogies from 2D images to NeRFs. We leverage
+correspondence transfer along semantic affinity that is driven by semantic
+features from large, pre-trained 2D image models to achieve multi-view
+consistent appearance transfer. Our method allows exploring the mix-and-match
+product space of 3D geometry and appearance. We show that our method
+outperforms traditional stylization-based methods and that a large majority of
+users prefer our method over several typical baselines.",cs.CV,"['cs.CV', 'cs.GR']"
+MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers,Haoyu Ma · Shahin Mahdizadehaghdam · Bichen Wu · Zhipeng Fan · Yuchao Gu · Wenliang Zhao · Lior Shapira · Xiaohui Xie, ,https://arxiv.org/abs/2312.12468,,2312.12468.pdf,MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers,"Recent advances in generative AI have significantly enhanced image and video
+editing, particularly in the context of text prompt control. State-of-the-art
+approaches predominantly rely on diffusion models to accomplish these tasks.
+However, the computational demands of diffusion-based methods are substantial,
+often necessitating large-scale paired datasets for training, and therefore
+challenging the deployment in real applications. To address these issues, this
+paper breaks down the text-based video editing task into two stages. First, we
+leverage an pre-trained text-to-image diffusion model to simultaneously edit
+few keyframes in an zero-shot way. Second, we introduce an efficient model
+called MaskINT, which is built on non-autoregressive masked generative
+transformers and specializes in frame interpolation between the edited
+keyframes, using the structural guidance from intermediate frames. Experimental
+results suggest that our MaskINT achieves comparable performance with
+diffusion-based methodologies, while significantly improve the inference time.
+This research offers a practical solution for text-based video editing and
+showcases the potential of non-autoregressive masked generative transformers in
+this domain.",cs.CV,['cs.CV']
+Person in Place: Generating Associative Skeleton-Guidance Maps for Human-Object Interaction Image Editing,ChangHee Yang · ChanHee Kang · Kyeongbo Kong · Hanni Oh · Suk-Ju Kang,https://yangchanghee.github.io/Person-in-Place_page/,,https://vds.sogang.ac.kr/?cat=5,,,,,nan
+Correspondence-Free Non-Rigid Point Set Registration Using Unsupervised Clustering Analysis,Mingyang Zhao · Jiang Jingen · Lei Ma · Shiqing Xin · Gaofeng Meng · Dong-Ming Yan, ,,https://link.springer.com/article/10.1007/s11042-023-16854-0,,,,,nan
+Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion,Litu Rout · Yujia Chen · Abhishek Kumar · Constantine Caramanis · Sanjay Shakkottai · Wen-Sheng Chu,https://stsl-inverse-edit.github.io/,https://arxiv.org/abs/2312.00852,,2312.00852.pdf,Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion,"Sampling from the posterior distribution poses a major computational
+challenge in solving inverse problems using latent diffusion models. Common
+methods rely on Tweedie's first-order moments, which are known to induce a
+quality-limiting bias. Existing second-order approximations are impractical due
+to prohibitive computational costs, making standard reverse diffusion processes
+intractable for posterior sampling. This paper introduces Second-order Tweedie
+sampler from Surrogate Loss (STSL), a novel sampler that offers efficiency
+comparable to first-order Tweedie with a tractable reverse process using
+second-order approximation. Our theoretical results reveal that the
+second-order approximation is lower bounded by our surrogate loss that only
+requires $O(1)$ compute using the trace of the Hessian, and by the lower bound
+we derive a new drift term to make the reverse process tractable. Our method
+surpasses SoTA solvers PSLD and P2L, achieving 4X and 8X reduction in neural
+function evaluations, respectively, while notably enhancing sampling quality on
+FFHQ, ImageNet, and COCO benchmarks. In addition, we show STSL extends to
+text-guided image editing and addresses residual distortions present from
+corrupted images in leading text-guided image editing methods. To our best
+knowledge, this is the first work to offer an efficient second-order
+approximation in solving inverse problems using latent diffusion and editing
+real-world images with corruptions.",cs.LG,"['cs.LG', 'cs.CV', 'stat.ML']"
+Human Motion Prediction under Unexpected Perturbation,Jiangbei Yue · Baiyi Li · Julien Pettré · Armin Seyfried · He Wang, ,https://arxiv.org/abs/2403.15891,,2403.15891.pdf,Human Motion Prediction under Unexpected Perturbation,"We investigate a new task in human motion prediction, which is predicting
+motions under unexpected physical perturbation potentially involving multiple
+people. Compared with existing research, this task involves predicting less
+controlled, unpremeditated and pure reactive motions in response to external
+impact and how such motions can propagate through people. It brings new
+challenges such as data scarcity and predicting complex interactions. To this
+end, we propose a new method capitalizing differential physics and deep neural
+networks, leading to an explicit Latent Differential Physics (LDP) model.
+Through experiments, we demonstrate that LDP has high data efficiency,
+outstanding prediction accuracy, strong generalizability and good
+explainability. Since there is no similar research, a comprehensive comparison
+with 11 adapted baselines from several relevant domains is conducted, showing
+LDP outperforming existing research both quantitatively and qualitatively,
+improving prediction accuracy by as much as 70%, and demonstrating
+significantly stronger generalization.",cs.CV,['cs.CV']
+TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation,Xiaopei Wu · Yuenan Hou · Xiaoshui Huang · Binbin Lin · Tong He · Xinge Zhu · Yuexin Ma · Boxi Wu · Haifeng Liu · Deng Cai · Wanli Ouyang, ,https://arxiv.org/html/2309.07849v3,,2309.07849v3.pdf,TFNet: Exploiting Temporal Cues for Fast and Accurate LiDAR Semantic Segmentation,"LiDAR semantic segmentation plays a crucial role in enabling autonomous
+driving and robots to understand their surroundings accurately and robustly. A
+multitude of methods exist within this domain, including point-based,
+range-image-based, polar-coordinate-based, and hybrid strategies. Among these,
+range-image-based techniques have gained widespread adoption in practical
+applications due to their efficiency. However, they face a significant
+challenge known as the ``many-to-one'' problem caused by the range image's
+limited horizontal and vertical angular resolution. As a result, around 20% of
+the 3D points can be occluded. In this paper, we present TFNet, a
+range-image-based LiDAR semantic segmentation method that utilizes temporal
+information to address this issue. Specifically, we incorporate a temporal
+fusion layer to extract useful information from previous scans and integrate it
+with the current scan. We then design a max-voting-based post-processing
+technique to correct false predictions, particularly those caused by the
+``many-to-one'' issue. We evaluated the approach on two benchmarks and
+demonstrated that the plug-in post-processing technique is generic and can be
+applied to various networks.",cs.CV,['cs.CV']
+ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning,Beomyoung Kim · Joonsang Yu · Sung Ju Hwang,https://github.com/clovaai/ECLIPSE,https://arxiv.org/abs/2403.20126,,2403.20126.pdf,ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning,"Panoptic segmentation, combining semantic and instance segmentation, stands
+as a cutting-edge computer vision task. Despite recent progress with deep
+learning models, the dynamic nature of real-world applications necessitates
+continual learning, where models adapt to new classes (plasticity) over time
+without forgetting old ones (catastrophic forgetting). Current continual
+segmentation methods often rely on distillation strategies like knowledge
+distillation and pseudo-labeling, which are effective but result in increased
+training complexity and computational overhead. In this paper, we introduce a
+novel and efficient method for continual panoptic segmentation based on Visual
+Prompt Tuning, dubbed ECLIPSE. Our approach involves freezing the base model
+parameters and fine-tuning only a small set of prompt embeddings, addressing
+both catastrophic forgetting and plasticity and significantly reducing the
+trainable parameters. To mitigate inherent challenges such as error propagation
+and semantic drift in continual segmentation, we propose logit manipulation to
+effectively leverage common knowledge across the classes. Experiments on ADE20K
+continual panoptic segmentation benchmark demonstrate the superiority of
+ECLIPSE, notably its robustness against catastrophic forgetting and its
+reasonable plasticity, achieving a new state-of-the-art. The code is available
+at https://github.com/clovaai/ECLIPSE.",cs.CV,['cs.CV']
+LEOD: Label-Efficient Object Detection for Event Cameras,Ziyi Wu · Mathias Gehrig · Qing Lyu · Xudong Liu · Igor Gilitschenski,https://github.com/Wuziyi616/LEOD,https://arxiv.org/abs/2311.17286,,2311.17286.pdf,LEOD: Label-Efficient Object Detection for Event Cameras,"Object detection with event cameras benefits from the sensor's low latency
+and high dynamic range. However, it is costly to fully label event streams for
+supervised training due to their high temporal resolution. To reduce this cost,
+we present LEOD, the first method for label-efficient event-based detection.
+Our approach unifies weakly- and semi-supervised object detection with a
+self-training mechanism. We first utilize a detector pre-trained on limited
+labels to produce pseudo ground truth on unlabeled events. Then, the detector
+is re-trained with both real and generated labels. Leveraging the temporal
+consistency of events, we run bi-directional inference and apply tracking-based
+post-processing to enhance the quality of pseudo labels. To stabilize training
+against label noise, we further design a soft anchor assignment strategy. We
+introduce new experimental protocols to evaluate the task of label-efficient
+event-based detection on Gen1 and 1Mpx datasets. LEOD consistently outperforms
+supervised baselines across various labeling ratios. For example, on Gen1, it
+improves mAP by 8.6% and 7.8% for RVT-S trained with 1% and 2% labels. On 1Mpx,
+RVT-S with 10% labels even surpasses its fully-supervised counterpart using
+100% labels. LEOD maintains its effectiveness even when all labeled data are
+available, reaching new state-of-the-art results. Finally, we show that our
+method readily scales to improve larger detectors as well. Code is released at
+https://github.com/Wuziyi616/LEOD",cs.CV,['cs.CV']
+Adapters Strike Back,Jan-Martin Steitz · Stefan Roth, ,,https://strikefans.com/the-ink-black-heart-has-wrapped/,,,,,nan
+CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras,Sachin Shah · Matthew Chan · Haoming Cai · Jingxi Chen · Sakshum Kulshrestha · Chahat Deep Singh · Yiannis Aloimonos · Christopher Metzler, ,https://arxiv.org/abs/2404.11511,,2404.11511.pdf,"Event Cameras Meet SPADs for High-Speed, Low-Bandwidth Imaging","Traditional cameras face a trade-off between low-light performance and
+high-speed imaging: longer exposure times to capture sufficient light results
+in motion blur, whereas shorter exposures result in Poisson-corrupted noisy
+images. While burst photography techniques help mitigate this tradeoff,
+conventional cameras are fundamentally limited in their sensor noise
+characteristics. Event cameras and single-photon avalanche diode (SPAD) sensors
+have emerged as promising alternatives to conventional cameras due to their
+desirable properties. SPADs are capable of single-photon sensitivity with
+microsecond temporal resolution, and event cameras can measure brightness
+changes up to 1 MHz with low bandwidth requirements. We show that these
+properties are complementary, and can help achieve low-light, high-speed image
+reconstruction with low bandwidth requirements. We introduce a sensor fusion
+framework to combine SPADs with event cameras to improves the reconstruction of
+high-speed, low-light scenes while reducing the high bandwidth cost associated
+with using every SPAD frame. Our evaluation, on both synthetic and real sensor
+data, demonstrates significant enhancements ( > 5 dB PSNR) in reconstructing
+low-light scenes at high temporal resolution (100 kHz) compared to conventional
+cameras. Event-SPAD fusion shows great promise for real-world applications,
+such as robotics or medical imaging.",eess.IV,"['eess.IV', 'cs.CV']"
+Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians,Yuelang Xu · Benwang Chen · Zhe Li · Hongwen Zhang · Lizhen Wang · Zerong Zheng · Yebin Liu,https://yuelangx.github.io/gaussianheadavatar,https://arxiv.org/abs/2312.03029,,2312.03029.pdf,Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians,"Creating high-fidelity 3D head avatars has always been a research hotspot,
+but there remains a great challenge under lightweight sparse view setups. In
+this paper, we propose Gaussian Head Avatar represented by controllable 3D
+Gaussians for high-fidelity head avatar modeling. We optimize the neutral 3D
+Gaussians and a fully learned MLP-based deformation field to capture complex
+expressions. The two parts benefit each other, thereby our method can model
+fine-grained dynamic details while ensuring expression accuracy. Furthermore,
+we devise a well-designed geometry-guided initialization strategy based on
+implicit SDF and Deep Marching Tetrahedra for the stability and convergence of
+the training procedure. Experiments show our approach outperforms other
+state-of-the-art sparse-view methods, achieving ultra high-fidelity rendering
+quality at 2K resolution even under exaggerated expressions.",cs.CV,"['cs.CV', 'cs.GR']"
+Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis,Yuchao Gu · Xintao Wang · Yixiao Ge · Ying Shan · Mike Zheng Shou, ,https://ar5iv.labs.arxiv.org/html/2310.01218,,2310.01218.pdf,Making LLaMA SEE and Draw with SEED Tokenizer,"The great success of Large Language Models (LLMs) has expanded the potential
+of multimodality, contributing to the gradual evolution of General Artificial
+Intelligence (AGI). A true AGI agent should not only possess the capability to
+perform predefined multi-tasks but also exhibit emergent abilities in an
+open-world context. However, despite the considerable advancements made by
+recent multimodal LLMs, they still fall short in effectively unifying
+comprehension and generation tasks, let alone open-world emergent abilities. We
+contend that the key to overcoming the present impasse lies in enabling text
+and images to be represented and processed interchangeably within a unified
+autoregressive Transformer. To this end, we introduce SEED, an elaborate image
+tokenizer that empowers LLMs with the ability to SEE and Draw at the same time.
+We identify two crucial design principles: (1) Image tokens should be
+independent of 2D physical patch positions and instead be produced with a 1D
+causal dependency, exhibiting intrinsic interdependence that aligns with the
+left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens
+should capture high-level semantics consistent with the degree of semantic
+abstraction in words, and be optimized for both discriminativeness and
+reconstruction during the tokenizer training phase. With SEED tokens, LLM is
+able to perform scalable multimodal autoregression under its original training
+recipe, i.e., next-word prediction. SEED-LLaMA is therefore produced by
+large-scale pretraining and instruction tuning on the interleaved textual and
+visual data, demonstrating impressive performance on a broad range of
+multimodal comprehension and generation tasks. More importantly, SEED-LLaMA has
+exhibited compositional emergent abilities such as multi-turn in-context
+multimodal generation, acting like your AI assistant.",cs.CV,['cs.CV']
+HDRFlow: Real-Time HDR Video Reconstruction with Large Motions,Gangwei Xu · Yujin Wang · Jinwei Gu · Tianfan Xue · Xin Yang, ,https://arxiv.org/abs/2403.03447,,2403.03447.pdf,HDRFlow: Real-Time HDR Video Reconstruction with Large Motions,"Reconstructing High Dynamic Range (HDR) video from image sequences captured
+with alternating exposures is challenging, especially in the presence of large
+camera or object motion. Existing methods typically align low dynamic range
+sequences using optical flow or attention mechanism for deghosting. However,
+they often struggle to handle large complex motions and are computationally
+expensive. To address these challenges, we propose a robust and efficient flow
+estimator tailored for real-time HDR video reconstruction, named HDRFlow.
+HDRFlow has three novel designs: an HDR-domain alignment loss (HALoss), an
+efficient flow network with a multi-size large kernel (MLK), and a new HDR flow
+training scheme. The HALoss supervises our flow network to learn an
+HDR-oriented flow for accurate alignment in saturated and dark regions. The MLK
+can effectively model large motions at a negligible cost. In addition, we
+incorporate synthetic data, Sintel, into our training dataset, utilizing both
+its provided forward flow and backward flow generated by us to supervise our
+flow network, enhancing our performance in large motion regions. Extensive
+experiments demonstrate that our HDRFlow outperforms previous methods on
+standard benchmarks. To the best of our knowledge, HDRFlow is the first
+real-time HDR video reconstruction method for video sequences captured with
+alternating exposures, capable of processing 720p resolution inputs at 25ms.",cs.CV,['cs.CV']
+LiSA: LiDAR Localization with Semantic Awareness,Bochun Yang · Zijun Li · Wen Li · zhipeng cai · Chenglu Wen · Yu Zang · Matthias Mueller · Cheng Wang, ,https://arxiv.org/abs/2402.18934,,2402.18934.pdf,RELEAD: Resilient Localization with Enhanced LiDAR Odometry in Adverse Environments,"LiDAR-based localization is valuable for applications like mining surveys and
+underground facility maintenance. However, existing methods can struggle when
+dealing with uninformative geometric structures in challenging scenarios. This
+paper presents RELEAD, a LiDAR-centric solution designed to address
+scan-matching degradation. Our method enables degeneracy-free point cloud
+registration by solving constrained ESIKF updates in the front end and
+incorporates multisensor constraints, even when dealing with outlier
+measurements, through graph optimization based on Graduated Non-Convexity
+(GNC). Additionally, we propose a robust Incremental Fixed Lag Smoother (rIFL)
+for efficient GNC-based optimization. RELEAD has undergone extensive evaluation
+in degenerate scenarios and has outperformed existing state-of-the-art
+LiDAR-Inertial odometry and LiDAR-Visual-Inertial odometry methods.",cs.RO,['cs.RO']
+Language Models as Black-Box Optimizers for Vision-Language Models,Shihong Liu · Samuel Yu · Zhiqiu Lin · Deepak Pathak · Deva Ramanan,https://llm-can-optimize-vlm.github.io/,https://arxiv.org/abs/2309.05950,,2309.05950.pdf,Language Models as Black-Box Optimizers for Vision-Language Models,"Vision-language models (VLMs) pre-trained on web-scale datasets have
+demonstrated remarkable capabilities on downstream tasks when fine-tuned with
+minimal data. However, many VLMs rely on proprietary data and are not
+open-source, which restricts the use of white-box approaches for fine-tuning.
+As such, we aim to develop a black-box approach to optimize VLMs through
+natural language prompts, thereby avoiding the need to access model parameters,
+feature embeddings, or even output logits. We propose employing chat-based LLMs
+to search for the best text prompt for VLMs. Specifically, we adopt an
+automatic hill-climbing procedure that converges to an effective prompt by
+evaluating the performance of current prompts and asking LLMs to refine them
+based on textual feedback, all within a conversational process without
+human-in-the-loop. In a challenging 1-shot image classification setup, our
+simple approach surpasses the white-box continuous prompting method (CoOp) by
+an average of 1.5% across 11 datasets including ImageNet. Our approach also
+outperforms both human-engineered and LLM-generated prompts. We highlight the
+advantage of conversational feedback that incorporates both positive and
+negative prompts, suggesting that LLMs can utilize the implicit gradient
+direction in textual feedback for a more efficient search. In addition, we find
+that the text prompts generated through our strategy are not only more
+interpretable but also transfer well across different VLM architectures in a
+black-box manner. Lastly, we apply our framework to optimize the
+state-of-the-art black-box VLM (DALL-E 3) for text-to-image generation, prompt
+inversion, and personalization.",cs.CL,"['cs.CL', 'cs.CV', 'cs.LG', 'cs.MM']"
+The Neglected Tails of Vision-Language Models,Shubham Parashar · Tian Liu · Zhiqiu Lin · Xiangjue Dong · Yanan Li · James Caverlee · Deva Ramanan · Shu Kong,https://shubhamprshr27.github.io/neglected-tails-of-vlms/,https://arxiv.org/abs/2401.12425,,2401.12425.pdf,The Neglected Tails in Vision-Language Models,"Vision-language models (VLMs) excel in zero-shot recognition but their
+performance varies greatly across different visual concepts. For example,
+although CLIP achieves impressive accuracy on ImageNet (60-80%), its
+performance drops below 10% for more than ten concepts like night snake,
+presumably due to their limited presence in the pretraining data. However,
+measuring the frequency of concepts in VLMs' large-scale datasets is
+challenging. We address this by using large language models (LLMs) to count the
+number of pretraining texts that contain synonyms of these concepts. Our
+analysis confirms that popular datasets, such as LAION, exhibit a long-tailed
+concept distribution, yielding biased performance in VLMs. We also find that
+downstream applications of VLMs, including visual chatbots (e.g., GPT-4V) and
+text-to-image models (e.g., Stable Diffusion), often fail to recognize or
+generate images of rare concepts identified by our method. To mitigate the
+imbalanced performance of zero-shot VLMs, we propose REtrieval-Augmented
+Learning (REAL). First, instead of prompting VLMs using the original class
+names, REAL uses their most frequent synonyms found in pretraining texts. This
+simple change already outperforms costly human-engineered and LLM-enriched
+prompts over nine benchmark datasets. Second, REAL trains a linear classifier
+on a small yet balanced set of pretraining data retrieved using concept
+synonyms. REAL surpasses the previous zero-shot SOTA, using 400x less storage
+and 10,000x less training time!",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']"
+DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data,Chengxiang Fan · Muzhi Zhu · Hao Chen · Yang Liu · Weijia Wu · Huaqi Zhang · Chunhua Shen,https://github.com/aim-uofa/DiverGen,https://arxiv.org/abs/2405.10185,,2405.10185.pdf,DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data,"Instance segmentation is data-hungry, and as model capacity increases, data
+scale becomes crucial for improving the accuracy. Most instance segmentation
+datasets today require costly manual annotation, limiting their data scale.
+Models trained on such data are prone to overfitting on the training set,
+especially for those rare categories. While recent works have delved into
+exploiting generative models to create synthetic datasets for data
+augmentation, these approaches do not efficiently harness the full potential of
+generative models.
+  To address these issues, we introduce a more efficient strategy to construct
+generative datasets for data augmentation, termed DiverGen. Firstly, we provide
+an explanation of the role of generative data from the perspective of
+distribution discrepancy. We investigate the impact of different data on the
+distribution learned by the model. We argue that generative data can expand the
+data distribution that the model can learn, thus mitigating overfitting.
+Additionally, we find that the diversity of generative data is crucial for
+improving model performance and enhance it through various strategies,
+including category diversity, prompt diversity, and generative model diversity.
+With these strategies, we can scale the data to millions while maintaining the
+trend of model performance improvement. On the LVIS dataset, DiverGen
+significantly outperforms the strong model X-Paste, achieving +1.1 box AP and
++1.1 mask AP across all categories, and +1.9 box AP and +2.5 mask AP for rare
+categories.",cs.CV,['cs.CV']
+Extend Your Own Correspondences: Unsupervised Distant Point Cloud Registration by Progressive Distance Extension,Quan Liu · Hongzi Zhu · Zhenxi Wang · Yunsong Zhou · Shan Chang · Minyi Guo, ,https://arxiv.org/abs/2403.03532,,2403.03532.pdf,Extend Your Own Correspondences: Unsupervised Distant Point Cloud Registration by Progressive Distance Extension,"Registration of point clouds collected from a pair of distant vehicles
+provides a comprehensive and accurate 3D view of the driving scenario, which is
+vital for driving safety related applications, yet existing literature suffers
+from the expensive pose label acquisition and the deficiency to generalize to
+new data distributions. In this paper, we propose EYOC, an unsupervised distant
+point cloud registration method that adapts to new point cloud distributions on
+the fly, requiring no global pose labels. The core idea of EYOC is to train a
+feature extractor in a progressive fashion, where in each round, the feature
+extractor, trained with near point cloud pairs, can label slightly farther
+point cloud pairs, enabling self-supervision on such far point cloud pairs.
+This process continues until the derived extractor can be used to register
+distant point clouds. Particularly, to enable high-fidelity correspondence
+label generation, we devise an effective spatial filtering scheme to select the
+most representative correspondences to register a point cloud pair, and then
+utilize the aligned point clouds to discover more correct correspondences.
+Experiments show that EYOC can achieve comparable performance with
+state-of-the-art supervised methods at a lower training cost. Moreover, it
+outwits supervised methods regarding generalization performance on new data
+distributions.",cs.CV,['cs.CV']
+Fine-grained Bipartite Concept Factorization for Clustering,Chong Peng · Pengfei Zhang · Yongyong Chen · zhao kang · Chenglizhao Chen · Qiang Cheng, ,,https://ieeexplore.ieee.org/document/10506642,,,,,nan
+SuperNormal: Neural Surface Reconstruction via Multi-View Normal Integration,Xu Cao · Takafumi Taketomi, ,https://arxiv.org/abs/2312.04803,,2312.04803.pdf,SuperNormal: Neural Surface Reconstruction via Multi-View Normal Integration,"We present SuperNormal, a fast, high-fidelity approach to multi-view 3D
+reconstruction using surface normal maps. With a few minutes, SuperNormal
+produces detailed surfaces on par with 3D scanners. We harness volume rendering
+to optimize a neural signed distance function (SDF) powered by multi-resolution
+hash encoding. To accelerate training, we propose directional finite difference
+and patch-based ray marching to approximate the SDF gradients numerically.
+While not compromising reconstruction quality, this strategy is nearly twice as
+efficient as analytical gradients and about three times faster than
+axis-aligned finite difference. Experiments on the benchmark dataset
+demonstrate the superiority of SuperNormal in efficiency and accuracy compared
+to existing multi-view photometric stereo methods. On our captured objects,
+SuperNormal produces more fine-grained geometry than recent neural 3D
+reconstruction methods.",cs.CV,['cs.CV']
+SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers,Jonathan F. Carter · Joao Jorge · Oliver Gibson · Lionel Tarassenko, ,https://arxiv.org/abs/2404.03831,,2404.03831.pdf,SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers,"Advances in camera-based physiological monitoring have enabled the robust,
+non-contact measurement of respiration and the cardiac pulse, which are known
+to be indicative of the sleep stage. This has led to research into camera-based
+sleep monitoring as a promising alternative to ""gold-standard"" polysomnography,
+which is cumbersome, expensive to administer, and hence unsuitable for
+longer-term clinical studies. In this paper, we introduce SleepVST, a
+transformer model which enables state-of-the-art performance in camera-based
+sleep stage classification (sleep staging). After pre-training on contact
+sensor data, SleepVST outperforms existing methods for cardio-respiratory sleep
+staging on the SHHS and MESA datasets, achieving total Cohen's kappa scores of
+0.75 and 0.77 respectively. We then show that SleepVST can be successfully
+transferred to cardio-respiratory waveforms extracted from video, enabling
+fully contact-free sleep staging. Using a video dataset of 50 nights, we
+achieve a total accuracy of 78.8\% and a Cohen's $\kappa$ of 0.71 in four-class
+video-based sleep staging, setting a new state-of-the-art in the domain.",cs.CV,"['cs.CV', 'cs.HC', 'q-bio.NC']"
+Progress-Aware Online Action Segmentation for Egocentric Procedural Task Videos,Yuhan Shen · Ehsan Elhamifar, ,https://arxiv.org/abs/2404.01933,,,PREGO: online mistake detection in PRocedural EGOcentric videos,"Promptly identifying procedural errors from egocentric videos in an online
+setting is highly challenging and valuable for detecting mistakes as soon as
+they happen. This capability has a wide range of applications across various
+fields, such as manufacturing and healthcare. The nature of procedural mistakes
+is open-set since novel types of failures might occur, which calls for
+one-class classifiers trained on correctly executed procedures. However, no
+technique can currently detect open-set procedural mistakes online. We propose
+PREGO, the first online one-class classification model for mistake detection in
+PRocedural EGOcentric videos. PREGO is based on an online action recognition
+component to model the current action, and a symbolic reasoning module to
+predict the next actions. Mistake detection is performed by comparing the
+recognized current action with the expected future one. We evaluate PREGO on
+two procedural egocentric video datasets, Assembly101 and Epic-tent, which we
+adapt for online benchmarking of procedural mistake detection to establish
+suitable benchmarks, thus defining the Assembly101-O and Epic-tent-O datasets,
+respectively.",cs.CV,['cs.CV']
+Efficient Solution of Point-Line Absolute Pose,Petr Hruby · Timothy Duff · Marc Pollefeys,https://github.com/petrhruby97/efficient_absolute,https://arxiv.org/abs/2404.16552,,2404.16552.pdf,Efficient Solution of Point-Line Absolute Pose,"We revisit certain problems of pose estimation based on 3D--2D
+correspondences between features which may be points or lines. Specifically, we
+address the two previously-studied minimal problems of estimating camera
+extrinsics from $p \in \{ 1, 2 \}$ point--point correspondences and $l=3-p$
+line--line correspondences. To the best of our knowledge, all of the
+previously-known practical solutions to these problems required computing the
+roots of degree $\ge 4$ (univariate) polynomials when $p=2$, or degree $\ge 8$
+polynomials when $p=1.$ We describe and implement two elementary solutions
+which reduce the degrees of the needed polynomials from $4$ to $2$ and from $8$
+to $4$, respectively. We show experimentally that the resulting solvers are
+numerically stable and fast: when compared to the previous state-of-the art, we
+may obtain nearly an order of magnitude speedup. The code is available at
+\url{https://github.com/petrhruby97/efficient\_absolute}",cs.CV,"['cs.CV', '68T45', 'I.4.5']"
+ProTeCt: Prompt Tuning for Taxonomic Open Set Classification,Tz-Ying Wu · Chih-Hui Ho · Nuno Vasconcelos,http://www.svcl.ucsd.edu/projects/protect/,https://arxiv.org/abs/2306.02240,,2306.02240.pdf,ProTeCt: Prompt Tuning for Taxonomic Open Set Classification,"Visual-language foundation models, like CLIP, learn generalized
+representations that enable zero-shot open-set classification. Few-shot
+adaptation methods, based on prompt tuning, have been shown to further improve
+performance on downstream datasets. However, these methods do not fare well in
+the taxonomic open set (TOS) setting, where the classifier is asked to make
+predictions from label sets across different levels of semantic granularity.
+Frequently, they infer incorrect labels at coarser taxonomic class levels, even
+when the inference at the leaf level (original class labels) is correct. To
+address this problem, we propose a prompt tuning technique that calibrates the
+hierarchical consistency of model predictions. A set of metrics of hierarchical
+consistency, the Hierarchical Consistent Accuracy (HCA) and the Mean Treecut
+Accuracy (MTA), are first proposed to evaluate TOS model performance. A new
+Prompt Tuning for Hierarchical Consistency (ProTeCt) technique is then proposed
+to calibrate classification across label set granularities. Results show that
+ProTeCt can be combined with existing prompt tuning methods to significantly
+improve TOS classification without degrading the leaf level classification
+performance.",cs.CV,['cs.CV']
+In-N-Out: Faithful 3D GAN Inversion with Volumetric Decomposition for Face Editing,Yiran Xu · Zhixin Shu · Cameron Smith · Seoung Wug Oh · Jia-Bin Huang,https://in-n-out-3d.github.io/,,https://www.youtube.com/watch?v=JGbLEEANtnI,,,,,nan
+On the Faithfulness of Vision Transformer Explanations,Junyi Wu · Weitai Kang · Hao Tang · Yuan Hong · Yan Yan, ,https://arxiv.org/abs/2404.01415,,2404.01415.pdf,On the Faithfulness of Vision Transformer Explanations,"To interpret Vision Transformers, post-hoc explanations assign salience
+scores to input pixels, providing human-understandable heatmaps. However,
+whether these interpretations reflect true rationales behind the model's output
+is still underexplored. To address this gap, we study the faithfulness
+criterion of explanations: the assigned salience scores should represent the
+influence of the corresponding input pixels on the model's predictions. To
+evaluate faithfulness, we introduce Salience-guided Faithfulness Coefficient
+(SaCo), a novel evaluation metric leveraging essential information of salience
+distribution. Specifically, we conduct pair-wise comparisons among distinct
+pixel groups and then aggregate the differences in their salience scores,
+resulting in a coefficient that indicates the explanation's degree of
+faithfulness. Our explorations reveal that current metrics struggle to
+differentiate between advanced explanation methods and Random Attribution,
+thereby failing to capture the faithfulness property. In contrast, our proposed
+SaCo offers a reliable faithfulness measurement, establishing a robust metric
+for interpretations. Furthermore, our SaCo demonstrates that the use of
+gradient and multi-layer aggregation can markedly enhance the faithfulness of
+attention-based explanation, shedding light on potential paths for advancing
+Vision Transformer explainability.",cs.CV,['cs.CV']
+Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models,Gihyun Kwon · Simon Jenni · Ding Li · Joon-Young Lee · Jong Chul Ye · Fabian Caba Heilbron, ,https://arxiv.org/abs/2404.03913,,2404.03913.pdf,Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models,"While there has been significant progress in customizing text-to-image
+generation models, generating images that combine multiple personalized
+concepts remains challenging. In this work, we introduce Concept Weaver, a
+method for composing customized text-to-image diffusion models at inference
+time. Specifically, the method breaks the process into two steps: creating a
+template image aligned with the semantics of input prompts, and then
+personalizing the template using a concept fusion strategy. The fusion strategy
+incorporates the appearance of the target concepts into the template image
+while retaining its structural details. The results indicate that our method
+can generate multiple custom concepts with higher identity fidelity compared to
+alternative approaches. Furthermore, the method is shown to seamlessly handle
+more than two concepts and closely follow the semantic meaning of the input
+prompt without blending appearances across different subjects.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+SPAD: Spatially Aware Multiview Diffusers,Yash Kant · Aliaksandr Siarohin · Ziyi Wu · Michael Vasilkovsky · Guocheng Qian · Jian Ren · Riza Alp Guler · Bernard Ghanem · Sergey Tulyakov · Igor Gilitschenski,https://yashkant.github.io/spad,https://arxiv.org/abs/2402.05235,,2402.05235.pdf,SPAD : Spatially Aware Multiview Diffusers,"We present SPAD, a novel approach for creating consistent multi-view images
+from text prompts or single images. To enable multi-view generation, we
+repurpose a pretrained 2D diffusion model by extending its self-attention
+layers with cross-view interactions, and fine-tune it on a high quality subset
+of Objaverse. We find that a naive extension of the self-attention proposed in
+prior work (e.g. MVDream) leads to content copying between views. Therefore, we
+explicitly constrain the cross-view attention based on epipolar geometry. To
+further enhance 3D consistency, we utilize Plucker coordinates derived from
+camera rays and inject them as positional encoding. This enables SPAD to reason
+over spatial proximity in 3D well. In contrast to recent works that can only
+generate views at fixed azimuth and elevation, SPAD offers full camera control
+and achieves state-of-the-art results in novel view synthesis on unseen objects
+from the Objaverse and Google Scanned Objects datasets. Finally, we demonstrate
+that text-to-3D generation using SPAD prevents the multi-face Janus issue. See
+more details at our webpage: https://yashkant.github.io/spad",cs.CV,['cs.CV']
+Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth,Zhaoyang Sun · Shengwu Xiong · Yaxiong Chen · Yi Rong, ,https://arxiv.org/abs/2405.17240,,2405.17240.pdf,Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth,"The absence of real targets to guide the model training is one of the main
+problems with the makeup transfer task. Most existing methods tackle this
+problem by synthesizing pseudo ground truths (PGTs). However, the generated
+PGTs are often sub-optimal and their imprecision will eventually lead to
+performance degradation. To alleviate this issue, in this paper, we propose a
+novel Content-Style Decoupled Makeup Transfer (CSD-MT) method, which works in a
+purely unsupervised manner and thus eliminates the negative effects of
+generating PGTs. Specifically, based on the frequency characteristics analysis,
+we assume that the low-frequency (LF) component of a face image is more
+associated with its makeup style information, while the high-frequency (HF)
+component is more related to its content details. This assumption allows CSD-MT
+to decouple the content and makeup style information in each face image through
+the frequency decomposition. After that, CSD-MT realizes makeup transfer by
+maximizing the consistency of these two types of information between the
+transferred result and input images, respectively. Two newly designed loss
+functions are also introduced to further improve the transfer performance.
+Extensive quantitative and qualitative analyses show the effectiveness of our
+CSD-MT method. Our code is available at
+https://github.com/Snowfallingplum/CSD-MT.",cs.CV,['cs.CV']
+SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation,Yamei Chen · Yan Di · Guangyao Zhai · Fabian Manhardt · Chenyangguang Zhang · Ruida Zhang · Federico Tombari · Nassir Navab · Benjamin Busam, ,https://arxiv.org/abs/2311.11125,,2311.11125.pdf,SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation,"Category-level object pose estimation, aiming to predict the 6D pose and 3D
+size of objects from known categories, typically struggles with large
+intra-class shape variation. Existing works utilizing mean shapes often fall
+short of capturing this variation. To address this issue, we present
+SecondPose, a novel approach integrating object-specific geometric features
+with semantic category priors from DINOv2. Leveraging the advantage of DINOv2
+in providing SE(3)-consistent semantic features, we hierarchically extract two
+types of SE(3)-invariant geometric features to further encapsulate
+local-to-global object-specific information. These geometric features are then
+point-aligned with DINOv2 features to establish a consistent object
+representation under SE(3) transformations, facilitating the mapping from
+camera space to the pre-defined canonical space, thus further enhancing pose
+estimation. Extensive experiments on NOCS-REAL275 demonstrate that SecondPose
+achieves a 12.4% leap forward over the state-of-the-art. Moreover, on a more
+complex dataset HouseCat6D which provides photometrically challenging objects,
+SecondPose still surpasses other competitors by a large margin.",cs.CV,['cs.CV']
+Rethinking FID: Towards a Better Evaluation Metric for Image Generation,Sadeep Jayasumana · Srikumar Ramalingam · Andreas Veit · Daniel Glasner · Ayan Chakrabarti · Sanjiv Kumar, ,https://arxiv.org/abs/2401.09603,,2401.09603.pdf,Rethinking FID: Towards a Better Evaluation Metric for Image Generation,"As with many machine learning problems, the progress of image generation
+methods hinges on good evaluation metrics. One of the most popular is the
+Frechet Inception Distance (FID). FID estimates the distance between a
+distribution of Inception-v3 features of real images, and those of images
+generated by the algorithm. We highlight important drawbacks of FID:
+Inception's poor representation of the rich and varied content generated by
+modern text-to-image models, incorrect normality assumptions, and poor sample
+complexity. We call for a reevaluation of FID's use as the primary quality
+metric for generated images. We empirically demonstrate that FID contradicts
+human raters, it does not reflect gradual improvement of iterative
+text-to-image models, it does not capture distortion levels, and that it
+produces inconsistent results when varying the sample size. We also propose an
+alternative new metric, CMMD, based on richer CLIP embeddings and the maximum
+mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased
+estimator that does not make any assumptions on the probability distribution of
+the embeddings and is sample efficient. Through extensive experiments and
+analysis, we demonstrate that FID-based evaluations of text-to-image models may
+be unreliable, and that CMMD offers a more robust and reliable assessment of
+image quality.",cs.CV,['cs.CV']
+CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation,Xi Liu · Ying Guo · Cheng Zhen · Tong Li · Yingying Ao · Pengfei Yan,https://customlistener.github.io/,https://arxiv.org/abs/2403.00274,,2403.00274.pdf,CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation,"Listening head generation aims to synthesize a non-verbal responsive listener
+head by modeling the correlation between the speaker and the listener in
+dynamic conversion.The applications of listener agent generation in virtual
+interaction have promoted many works achieving the diverse and fine-grained
+motion generation. However, they can only manipulate motions through simple
+emotional labels, but cannot freely control the listener's motions. Since
+listener agents should have human-like attributes (e.g. identity, personality)
+which can be freely customized by users, this limits their realism. In this
+paper, we propose a user-friendly framework called CustomListener to realize
+the free-form text prior guided listener generation. To achieve
+speaker-listener coordination, we design a Static to Dynamic Portrait module
+(SDP), which interacts with speaker information to transform static text into
+dynamic portrait token with completion rhythm and amplitude information. To
+achieve coherence between segments, we design a Past Guided Generation Module
+(PGG) to maintain the consistency of customized listener attributes through the
+motion prior, and utilize a diffusion-based structure conditioned on the
+portrait token and the motion prior to realize the controllable generation. To
+train and evaluate our model, we have constructed two text-annotated listening
+head datasets based on ViCo and RealTalk, which provide text-video paired
+labels. Extensive experiments have verified the effectiveness of our model.",cs.CV,"['cs.CV', 'cs.SD', 'eess.AS']"
+Geometry-aware Reconstruction and Fusion-refined Rendering for Generalizable Neural Radiance Fields,TIANQI LIU · Xinyi Ye · Min Shi · Zihao Huang · Zhiyu Pan · Zhan Peng · Zhiguo Cao, ,https://arxiv.org/abs/2404.17528,,2404.17528.pdf,Geometry-aware Reconstruction and Fusion-refined Rendering for Generalizable Neural Radiance Fields,"Generalizable NeRF aims to synthesize novel views for unseen scenes. Common
+practices involve constructing variance-based cost volumes for geometry
+reconstruction and encoding 3D descriptors for decoding novel views. However,
+existing methods show limited generalization ability in challenging conditions
+due to inaccurate geometry, sub-optimal descriptors, and decoding strategies.
+We address these issues point by point. First, we find the variance-based cost
+volume exhibits failure patterns as the features of pixels corresponding to the
+same point can be inconsistent across different views due to occlusions or
+reflections. We introduce an Adaptive Cost Aggregation (ACA) approach to
+amplify the contribution of consistent pixel pairs and suppress inconsistent
+ones. Unlike previous methods that solely fuse 2D features into descriptors,
+our approach introduces a Spatial-View Aggregator (SVA) to incorporate 3D
+context into descriptors through spatial and inter-view interaction. When
+decoding the descriptors, we observe the two existing decoding strategies excel
+in different areas, which are complementary. A Consistency-Aware Fusion (CAF)
+strategy is proposed to leverage the advantages of both. We incorporate the
+above ACA, SVA, and CAF into a coarse-to-fine framework, termed Geometry-aware
+Reconstruction and Fusion-refined Rendering (GeFu). GeFu attains
+state-of-the-art performance across multiple datasets. Code is available at
+https://github.com/TQTQliu/GeFu .",cs.CV,['cs.CV']
+Rethinking Few-shot 3D Point Cloud Semantic Segmentation,Zhaochong An · Guolei Sun · Yun Liu · Fayao Liu · Zongwei Wu · Dan Wang · Luc Van Gool · Serge Belongie, ,https://arxiv.org/abs/2403.00592,,2403.00592.pdf,Rethinking Few-shot 3D Point Cloud Semantic Segmentation,"This paper revisits few-shot 3D point cloud semantic segmentation (FS-PCS),
+with a focus on two significant issues in the state-of-the-art: foreground
+leakage and sparse point distribution. The former arises from non-uniform point
+sampling, allowing models to distinguish the density disparities between
+foreground and background for easier segmentation. The latter results from
+sampling only 2,048 points, limiting semantic information and deviating from
+the real-world practice. To address these issues, we introduce a standardized
+FS-PCS setting, upon which a new benchmark is built. Moreover, we propose a
+novel FS-PCS model. While previous methods are based on feature optimization by
+mainly refining support features to enhance prototypes, our method is based on
+correlation optimization, referred to as Correlation Optimization Segmentation
+(COSeg). Specifically, we compute Class-specific Multi-prototypical Correlation
+(CMC) for each query point, representing its correlations to category
+prototypes. Then, we propose the Hyper Correlation Augmentation (HCA) module to
+enhance CMC. Furthermore, tackling the inherent property of few-shot training
+to incur base susceptibility for models, we propose to learn non-parametric
+prototypes for the base classes during training. The learned base prototypes
+are used to calibrate correlations for the background class through a Base
+Prototypes Calibration (BPC) module. Experiments on popular datasets
+demonstrate the superiority of COSeg over existing methods. The code is
+available at: https://github.com/ZhaochongAn/COSeg",cs.CV,['cs.CV']
+MESA: Matching Everything by Segmenting Anything,Yesheng Zhang · Xu Zhao, ,https://arxiv.org/abs/2401.16741v1,,2401.16741v1.pdf,MESA: Matching Everything by Segmenting Anything,"Feature matching is a crucial task in the field of computer vision, which
+involves finding correspondences between images. Previous studies achieve
+remarkable performance using learning-based feature comparison. However, the
+pervasive presence of matching redundancy between images gives rise to
+unnecessary and error-prone computations in these methods, imposing limitations
+on their accuracy. To address this issue, we propose MESA, a novel approach to
+establish precise area (or region) matches for efficient matching redundancy
+reduction. MESA first leverages the advanced image understanding capability of
+SAM, a state-of-the-art foundation model for image segmentation, to obtain
+image areas with implicit semantic. Then, a multi-relational graph is proposed
+to model the spatial structure of these areas and construct their scale
+hierarchy. Based on graphical models derived from the graph, the area matching
+is reformulated as an energy minimization task and effectively resolved.
+Extensive experiments demonstrate that MESA yields substantial precision
+improvement for multiple point matchers in indoor and outdoor downstream tasks,
+e.g. +13.61% for DKM in indoor pose estimation.",cs.CV,['cs.CV']
+DeiT-LT: Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets,Harsh Rangwani · Pradipto Mondal · Mayank Mishra · Ashish Asokan · R. Venkatesh Babu,https://rangwani-harsh.github.io/DeiT-LT/,https://arxiv.org/abs/2404.02900,,2404.02900.pdf,DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets,"Vision Transformer (ViT) has emerged as a prominent architecture for various
+computer vision tasks. In ViT, we divide the input image into patch tokens and
+process them through a stack of self attention blocks. However, unlike
+Convolutional Neural Networks (CNN), ViTs simple architecture has no
+informative inductive bias (e.g., locality,etc. ). Due to this, ViT requires a
+large amount of data for pre-training. Various data efficient approaches (DeiT)
+have been proposed to train ViT on balanced datasets effectively. However,
+limited literature discusses the use of ViT for datasets with long-tailed
+imbalances. In this work, we introduce DeiT-LT to tackle the problem of
+training ViTs from scratch on long-tailed datasets. In DeiT-LT, we introduce an
+efficient and effective way of distillation from CNN via distillation DIST
+token by using out-of-distribution images and re-weighting the distillation
+loss to enhance focus on tail classes. This leads to the learning of local
+CNN-like features in early ViT blocks, improving generalization for tail
+classes. Further, to mitigate overfitting, we propose distilling from a flat
+CNN teacher, which leads to learning low-rank generalizable features for DIST
+tokens across all ViT blocks. With the proposed DeiT-LT scheme, the
+distillation DIST token becomes an expert on the tail classes, and the
+classifier CLS token becomes an expert on the head classes. The experts help to
+effectively learn features corresponding to both the majority and minority
+classes using a distinct set of tokens within the same ViT architecture. We
+show the effectiveness of DeiT-LT for training ViT from scratch on datasets
+ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis,Jiapeng Tang · Yinyu Nie · Lev Markhasin · Angela Dai · Justus Thies · Matthias Nießner,https://tangjiapeng.github.io/projects/DiffuScene/,,https://justusthies.github.io/posts/diffuscene/,,,,,nan
+TokenCompose: Text-to-Image Diffusion with Token-level Supervision,Zirui Wang · Zhizhou Sha · Zheng Ding · Yilin Wang · Zhuowen Tu,https://mlpc-ucsd.github.io/TokenCompose/,https://arxiv.org/abs/2312.03626,,2312.03626.pdf,TokenCompose: Grounding Diffusion with Token-level Supervision,"We present TokenCompose, a Latent Diffusion Model for text-to-image
+generation that achieves enhanced consistency between user-specified text
+prompts and model-generated images. Despite its tremendous success, the
+standard denoising process in the Latent Diffusion Model takes text prompts as
+conditions only, absent explicit constraint for the consistency between the
+text prompts and the image contents, leading to unsatisfactory results for
+composing multiple object categories. TokenCompose aims to improve
+multi-category instance composition by introducing the token-wise consistency
+terms between the image content and object segmentation maps in the finetuning
+stage. TokenCompose can be applied directly to the existing training pipeline
+of text-conditioned diffusion models without extra human labeling information.
+By finetuning Stable Diffusion, the model exhibits significant improvements in
+multi-category instance composition and enhanced photorealism for its generated
+images.",cs.CV,['cs.CV']
+Unbiased Estimator for Distorted Conic in Camera Calibration,Chaehyeon Song · Jaeho Shin · Myung-Hwan Jeon · Jongwoo Lim · Ayoung Kim,https://github.com/chaehyeonsong/discocal,https://arxiv.org/abs/2403.04583,,2403.04583.pdf,Unbiased Estimator for Distorted Conics in Camera Calibration,"In the literature, points and conics have been major features for camera
+geometric calibration. Although conics are more informative features than
+points, the loss of the conic property under distortion has critically limited
+the utility of conic features in camera calibration. Many existing approaches
+addressed conic-based calibration by ignoring distortion or introducing 3D
+spherical targets to circumvent this limitation. In this paper, we present a
+novel formulation for conic-based calibration using moments. Our derivation is
+based on the mathematical finding that the first moment can be estimated
+without bias even under distortion. This allows us to track moment changes
+during projection and distortion, ensuring the preservation of the first moment
+of the distorted conic. With an unbiased estimator, the circular patterns can
+be accurately detected at the sub-pixel level and can now be fully exploited
+for an entire calibration pipeline, resulting in significantly improved
+calibration. The entire code is readily available from
+https://github.com/ChaehyeonSong/discocal.",cs.CV,['cs.CV']
+Unleashing Channel Potential: Space-Frequency Selection Convolution for SAR Object Detection,Ke Li · Di Wang · Zhangyuan Hu · Wenxuan Zhu · Shaofeng Li · Quan Wang, ,https://arxiv.org/abs/2312.16943,,2312.16943.pdf,Multi-scale direction-aware SAR object detection network via global information fusion,"Deep learning has driven significant progress in object detection using
+Synthetic Aperture Radar (SAR) imagery. Existing methods, while achieving
+promising results, often struggle to effectively integrate local and global
+information, particularly direction-aware features. This paper proposes
+SAR-Net, a novel framework specifically designed for global fusion of
+direction-aware information in SAR object detection. SAR-Net leverages two key
+innovations: the Unity Compensation Mechanism (UCM) and the Direction-aware
+Attention Module (DAM). UCM facilitates the establishment of complementary
+relationships among features across different scales, enabling efficient global
+information fusion and transmission. Additionally, DAM, through bidirectional
+attention polymerization, captures direction-aware information, effectively
+eliminating background interference. Extensive experiments demonstrate the
+effectiveness of SAR-Net, achieving state-of-the-art results on aircraft
+(SAR-AIRcraft-1.0) and ship datasets (SSDD, HRSID), confirming its
+generalization capability and robustness.",cs.CV,['cs.CV']
+FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models,LIn Zhao · Tianchen Zhao · Zinan Lin · Xuefei Ning · Guohao Dai · Huazhong Yang · Yu Wang, ,https://arxiv.org/abs/2403.16379,,2403.16379.pdf,FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models,"In recent years, there has been significant progress in the development of
+text-to-image generative models. Evaluating the quality of the generative
+models is one essential step in the development process. Unfortunately, the
+evaluation process could consume a significant amount of computational
+resources, making the required periodic evaluation of model performance (e.g.,
+monitoring training progress) impractical. Therefore, we seek to improve the
+evaluation efficiency by selecting the representative subset of the text-image
+dataset. We systematically investigate the design choices, including the
+selection criteria (textural features or image-based metrics) and the selection
+granularity (prompt-level or set-level). We find that the insights from prior
+work on subset selection for training data do not generalize to this problem,
+and we propose FlashEval, an iterative search algorithm tailored to evaluation
+data selection. We demonstrate the effectiveness of FlashEval on ranking
+diffusion models with various configurations, including architectures,
+quantization levels, and sampler schedules on COCO and DiffusionDB datasets.
+Our searched 50-item subset could achieve comparable evaluation quality to the
+randomly sampled 500-item subset for COCO annotations on unseen models,
+achieving a 10x evaluation speedup. We release the condensed subset of these
+commonly used datasets to help facilitate diffusion algorithm design and
+evaluation, and open-source FlashEval as a tool for condensing future datasets,
+accessible at https://github.com/thu-nics/FlashEval.",cs.CV,['cs.CV']
+Fair-VPT: Fair Visual Prompt Tuning for Image Classification,Sungho Park · Hyeran Byun, ,https://arxiv.org/abs/2404.05207,,2404.05207.pdf,iVPT: Improving Task-relevant Information Sharing in Visual Prompt Tuning by Cross-layer Dynamic Connection,"Recent progress has shown great potential of visual prompt tuning (VPT) when
+adapting pre-trained vision transformers to various downstream tasks. However,
+most existing solutions independently optimize prompts at each layer, thereby
+neglecting the usage of task-relevant information encoded in prompt tokens
+across layers. Additionally, existing prompt structures are prone to
+interference from task-irrelevant noise in input images, which can do harm to
+the sharing of task-relevant information. In this paper, we propose a novel VPT
+approach, \textbf{iVPT}. It innovatively incorporates a cross-layer dynamic
+connection (CDC) for input prompt tokens from adjacent layers, enabling
+effective sharing of task-relevant information. Furthermore, we design a
+dynamic aggregation (DA) module that facilitates selective sharing of
+information between layers. The combination of CDC and DA enhances the
+flexibility of the attention process within the VPT framework. Building upon
+these foundations, iVPT introduces an attentive reinforcement (AR) mechanism,
+by automatically identifying salient image tokens, which are further enhanced
+by prompt tokens in an additive manner. Extensive experiments on 24 image
+classification and semantic segmentation benchmarks clearly demonstrate the
+advantage of the proposed iVPT, compared to the state-of-the-art counterparts.",cs.CV,['cs.CV']
+"Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion",Junjiao Tian · Lavisha Aggarwal · Andrea Colaco · Zsolt Kira · Mar Gonzalez-Franco,https://sites.google.com/view/diffseg,https://arxiv.org/abs/2308.12469,,2308.12469.pdf,"Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion","Producing quality segmentation masks for images is a fundamental problem in
+computer vision. Recent research has explored large-scale supervised training
+to enable zero-shot segmentation on virtually any image style and unsupervised
+training to enable segmentation without dense annotations. However,
+constructing a model capable of segmenting anything in a zero-shot manner
+without any annotations is still challenging. In this paper, we propose to
+utilize the self-attention layers in stable diffusion models to achieve this
+goal because the pre-trained stable diffusion model has learned inherent
+concepts of objects within its attention layers. Specifically, we introduce a
+simple yet effective iterative merging process based on measuring KL divergence
+among attention maps to merge them into valid segmentation masks. The proposed
+method does not require any training or language dependency to extract quality
+segmentation for any images. On COCO-Stuff-27, our method surpasses the prior
+unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17%
+in mean IoU. The project page is at
+\url{https://sites.google.com/view/diffseg/home}.",cs.CV,['cs.CV']
+DPHMs: Diffusion Parametric Head Models for Depth-based Tracking,Jiapeng Tang · Angela Dai · Yinyu Nie · Lev Markhasin · Justus Thies · Matthias Nießner,https://tangjiapeng.github.io/projects/DPHMs/,https://arxiv.org/abs/2312.01068,,2312.01068.pdf,DPHMs: Diffusion Parametric Head Models for Depth-based Tracking,"We introduce Diffusion Parametric Head Models (DPHMs), a generative model
+that enables robust volumetric head reconstruction and tracking from monocular
+depth sequences. While recent volumetric head models, such as NPHMs, can now
+excel in representing high-fidelity head geometries, tracking and
+reconstructing heads from real-world single-view depth sequences remains very
+challenging, as the fitting to partial and noisy observations is
+underconstrained. To tackle these challenges, we propose a latent
+diffusion-based prior to regularize volumetric head reconstruction and
+tracking. This prior-based regularizer effectively constrains the identity and
+expression codes to lie on the underlying latent manifold which represents
+plausible head shapes. To evaluate the effectiveness of the diffusion-based
+prior, we collect a dataset of monocular Kinect sequences consisting of various
+complex facial expression motions and rapid transitions. We compare our method
+to state-of-the-art tracking methods and demonstrate improved head identity
+reconstruction as well as robust expression tracking.",cs.CV,['cs.CV']
+A Unified Approach for Text- and Image-guided 4D Scene Generation,Yufeng Zheng · Xueting Li · Koki Nagano · Sifei Liu · Otmar Hilliges · Shalini De Mello, ,https://arxiv.org/abs/2311.16854,,2311.16854.pdf,A Unified Approach for Text- and Image-guided 4D Scene Generation,"Large-scale diffusion generative models are greatly simplifying image, video
+and 3D asset creation from user-provided text prompts and images. However, the
+challenging problem of text-to-4D dynamic 3D scene generation with diffusion
+guidance remains largely unexplored. We propose Dream-in-4D, which features a
+novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D
+diffusion guidance to effectively learn a high-quality static 3D asset in the
+first stage; (2) a deformable neural radiance field that explicitly
+disentangles the learned static asset from its deformation, preserving quality
+during motion learning; and (3) a multi-resolution feature grid for the
+deformation field with a displacement total variation loss to effectively learn
+motion with video diffusion guidance in the second stage. Through a user
+preference study, we demonstrate that our approach significantly advances image
+and motion quality, 3D consistency and text fidelity for text-to-4D generation
+compared to baseline approaches. Thanks to its motion-disentangled
+representation, Dream-in-4D can also be easily adapted for controllable
+generation where appearance is defined by one or multiple images, without the
+need to modify the motion learning stage. Thus, our method offers, for the
+first time, a unified approach for text-to-4D, image-to-4D and personalized 4D
+generation tasks.",cs.CV,['cs.CV']
+Continuous Pose for Monocular Cameras in Neural Implicit Representation,Qi Ma · Danda Paudel · Ajad Chhatkuli · Luc Van Gool,https://github.com/qimaqi/Continuous-Pose-in-NeRF,https://arxiv.org/abs/2311.17119,,2311.17119.pdf,Continuous Pose for Monocular Cameras in Neural Implicit Representation,"In this paper, we showcase the effectiveness of optimizing monocular camera
+poses as a continuous function of time. The camera poses are represented using
+an implicit neural function which maps the given time to the corresponding
+camera pose. The mapped camera poses are then used for the downstream tasks
+where joint camera pose optimization is also required. While doing so, the
+network parameters -- that implicitly represent camera poses -- are optimized.
+We exploit the proposed method in four diverse experimental settings, namely,
+(1) NeRF from noisy poses; (2) NeRF from asynchronous Events; (3) Visual
+Simultaneous Localization and Mapping (vSLAM); and (4) vSLAM with IMUs. In all
+four settings, the proposed method performs significantly better than the
+compared baselines and the state-of-the-art methods. Additionally, using the
+assumption of continuous motion, changes in pose may actually live in a
+manifold that has lower than 6 degrees of freedom (DOF) is also realized. We
+call this low DOF motion representation as the \emph{intrinsic motion} and use
+the approach in vSLAM settings, showing impressive camera tracking performance.",cs.CV,['cs.CV']
+Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning,Christopher Liao · Theodoros Tsiligkaridis · Brian Kulis, ,https://arxiv.org/abs/2311.13612,,2311.13612.pdf,Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning,"Over the past year, a large body of multimodal research has emerged around
+zero-shot evaluation using GPT descriptors. These studies boost the zero-shot
+accuracy of pretrained VL models with an ensemble of label-specific text
+generated by GPT. A recent study, WaffleCLIP, demonstrated that similar
+zero-shot accuracy can be achieved with an ensemble of random descriptors.
+However, both zero-shot methods are un-trainable and consequently sub-optimal
+when some few-shot out-of-distribution (OOD) training data is available.
+Inspired by these prior works, we present two more flexible methods called
+descriptor and word soups, which do not require an LLM at test time and can
+leverage training data to increase OOD target accuracy. Descriptor soup
+greedily selects a small set of textual descriptors using generic few-shot
+training data, then calculates robust class embeddings using the selected
+descriptors. Word soup greedily assembles a chain of words in a similar manner.
+Compared to existing few-shot soft prompt tuning methods, word soup requires
+fewer parameters by construction and less GPU memory, since it does not require
+backpropagation. Both soups outperform current published few-shot methods, even
+when combined with SoTA zero-shot methods, on cross-dataset and domain
+generalization benchmarks. Compared with SoTA prompt and descriptor ensembling
+methods, such as ProDA and WaffleCLIP, word soup achieves higher OOD accuracy
+with fewer ensemble members. Please checkout our code:
+github.com/Chris210634/word_soups",cs.CV,['cs.CV']
+ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis,Muhammad Hamza Mughal · Rishabh Dabral · Ikhsanul Habibie · Lucia Donatelli · Marc Habermann · Christian Theobalt, ,https://arxiv.org/abs/2403.17936,,2403.17936.pdf,ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis,"Gestures play a key role in human communication. Recent methods for co-speech
+gesture generation, while managing to generate beat-aligned motions, struggle
+generating gestures that are semantically aligned with the utterance. Compared
+to beat gestures that align naturally to the audio signal, semantically
+coherent gestures require modeling the complex interactions between the
+language and human motion, and can be controlled by focusing on certain words.
+Therefore, we present ConvoFusion, a diffusion-based approach for multi-modal
+gesture synthesis, which can not only generate gestures based on multi-modal
+speech inputs, but can also facilitate controllability in gesture synthesis.
+Our method proposes two guidance objectives that allow the users to modulate
+the impact of different conditioning modalities (e.g. audio vs text) as well as
+to choose certain words to be emphasized during gesturing. Our method is
+versatile in that it can be trained either for generating monologue gestures or
+even the conversational gestures. To further advance the research on
+multi-party interactive gestures, the DnD Group Gesture dataset is released,
+which contains 6 hours of gesture data showing 5 people interacting with one
+another. We compare our method with several recent works and demonstrate
+effectiveness of our method on a variety of tasks. We urge the reader to watch
+our supplementary video at our website.",cs.CV,['cs.CV']
+SEAS: ShapE-Aligned Supervision for Person Re-Identification,Haidong Zhu · Pranav Budhwant · Zhaoheng Zheng · Ram Nevatia, ,https://arxiv.org/abs/2312.05634,,2312.05634.pdf,PGDS: Pose-Guidance Deep Supervision for Mitigating Clothes-Changing in Person Re-Identification,"Person Re-Identification (Re-ID) task seeks to enhance the tracking of
+multiple individuals by surveillance cameras. It supports multimodal tasks,
+including text-based person retrieval and human matching. One of the most
+significant challenges faced in Re-ID is clothes-changing, where the same
+person may appear in different outfits. While previous methods have made
+notable progress in maintaining clothing data consistency and handling clothing
+change data, they still rely excessively on clothing information, which can
+limit performance due to the dynamic nature of human appearances. To mitigate
+this challenge, we propose the Pose-Guidance Deep Supervision (PGDS), an
+effective framework for learning pose guidance within the Re-ID task. It
+consists of three modules: a human encoder, a pose encoder, and a Pose-to-Human
+Projection module (PHP). Our framework guides the human encoder, i.e., the main
+re-identification model, with pose information from the pose encoder through
+multiple layers via the knowledge transfer mechanism from the PHP module,
+helping the human encoder learn body parts information without increasing
+computation resources in the inference stage. Through extensive experiments,
+our method surpasses the performance of current state-of-the-art methods,
+demonstrating its robustness and effectiveness for real-world applications. Our
+code is available at https://github.com/huyquoctrinh/PGDS.",cs.CV,['cs.CV']
+Normalizing Flows on the Product Space of SO(3) Manifolds for Probabilistic Human Pose Modeling,Olaf Dünkel · Tim Salzmann �� Florian Pfaff, ,https://arxiv.org/abs/2404.05675,,2404.05675.pdf,Normalizing Flows on the Product Space of SO(3) Manifolds for Probabilistic Human Pose Modeling,"Normalizing flows have proven their efficacy for density estimation in
+Euclidean space, but their application to rotational representations, crucial
+in various domains such as robotics or human pose modeling, remains
+underexplored. Probabilistic models of the human pose can benefit from
+approaches that rigorously consider the rotational nature of human joints. For
+this purpose, we introduce HuProSO3, a normalizing flow model that operates on
+a high-dimensional product space of SO(3) manifolds, modeling the joint
+distribution for human joints with three degrees of freedom. HuProSO3's
+advantage over state-of-the-art approaches is demonstrated through its superior
+modeling accuracy in three different applications and its capability to
+evaluate the exact likelihood. This work not only addresses the technical
+challenge of learning densities on SO(3) manifolds, but it also has broader
+implications for domains where the probabilistic regression of correlated 3D
+rotations is of importance.",cs.CV,['cs.CV']
+ES$^3$: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations,Yuanhang Zhang · Shuang Yang · Shiguang Shan · Xilin Chen, ,https://arxiv.org/abs/2312.10305,,2312.10305.pdf,Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction,"Speech signals are inherently complex as they encompass both global acoustic
+characteristics and local semantic information. However, in the task of target
+speech extraction, certain elements of global and local semantic information in
+the reference speech, which are irrelevant to speaker identity, can lead to
+speaker confusion within the speech extraction network. To overcome this
+challenge, we propose a self-supervised disentangled representation learning
+method. Our approach tackles this issue through a two-phase process, utilizing
+a reference speech encoding network and a global information disentanglement
+network to gradually disentangle the speaker identity information from other
+irrelevant factors. We exclusively employ the disentangled speaker identity
+information to guide the speech extraction network. Moreover, we introduce the
+adaptive modulation Transformer to ensure that the acoustic representation of
+the mixed signal remains undisturbed by the speaker embeddings. This component
+incorporates speaker embeddings as conditional information, facilitating
+natural and efficient guidance for the speech extraction network. Experimental
+results substantiate the effectiveness of our meticulously crafted approach,
+showcasing a substantial reduction in the likelihood of speaker confusion.",cs.SD,"['cs.SD', 'cs.AI', 'cs.LG', 'eess.AS']"
+Video Interpolation with Diffusion Models,Siddhant Jain · Daniel Watson · Aleksander Holynski · Eric Tabellion · Ben Poole · Janne Kontkanen,https://vidim-interpolation.github.io/,https://arxiv.org/abs/2404.01203,,2404.01203.pdf,Video Interpolation with Diffusion Models,"We present VIDIM, a generative model for video interpolation, which creates
+short videos given a start and end frame. In order to achieve high fidelity and
+generate motions unseen in the input data, VIDIM uses cascaded diffusion models
+to first generate the target video at low resolution, and then generate the
+high-resolution video conditioned on the low-resolution generated video. We
+compare VIDIM to previous state-of-the-art methods on video interpolation, and
+demonstrate how such works fail in most settings where the underlying motion is
+complex, nonlinear, or ambiguous while VIDIM can easily handle such cases. We
+additionally demonstrate how classifier-free guidance on the start and end
+frame and conditioning the super-resolution model on the original
+high-resolution frames without additional parameters unlocks high-fidelity
+results. VIDIM is fast to sample from as it jointly denoises all the frames to
+be generated, requires less than a billion parameters per diffusion model to
+produce compelling results, and still enjoys scalability and improved quality
+at larger parameter counts.",cs.CV,['cs.CV']
+Infer from What You Have Seen Before: Temporally-dependent Classifier for Semi-supervised Video Semantic Segmentation,Jiafan Zhuang · Zilei Wang · Yixin Zhang · Zhun Fan, ,,https://www.youtube.com/watch?v=k50sUgxC09o,,,,,nan
+IReNe: Instant Recoloring of Neural Radiance Fields,Alessio Mazzucchelli · Adrian Garcia-Garcia · Elena Garces · Fernando Rivas-Manzaneque · Francesc Moreno-Noguer · Adrian Penate-Sanchez,https://iviazz97.github.io/irene/,https://arxiv.org/abs/2405.19876,,2405.19876.pdf,IReNe: Instant Recoloring in Neural Radiance Fields,"Advances in NERFs have allowed for 3D scene reconstructions and novel view
+synthesis. Yet, efficiently editing these representations while retaining
+photorealism is an emerging challenge. Recent methods face three primary
+limitations: they're slow for interactive use, lack precision at object
+boundaries, and struggle to ensure multi-view consistency. We introduce IReNe
+to address these limitations, enabling swift, near real-time color editing in
+NeRF. Leveraging a pre-trained NeRF model and a single training image with
+user-applied color edits, IReNe swiftly adjusts network parameters in seconds.
+This adjustment allows the model to generate new scene views, accurately
+representing the color changes from the training image while also controlling
+object boundaries and view-specific effects. Object boundary control is
+achieved by integrating a trainable segmentation module into the model. The
+process gains efficiency by retraining only the weights of the last network
+layer. We observed that neurons in this layer can be classified into those
+responsible for view-dependent appearance and those contributing to diffuse
+appearance. We introduce an automated classification approach to identify these
+neuron types and exclusively fine-tune the weights of the diffuse neurons. This
+further accelerates training and ensures consistent color edits across
+different views. A thorough validation on a new dataset, with edited object
+colors, shows significant quantitative and qualitative advancements over
+competitors, accelerating speeds by 5x to 500x.",cs.CV,['cs.CV']
+FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization,Shuai Tan · Bin Ji · Ye Pan, ,https://arxiv.org/abs/2403.06375,,2403.06375.pdf,FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization,"Generating emotional talking faces is a practical yet challenging endeavor.
+To create a lifelike avatar, we draw upon two critical insights from a human
+perspective: 1) The connection between audio and the non-deterministic facial
+dynamics, encompassing expressions, blinks, poses, should exhibit synchronous
+and one-to-many mapping. 2) Vibrant expressions are often accompanied by
+emotion-aware high-definition (HD) textures and finely detailed teeth. However,
+both aspects are frequently overlooked by existing methods. To this end, this
+paper proposes using normalizing Flow and Vector-Quantization modeling to
+produce emotional talking faces that satisfy both insights concurrently
+(FlowVQTalker). Specifically, we develop a flow-based coefficient generator
+that encodes the dynamics of facial emotion into a multi-emotion-class latent
+space represented as a mixture distribution. The generation process commences
+with random sampling from the modeled distribution, guided by the accompanying
+audio, enabling both lip-synchronization and the uncertain nonverbal facial
+cues generation. Furthermore, our designed vector-quantization image generator
+treats the creation of expressive facial images as a code query task, utilizing
+a learned codebook to provide rich, high-quality textures that enhance the
+emotional perception of the results. Extensive experiments are conducted to
+showcase the effectiveness of our approach.",cs.CV,['cs.CV']
+Self-Supervised Class-Agnostic Motion Prediction with Spatial and Temporal Consistency Regularizations,Kewei Wang · Yizheng Wu · Jun Cen · Zhiyu Pan · Xingyi Li · Zhe Wang · Zhiguo Cao · Guosheng Lin, ,https://arxiv.org/abs/2403.13261,,2403.13261.pdf,Self-Supervised Class-Agnostic Motion Prediction with Spatial and Temporal Consistency Regularizations,"The perception of motion behavior in a dynamic environment holds significant
+importance for autonomous driving systems, wherein class-agnostic motion
+prediction methods directly predict the motion of the entire point cloud. While
+most existing methods rely on fully-supervised learning, the manual labeling of
+point cloud data is laborious and time-consuming. Therefore, several
+annotation-efficient methods have been proposed to address this challenge.
+Although effective, these methods rely on weak annotations or additional
+multi-modal data like images, and the potential benefits inherent in the point
+cloud sequence are still underexplored. To this end, we explore the feasibility
+of self-supervised motion prediction with only unlabeled LiDAR point clouds.
+Initially, we employ an optimal transport solver to establish coarse
+correspondences between current and future point clouds as the coarse pseudo
+motion labels. Training models directly using such coarse labels leads to
+noticeable spatial and temporal prediction inconsistencies. To mitigate these
+issues, we introduce three simple spatial and temporal regularization losses,
+which facilitate the self-supervised training process effectively. Experimental
+results demonstrate the significant superiority of our approach over the
+state-of-the-art self-supervised methods.",cs.CV,['cs.CV']
+Latency Correction for Event-guided Deblurring and Frame Interpolation,Yixin Yang · Jinxiu Liang · Bohan Yu · Yan Chen · Jimmy S. Ren · Boxin Shi, ,https://arxiv.org/abs/2306.15507,,2306.15507.pdf,Self-supervised Learning of Event-guided Video Frame Interpolation for Rolling Shutter Frames,"This paper makes the first attempt to tackle the challenging task of
+recovering arbitrary frame rate latent global shutter (GS) frames from two
+consecutive rolling shutter (RS) frames, guided by the novel event camera data.
+Although events possess high temporal resolution, beneficial for video frame
+interpolation (VFI), a hurdle in tackling this task is the lack of paired GS
+frames. Another challenge is that RS frames are susceptible to distortion when
+capturing moving objects. To this end, we propose a novel self-supervised
+framework that leverages events to guide RS frame correction and VFI in a
+unified framework. Our key idea is to estimate the displacement field (DF)
+non-linear dense 3D spatiotemporal information of all pixels during the
+exposure time, allowing for the reciprocal reconstruction between RS and GS
+frames as well as arbitrary frame rate VFI. Specifically, the displacement
+field estimation (DFE) module is proposed to estimate the spatiotemporal motion
+from events to correct the RS distortion and interpolate the GS frames in one
+step. We then combine the input RS frames and DF to learn a mapping for
+RS-to-GS frame interpolation. However, as the mapping is highly
+under-constrained, we couple it with an inverse mapping (i.e., GS-to-RS) and RS
+frame warping (i.e., RS-to-RS) for self-supervision. As there is a lack of
+labeled datasets for evaluation, we generate two synthetic datasets and collect
+a real-world dataset to train and test our method. Experimental results show
+that our method yields comparable or better performance with prior supervised
+methods.",cs.CV,"['cs.CV', 'cs.RO']"
+Single Domain Generalization for Crowd Counting,Zhuoxuan Peng · S.-H. Gary Chan,https://github.com/Shimmer93/MPCount,https://arxiv.org/abs/2403.09124,,2403.09124.pdf,Single Domain Generalization for Crowd Counting,"Due to its promising results, density map regression has been widely employed
+for image-based crowd counting. The approach, however, often suffers from
+severe performance degradation when tested on data from unseen scenarios, the
+so-called ""domain shift"" problem. To address the problem, we investigate in
+this work single domain generalization (SDG) for crowd counting. The existing
+SDG approaches are mainly for image classification and segmentation, and can
+hardly be extended to our case due to its regression nature and label ambiguity
+(i.e., ambiguous pixel-level ground truths). We propose MPCount, a novel
+effective SDG approach even for narrow source distribution. MPCount stores
+diverse density values for density map regression and reconstructs
+domain-invariant features by means of only one memory bank, a content error
+mask and attention consistency loss. By partitioning the image into grids, it
+employs patch-wise classification as an auxiliary task to mitigate label
+ambiguity. Through extensive experiments on different datasets, MPCount is
+shown to significantly improve counting accuracy compared to the state of the
+art under diverse scenarios unobserved in the training data characterized by
+narrow source distribution. Code is available at
+https://github.com/Shimmer93/MPCount.",cs.CV,['cs.CV']
+Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild,Fanghua Yu · Jinjin Gu · Zheyuan Li · Jinfan Hu · Xiangtao Kong · Xintao Wang · Jingwen He · Yu Qiao · Chao Dong, ,https://arxiv.org/abs/2401.13627,,2401.13627.pdf,Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild,"We introduce SUPIR (Scaling-UP Image Restoration), a groundbreaking image
+restoration method that harnesses generative prior and the power of model
+scaling up. Leveraging multi-modal techniques and advanced generative prior,
+SUPIR marks a significant advance in intelligent and realistic image
+restoration. As a pivotal catalyst within SUPIR, model scaling dramatically
+enhances its capabilities and demonstrates new potential for image restoration.
+We collect a dataset comprising 20 million high-resolution, high-quality images
+for model training, each enriched with descriptive text annotations. SUPIR
+provides the capability to restore images guided by textual prompts, broadening
+its application scope and potential. Moreover, we introduce negative-quality
+prompts to further improve perceptual quality. We also develop a
+restoration-guided sampling method to suppress the fidelity issue encountered
+in generative-based restoration. Experiments demonstrate SUPIR's exceptional
+restoration effects and its novel capacity to manipulate restoration through
+textual prompts.",cs.CV,['cs.CV']
+ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles,Jiawei Zhang · Chejian Xu · Bo Li, ,https://arxiv.org/abs/2405.14062,,2405.14062.pdf,ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles,"We present ChatScene, a Large Language Model (LLM)-based agent that leverages
+the capabilities of LLMs to generate safety-critical scenarios for autonomous
+vehicles. Given unstructured language instructions, the agent first generates
+textually described traffic scenarios using LLMs. These scenario descriptions
+are subsequently broken down into several sub-descriptions for specified
+details such as behaviors and locations of vehicles. The agent then
+distinctively transforms the textually described sub-scenarios into
+domain-specific languages, which then generate actual code for prediction and
+control in simulators, facilitating the creation of diverse and complex
+scenarios within the CARLA simulation environment. A key part of our agent is a
+comprehensive knowledge retrieval component, which efficiently translates
+specific textual descriptions into corresponding domain-specific code snippets
+by training a knowledge database containing the scenario description and code
+pairs. Extensive experimental results underscore the efficacy of ChatScene in
+improving the safety of autonomous vehicles. For instance, the scenarios
+generated by ChatScene show a 15% increase in collision rates compared to
+state-of-the-art baselines when tested against different reinforcement
+learning-based ego vehicles. Furthermore, we show that by using our generated
+safety-critical scenarios to fine-tune different RL-based autonomous driving
+models, they can achieve a 9% reduction in collision rates, surpassing current
+SOTA methods. ChatScene effectively bridges the gap between textual
+descriptions of traffic scenarios and practical CARLA simulations, providing a
+unified way to conveniently generate safety-critical scenarios for safety
+testing and improvement for AVs.",cs.AI,"['cs.AI', 'cs.LG']"
+KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation,Jihua Peng · Yanghong Zhou · Tracy P Y Mok, ,https://arxiv.org/abs/2404.00658,,2404.00658.pdf,KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation,"This paper presents a novel Kinematics and Trajectory Prior
+Knowledge-Enhanced Transformer (KTPFormer), which overcomes the weakness in
+existing transformer-based methods for 3D human pose estimation that the
+derivation of Q, K, V vectors in their self-attention mechanisms are all based
+on simple linear mapping. We propose two prior attention modules, namely
+Kinematics Prior Attention (KPA) and Trajectory Prior Attention (TPA) to take
+advantage of the known anatomical structure of the human body and motion
+trajectory information, to facilitate effective learning of global dependencies
+and features in the multi-head self-attention. KPA models kinematic
+relationships in the human body by constructing a topology of kinematics, while
+TPA builds a trajectory topology to learn the information of joint motion
+trajectory across frames. Yielding Q, K, V vectors with prior knowledge, the
+two modules enable KTPFormer to model both spatial and temporal correlations
+simultaneously. Extensive experiments on three benchmarks (Human3.6M,
+MPI-INF-3DHP and HumanEva) show that KTPFormer achieves superior performance in
+comparison to state-of-the-art methods. More importantly, our KPA and TPA
+modules have lightweight plug-and-play designs and can be integrated into
+various transformer-based networks (i.e., diffusion-based) to improve the
+performance with only a very small increase in the computational overhead. The
+code is available at: https://github.com/JihuaPeng/KTPFormer.",cs.CV,['cs.CV']
+Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification,Pingping Zhang · Yuhao Wang · Yang Liu · Zhengzheng Tu · Huchuan Lu,https://github.com/924973292/EDITOR,https://arxiv.org/abs/2403.10254,,2403.10254.pdf,Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification,"Single-modal object re-identification (ReID) faces great challenges in
+maintaining robustness within complex visual scenarios. In contrast,
+multi-modal object ReID utilizes complementary information from diverse
+modalities, showing great potentials for practical applications. However,
+previous methods may be easily affected by irrelevant backgrounds and usually
+ignore the modality gaps. To address above issues, we propose a novel learning
+framework named \textbf{EDITOR} to select diverse tokens from vision
+Transformers for multi-modal object ReID. We begin with a shared vision
+Transformer to extract tokenized features from different input modalities.
+Then, we introduce a Spatial-Frequency Token Selection (SFTS) module to
+adaptively select object-centric tokens with both spatial and frequency
+information. Afterwards, we employ a Hierarchical Masked Aggregation (HMA)
+module to facilitate feature interactions within and across modalities.
+Finally, to further reduce the effect of backgrounds, we propose a Background
+Consistency Constraint (BCC) and an Object-Centric Feature Refinement (OCFR).
+They are formulated as two new loss functions, which improve the feature
+discrimination with background suppression. As a result, our framework can
+generate more discriminative features for multi-modal object ReID. Extensive
+experiments on three multi-modal ReID benchmarks verify the effectiveness of
+our methods. The code is available at https://github.com/924973292/EDITOR.",cs.CV,"['cs.CV', 'cs.IR', 'cs.MM']"
+ShapeWalk: Compositional Shape Editing through Language-Guided Chains,Habib Slim · Mohamed Elhoseiny,https://shapewalk.github.io/,https://arxiv.org/html/2405.20319v1,,2405.20319v1.pdf,ParSEL: Parameterized Shape Editing with Language,"The ability to edit 3D assets from natural language presents a compelling
+paradigm to aid in the democratization of 3D content creation. However, while
+natural language is often effective at communicating general intent, it is
+poorly suited for specifying precise manipulation. To address this gap, we
+introduce ParSEL, a system that enables controllable editing of high-quality 3D
+assets from natural language. Given a segmented 3D mesh and an editing request,
+ParSEL produces a parameterized editing program. Adjusting the program
+parameters allows users to explore shape variations with a precise control over
+the magnitudes of edits. To infer editing programs which align with an input
+edit request, we leverage the abilities of large-language models (LLMs).
+However, while we find that LLMs excel at identifying initial edit operations,
+they often fail to infer complete editing programs, and produce outputs that
+violate shape semantics. To overcome this issue, we introduce Analytical Edit
+Propagation (AEP), an algorithm which extends a seed edit with additional
+operations until a complete editing program has been formed. Unlike prior
+methods, AEP searches for analytical editing operations compatible with a range
+of possible user edits through the integration of computer algebra systems for
+geometric analysis. Experimentally we demonstrate ParSEL's effectiveness in
+enabling controllable editing of 3D objects through natural language requests
+over alternative system designs.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.HC', 'cs.SC']"
+Video-P2P: Video Editing with Cross-attention Control,Shaoteng Liu · Yuechen Zhang · Wenbo Li · Zhe Lin · Jiaya Jia, ,,https://www.researchgate.net/publication/380733385_Video-P2P_Video_Editing_with_Cross-attention_Control,,,,,nan
+R-Cyclic Diffuser: Reductive and Cyclic Latent Diffusion for 3D Clothed Human Digitalization,Kennard Chan · Fayao Liu · Guosheng Lin · Chuan-Sheng Foo · Weisi Lin, ,https://arxiv.org/html/2401.12175v2,,2401.12175v2.pdf,Template-Free Single-View 3D Human Digitalization with Diffusion-Guided LRM,"Reconstructing 3D humans from a single image has been extensively
+investigated. However, existing approaches often fall short on capturing fine
+geometry and appearance details, hallucinating occluded parts with plausible
+details, and achieving generalization across unseen and in-the-wild datasets.
+We present Human-LRM, a diffusion-guided feed-forward model that predicts the
+implicit field of a human from a single image. Leveraging the power of the
+state-of-the-art reconstruction model (i.e., LRM) and generative model (i.e
+Stable Diffusion), our method is able to capture human without any template
+prior, e.g., SMPL, and effectively enhance occluded parts with rich and
+realistic details. Our approach first uses a single-view LRM model with an
+enhanced geometry decoder to get the triplane NeRF representation. The novel
+view renderings from the triplane NeRF provide strong geometry and color prior,
+from which we generate photo-realistic details for the occluded parts using a
+diffusion model. The generated multiple views then enable reconstruction with
+high-quality geometry and appearance, leading to superior overall performance
+comparing to all existing human reconstruction methods.",cs.CV,['cs.CV']
+Arbitrary Motion Style Transfer with Multi-condition Motion Latent Diffusion Model,Wenfeng Song · Xingliang Jin · Shuai Li · Chenglizhao Chen · Aimin Hao · Xia HOU · Ning Li · Hong Qin,https://xingliangjin.github.io/MCM-LDM-Web/,https://arxiv.org/abs/2306.09330,,2306.09330.pdf,ArtFusion: Controllable Arbitrary Style Transfer using Dual Conditional Latent Diffusion Models,"Arbitrary Style Transfer (AST) aims to transform images by adopting the style
+from any selected artwork. Nonetheless, the need to accommodate diverse and
+subjective user preferences poses a significant challenge. While some users
+wish to preserve distinct content structures, others might favor a more
+pronounced stylization. Despite advances in feed-forward AST methods, their
+limited customizability hinders their practical application. We propose a new
+approach, ArtFusion, which provides a flexible balance between content and
+style. In contrast to traditional methods reliant on biased similarity losses,
+ArtFusion utilizes our innovative Dual Conditional Latent Diffusion
+Probabilistic Models (Dual-cLDM). This approach mitigates repetitive patterns
+and enhances subtle artistic aspects like brush strokes and genre-specific
+features. Despite the promising results of conditional diffusion probabilistic
+models (cDM) in various generative tasks, their introduction to style transfer
+is challenging due to the requirement for paired training data. ArtFusion
+successfully navigates this issue, offering more practical and controllable
+stylization. A key element of our approach involves using a single image for
+both content and style during model training, all the while maintaining
+effective stylization during inference. ArtFusion outperforms existing
+approaches on outstanding controllability and faithful presentation of artistic
+details, providing evidence of its superior style transfer capabilities.
+Furthermore, the Dual-cLDM utilized in ArtFusion carries the potential for a
+variety of complex multi-condition generative tasks, thus greatly broadening
+the impact of our research.",cs.CV,['cs.CV']
+HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting,Hongyu Zhou · Jiahao Shao · Lu Xu · Dongfeng Bai · Weichao Qiu · Bingbing Liu · Yue Wang · Andreas Geiger · Yiyi Liao, ,https://arxiv.org/abs/2403.12722,,2403.12722.pdf,HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting,"Holistic understanding of urban scenes based on RGB images is a challenging
+yet important problem. It encompasses understanding both the geometry and
+appearance to enable novel view synthesis, parsing semantic labels, and
+tracking moving objects. Despite considerable progress, existing approaches
+often focus on specific aspects of this task and require additional inputs such
+as LiDAR scans or manually annotated 3D bounding boxes. In this paper, we
+introduce a novel pipeline that utilizes 3D Gaussian Splatting for holistic
+urban scene understanding. Our main idea involves the joint optimization of
+geometry, appearance, semantics, and motion using a combination of static and
+dynamic 3D Gaussians, where moving object poses are regularized via physical
+constraints. Our approach offers the ability to render new viewpoints in
+real-time, yielding 2D and 3D semantic information with high accuracy, and
+reconstruct dynamic scenes, even in scenarios where 3D bounding box detection
+are highly noisy. Experimental results on KITTI, KITTI-360, and Virtual KITTI 2
+demonstrate the effectiveness of our approach.",cs.CV,['cs.CV']
+HumMUSS: Human Motion Understanding using State Space Models,Arnab Mondal · Stefano Alletto · Denis Tome, ,https://arxiv.org/abs/2404.10880,,2404.10880.pdf,HumMUSS: Human Motion Understanding using State Space Models,"Understanding human motion from video is essential for a range of
+applications, including pose estimation, mesh recovery and action recognition.
+While state-of-the-art methods predominantly rely on transformer-based
+architectures, these approaches have limitations in practical scenarios.
+Transformers are slower when sequentially predicting on a continuous stream of
+frames in real-time, and do not generalize to new frame rates. In light of
+these constraints, we propose a novel attention-free spatiotemporal model for
+human motion understanding building upon recent advancements in state space
+models. Our model not only matches the performance of transformer-based models
+in various motion understanding tasks but also brings added benefits like
+adaptability to different video frame rates and enhanced training speed when
+working with longer sequence of keypoints. Moreover, the proposed model
+supports both offline and real-time applications. For real-time sequential
+prediction, our model is both memory efficient and several times faster than
+transformer-based approaches while maintaining their high accuracy.",cs.CV,"['cs.CV', 'cs.AI']"
+SmartMask: Context Aware High-Fidelity Mask Generation for Fine-grained Object Insertion and Layout Control,Jaskirat Singh · Jianming Zhang · Qing Liu · Cameron Smith · Zhe Lin · Liang Zheng, ,https://arxiv.org/abs/2312.05039,,,SmartMask: Context Aware High-Fidelity Mask Generation for Fine-grained Object Insertion and Layout Control,"The field of generative image inpainting and object insertion has made
+significant progress with the recent advent of latent diffusion models.
+Utilizing a precise object mask can greatly enhance these applications.
+However, due to the challenges users encounter in creating high-fidelity masks,
+there is a tendency for these methods to rely on more coarse masks (e.g.,
+bounding box) for these applications. This results in limited control and
+compromised background content preservation. To overcome these limitations, we
+introduce SmartMask, which allows any novice user to create detailed masks for
+precise object insertion. Combined with a ControlNet-Inpaint model, our
+experiments demonstrate that SmartMask achieves superior object insertion
+quality, preserving the background content more effectively than previous
+methods. Notably, unlike prior works the proposed approach can also be used
+even without user-mask guidance, which allows it to perform mask-free object
+insertion at diverse positions and scales. Furthermore, we find that when used
+iteratively with a novel instruction-tuning based planning model, SmartMask can
+be used to design detailed layouts from scratch. As compared with user-scribble
+based layout design, we observe that SmartMask allows for better quality
+outputs with layout-to-image generation methods. Project page is available at
+https://smartmask-gen.github.io",cs.CV,"['cs.CV', 'cs.AI', 'cs.HC', 'cs.LG', 'cs.MM']"
+Towards Progressive Multi-Frequency Representation for Image Warping,Jun Xiao · Zihang Lyu · Cong Zhang · Yakun Ju · Changjian Shui · Kin-man Lam, ,https://arxiv.org/abs/2404.10716,,2404.10716.pdf,MOWA: Multiple-in-One Image Warping Model,"While recent image warping approaches achieved remarkable success on existing
+benchmarks, they still require training separate models for each specific task
+and cannot generalize well to different camera models or customized
+manipulations. To address diverse types of warping in practice, we propose a
+Multiple-in-One image WArping model (named MOWA) in this work. Specifically, we
+mitigate the difficulty of multi-task learning by disentangling the motion
+estimation at both the region level and pixel level. To further enable dynamic
+task-aware image warping, we introduce a lightweight point-based classifier
+that predicts the task type, serving as prompts to modulate the feature maps
+for better estimation. To our knowledge, this is the first work that solves
+multiple practical warping tasks in one single model. Extensive experiments
+demonstrate that our MOWA, which is trained on six tasks for multiple-in-one
+image warping, outperforms state-of-the-art task-specific models across most
+tasks. Moreover, MOWA also exhibits promising potential to generalize into
+unseen scenes, as evidenced by cross-domain and zero-shot evaluations. The code
+will be made publicly available.",cs.CV,['cs.CV']
+V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs,Penghao Wu · Saining Xie,https://vstar-seal.github.io/,https://arxiv.org/abs/2312.14135,,2312.14135.pdf,V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs,"When we look around and perform complex tasks, how we see and selectively
+process what we see is crucial. However, the lack of this visual search
+mechanism in current multimodal LLMs (MLLMs) hinders their ability to focus on
+important visual details, especially when handling high-resolution and visually
+crowded images. To address this, we introduce V*, an LLM-guided visual search
+mechanism that employs the world knowledge in LLMs for efficient visual
+querying. When combined with an MLLM, this mechanism enhances collaborative
+reasoning, contextual understanding, and precise targeting of specific visual
+elements. This integration results in a new MLLM meta-architecture, named Show,
+sEArch, and TelL (SEAL). We further create V*Bench, a benchmark specifically
+designed to evaluate MLLMs in their ability to process high-resolution images
+and focus on visual details. Our study highlights the necessity of
+incorporating visual search capabilities into multimodal systems. The code is
+available https://github.com/penghao-wu/vstar.",cs.CV,['cs.CV']
+Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models,Xinpeng Ding · Jianhua Han · Hang Xu · Xiaodan Liang · Wei Zhang · Xiaomeng Li, ,https://arxiv.org/abs/2401.00988v1,,2401.00988v1.pdf,Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models,"The rise of multimodal large language models (MLLMs) has spurred interest in
+language-based driving tasks. However, existing research typically focuses on
+limited tasks and often omits key multi-view and temporal information which is
+crucial for robust autonomous driving. To bridge these gaps, we introduce
+NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17
+subtasks, where each task demands holistic information (e.g., temporal,
+multi-view, and spatial), significantly elevating the challenge level. To
+obtain NuInstruct, we propose a novel SQL-based method to generate
+instruction-response pairs automatically, which is inspired by the driving
+logical progression of humans. We further present BEV-InMLLM, an end-to-end
+method for efficiently deriving instruction-aware Bird's-Eye-View (BEV)
+features, language-aligned for large language models. BEV-InMLLM integrates
+multi-view, spatial awareness, and temporal semantics to enhance MLLMs'
+capabilities on NuInstruct tasks. Moreover, our proposed BEV injection module
+is a plug-and-play method for existing MLLMs. Our experiments on NuInstruct
+demonstrate that BEV-InMLLM significantly outperforms existing MLLMs, e.g.
+around 9% improvement on various tasks. We plan to release our NuInstruct for
+future research development.",cs.CV,['cs.CV']
+Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation,Xinyao Li · Yuke Li · Zhekai Du · Fengling Li · Ke Lu · Jingjing Li,https://github.com/TL-UESTC/UniMoS,https://arxiv.org/abs/2403.06946,,2403.06946.pdf,Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation,"Large vision-language models (VLMs) like CLIP have demonstrated good
+zero-shot learning performance in the unsupervised domain adaptation task. Yet,
+most transfer approaches for VLMs focus on either the language or visual
+branches, overlooking the nuanced interplay between both modalities. In this
+work, we introduce a Unified Modality Separation (UniMoS) framework for
+unsupervised domain adaptation. Leveraging insights from modality gap studies,
+we craft a nimble modality separation network that distinctly disentangles
+CLIP's features into language-associated and vision-associated components. Our
+proposed Modality-Ensemble Training (MET) method fosters the exchange of
+modality-agnostic information while maintaining modality-specific nuances. We
+align features across domains using a modality discriminator. Comprehensive
+evaluations on three benchmarks reveal our approach sets a new state-of-the-art
+with minimal computational costs. Code: https://github.com/TL-UESTC/UniMoS",cs.CV,['cs.CV']
+Relation Rectification in Diffusion Model,Yinwei Wu · Xingyi Yang · Xinchao Wang,https://wuyinwei-hah.github.io/rrnet.github.io/,https://arxiv.org/abs/2403.20249,,2403.20249.pdf,Relation Rectification in Diffusion Model,"Despite their exceptional generative abilities, large text-to-image diffusion
+models, much like skilled but careless artists, often struggle with accurately
+depicting visual relationships between objects. This issue, as we uncover
+through careful analysis, arises from a misaligned text encoder that struggles
+to interpret specific relationships and differentiate the logical order of
+associated objects. To resolve this, we introduce a novel task termed Relation
+Rectification, aiming to refine the model to accurately represent a given
+relationship it initially fails to generate. To address this, we propose an
+innovative solution utilizing a Heterogeneous Graph Convolutional Network
+(HGCN). It models the directional relationships between relation terms and
+corresponding objects within the input prompts. Specifically, we optimize the
+HGCN on a pair of prompts with identical relational words but reversed object
+orders, supplemented by a few reference images. The lightweight HGCN adjusts
+the text embeddings generated by the text encoder, ensuring the accurate
+reflection of the textual relation in the embedding space. Crucially, our
+method retains the parameters of the text encoder and diffusion model,
+preserving the model's robust performance on unrelated descriptions. We
+validated our approach on a newly curated dataset of diverse relational data,
+demonstrating both quantitative and qualitative enhancements in generating
+images with precise visual relations. Project page:
+https://wuyinwei-hah.github.io/rrnet.github.io/.",cs.CV,['cs.CV']
+CoralSCOP: Segment any COral Image on this Planet,"Zheng Ziqiang · Liang Haixin · Binh-Son Hua · Tim, Yue Him Wong · Put ANG · Apple CHUI · Sai-Kit Yeung", ,,https://ais.hkust.edu.hk/whats-happening/news/isd-research-team-produces-first-model-segment-and-generalize-coral-reef-image,,,,,nan
+Category-Level Multi-Part Multi-Joint 3D Shape Assembly,Yichen Li · Kaichun Mo · Yueqi Duan · He Wang · Jiequan Zhang · Lin Shao · Wojciech Matusik · Leonidas Guibas, ,,,,,,,nan
+AirPlanes: Accurate Plane Estimation via 3D-Consistent Embeddings,Jamie Watson · Filippo Aleotti · Mohamed Sayed · Zawar Qureshi · Oisin Mac Aodha · Gabriel J. Brostow · Michael Firman · Sara Vicente,https://nianticlabs.github.io/airplanes/,,https://link.springer.com/article/10.1007/s00371-023-03110-7,,,,,nan
+Fun with Flags: Robust Principal Directions via Flag Manifolds,Tolga Birdal · Nathan Mankovich, ,https://arxiv.org/abs/2401.04071v1,,2401.04071v1.pdf,Fun with Flags: Robust Principal Directions via Flag Manifolds,"Principal component analysis (PCA), along with its extensions to manifolds
+and outlier contaminated data, have been indispensable in computer vision and
+machine learning. In this work, we present a unifying formalism for PCA and its
+variants, and introduce a framework based on the flags of linear subspaces, \ie
+a hierarchy of nested linear subspaces of increasing dimension, which not only
+allows for a common implementation but also yields novel variants, not explored
+previously. We begin by generalizing traditional PCA methods that either
+maximize variance or minimize reconstruction error. We expand these
+interpretations to develop a wide array of new dimensionality reduction
+algorithms by accounting for outliers and the data manifold. To devise a common
+computational approach, we recast robust and dual forms of PCA as optimization
+problems on flag manifolds. We then integrate tangent space approximations of
+principal geodesic analysis (tangent-PCA) into this flag-based framework,
+creating novel robust and dual geodesic PCA variations. The remarkable
+flexibility offered by the 'flagification' introduced here enables even more
+algorithmic variants identified by specific flag types. Last but not least, we
+propose an effective convergent solver for these flag-formulations employing
+the Stiefel manifold. Our empirical results on both real-world and synthetic
+scenarios, demonstrate the superiority of our novel algorithms, especially in
+terms of robustness to outliers on manifolds.",cs.CV,"['cs.CV', 'cs.LG', 'math.DG', 'math.OC', 'stat.ML']"
+PixelRNN: In-pixel Recurrent Neural Networks for End-to-end-optimized Perception with Neural Sensors,Haley So · Laurie Bose · Piotr Dudek · Gordon Wetzstein, ,,,,,,,nan
+Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations,Chenyu You · Yifei Min · Weicheng Dai · Jasjeet Sekhon · Lawrence Staib · James Duncan, ,https://arxiv.org/abs/2403.07241,,2403.07241.pdf,Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations,"Fine-tuning pre-trained vision-language models, like CLIP, has yielded
+success on diverse downstream tasks. However, several pain points persist for
+this paradigm: (i) directly tuning entire pre-trained models becomes both
+time-intensive and computationally costly. Additionally, these tuned models
+tend to become highly specialized, limiting their practicality for real-world
+deployment; (ii) recent studies indicate that pre-trained vision-language
+classifiers may overly depend on spurious features -- patterns that correlate
+with the target in training data, but are not related to the true labeling
+function; and (iii) existing studies on mitigating the reliance on spurious
+features, largely based on the assumption that we can identify such features,
+does not provide definitive assurance for real-world applications. As a
+piloting study, this work focuses on exploring mitigating the reliance on
+spurious features for CLIP without using any group annotation. To this end, we
+systematically study the existence of spurious correlation on CLIP and
+CILP+ERM. We first, following recent work on Deep Feature Reweighting (DFR),
+verify that last-layer retraining can greatly improve group robustness on
+pretrained CLIP. In view of them, we advocate a lightweight representation
+calibration method for fine-tuning CLIP, by first generating a calibration set
+using the pretrained CLIP, and then calibrating representations of samples
+within this set through contrastive learning, all without the need for group
+labels. Extensive experiments and in-depth visualizations on several benchmarks
+validate the effectiveness of our proposals, largely reducing reliance and
+significantly boosting the model generalization.",cs.CV,"['cs.CV', 'cs.LG']"
+Guided Slot Attention for Unsupervised Video Object Segmentation,Minhyeok Lee · Suhwan Cho · Dogyoon Lee · Chaewon Park · Jungho Lee · Sangyoun Lee, ,https://arxiv.org/abs/2309.14786,,,Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation,"Unsupervised video object segmentation (VOS) is a task that aims to detect
+the most salient object in a video without external guidance about the object.
+To leverage the property that salient objects usually have distinctive
+movements compared to the background, recent methods collaboratively use motion
+cues extracted from optical flow maps with appearance cues extracted from RGB
+images. However, as optical flow maps are usually very relevant to segmentation
+masks, the network is easy to be learned overly dependent on the motion cues
+during network training. As a result, such two-stream approaches are vulnerable
+to confusing motion cues, making their prediction unstable. To relieve this
+issue, we design a novel motion-as-option network by treating motion cues as
+optional. During network training, RGB images are randomly provided to the
+motion encoder instead of optical flow maps, to implicitly reduce motion
+dependency of the network. As the learned motion encoder can deal with both RGB
+images and optical flow maps, two different predictions can be generated
+depending on which source information is used as motion input. In order to
+fully exploit this property, we also propose an adaptive output selection
+algorithm to adopt optimal prediction result at test time. Our proposed
+approach affords state-of-the-art performance on all public benchmark datasets,
+even maintaining real-time inference speed.",cs.CV,['cs.CV']
+A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning,Siddharth Srivastava · Gaurav Sharma, ,https://arxiv.org/abs/2310.09276,,2310.09276.pdf,Transformer-based Multimodal Change Detection with Multitask Consistency Constraints,"Change detection plays a fundamental role in Earth observation for analyzing
+temporal iterations over time. However, recent studies have largely neglected
+the utilization of multimodal data that presents significant practical and
+technical advantages compared to single-modal approaches. This research focuses
+on leveraging {pre-event} digital surface model (DSM) data and {post-event}
+digital aerial images captured at different times for detecting change beyond
+2D. We observe that the current change detection methods struggle with the
+multitask conflicts between semantic and height change detection tasks. To
+address this challenge, we propose an efficient Transformer-based network that
+learns shared representation between cross-dimensional inputs through
+cross-attention. {It adopts a consistency constraint to establish the
+multimodal relationship. Initially, pseudo-changes are derived by employing
+height change thresholding. Subsequently, the $L2$ distance between semantic
+and pseudo-changes within their overlapping regions is minimized. This
+explicitly endows the height change detection (regression task) and semantic
+change detection (classification task) with representation consistency.} A
+DSM-to-image multimodal dataset encompassing three cities in the Netherlands
+was constructed. It lays a new foundation for beyond-2D change detection from
+cross-dimensional inputs. Compared to five state-of-the-art change detection
+methods, our model demonstrates consistent multitask superiority in terms of
+semantic and height change detection. Furthermore, the consistency strategy can
+be seamlessly adapted to the other methods, yielding promising improvements.",cs.CV,['cs.CV']
+Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles,Rui Song · Chenwei Liang · Hu Cao · Zhiran Yan · Walter Zimmer · Markus Gross · Andreas Festag · Alois Knoll,https://rruisong.github.io/publications/CoHFF/,https://arxiv.org/abs/2402.07635,,2402.07635.pdf,Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles,"Collaborative perception in automated vehicles leverages the exchange of
+information between agents, aiming to elevate perception results. Previous
+camera-based collaborative 3D perception methods typically employ 3D bounding
+boxes or bird's eye views as representations of the environment. However, these
+approaches fall short in offering a comprehensive 3D environmental prediction.
+To bridge this gap, we introduce the first method for collaborative 3D semantic
+occupancy prediction. Particularly, it improves local 3D semantic occupancy
+predictions by hybrid fusion of (i) semantic and occupancy task features, and
+(ii) compressed orthogonal attention features shared between vehicles.
+Additionally, due to the lack of a collaborative perception dataset designed
+for semantic occupancy prediction, we augment a current collaborative
+perception dataset to include 3D collaborative semantic occupancy labels for a
+more robust evaluation. The experimental findings highlight that: (i) our
+collaborative semantic occupancy predictions excel above the results from
+single vehicles by over 30%, and (ii) models anchored on semantic occupancy
+outpace state-of-the-art collaborative 3D detection techniques in subsequent
+perception applications, showcasing enhanced accuracy and enriched
+semantic-awareness in road environments.",cs.CV,['cs.CV']
+"EVS-assisted joint Deblurring, Rolling-Shutter Correction and Video Frame Interpolation through Sensor Inverse Modeling",Rui Jiang · Fangwen Tu · Yixuan Long · Aabhaas Vaish · Bowen Zhou · Qinyi Wang · Wei Zhang · Yuntan Fang · Luis Eduardo García Capel · Bo Mu · Tiejun Dai · Andreas Suess, ,https://arxiv.org/abs/2404.18156,,,Event-based Video Frame Interpolation with Edge Guided Motion Refinement,"Video frame interpolation, the process of synthesizing intermediate frames
+between sequential video frames, has made remarkable progress with the use of
+event cameras. These sensors, with microsecond-level temporal resolution, fill
+information gaps between frames by providing precise motion cues. However,
+contemporary Event-Based Video Frame Interpolation (E-VFI) techniques often
+neglect the fact that event data primarily supply high-confidence features at
+scene edges during multi-modal feature fusion, thereby diminishing the role of
+event signals in optical flow (OF) estimation and warping refinement. To
+address this overlooked aspect, we introduce an end-to-end E-VFI learning
+method (referred to as EGMR) to efficiently utilize edge features from event
+signals for motion flow and warping enhancement. Our method incorporates an
+Edge Guided Attentive (EGA) module, which rectifies estimated video motion
+through attentive aggregation based on the local correlation of multi-modal
+features in a coarse-to-fine strategy. Moreover, given that event data can
+provide accurate visual references at scene edges between consecutive frames,
+we introduce a learned visibility map derived from event data to adaptively
+mitigate the occlusion problem in the warping refinement process. Extensive
+experiments on both synthetic and real datasets show the effectiveness of the
+proposed approach, demonstrating its potential for higher quality video frame
+interpolation.",cs.CV,['cs.CV']
+EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion,Zehuan Huang · Hao Wen · Junting Dong · Yaohui Wang · Yangguang Li · Xinyuan Chen · Yan-Pei Cao · Ding Liang · Yu Qiao · Bo Dai · Lu Sheng,https://huanngzh.github.io/EpiDiff/,https://arxiv.org/abs/2312.06725,,2312.06725.pdf,EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion,"Generating multiview images from a single view facilitates the rapid
+generation of a 3D mesh conditioned on a single image. Recent methods that
+introduce 3D global representation into diffusion models have shown the
+potential to generate consistent multiviews, but they have reduced generation
+speed and face challenges in maintaining generalizability and quality. To
+address this issue, we propose EpiDiff, a localized interactive multiview
+diffusion model. At the core of the proposed approach is to insert a
+lightweight epipolar attention block into the frozen diffusion model,
+leveraging epipolar constraints to enable cross-view interaction among feature
+maps of neighboring views. The newly initialized 3D modeling module preserves
+the original feature distribution of the diffusion model, exhibiting
+compatibility with a variety of base diffusion models. Experiments show that
+EpiDiff generates 16 multiview images in just 12 seconds, and it surpasses
+previous methods in quality evaluation metrics, including PSNR, SSIM and LPIPS.
+Additionally, EpiDiff can generate a more diverse distribution of views,
+improving the reconstruction quality from generated multiviews. Please see our
+project page at https://huanngzh.github.io/EpiDiff/.",cs.CV,['cs.CV']
+GaussianAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh,Jing Wen · Xiaoming Zhao · Jason Ren · Alexander G. Schwing · Shenlong Wang, ,https://arxiv.org/abs/2404.07991,,2404.07991.pdf,GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh,"We introduce GoMAvatar, a novel approach for real-time, memory-efficient,
+high-quality animatable human modeling. GoMAvatar takes as input a single
+monocular video to create a digital avatar capable of re-articulation in new
+poses and real-time rendering from novel viewpoints, while seamlessly
+integrating with rasterization-based graphics pipelines. Central to our method
+is the Gaussians-on-Mesh representation, a hybrid 3D model combining rendering
+quality and speed of Gaussian splatting with geometry modeling and
+compatibility of deformable meshes. We assess GoMAvatar on ZJU-MoCap data and
+various YouTube videos. GoMAvatar matches or surpasses current monocular human
+modeling algorithms in rendering quality and significantly outperforms them in
+computational efficiency (43 FPS) while being memory-efficient (3.63 MB per
+subject).",cs.CV,['cs.CV']
+Learning Instance-Aware Correspondences for Robust Multi-Instance Point Cloud Registration in Cluttered Scenes,Zhiyuan Yu · Zheng Qin · lintao zheng · Kai Xu, ,https://arxiv.org/abs/2404.04557,,2404.04557.pdf,Learning Instance-Aware Correspondences for Robust Multi-Instance Point Cloud Registration in Cluttered Scenes,"Multi-instance point cloud registration estimates the poses of multiple
+instances of a model point cloud in a scene point cloud. Extracting accurate
+point correspondence is to the center of the problem. Existing approaches
+usually treat the scene point cloud as a whole, overlooking the separation of
+instances. Therefore, point features could be easily polluted by other points
+from the background or different instances, leading to inaccurate
+correspondences oblivious to separate instances, especially in cluttered
+scenes. In this work, we propose MIRETR, Multi-Instance REgistration
+TRansformer, a coarse-to-fine approach to the extraction of instance-aware
+correspondences. At the coarse level, it jointly learns instance-aware
+superpoint features and predicts per-instance masks. With instance masks, the
+influence from outside of the instance being concerned is minimized, such that
+highly reliable superpoint correspondences can be extracted. The superpoint
+correspondences are then extended to instance candidates at the fine level
+according to the instance masks. At last, an efficient candidate selection and
+refinement algorithm is devised to obtain the final registrations. Extensive
+experiments on three public benchmarks demonstrate the efficacy of our
+approach. In particular, MIRETR outperforms the state of the arts by 16.6
+points on F1 score on the challenging ROBI benchmark. Code and models are
+available at https://github.com/zhiyuanYU134/MIRETR.",cs.CV,['cs.CV']
+RTracker: Recoverable Tracking via PN Tree Structured Memory,Yuqing Huang · Xin Li · Zikun Zhou · Yaowei Wang · Zhenyu He · Ming-Hsuan Yang, ,https://arxiv.org/abs/2403.19242,,2403.19242.pdf,RTracker: Recoverable Tracking via PN Tree Structured Memory,"Existing tracking methods mainly focus on learning better target
+representation or developing more robust prediction models to improve tracking
+performance. While tracking performance has significantly improved, the target
+loss issue occurs frequently due to tracking failures, complete occlusion, or
+out-of-view situations. However, considerably less attention is paid to the
+self-recovery issue of tracking methods, which is crucial for practical
+applications. To this end, we propose a recoverable tracking framework,
+RTracker, that uses a tree-structured memory to dynamically associate a tracker
+and a detector to enable self-recovery ability. Specifically, we propose a
+Positive-Negative Tree-structured memory to chronologically store and maintain
+positive and negative target samples. Upon the PN tree memory, we develop
+corresponding walking rules for determining the state of the target and define
+a set of control flows to unite the tracker and the detector in different
+tracking scenarios. Our core idea is to use the support samples of positive and
+negative target categories to establish a relative distance-based criterion for
+a reliable assessment of target loss. The favorable performance in comparison
+against the state-of-the-art methods on numerous challenging benchmarks
+demonstrates the effectiveness of the proposed algorithm.",cs.CV,['cs.CV']
+Supervised Anomaly Detection for Complex Industrial Images,Aimira Baitieva · David Hurych · Victor Besnier · Olivier BERNARD, ,https://arxiv.org/abs/2405.04953,,2405.04953.pdf,Supervised Anomaly Detection for Complex Industrial Images,"Automating visual inspection in industrial production lines is essential for
+increasing product quality across various industries. Anomaly detection (AD)
+methods serve as robust tools for this purpose. However, existing public
+datasets primarily consist of images without anomalies, limiting the practical
+application of AD methods in production settings. To address this challenge, we
+present (1) the Valeo Anomaly Dataset (VAD), a novel real-world industrial
+dataset comprising 5000 images, including 2000 instances of challenging real
+defects across more than 20 subclasses. Acknowledging that traditional AD
+methods struggle with this dataset, we introduce (2) Segmentation-based Anomaly
+Detector (SegAD). First, SegAD leverages anomaly maps as well as segmentation
+maps to compute local statistics. Next, SegAD uses these statistics and an
+optional supervised classifier score as input features for a Boosted Random
+Forest (BRF) classifier, yielding the final anomaly score. Our SegAD achieves
+state-of-the-art performance on both VAD (+2.1% AUROC) and the VisA dataset
+(+0.4% AUROC). The code and the models are publicly available.",cs.CV,"['cs.CV', 'cs.LG']"
+InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization,Xiefan Guo · Jinlin Liu · Miaomiao Cui · Jiankai Li · Hongyu Yang · Di Huang, ,https://arxiv.org/abs/2404.04650,,2404.04650.pdf,InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization,"Recent strides in the development of diffusion models, exemplified by
+advancements such as Stable Diffusion, have underscored their remarkable
+prowess in generating visually compelling images. However, the imperative of
+achieving a seamless alignment between the generated image and the provided
+prompt persists as a formidable challenge. This paper traces the root of these
+difficulties to invalid initial noise, and proposes a solution in the form of
+Initial Noise Optimization (InitNO), a paradigm that refines this noise.
+Considering text prompts, not all random noises are effective in synthesizing
+semantically-faithful images. We design the cross-attention response score and
+the self-attention conflict score to evaluate the initial noise, bifurcating
+the initial latent space into valid and invalid sectors. A strategically
+crafted noise optimization pipeline is developed to guide the initial noise
+towards valid regions. Our method, validated through rigorous experimentation,
+shows a commendable proficiency in generating images in strict accordance with
+text prompts. Our code is available at https://github.com/xiefan-guo/initno.",cs.CV,['cs.CV']
+MFP: Making Full use of Probability Maps for Interactive Image Segmentation,Chaewon Lee · Seon-Ho Lee · Chang-Su Kim, ,https://arxiv.org/abs/2404.18448,,2404.18448.pdf,MFP: Making Full Use of Probability Maps for Interactive Image Segmentation,"In recent interactive segmentation algorithms, previous probability maps are
+used as network input to help predictions in the current segmentation round.
+However, despite the utilization of previous masks, useful information
+contained in the probability maps is not well propagated to the current
+predictions. In this paper, to overcome this limitation, we propose a novel and
+effective algorithm for click-based interactive image segmentation, called MFP,
+which attempts to make full use of probability maps. We first modulate previous
+probability maps to enhance their representations of user-specified objects.
+Then, we feed the modulated probability maps as additional input to the
+segmentation network. We implement the proposed MFP algorithm based on the
+ResNet-34, HRNet-18, and ViT-B backbones and assess the performance extensively
+on various datasets. It is demonstrated that MFP meaningfully outperforms the
+existing algorithms using identical backbones. The source codes are available
+at https://github.com/cwlee00/MFP.",cs.CV,['cs.CV']
+A Conditional Denoising Diffusion Probabilistic Model for Point Cloud Upsampling,Wentao Qu · Yuantian Shao · Lingwu Meng · Xiaoshui Huang · Liang Xiao, ,https://arxiv.org/abs/2312.02719,,2312.02719.pdf,A Conditional Denoising Diffusion Probabilistic Model for Point Cloud Upsampling,"Point cloud upsampling (PCU) enriches the representation of raw point clouds,
+significantly improving the performance in downstream tasks such as
+classification and reconstruction. Most of the existing point cloud upsampling
+methods focus on sparse point cloud feature extraction and upsampling module
+design. In a different way, we dive deeper into directly modelling the gradient
+of data distribution from dense point clouds. In this paper, we proposed a
+conditional denoising diffusion probability model (DDPM) for point cloud
+upsampling, called PUDM. Specifically, PUDM treats the sparse point cloud as a
+condition, and iteratively learns the transformation relationship between the
+dense point cloud and the noise. Simultaneously, PUDM aligns with a dual
+mapping paradigm to further improve the discernment of point features. In this
+context, PUDM enables learning complex geometry details in the ground truth
+through the dominant features, while avoiding an additional upsampling module
+design. Furthermore, to generate high-quality arbitrary-scale point clouds
+during inference, PUDM exploits the prior knowledge of the scale between sparse
+point clouds and dense point clouds during training by parameterizing a rate
+factor. Moreover, PUDM exhibits strong noise robustness in experimental
+results. In the quantitative and qualitative evaluations on PU1K and PUGAN,
+PUDM significantly outperformed existing methods in terms of Chamfer Distance
+(CD) and Hausdorff Distance (HD), achieving state of the art (SOTA)
+performance.",cs.CV,['cs.CV']
+Bridging the Synthetic-to-Authentic Gap: Distortion-Guided Unsupervised Domain Adaptation for Blind Image Quality Assessment,Aobo Li · Jinjian Wu · Yongxu Liu · Leida Li, ,https://arxiv.org/abs/2405.04167,,2405.04167.pdf,Bridging the Synthetic-to-Authentic Gap: Distortion-Guided Unsupervised Domain Adaptation for Blind Image Quality Assessment,"The annotation of blind image quality assessment (BIQA) is labor-intensive
+and time-consuming, especially for authentic images. Training on synthetic data
+is expected to be beneficial, but synthetically trained models often suffer
+from poor generalization in real domains due to domain gaps. In this work, we
+make a key observation that introducing more distortion types in the synthetic
+dataset may not improve or even be harmful to generalizing authentic image
+quality assessment. To solve this challenge, we propose distortion-guided
+unsupervised domain adaptation for BIQA (DGQA), a novel framework that
+leverages adaptive multi-domain selection via prior knowledge from distortion
+to match the data distribution between the source domains and the target
+domain, thereby reducing negative transfer from the outlier source domains.
+Extensive experiments on two cross-domain settings (synthetic distortion to
+authentic distortion and synthetic distortion to algorithmic distortion) have
+demonstrated the effectiveness of our proposed DGQA. Besides, DGQA is
+orthogonal to existing model-based BIQA methods, and can be used in combination
+with such models to improve performance with less training data.",cs.CV,"['cs.CV', 'eess.IV']"
+Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters,Jiazuo Yu · Yunzhi Zhuge · Lu Zhang · Ping Hu · Dong Wang · Huchuan Lu · You He, ,https://arxiv.org/abs/2403.11549,,2403.11549.pdf,Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters,"Continual learning can empower vision-language models to continuously acquire
+new knowledge, without the need for access to the entire historical dataset.
+However, mitigating the performance degradation in large-scale models is
+non-trivial due to (i) parameter shifts throughout lifelong learning and (ii)
+significant computational burdens associated with full-model tuning. In this
+work, we present a parameter-efficient continual learning framework to
+alleviate long-term forgetting in incremental learning with vision-language
+models. Our approach involves the dynamic expansion of a pre-trained CLIP
+model, through the integration of Mixture-of-Experts (MoE) adapters in response
+to new tasks. To preserve the zero-shot recognition capability of
+vision-language models, we further introduce a Distribution Discriminative
+Auto-Selector (DDAS) that automatically routes in-distribution and
+out-of-distribution inputs to the MoE Adapter and the original CLIP,
+respectively. Through extensive experiments across various settings, our
+proposed method consistently outperforms previous state-of-the-art approaches
+while concurrently reducing parameter training burdens by 60%. Our code locates
+at https://github.com/JiazuoYu/MoE-Adapters4CL",cs.CV,['cs.CV']
+Unsupervised Blind Image Deblurring Based on Self-Enhancement,Lufei Chen · Xiangpeng Tian · Shuhua Xiong · Yinjie Lei · Chao Ren, ,,https://dl.acm.org/doi/abs/10.1145/3581783.3612535,,,,,nan
+UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity,Jialong Zuo · Hanyu Zhou · Ying Nie · Feng Zhang · Tianyu Guo · Nong Sang · Yunhe Wang · Changxin Gao, ,https://arxiv.org/abs/2312.03441v4,,2312.03441v4.pdf,UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity,"Existing text-based person retrieval datasets often have relatively
+coarse-grained text annotations. This hinders the model to comprehend the
+fine-grained semantics of query texts in real scenarios. To address this
+problem, we contribute a new benchmark named \textbf{UFineBench} for text-based
+person retrieval with ultra-fine granularity.
+  Firstly, we construct a new \textbf{dataset} named UFine6926. We collect a
+large number of person images and manually annotate each image with two
+detailed textual descriptions, averaging 80.8 words each. The average word
+count is three to four times that of the previous datasets. In addition of
+standard in-domain evaluation, we also propose a special \textbf{evaluation
+paradigm} more representative of real scenarios. It contains a new evaluation
+set with cross domains, cross textual granularity and cross textual styles,
+named UFine3C, and a new evaluation metric for accurately measuring retrieval
+ability, named mean Similarity Distribution (mSD). Moreover, we propose CFAM, a
+more efficient \textbf{algorithm} especially designed for text-based person
+retrieval with ultra fine-grained texts. It achieves fine granularity mining by
+adopting a shared cross-modal granularity decoder and hard negative match
+mechanism.
+  With standard in-domain evaluation, CFAM establishes competitive performance
+across various datasets, especially on our ultra fine-grained UFine6926.
+Furthermore, by evaluating on UFine3C, we demonstrate that training on our
+UFine6926 significantly improves generalization to real scenarios compared with
+other coarse-grained datasets. The dataset and code will be made publicly
+available at \url{https://github.com/Zplusdragon/UFineBench}.",cs.CV,['cs.CV']
+Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency,Xu Yingjie · Bangzhen Liu · Hao Tang · Bailin Deng · Shengfeng He, ,https://arxiv.org/abs/2403.17638v1,,2403.17638v1.pdf,Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency,"We propose a voxel-based optimization framework, ReVoRF, for few-shot
+radiance fields that strategically address the unreliability in pseudo novel
+view synthesis. Our method pivots on the insight that relative depth
+relationships within neighboring regions are more reliable than the absolute
+color values in disoccluded areas. Consequently, we devise a bilateral
+geometric consistency loss that carefully navigates the trade-off between color
+fidelity and geometric accuracy in the context of depth consistency for
+uncertain regions. Moreover, we present a reliability-guided learning strategy
+to discern and utilize the variable quality across synthesized views,
+complemented by a reliability-aware voxel smoothing algorithm that smoothens
+the transition between reliable and unreliable data patches. Our approach
+allows for a more nuanced use of all available data, promoting enhanced
+learning from regions previously considered unsuitable for high-quality
+reconstruction. Extensive experiments across diverse datasets reveal that our
+approach attains significant gains in efficiency and accuracy, delivering
+rendering speeds of 3 FPS, 7 mins to train a $360^\circ$ scene, and a 5\%
+improvement in PSNR over existing few-shot methods. Code is available at
+https://github.com/HKCLynn/ReVoRF.",cs.CV,['cs.CV']
+Entity-NeRF: Detecting and Removing Moving Entities in Urban Scenes,Takashi Otonari · Satoshi Ikehata · Kiyoharu Aizawa, ,https://arxiv.org/abs/2403.16141,,2403.16141.pdf,Entity-NeRF: Detecting and Removing Moving Entities in Urban Scenes,"Recent advancements in the study of Neural Radiance Fields (NeRF) for dynamic
+scenes often involve explicit modeling of scene dynamics. However, this
+approach faces challenges in modeling scene dynamics in urban environments,
+where moving objects of various categories and scales are present. In such
+settings, it becomes crucial to effectively eliminate moving objects to
+accurately reconstruct static backgrounds. Our research introduces an
+innovative method, termed here as Entity-NeRF, which combines the strengths of
+knowledge-based and statistical strategies. This approach utilizes entity-wise
+statistics, leveraging entity segmentation and stationary entity classification
+through thing/stuff segmentation. To assess our methodology, we created an
+urban scene dataset masked with moving objects. Our comprehensive experiments
+demonstrate that Entity-NeRF notably outperforms existing techniques in
+removing moving objects and reconstructing static urban backgrounds, both
+quantitatively and qualitatively.",cs.CV,['cs.CV']
+PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation,Yuqi Wang · Yuntao Chen · Xingyu Liao · Lue Fan · Zhaoxiang Zhang, ,https://arxiv.org/abs/2306.10013,,2306.10013.pdf,PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation,"Comprehensive modeling of the surrounding 3D world is key to the success of
+autonomous driving. However, existing perception tasks like object detection,
+road structure segmentation, depth & elevation estimation, and open-set object
+localization each only focus on a small facet of the holistic 3D scene
+understanding task. This divide-and-conquer strategy simplifies the algorithm
+development procedure at the cost of losing an end-to-end unified solution to
+the problem. In this work, we address this limitation by studying camera-based
+3D panoptic segmentation, aiming to achieve a unified occupancy representation
+for camera-only 3D scene understanding. To achieve this, we introduce a novel
+method called PanoOcc, which utilizes voxel queries to aggregate spatiotemporal
+information from multi-frame and multi-view images in a coarse-to-fine scheme,
+integrating feature learning and scene representation into a unified occupancy
+representation. We have conducted extensive ablation studies to verify the
+effectiveness and efficiency of the proposed method. Our approach achieves new
+state-of-the-art results for camera-based semantic segmentation and panoptic
+segmentation on the nuScenes dataset. Furthermore, our method can be easily
+extended to dense occupancy prediction and has shown promising performance on
+the Occ3D benchmark. The code will be released at
+https://github.com/Robertwyq/PanoOcc.",cs.CV,"['cs.CV', 'cs.RO']"
+HIT: Estimating Internal Human Implicit Tissues from the Body Surface,Marilyn Keller · Vaibhav ARORA · Abdelmouttaleb Dakri · Shivam Chandhok · Jürgen Machann · Andreas Fritsche · Michael J. Black · Sergi Pujades,https://hit.is.tue.mpg.de,,https://www.youtube.com/watch?v=3u4emFF3DcE,,,,,nan
+Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation,Renshuai Liu · Bowen Ma · Wei Zhang · Zhipeng Hu · Changjie Fan · Tangjie Lv · Yu Ding · Xuan Cheng, ,https://arxiv.org/abs/2401.01207,,2401.01207.pdf,Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation,"In human-centric content generation, the pre-trained text-to-image models
+struggle to produce user-wanted portrait images, which retain the identity of
+individuals while exhibiting diverse expressions. This paper introduces our
+efforts towards personalized face generation. To this end, we propose a novel
+multi-modal face generation framework, capable of simultaneous
+identity-expression control and more fine-grained expression synthesis. Our
+expression control is so sophisticated that it can be specialized by the
+fine-grained emotional vocabulary. We devise a novel diffusion model that can
+undertake the task of simultaneously face swapping and reenactment. Due to the
+entanglement of identity and expression, it's nontrivial to separately and
+precisely control them in one framework, thus has not been explored yet. To
+overcome this, we propose several innovative designs in the conditional
+diffusion model, including balancing identity and expression encoder, improved
+midpoint sampling, and explicitly background conditioning. Extensive
+experiments have demonstrated the controllability and scalability of the
+proposed framework, in comparison with state-of-the-art text-to-image, face
+swapping, and face reenactment methods.",cs.CV,['cs.CV']
+3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation,Songchun Zhang · Yibo Zhang · Quan Zheng · Rui Ma · Wei Hua · Hujun Bao · Weiwei Xu · Changqing Zou, ,https://arxiv.org/abs/2403.09439,,2403.09439.pdf,3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation,"Text-driven 3D scene generation techniques have made rapid progress in recent
+years. Their success is mainly attributed to using existing generative models
+to iteratively perform image warping and inpainting to generate 3D scenes.
+However, these methods heavily rely on the outputs of existing models, leading
+to error accumulation in geometry and appearance that prevent the models from
+being used in various scenarios (e.g., outdoor and unreal scenarios). To
+address this limitation, we generatively refine the newly generated local views
+by querying and aggregating global 3D information, and then progressively
+generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF
+as a unified representation of the 3D scene to constrain global 3D consistency,
+and propose a generative refinement network to synthesize new contents with
+higher quality by exploiting the natural image prior from 2D diffusion model as
+well as the global 3D information of the current scene. Our extensive
+experiments demonstrate that, in comparison to previous methods, our approach
+supports wide variety of scene generation and arbitrary camera trajectories
+with improved visual quality and 3D consistency.",cs.CV,"['cs.CV', 'cs.AI']"
+Accurate Spatial Gene Expression Prediction by Integrating Multi-Resolution Features,Youngmin Chung · Ji Hun Ha · Kyeong Chan Im · Joo Sang Lee, ,https://arxiv.org/abs/2403.07592v1,,2403.07592v1.pdf,Accurate Spatial Gene Expression Prediction by integrating Multi-resolution features,"Recent advancements in Spatial Transcriptomics (ST) technology have
+facilitated detailed gene expression analysis within tissue contexts. However,
+the high costs and methodological limitations of ST necessitate a more robust
+predictive model. In response, this paper introduces TRIPLEX, a novel deep
+learning framework designed to predict spatial gene expression from Whole Slide
+Images (WSIs). TRIPLEX uniquely harnesses multi-resolution features, capturing
+cellular morphology at individual spots, the local context around these spots,
+and the global tissue organization. By integrating these features through an
+effective fusion strategy, TRIPLEX achieves accurate gene expression
+prediction. Our comprehensive benchmark study, conducted on three public ST
+datasets and supplemented with Visium data from 10X Genomics, demonstrates that
+TRIPLEX outperforms current state-of-the-art models in Mean Squared Error
+(MSE), Mean Absolute Error (MAE), and Pearson Correlation Coefficient (PCC).
+The model's predictions align closely with ground truth gene expression
+profiles and tumor annotations, underscoring TRIPLEX's potential in advancing
+cancer diagnosis and treatment.",cs.CV,['cs.CV']
+Fantastic Animals and Where to Find Them: Segment Any Marine Animal with Dual SAM,Pingping Zhang · Tianyu Yan · Yang Liu · Huchuan Lu, ,https://arxiv.org/abs/2404.04996,,2404.04996.pdf,Fantastic Animals and Where to Find Them: Segment Any Marine Animal with Dual SAM,"As an important pillar of underwater intelligence, Marine Animal Segmentation
+(MAS) involves segmenting animals within marine environments. Previous methods
+don't excel in extracting long-range contextual features and overlook the
+connectivity between discrete pixels. Recently, Segment Anything Model (SAM)
+offers a universal framework for general segmentation tasks. Unfortunately,
+trained with natural images, SAM does not obtain the prior knowledge from
+marine images. In addition, the single-position prompt of SAM is very
+insufficient for prior guidance. To address these issues, we propose a novel
+feature learning framework, named Dual-SAM for high-performance MAS. To this
+end, we first introduce a dual structure with SAM's paradigm to enhance feature
+learning of marine images. Then, we propose a Multi-level Coupled Prompt (MCP)
+strategy to instruct comprehensive underwater prior information, and enhance
+the multi-level features of SAM's encoder with adapters. Subsequently, we
+design a Dilated Fusion Attention Module (DFAM) to progressively integrate
+multi-level features from SAM's encoder. Finally, instead of directly
+predicting the masks of marine animals, we propose a Criss-Cross Connectivity
+Prediction (C$^3$P) paradigm to capture the inter-connectivity between discrete
+pixels. With dual decoders, it generates pseudo-labels and achieves mutual
+supervision for complementary feature representations, resulting in
+considerable improvements over previous techniques. Extensive experiments
+verify that our proposed method achieves state-of-the-art performances on five
+widely-used MAS datasets. The code is available at
+https://github.com/Drchip61/Dual_SAM.",cs.CV,"['cs.CV', 'cs.MM']"
+Learnable Earth Parser: Discovering 3D Prototypes in Aerial Scans,Romain Loiseau · Elliot Vincent · Mathieu Aubry · Loic Landrieu,https://romainloiseau.fr/learnable-earth-parser/,,https://www.youtube.com/watch?v=0PkxeT17e8Q,,,,,nan
+Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications,Yuwen Xiong · Zhiqi Li · Yuntao Chen · Feng Wang · Xizhou Zhu · Jiapeng Luo · Wenhai Wang · Tong Lu · Hongsheng Li · Yu Qiao · Lewei Lu · Jie Zhou · Jifeng Dai, ,https://arxiv.org/abs/2401.06197,,2401.06197.pdf,Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications,"We introduce Deformable Convolution v4 (DCNv4), a highly efficient and
+effective operator designed for a broad spectrum of vision applications. DCNv4
+addresses the limitations of its predecessor, DCNv3, with two key enhancements:
+1. removing softmax normalization in spatial aggregation to enhance its dynamic
+property and expressive power and 2. optimizing memory access to minimize
+redundant operations for speedup. These improvements result in a significantly
+faster convergence compared to DCNv3 and a substantial increase in processing
+speed, with DCNv4 achieving more than three times the forward speed. DCNv4
+demonstrates exceptional performance across various tasks, including image
+classification, instance and semantic segmentation, and notably, image
+generation. When integrated into generative models like U-Net in the latent
+diffusion model, DCNv4 outperforms its baseline, underscoring its possibility
+to enhance generative models. In practical applications, replacing DCNv3 with
+DCNv4 in the InternImage model to create FlashInternImage results in up to 80%
+speed increase and further performance improvement without further
+modifications. The advancements in speed and efficiency of DCNv4, combined with
+its robust performance across diverse vision tasks, show its potential as a
+foundational building block for future vision models.",cs.CV,['cs.CV']
+A Simple and Effective Point-based Network for Event Camera 6-DOFs Pose Relocalization,Hongwei Ren · Jiadong Zhu · Yue Zhou · Haotian FU · Yulong Huang · Bojun Cheng, ,https://arxiv.org/abs/2403.19412,,2403.19412.pdf,A Simple and Effective Point-based Network for Event Camera 6-DOFs Pose Relocalization,"Event cameras exhibit remarkable attributes such as high dynamic range,
+asynchronicity, and low latency, making them highly suitable for vision tasks
+that involve high-speed motion in challenging lighting conditions. These
+cameras implicitly capture movement and depth information in events, making
+them appealing sensors for Camera Pose Relocalization (CPR) tasks.
+Nevertheless, existing CPR networks based on events neglect the pivotal
+fine-grained temporal information in events, resulting in unsatisfactory
+performance. Moreover, the energy-efficient features are further compromised by
+the use of excessively complex models, hindering efficient deployment on edge
+devices. In this paper, we introduce PEPNet, a simple and effective point-based
+network designed to regress six degrees of freedom (6-DOFs) event camera poses.
+We rethink the relationship between the event camera and CPR tasks, leveraging
+the raw Point Cloud directly as network input to harness the high-temporal
+resolution and inherent sparsity of events. PEPNet is adept at abstracting the
+spatial and implicit temporal features through hierarchical structure and
+explicit temporal features by Attentive Bi-directional Long Short-Term Memory
+(A-Bi-LSTM). By employing a carefully crafted lightweight design, PEPNet
+delivers state-of-the-art (SOTA) performance on both indoor and outdoor
+datasets with meager computational resources. Specifically, PEPNet attains a
+significant 38% and 33% performance improvement on the random split IJRR and
+M3ED datasets, respectively. Moreover, the lightweight design version
+PEPNet$_{tiny}$ accomplishes results comparable to the SOTA while employing a
+mere 0.5% of the parameters.",cs.CV,['cs.CV']
+Attention Calibration for Disentangled Text-to-Image Personalization,Yanbing Zhang · Mengping Yang · Qin Zhou · Zhe Wang, ,https://arxiv.org/abs/2403.18551,,2403.18551.pdf,Attention Calibration for Disentangled Text-to-Image Personalization,"Recent thrilling progress in large-scale text-to-image (T2I) models has
+unlocked unprecedented synthesis quality of AI-generated content (AIGC)
+including image generation, 3D and video composition. Further, personalized
+techniques enable appealing customized production of a novel concept given only
+several images as reference. However, an intriguing problem persists: Is it
+possible to capture multiple, novel concepts from one single reference image?
+In this paper, we identify that existing approaches fail to preserve visual
+consistency with the reference image and eliminate cross-influence from
+concepts. To alleviate this, we propose an attention calibration mechanism to
+improve the concept-level understanding of the T2I model. Specifically, we
+first introduce new learnable modifiers bound with classes to capture
+attributes of multiple concepts. Then, the classes are separated and
+strengthened following the activation of the cross-attention operation,
+ensuring comprehensive and self-contained concepts. Additionally, we suppress
+the attention activation of different classes to mitigate mutual influence
+among concepts. Together, our proposed method, dubbed DisenDiff, can learn
+disentangled multiple concepts from one single image and produce novel
+customized images with learned concepts. We demonstrate that our method
+outperforms the current state of the art in both qualitative and quantitative
+evaluations. More importantly, our proposed techniques are compatible with LoRA
+and inpainting pipelines, enabling more interactive experiences.",cs.CV,['cs.CV']
+LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching,Yixun Liang · Xin Yang · Jiantao Lin · Haodong LI · Xiaogang Xu · Ying-Cong Chen, ,https://arxiv.org/abs/2311.11284,,2311.11284.pdf,LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching,"The recent advancements in text-to-3D generation mark a significant milestone
+in generative models, unlocking new possibilities for creating imaginative 3D
+assets across various real-world scenarios. While recent advancements in
+text-to-3D generation have shown promise, they often fall short in rendering
+detailed and high-quality 3D models. This problem is especially prevalent as
+many methods base themselves on Score Distillation Sampling (SDS). This paper
+identifies a notable deficiency in SDS, that it brings inconsistent and
+low-quality updating direction for the 3D model, causing the over-smoothing
+effect. To address this, we propose a novel approach called Interval Score
+Matching (ISM). ISM employs deterministic diffusing trajectories and utilizes
+interval-based score matching to counteract over-smoothing. Furthermore, we
+incorporate 3D Gaussian Splatting into our text-to-3D generation pipeline.
+Extensive experiments show that our model largely outperforms the
+state-of-the-art in quality and training efficiency.",cs.CV,"['cs.CV', 'cs.GR', 'cs.MM']"
+Object Dynamics Modeling with Hierarchical Point Cloud-based Representations,Chanho Kim · Li Fuxin, ,https://arxiv.org/abs/2404.06044,,2404.06044.pdf,Object Dynamics Modeling with Hierarchical Point Cloud-based Representations,"Modeling object dynamics with a neural network is an important problem with
+numerous applications. Most recent work has been based on graph neural
+networks. However, physics happens in 3D space, where geometric information
+potentially plays an important role in modeling physical phenomena. In this
+work, we propose a novel U-net architecture based on continuous point
+convolution which naturally embeds information from 3D coordinates and allows
+for multi-scale feature representations with established downsampling and
+upsampling procedures. Bottleneck layers in the downsampled point clouds lead
+to better long-range interaction modeling. Besides, the flexibility of point
+convolutions allows our approach to generalize to sparsely sampled points from
+mesh vertices and dynamically generate features on important interaction points
+on mesh faces. Experimental results demonstrate that our approach significantly
+improves the state-of-the-art, especially in scenarios that require accurate
+gravity or collision reasoning.",cs.CV,['cs.CV']
+CAD: Photorealistic 3D Generation via Adversarial Distillation,Ziyu Wan · Despoina Paschalidou · Ian Huang · Hongyu Liu · Bokui Shen · Xiaoyu Xiang · Jing Liao · Leonidas Guibas,http://raywzy.com/CAD/,https://arxiv.org/abs/2312.06663,,2312.06663.pdf,CAD: Photorealistic 3D Generation via Adversarial Distillation,"The increased demand for 3D data in AR/VR, robotics and gaming applications,
+gave rise to powerful generative pipelines capable of synthesizing high-quality
+3D objects. Most of these models rely on the Score Distillation Sampling (SDS)
+algorithm to optimize a 3D representation such that the rendered image
+maintains a high likelihood as evaluated by a pre-trained diffusion model.
+However, finding a correct mode in the high-dimensional distribution produced
+by the diffusion model is challenging and often leads to issues such as
+over-saturation, over-smoothing, and Janus-like artifacts. In this paper, we
+propose a novel learning paradigm for 3D synthesis that utilizes pre-trained
+diffusion models. Instead of focusing on mode-seeking, our method directly
+models the distribution discrepancy between multi-view renderings and diffusion
+priors in an adversarial manner, which unlocks the generation of high-fidelity
+and photorealistic 3D content, conditioned on a single image and prompt.
+Moreover, by harnessing the latent space of GANs and expressive diffusion model
+priors, our method facilitates a wide variety of 3D applications including
+single-view reconstruction, high diversity generation and continuous 3D
+interpolation in the open domain. The experiments demonstrate the superiority
+of our pipeline compared to previous works in terms of generation quality and
+diversity.",cs.CV,"['cs.CV', 'cs.GR']"
+Gaussian Shell Maps for Efficient 3D Human Generation,Rameen Abdal · Wang Yifan · Zifan Shi · Yinghao Xu · Ryan Po · Zhengfei Kuang · Qifeng Chen · Dit-Yan Yeung · Gordon Wetzstein,https://rameenabdal.github.io/GaussianShellMaps/,https://arxiv.org/abs/2311.17857v1,,2311.17857v1.pdf,Gaussian Shell Maps for Efficient 3D Human Generation,"Efficient generation of 3D digital humans is important in several industries,
+including virtual reality, social media, and cinematic production. 3D
+generative adversarial networks (GANs) have demonstrated state-of-the-art
+(SOTA) quality and diversity for generated assets. Current 3D GAN
+architectures, however, typically rely on volume representations, which are
+slow to render, thereby hampering the GAN training and requiring
+multi-view-inconsistent 2D upsamplers. Here, we introduce Gaussian Shell Maps
+(GSMs) as a framework that connects SOTA generator network architectures with
+emerging 3D Gaussian rendering primitives using an articulable multi
+shell--based scaffold. In this setting, a CNN generates a 3D texture stack with
+features that are mapped to the shells. The latter represent inflated and
+deflated versions of a template surface of a digital human in a canonical body
+pose. Instead of rasterizing the shells directly, we sample 3D Gaussians on the
+shells whose attributes are encoded in the texture features. These Gaussians
+are efficiently and differentiably rendered. The ability to articulate the
+shells is important during GAN training and, at inference time, to deform a
+body into arbitrary user-defined poses. Our efficient rendering scheme bypasses
+the need for view-inconsistent upsamplers and achieves high-quality multi-view
+consistent renderings at a native resolution of $512 \times 512$ pixels. We
+demonstrate that GSMs successfully generate 3D humans when trained on
+single-view datasets, including SHHQ and DeepFashion.",cs.CV,"['cs.CV', 'cs.GR']"
+3D-Aware Face Editing via Warping-Guided Latent Direction Learning,Yuhao Cheng · Zhuo Chen · Xingyu Ren · Wenhan Zhu · Zhengqin Xu · Di Xu · Yang Changpeng · Yichao Yan, ,https://arxiv.org/abs/2402.14000,,2402.14000.pdf,Real-time 3D-aware Portrait Editing from a Single Image,"This work presents 3DPE, a practical method that can efficiently edit a face
+image following given prompts, like reference images or text descriptions, in a
+3D-aware manner. To this end, a lightweight module is distilled from a 3D
+portrait generator and a text-to-image model, which provide prior knowledge of
+face geometry and superior editing capability, respectively. Such a design
+brings two compelling advantages over existing approaches. First, our system
+achieves real-time editing with a feedforward network (i.e., ~0.04s per image),
+over 100x faster than the second competitor. Second, thanks to the powerful
+priors, our module could focus on the learning of editing-related variations,
+such that it manages to handle various types of editing simultaneously in the
+training phase and further supports fast adaptation to user-specified
+customized types of editing during inference (e.g., with ~5min fine-tuning per
+style). The code, the model, and the interface will be made publicly available
+to facilitate future research.",cs.CV,['cs.CV']
+NeRFiller: Completing Scenes via Generative 3D Inpainting,Ethan Weber · Aleksander Holynski · Varun Jampani · Saurabh Saxena · Noah Snavely · Abhishek Kar · Angjoo Kanazawa,https://ethanweber.me/nerfiller/,https://arxiv.org/abs/2312.04560,,2312.04560.pdf,NeRFiller: Completing Scenes via Generative 3D Inpainting,"We propose NeRFiller, an approach that completes missing portions of a 3D
+capture via generative 3D inpainting using off-the-shelf 2D visual generative
+models. Often parts of a captured 3D scene or object are missing due to mesh
+reconstruction failures or a lack of observations (e.g., contact regions, such
+as the bottom of objects, or hard-to-reach areas). We approach this challenging
+3D inpainting problem by leveraging a 2D inpainting diffusion model. We
+identify a surprising behavior of these models, where they generate more 3D
+consistent inpaints when images form a 2$\times$2 grid, and show how to
+generalize this behavior to more than four images. We then present an iterative
+framework to distill these inpainted regions into a single consistent 3D scene.
+In contrast to related works, we focus on completing scenes rather than
+deleting foreground objects, and our approach does not require tight 2D object
+masks or text. We compare our approach to relevant baselines adapted to our
+setting on a variety of scenes, where NeRFiller creates the most 3D consistent
+and plausible scene completions. Our project page is at
+https://ethanweber.me/nerfiller.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR']"
+Exploring the Transferability of Visual Prompting for Multimodal Large Language Models,Yichi Zhang · Yinpeng Dong · Siyuan Zhang · Tianzan Min · Hang Su · Jun Zhu, ,https://arxiv.org/abs/2404.11207v1,,2404.11207v1.pdf,Exploring the Transferability of Visual Prompting for Multimodal Large Language Models,"Although Multimodal Large Language Models (MLLMs) have demonstrated promising
+versatile capabilities, their performance is still inferior to specialized
+models on downstream tasks, which makes adaptation necessary to enhance their
+utility. However, fine-tuning methods require independent training for every
+model, leading to huge computation and memory overheads. In this paper, we
+propose a novel setting where we aim to improve the performance of diverse
+MLLMs with a group of shared parameters optimized for a downstream task. To
+achieve this, we propose Transferable Visual Prompting (TVP), a simple and
+effective approach to generate visual prompts that can transfer to different
+models and improve their performance on downstream tasks after trained on only
+one model. We introduce two strategies to address the issue of cross-model
+feature corruption of existing visual prompting methods and enhance the
+transferability of the learned prompts, including 1) Feature Consistency
+Alignment: which imposes constraints to the prompted feature changes to
+maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which
+encourages the prompted images to contain richer task-specific semantics with
+language guidance. We validate the effectiveness of TVP through extensive
+experiments with 6 modern MLLMs on a wide variety of tasks ranging from object
+recognition and counting to multimodal reasoning and hallucination correction.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Utility-Fairness Trade-Offs and How to Find Them,Sepehr Dehdashtian · Bashir Sadeghi · Vishnu Naresh Boddeti,https://sepehrdehdashtian.github.io/Papers/U-FaTE/index.html,https://arxiv.org/abs/2404.09454v1,,2404.09454v1.pdf,Utility-Fairness Trade-Offs and How to Find Them,"When building classification systems with demographic fairness
+considerations, there are two objectives to satisfy: 1) maximizing utility for
+the specific task and 2) ensuring fairness w.r.t. a known demographic
+attribute. These objectives often compete, so optimizing both can lead to a
+trade-off between utility and fairness. While existing works acknowledge the
+trade-offs and study their limits, two questions remain unanswered: 1) What are
+the optimal trade-offs between utility and fairness? and 2) How can we
+numerically quantify these trade-offs from data for a desired prediction task
+and demographic attribute of interest? This paper addresses these questions. We
+introduce two utility-fairness trade-offs: the Data-Space and Label-Space
+Trade-off. The trade-offs reveal three regions within the utility-fairness
+plane, delineating what is fully and partially possible and impossible. We
+propose U-FaTE, a method to numerically quantify the trade-offs for a given
+prediction task and group fairness definition from data samples. Based on the
+trade-offs, we introduce a new scheme for evaluating representations. An
+extensive evaluation of fair representation learning methods and
+representations from over 1000 pre-trained models revealed that most current
+approaches are far from the estimated and achievable fairness-utility
+trade-offs across multiple datasets and prediction tasks.",cs.CV,"['cs.CV', 'cs.CY', 'cs.LG']"
+Observation-Guided Diffusion Probabilistic Models,Junoh Kang · Jinyoung Choi · Sungik Choi · Bohyung Han, ,https://arxiv.org/abs/2310.04041,,2310.04041.pdf,Observation-Guided Diffusion Probabilistic Models,"We propose a novel diffusion-based image generation method called the
+observation-guided diffusion probabilistic model (OGDM), which effectively
+addresses the tradeoff between quality control and fast sampling. Our approach
+reestablishes the training objective by integrating the guidance of the
+observation process with the Markov chain in a principled way. This is achieved
+by introducing an additional loss term derived from the observation based on a
+conditional discriminator on noise level, which employs a Bernoulli
+distribution indicating whether its input lies on the (noisy) real manifold or
+not. This strategy allows us to optimize the more accurate negative
+log-likelihood induced in the inference stage especially when the number of
+function evaluations is limited. The proposed training scheme is also
+advantageous even when incorporated only into the fine-tuning process, and it
+is compatible with various fast inference strategies since our method yields
+better denoising networks using the exactly the same inference procedure
+without incurring extra computational cost. We demonstrate the effectiveness of
+our training algorithm using diverse inference techniques on strong diffusion
+model baselines. Our implementation is available at
+https://github.com/Junoh-Kang/OGDM_edm.",cs.LG,"['cs.LG', 'cs.AI']"
+FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition,Ganggui Ding · Canyu Zhao · Wen Wang · Zhen Yang · Zide Liu · Hao Chen · Chunhua Shen,https://aim-uofa.github.io/FreeCustom/,https://arxiv.org/abs/2405.13870,,2405.13870.pdf,FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition,"Benefiting from large-scale pre-trained text-to-image (T2I) generative
+models, impressive progress has been achieved in customized image generation,
+which aims to generate user-specified concepts. Existing approaches have
+extensively focused on single-concept customization and still encounter
+challenges when it comes to complex scenarios that involve combining multiple
+concepts. These approaches often require retraining/fine-tuning using a few
+images, leading to time-consuming training processes and impeding their swift
+implementation. Furthermore, the reliance on multiple images to represent a
+singular concept increases the difficulty of customization. To this end, we
+propose FreeCustom, a novel tuning-free method to generate customized images of
+multi-concept composition based on reference concepts, using only one image per
+concept as input. Specifically, we introduce a new multi-reference
+self-attention (MRSA) mechanism and a weighted mask strategy that enables the
+generated image to access and focus more on the reference concepts. In
+addition, MRSA leverages our key finding that input concepts are better
+preserved when providing images with context interactions. Experiments show
+that our method's produced images are consistent with the given concepts and
+better aligned with the input text. Our method outperforms or performs on par
+with other training-based methods in terms of multi-concept composition and
+single-concept customization, but is simpler. Codes can be found at
+https://github.com/aim-uofa/FreeCustom.",cs.CV,['cs.CV']
+ModaVerse: Efficiently Transforming Modalities with LLMs,Xinyu Wang · Bohan Zhuang · Qi Wu, ,https://arxiv.org/abs/2401.06395,,2401.06395.pdf,ModaVerse: Efficiently Transforming Modalities with LLMs,"Humans possess the capability to comprehend diverse modalities and seamlessly
+transfer information between them. In this work, we introduce ModaVerse, a
+Multi-modal Large Language Model (MLLM) capable of comprehending and
+transforming content across various modalities including images, videos, and
+audio. Predominant MLLM frameworks have largely relied on the alignment of
+latent spaces of textual and non-textual features. This alignment process,
+which synchronizes a language model trained on textual data with encoders and
+decoders trained on multi-modal data, often necessitates extensive training of
+several projection layers in multiple stages. Inspired by LLM-as-agent
+methodologies, we propose a novel Input/Output (I/O) alignment mechanism that
+operates directly at the level of natural language. It aligns the LLM's output
+with the input of generative models, avoiding the complexities associated with
+latent feature alignments, and simplifying the multiple training stages of
+existing MLLMs into a single, efficient process. This conceptual advancement
+leads to significant reductions in both data and computational costs. By
+conducting experiments on several benchmarks, we demonstrate that our approach
+attains comparable performance with the state of the art while achieving
+considerable efficiencies in data usage and training duration.",cs.CV,['cs.CV']
+Targeted Representation Alignment for Open-World Semi-Supervised Learning,Ruixuan Xiao · Lei Feng · Kai Tang · Junbo Zhao · Yixuan Li · Gang Chen · Haobo Wang, ,https://arxiv.org/abs/2311.03524,,2311.03524.pdf,A Graph-Theoretic Framework for Understanding Open-World Semi-Supervised Learning,"Open-world semi-supervised learning aims at inferring both known and novel
+classes in unlabeled data, by harnessing prior knowledge from a labeled set
+with known classes. Despite its importance, there is a lack of theoretical
+foundations for this problem. This paper bridges the gap by formalizing a
+graph-theoretic framework tailored for the open-world setting, where the
+clustering can be theoretically characterized by graph factorization. Our
+graph-theoretic framework illuminates practical algorithms and provides
+guarantees. In particular, based on our graph formulation, we apply the
+algorithm called Spectral Open-world Representation Learning (SORL), and show
+that minimizing our loss is equivalent to performing spectral decomposition on
+the graph. Such equivalence allows us to derive a provable error bound on the
+clustering performance for both known and novel classes, and analyze rigorously
+when labeled data helps. Empirically, SORL can match or outperform several
+strong baselines on common benchmark datasets, which is appealing for practical
+usage while enjoying theoretical guarantees.",cs.LG,['cs.LG']
+PELA: Learning Parameter-Efficient Models with Low-Rank Approximation,Yangyang Guo · Guangzhi Wang · Mohan Kankanhalli, ,https://arxiv.org/abs/2310.10700,,2310.10700.pdf,PELA: Learning Parameter-Efficient Models with Low-Rank Approximation,"Applying a pre-trained large model to downstream tasks is prohibitive under
+resource-constrained conditions. Recent dominant approaches for addressing
+efficiency issues involve adding a few learnable parameters to the fixed
+backbone model. This strategy, however, leads to more challenges in loading
+large models for downstream fine-tuning with limited resources. In this paper,
+we propose a novel method for increasing the parameter efficiency of
+pre-trained models by introducing an intermediate pre-training stage. To this
+end, we first employ low-rank approximation to compress the original large
+model and then devise a feature distillation module and a weight perturbation
+regularization module. These modules are specifically designed to enhance the
+low-rank model. In particular, we update only the low-rank model while freezing
+the backbone parameters during pre-training. This allows for direct and
+efficient utilization of the low-rank model for downstream fine-tuning tasks.
+The proposed method achieves both efficiencies in terms of required parameters
+and computation time while maintaining comparable results with minimal
+modifications to the backbone architecture. Specifically, when applied to three
+vision-only and one vision-language Transformer models, our approach often
+demonstrates a merely $\sim$0.6 point decrease in performance while reducing
+the original parameter size by 1/3 to 2/3.",cs.CV,['cs.CV']
+Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning,Shiming Chen · Wenjin Hou · Salman Khan · Fahad Shahbaz Khan, ,https://arxiv.org/abs/2404.07713,,2404.07713.pdf,Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning,"Zero-shot learning (ZSL) recognizes the unseen classes by conducting
+visual-semantic interactions to transfer semantic knowledge from seen classes
+to unseen ones, supported by semantic information (e.g., attributes). However,
+existing ZSL methods simply extract visual features using a pre-trained network
+backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic
+correspondences for representing semantic-related visual features as lacking of
+the guidance of semantic information, resulting in undesirable visual-semantic
+interactions. To tackle this issue, we propose a progressive semantic-guided
+vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly
+considers two properties in the whole network: i) discover the semantic-related
+visual representations explicitly, and ii) discard the semantic-unrelated
+visual information. Specifically, we first introduce semantic-embedded token
+learning to improve the visual-semantic correspondences via semantic
+enhancement and discover the semantic-related visual tokens explicitly with
+semantic-guided token attention. Then, we fuse low semantic-visual
+correspondence visual tokens to discard the semantic-unrelated visual
+information for visual enhancement. These two operations are integrated into
+various encoders to progressively learn semantic-related visual representations
+for accurate visual-semantic interactions in ZSL. The extensive experiments
+show that our ZSLViT achieves significant performance gains on three popular
+benchmark datasets, i.e., CUB, SUN, and AWA2.",cs.CV,"['cs.CV', 'cs.LG']"
+Endow SAM with Keen Eyes: Temporal-spatial Prompt Learning for Video Camouflaged Object Detection,Wenjun Hui · Zhenfeng Zhu · Shuai Zheng · Yao Zhao, ,https://arxiv.org/html/2403.01968v1,,2403.01968v1.pdf,Explicit Motion Handling and Interactive Prompting for Video Camouflaged Object Detection,"Camouflage poses challenges in distinguishing a static target, whereas any
+movement of the target can break this disguise. Existing video camouflaged
+object detection (VCOD) approaches take noisy motion estimation as input or
+model motion implicitly, restricting detection performance in complex dynamic
+scenes. In this paper, we propose a novel Explicit Motion handling and
+Interactive Prompting framework for VCOD, dubbed EMIP, which handles motion
+cues explicitly using a frozen pre-trained optical flow fundamental model. EMIP
+is characterized by a two-stream architecture for simultaneously conducting
+camouflaged segmentation and optical flow estimation. Interactions across the
+dual streams are realized in an interactive prompting way that is inspired by
+emerging visual prompt learning. Two learnable modules, i.e. the camouflaged
+feeder and motion collector, are designed to incorporate segmentation-to-motion
+and motion-to-segmentation prompts, respectively, and enhance outputs of the
+both streams. The prompt fed to the motion stream is learned by supervising
+optical flow in a self-supervised manner. Furthermore, we show that long-term
+historical information can also be incorporated as a prompt into EMIP and
+achieve more robust results with temporal consistency. Experimental results
+demonstrate that our EMIP achieves new state-of-the-art records on popular VCOD
+benchmarks. The code will be publicly available.",cs.CV,['cs.CV']
+Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment,Zheren Fu · Lei Zhang · Hou Xia · Zhendong Mao,https://github.com/CrossmodalGroup/LAPS,https://arxiv.org/html/2312.05278v2,,2312.05278v2.pdf,Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects,"Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot
+capabilities in various vision-language dialogue scenarios. However, the
+absence of fine-grained visual object detection hinders the model from
+understanding the details of images, leading to irreparable visual
+hallucinations and factual errors. In this paper, we propose Lyrics, a novel
+multi-modal pre-training and instruction fine-tuning paradigm that bootstraps
+vision-language alignment from fine-grained cross-modal collaboration. Building
+on the foundation of BLIP-2, Lyrics infuses local visual features extracted
+from a visual refiner that includes image tagging, object detection and
+semantic segmentation modules into the Querying Transformer, while on the text
+side, the language inputs equip the boundary boxes and tags derived from the
+visual refiner. We further introduce a two-stage training scheme, in which the
+pre-training stage bridges the modality gap through explicit and comprehensive
+vision-language alignment targets. During the instruction fine-tuning stage, we
+introduce semantic-aware visual feature extraction, a crucial method that
+enables the model to extract informative features from concrete visual objects.
+Our approach achieves robust performance on 13 datasets across various
+vision-language tasks, and demonstrates promising multi-modal understanding,
+perception and conversation capabilities in 11 scenario-based benchmark
+toolkits.",cs.CL,['cs.CL']
+"GenH2R: Learning Generalizable Human-to-Robot Handover via Scalable Simulation, Demonstration, and Imitation",Zifan Wang · Junyu Chen · Ziqing Chen · Pengwei Xie · Rui Chen · Li Yi, ,https://arxiv.org/abs/2401.00929,,2401.00929.pdf,"GenH2R: Learning Generalizable Human-to-Robot Handover via Scalable Simulation, Demonstration, and Imitation","This paper presents GenH2R, a framework for learning generalizable
+vision-based human-to-robot (H2R) handover skills. The goal is to equip robots
+with the ability to reliably receive objects with unseen geometry handed over
+by humans in various complex trajectories. We acquire such generalizability by
+learning H2R handover at scale with a comprehensive solution including
+procedural simulation assets creation, automated demonstration generation, and
+effective imitation learning. We leverage large-scale 3D model repositories,
+dexterous grasp generation methods, and curve-based 3D animation to create an
+H2R handover simulation environment named \simabbns, surpassing the number of
+scenes in existing simulators by three orders of magnitude. We further
+introduce a distillation-friendly demonstration generation method that
+automatically generates a million high-quality demonstrations suitable for
+learning. Finally, we present a 4D imitation learning method augmented by a
+future forecasting objective to distill demonstrations into a visuo-motor
+handover policy. Experimental evaluations in both simulators and the real world
+demonstrate significant improvements (at least +10\% success rate) over
+baselines in all cases. The project page is https://GenH2R.github.io/.",cs.RO,"['cs.RO', 'cs.CV']"
+HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data,Mengqi Zhang · Yang Fu · Zheng Ding · Sifei Liu · Zhuowen Tu · Xiaolong Wang, ,https://arxiv.org/abs/2403.12011,,2403.12011.pdf,HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data,"3D hand-object interaction data is scarce due to the hardware constraints in
+scaling up the data collection process. In this paper, we propose HOIDiffusion
+for generating realistic and diverse 3D hand-object interaction data. Our model
+is a conditional diffusion model that takes both the 3D hand-object geometric
+structure and text description as inputs for image synthesis. This offers a
+more controllable and realistic synthesis as we can specify the structure and
+style inputs in a disentangled manner. HOIDiffusion is trained by leveraging a
+diffusion model pre-trained on large-scale natural images and a few 3D human
+demonstrations. Beyond controllable image synthesis, we adopt the generated 3D
+data for learning 6D object pose estimation and show its effectiveness in
+improving perception systems. Project page:
+https://mq-zhang1.github.io/HOIDiffusion",cs.CV,['cs.CV']
+Learning to navigate efficiently and precisely in real environments,Guillaume Bono · Hervé Poirier · Leonid Antsfeld · Gianluca Monaci · Boris Chidlovskii · Christian Wolf, ,https://arxiv.org/abs/2401.14349,,2401.14349.pdf,Learning to navigate efficiently and precisely in real environments,"In the context of autonomous navigation of terrestrial robots, the creation
+of realistic models for agent dynamics and sensing is a widespread habit in the
+robotics literature and in commercial applications, where they are used for
+model based control and/or for localization and mapping. The more recent
+Embodied AI literature, on the other hand, focuses on modular or end-to-end
+agents trained in simulators like Habitat or AI-Thor, where the emphasis is put
+on photo-realistic rendering and scene diversity, but high-fidelity robot
+motion is assigned a less privileged role. The resulting sim2real gap
+significantly impacts transfer of the trained models to real robotic platforms.
+In this work we explore end-to-end training of agents in simulation in settings
+which minimize the sim2real gap both, in sensing and in actuation. Our agent
+directly predicts (discretized) velocity commands, which are maintained through
+closed-loop control in the real robot. The behavior of the real robot
+(including the underlying low-level controller) is identified and simulated in
+a modified Habitat simulator. Noise models for odometry and localization
+further contribute in lowering the sim2real gap. We evaluate on real navigation
+scenarios, explore different localization and point goal calculation methods
+and report significant gains in performance and robustness compared to prior
+work.",cs.RO,"['cs.RO', 'cs.CV']"
+TexOct: Generating Textures of 3D Models with Octree-based Diffusion,Jialun Liu · Chenming Wu · Xinqi Liu · Xing Liu · Jinbo Wu · Haotian Peng · Chen Zhao · Haocheng Feng · Jingtuo Liu · Errui Ding, ,https://arxiv.org/html/2403.15009v1,,2403.15009v1.pdf,TexRO: Generating Delicate Textures of 3D Models by Recursive Optimization,"This paper presents TexRO, a novel method for generating delicate textures of
+a known 3D mesh by optimizing its UV texture. The key contributions are
+two-fold. We propose an optimal viewpoint selection strategy, that finds the
+most miniature set of viewpoints covering all the faces of a mesh. Our
+viewpoint selection strategy guarantees the completeness of a generated result.
+We propose a recursive optimization pipeline that optimizes a UV texture at
+increasing resolutions, with an adaptive denoising method that re-uses existing
+textures for new texture generation. Through extensive experimentation, we
+demonstrate the superior performance of TexRO in terms of texture quality,
+detail preservation, visual consistency, and, notably runtime speed,
+outperforming other current methods. The broad applicability of TexRO is
+further confirmed through its successful use on diverse 3D models.",cs.CV,['cs.CV']
+Explaining CLIP's performance disparities on data from blind/low vision users,Daniela Massiceti · Camilla Longden · Agnieszka Słowik · Samuel Wills · Martin Grayson · Cecily Morrison, ,https://arxiv.org/abs/2311.17315,,2311.17315.pdf,Explaining CLIP's performance disparities on data from blind/low vision users,"Large multi-modal models (LMMs) hold the potential to usher in a new era of
+automated visual assistance for people who are blind or low vision (BLV). Yet,
+these models have not been systematically evaluated on data captured by BLV
+users. We address this by empirically assessing CLIP, a widely-used LMM likely
+to underpin many assistive technologies. Testing 25 CLIP variants in a
+zero-shot classification task, we find that their accuracy is 15 percentage
+points lower on average for images captured by BLV users than web-crawled
+images. This disparity stems from CLIP's sensitivities to 1) image content
+(e.g. not recognizing disability objects as well as other objects); 2) image
+quality (e.g. not being robust to lighting variation); and 3) text content
+(e.g. not recognizing objects described by tactile adjectives as well as visual
+ones). We delve deeper with a textual analysis of three common pre-training
+datasets: LAION-400M, LAION-2B and DataComp-1B, showing that disability content
+is rarely mentioned. We then provide three examples that illustrate how the
+performance disparities extend to three downstream models underpinned by CLIP:
+OWL-ViT, CLIPSeg and DALL-E2. We find that few-shot learning with as few as 5
+images can mitigate CLIP's quality-of-service disparities for BLV users in some
+scenarios, which we discuss alongside a set of other possible mitigations.",cs.CV,['cs.CV']
+Virtual Immunohistochemistry Staining for Histological Images Assisted by Weakly-supervised Learning,Jiahan Li · Jiuyang Dong · Shenjin Huang · Xi Li · Junjun Jiang · Xiaopeng Fan · Yongbing Zhang, ,,https://www.sciencedirect.com/science/article/pii/S0167779924000386,,,,,nan
+Bridging Remote Sensors with Multisensor Geospatial Foundation Models,Boran Han · Shuai Zhang · Xingjian Shi · Markus Reichstein, ,https://arxiv.org/abs/2404.01260,,2404.01260.pdf,Bridging Remote Sensors with Multisensor Geospatial Foundation Models,"In the realm of geospatial analysis, the diversity of remote sensors,
+encompassing both optical and microwave technologies, offers a wealth of
+distinct observational capabilities. Recognizing this, we present msGFM, a
+multisensor geospatial foundation model that effectively unifies data from four
+key sensor modalities. This integration spans an expansive dataset of two
+million multisensor images. msGFM is uniquely adept at handling both paired and
+unpaired sensor data. For data originating from identical geolocations, our
+model employs an innovative cross-sensor pretraining approach in masked image
+modeling, enabling the synthesis of joint representations from diverse sensors.
+msGFM, incorporating four remote sensors, upholds strong performance, forming a
+comprehensive model adaptable to various sensor types. msGFM has demonstrated
+enhanced proficiency in a range of both single-sensor and multisensor
+downstream tasks. These include scene classification, segmentation, cloud
+removal, and pan-sharpening. A key discovery of our research is that
+representations derived from natural images are not always compatible with the
+distinct characteristics of geospatial remote sensors, underscoring the
+limitations of existing representations in this field. Our work can serve as a
+guide for developing multisensor geospatial pretraining models, paving the way
+for more advanced geospatial capabilities.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Active Generalized Category Discovery,Shijie Ma · Fei Zhu · Zhun Zhong · Xu-Yao Zhang · Cheng-Lin Liu,https://github.com/mashijie1028/ActiveGCD,https://arxiv.org/abs/2403.04272v1,,2403.04272v1.pdf,Active Generalized Category Discovery,"Generalized Category Discovery (GCD) is a pragmatic and challenging
+open-world task, which endeavors to cluster unlabeled samples from both novel
+and old classes, leveraging some labeled data of old classes. Given that
+knowledge learned from old classes is not fully transferable to new classes,
+and that novel categories are fully unlabeled, GCD inherently faces intractable
+problems, including imbalanced classification performance and inconsistent
+confidence between old and new classes, especially in the low-labeling regime.
+Hence, some annotations of new classes are deemed necessary. However, labeling
+new classes is extremely costly. To address this issue, we take the spirit of
+active learning and propose a new setting called Active Generalized Category
+Discovery (AGCD). The goal is to improve the performance of GCD by actively
+selecting a limited amount of valuable samples for labeling from the oracle. To
+solve this problem, we devise an adaptive sampling strategy, which jointly
+considers novelty, informativeness and diversity to adaptively select novel
+samples with proper uncertainty. However, owing to the varied orderings of
+label indices caused by the clustering of novel classes, the queried labels are
+not directly applicable to subsequent training. To overcome this issue, we
+further propose a stable label mapping algorithm that transforms ground truth
+labels to the label space of the classifier, thereby ensuring consistent
+training across different active selection stages. Our method achieves
+state-of-the-art performance on both generic and fine-grained datasets. Our
+code is available at https://github.com/mashijie1028/ActiveGCD",cs.CV,['cs.CV']
+PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation,Zhenyu Li · Shariq Bhat · Peter Wonka, ,https://arxiv.org/abs/2312.02284,,2312.02284.pdf,PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation,"Single image depth estimation is a foundational task in computer vision and
+generative modeling. However, prevailing depth estimation models grapple with
+accommodating the increasing resolutions commonplace in today's consumer
+cameras and devices. Existing high-resolution strategies show promise, but they
+often face limitations, ranging from error propagation to the loss of
+high-frequency details. We present PatchFusion, a novel tile-based framework
+with three key components to improve the current state of the art: (1) A
+patch-wise fusion network that fuses a globally-consistent coarse prediction
+with finer, inconsistent tiled predictions via high-level feature guidance, (2)
+A Global-to-Local (G2L) module that adds vital context to the fusion network,
+discarding the need for patch selection heuristics, and (3) A Consistency-Aware
+Training (CAT) and Inference (CAI) approach, emphasizing patch overlap
+consistency and thereby eradicating the necessity for post-processing.
+Experiments on UnrealStereo4K, MVS-Synth, and Middleburry 2014 demonstrate that
+our framework can generate high-resolution depth maps with intricate details.
+PatchFusion is independent of the base model for depth estimation. Notably, our
+framework built on top of SOTA ZoeDepth brings improvements for a total of
+17.3% and 29.4% in terms of the root mean squared error (RMSE) on
+UnrealStereo4K and MVS-Synth, respectively.",cs.CV,['cs.CV']
+Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation,Xianghui Xie · Bharat Lal Bhatnagar · Jan Lenssen · Gerard Pons-Moll, ,https://arxiv.org/abs/2312.07063,,2312.07063.pdf,Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation,"Reconstructing human-object interaction in 3D from a single RGB image is a
+challenging task and existing data driven methods do not generalize beyond the
+objects present in the carefully curated 3D interaction datasets. Capturing
+large-scale real data to learn strong interaction and 3D shape priors is very
+expensive due to the combinatorial nature of human-object interactions. In this
+paper, we propose ProciGen (Procedural interaction Generation), a method to
+procedurally generate datasets with both, plausible interaction and diverse
+object variation. We generate 1M+ human-object interaction pairs in 3D and
+leverage this large-scale data to train our HDM (Hierarchical Diffusion Model),
+a novel method to reconstruct interacting human and unseen objects, without any
+templates. Our HDM is an image-conditioned diffusion model that learns both
+realistic interaction and highly accurate human and object shapes. Experiments
+show that our HDM trained with ProciGen significantly outperforms prior methods
+that requires template meshes and that our dataset allows training methods with
+strong generalization ability to unseen object instances. Our code and data are
+released.",cs.CV,['cs.CV']
+Transferable and Principled Efficiency for Open-Vocabulary Segmentation,Jingxuan Xu · Wuyang Chen · Yao Zhao · Yunchao Wei, ,https://arxiv.org/abs/2404.07448,,2404.07448.pdf,Transferable and Principled Efficiency for Open-Vocabulary Segmentation,"Recent success of pre-trained foundation vision-language models makes
+Open-Vocabulary Segmentation (OVS) possible. Despite the promising performance,
+this approach introduces heavy computational overheads for two challenges: 1)
+large model sizes of the backbone; 2) expensive costs during the fine-tuning.
+These challenges hinder this OVS strategy from being widely applicable and
+affordable in real-world scenarios. Although traditional methods such as model
+compression and efficient fine-tuning can address these challenges, they often
+rely on heuristics. This means that their solutions cannot be easily
+transferred and necessitate re-training on different models, which comes at a
+cost. In the context of efficient OVS, we target achieving performance that is
+comparable to or even better than prior OVS works based on large
+vision-language foundation models, by utilizing smaller models that incur lower
+training costs. The core strategy is to make our efficiency principled and thus
+seamlessly transferable from one OVS framework to others without further
+customization. Comprehensive experiments on diverse OVS benchmarks demonstrate
+our superior trade-off between segmentation accuracy and computation costs over
+previous works. Our code is available on https://github.com/Xujxyang/OpenTrans",cs.CV,"['cs.CV', 'cs.CL', 'eess.IV']"
+Low-Res Leads the Way: Improving Generalization for Super-Resolution by Self-Supervised Learning,Haoyu Chen · Wenbo Li · Jinjin Gu · Jingjing Ren · Haoze Sun · Xueyi Zou · Youliang Yan · Zhensong Zhang · Lei Zhu, ,https://arxiv.org/abs/2403.02601,,2403.02601.pdf,Low-Res Leads the Way: Improving Generalization for Super-Resolution by Self-Supervised Learning,"For image super-resolution (SR), bridging the gap between the performance on
+synthetic datasets and real-world degradation scenarios remains a challenge.
+This work introduces a novel ""Low-Res Leads the Way"" (LWay) training framework,
+merging Supervised Pre-training with Self-supervised Learning to enhance the
+adaptability of SR models to real-world images. Our approach utilizes a
+low-resolution (LR) reconstruction network to extract degradation embeddings
+from LR images, merging them with super-resolved outputs for LR reconstruction.
+Leveraging unseen LR images for self-supervised learning guides the model to
+adapt its modeling space to the target domain, facilitating fine-tuning of SR
+models without requiring paired high-resolution (HR) images. The integration of
+Discrete Wavelet Transform (DWT) further refines the focus on high-frequency
+details. Extensive evaluations show that our method significantly improves the
+generalization and detail restoration capabilities of SR models on unseen
+real-world datasets, outperforming existing methods. Our training regime is
+universally compatible, requiring no network architecture modifications, making
+it a practical solution for real-world SR applications.",eess.IV,"['eess.IV', 'cs.CV']"
+BodyMAP - Jointly Predicting Body Mesh and 3D Applied Pressure Map for People in Bed,Abhishek Tandon · Anujraaj Goyal · Henry M. Clever · Zackory Erickson, ,https://arxiv.org/abs/2404.03183,,2404.03183.pdf,BodyMAP -- Jointly Predicting Body Mesh and 3D Applied Pressure Map for People in Bed,"Accurately predicting the 3D human posture and the pressure exerted on the
+body for people resting in bed, visualized as a body mesh (3D pose & shape)
+with a 3D pressure map, holds significant promise for healthcare applications,
+particularly, in the prevention of pressure ulcers. Current methods focus on
+singular facets of the problem -- predicting only 2D/3D poses, generating 2D
+pressure images, predicting pressure only for certain body regions instead of
+the full body, or forming indirect approximations to the 3D pressure map. In
+contrast, we introduce BodyMAP, which jointly predicts the human body mesh and
+3D applied pressure map across the entire human body. Our network leverages
+multiple visual modalities, incorporating both a depth image of a person in bed
+and its corresponding 2D pressure image acquired from a pressure-sensing
+mattress. The 3D pressure map is represented as a pressure value at each mesh
+vertex and thus allows for precise localization of high-pressure regions on the
+body. Additionally, we present BodyMAP-WS, a new formulation of pressure
+prediction in which we implicitly learn pressure in 3D by aligning sensed 2D
+pressure images with a differentiable 2D projection of the predicted 3D
+pressure maps. In evaluations with real-world human data, our method
+outperforms the current state-of-the-art technique by 25% on both body mesh and
+3D applied pressure map prediction tasks for people in bed.",cs.CV,['cs.CV']
+OOSTraj: Out-of-Sight Trajectory Prediction With Vision-Positioning Denoising,Haichao Zhang · Yi Xu · Hongsheng Lu · Takayuki Shimizu · Yun Fu, ,https://arxiv.org/abs/2404.02227,,2404.02227.pdf,OOSTraj: Out-of-Sight Trajectory Prediction With Vision-Positioning Denoising,"Trajectory prediction is fundamental in computer vision and autonomous
+driving, particularly for understanding pedestrian behavior and enabling
+proactive decision-making. Existing approaches in this field often assume
+precise and complete observational data, neglecting the challenges associated
+with out-of-view objects and the noise inherent in sensor data due to limited
+camera range, physical obstructions, and the absence of ground truth for
+denoised sensor data. Such oversights are critical safety concerns, as they can
+result in missing essential, non-visible objects. To bridge this gap, we
+present a novel method for out-of-sight trajectory prediction that leverages a
+vision-positioning technique. Our approach denoises noisy sensor observations
+in an unsupervised manner and precisely maps sensor-based trajectories of
+out-of-sight objects into visual trajectories. This method has demonstrated
+state-of-the-art performance in out-of-sight noisy sensor trajectory denoising
+and prediction on the Vi-Fi and JRDB datasets. By enhancing trajectory
+prediction accuracy and addressing the challenges of out-of-sight objects, our
+work significantly contributes to improving the safety and reliability of
+autonomous driving in complex environments. Our work represents the first
+initiative towards Out-Of-Sight Trajectory prediction (OOSTraj), setting a new
+benchmark for future research. The code is available at
+\url{https://github.com/Hai-chao-Zhang/OOSTraj}.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.RO']"
+PAD: Patch-Agnostic Defense against Adversarial Patch Attacks,Lihua Jing · Rui Wang · Wenqi Ren · Xin Dong · Cong Zou, ,https://arxiv.org/abs/2404.16452,,2404.16452.pdf,PAD: Patch-Agnostic Defense against Adversarial Patch Attacks,"Adversarial patch attacks present a significant threat to real-world object
+detectors due to their practical feasibility. Existing defense methods, which
+rely on attack data or prior knowledge, struggle to effectively address a wide
+range of adversarial patches. In this paper, we show two inherent
+characteristics of adversarial patches, semantic independence and spatial
+heterogeneity, independent of their appearance, shape, size, quantity, and
+location. Semantic independence indicates that adversarial patches operate
+autonomously within their semantic context, while spatial heterogeneity
+manifests as distinct image quality of the patch area that differs from
+original clean image due to the independent generation process. Based on these
+observations, we propose PAD, a novel adversarial patch localization and
+removal method that does not require prior knowledge or additional training.
+PAD offers patch-agnostic defense against various adversarial patches,
+compatible with any pre-trained object detectors. Our comprehensive digital and
+physical experiments involving diverse patch types, such as localized noise,
+printable, and naturalistic patches, exhibit notable improvements over
+state-of-the-art works. Our code is available at
+https://github.com/Lihua-Jing/PAD.",cs.CV,['cs.CV']
+ESR-NeRF: Emissive Source Reconstruction Using LDR Multi-view Images,Jinseo Jeong · Junseo Koo · Qimeng Zhang · Gunhee Kim, ,https://arxiv.org/abs/2404.15707,,2404.15707.pdf,ESR-NeRF: Emissive Source Reconstruction Using LDR Multi-view Images,"Existing NeRF-based inverse rendering methods suppose that scenes are
+exclusively illuminated by distant light sources, neglecting the potential
+influence of emissive sources within a scene. In this work, we confront this
+limitation using LDR multi-view images captured with emissive sources turned on
+and off. Two key issues must be addressed: 1) ambiguity arising from the
+limited dynamic range along with unknown lighting details, and 2) the expensive
+computational cost in volume rendering to backtrace the paths leading to final
+object colors. We present a novel approach, ESR-NeRF, leveraging neural
+networks as learnable functions to represent ray-traced fields. By training
+networks to satisfy light transport segments, we regulate outgoing radiances,
+progressively identifying emissive sources while being aware of reflection
+areas. The results on scenes encompassing emissive sources with various
+properties demonstrate the superiority of ESR-NeRF in qualitative and
+quantitative ways. Our approach also extends its applicability to the scenes
+devoid of emissive sources, achieving lower CD metrics on the DTU dataset.",cs.CV,['cs.CV']
+Enhancing Video Super-Resolution via Implicit Resampling-based Alignment,Kai Xu · Ziwei Yu · Xin Wang · Michael Bi Mi · Angela Yao,https://github.com/kai422/IART,https://arxiv.org/html/2305.00163v2,,2305.00163v2.pdf,Enhancing Video Super-Resolution via Implicit Resampling-based Alignment,"In video super-resolution, it is common to use a frame-wise alignment to
+support the propagation of information over time. The role of alignment is
+well-studied for low-level enhancement in video, but existing works overlook a
+critical step -- resampling. We show through extensive experiments that for
+alignment to be effective, the resampling should preserve the reference
+frequency spectrum while minimizing spatial distortions. However, most existing
+works simply use a default choice of bilinear interpolation for resampling even
+though bilinear interpolation has a smoothing effect and hinders
+super-resolution. From these observations, we propose an implicit
+resampling-based alignment. The sampling positions are encoded by a sinusoidal
+positional encoding, while the value is estimated with a coordinate network and
+a window-based cross-attention. We show that bilinear interpolation inherently
+attenuates high-frequency information while an MLP-based coordinate network can
+approximate more frequencies. Experiments on synthetic and real-world datasets
+show that alignment with our proposed implicit resampling enhances the
+performance of state-of-the-art frameworks with minimal impact on both compute
+and parameters.",cs.CV,['cs.CV']
+UniPTS: A Unified Framework for Proficient Post-Training Sparsity,JingJing Xie · Yuxin Zhang · Mingbao Lin · ZhiHang Lin · Liujuan Cao · Rongrong Ji, ,https://arxiv.org/abs/2405.18810,,2405.18810.pdf,UniPTS: A Unified Framework for Proficient Post-Training Sparsity,"Post-training Sparsity (PTS) is a recently emerged avenue that chases
+efficient network sparsity with limited data in need. Existing PTS methods,
+however, undergo significant performance degradation compared with traditional
+methods that retrain the sparse networks via the whole dataset, especially at
+high sparsity ratios. In this paper, we attempt to reconcile this disparity by
+transposing three cardinal factors that profoundly alter the performance of
+conventional sparsity into the context of PTS. Our endeavors particularly
+comprise (1) A base-decayed sparsity objective that promotes efficient
+knowledge transferring from dense network to the sparse counterpart. (2) A
+reducing-regrowing search algorithm designed to ascertain the optimal sparsity
+distribution while circumventing overfitting to the small calibration set in
+PTS. (3) The employment of dynamic sparse training predicated on the preceding
+aspects, aimed at comprehensively optimizing the sparsity structure while
+ensuring training stability. Our proposed framework, termed UniPTS, is
+validated to be much superior to existing PTS methods across extensive
+benchmarks. As an illustration, it amplifies the performance of POT, a recently
+proposed recipe, from 3.9% to 68.6% when pruning ResNet-50 at 90% sparsity
+ratio on ImageNet. We release the code of our paper at
+https://github.com/xjjxmu/UniPTS.",cs.CV,"['cs.CV', 'cs.AI']"
+MMA-Diffusion: MultiModal Attack on Diffusion Models,Yijun Yang · Ruiyuan Gao · Xiaosen Wang · Tsung-Yi Ho · Xu Nan · Qiang Xu,https://github.com/cure-lab/MMA-Diffusion,https://arxiv.org/abs/2311.17516,,2311.17516.pdf,MMA-Diffusion: MultiModal Attack on Diffusion Models,"In recent years, Text-to-Image (T2I) models have seen remarkable
+advancements, gaining widespread adoption. However, this progress has
+inadvertently opened avenues for potential misuse, particularly in generating
+inappropriate or Not-Safe-For-Work (NSFW) content. Our work introduces
+MMA-Diffusion, a framework that presents a significant and realistic threat to
+the security of T2I models by effectively circumventing current defensive
+measures in both open-source models and commercial online services. Unlike
+previous approaches, MMA-Diffusion leverages both textual and visual modalities
+to bypass safeguards like prompt filters and post-hoc safety checkers, thus
+exposing and highlighting the vulnerabilities in existing defense mechanisms.",cs.CR,"['cs.CR', 'cs.CV']"
+Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures,Huijie Zhang · Yifu Lu · Ismail Alkhouri · Saiprasad Ravishankar · Dogyoon Song · Qing Qu, ,https://arxiv.org/abs/2312.09181,,2312.09181.pdf,Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures,"Diffusion models, emerging as powerful deep generative tools, excel in
+various applications. They operate through a two-steps process: introducing
+noise into training samples and then employing a model to convert random noise
+into new samples (e.g., images). However, their remarkable generative
+performance is hindered by slow training and sampling. This is due to the
+necessity of tracking extensive forward and reverse diffusion trajectories, and
+employing a large model with numerous parameters across multiple timesteps
+(i.e., noise levels). To tackle these challenges, we present a multi-stage
+framework inspired by our empirical findings. These observations indicate the
+advantages of employing distinct parameters tailored to each timestep while
+retaining universal parameters shared across all time steps. Our approach
+involves segmenting the time interval into multiple stages where we employ
+custom multi-decoder U-net architecture that blends time-dependent models with
+a universally shared encoder. Our framework enables the efficient distribution
+of computational resources and mitigates inter-stage interference, which
+substantially improves training efficiency. Extensive numerical experiments
+affirm the effectiveness of our framework, showcasing significant training and
+sampling efficiency enhancements on three state-of-the-art diffusion models,
+including large-scale latent diffusion models. Furthermore, our ablation
+studies illustrate the impact of two important components in our framework: (i)
+a novel timestep clustering algorithm for stage division, and (ii) an
+innovative multi-decoder U-net architecture, seamlessly integrating universal
+and customized hyperparameters.",cs.CV,['cs.CV']
+DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing,Chong Mou · Xintao Wang · Jiechong Song · Ying Shan · Jian Zhang, ,https://arxiv.org/abs/2402.02583,,2402.02583.pdf,DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing,"Large-scale Text-to-Image (T2I) diffusion models have revolutionized image
+generation over the last few years. Although owning diverse and high-quality
+generation capabilities, translating these abilities to fine-grained image
+editing remains challenging. In this paper, we propose DiffEditor to rectify
+two weaknesses in existing diffusion-based image editing: (1) in complex
+scenarios, editing results often lack editing accuracy and exhibit unexpected
+artifacts; (2) lack of flexibility to harmonize editing operations, e.g.,
+imagine new content. In our solution, we introduce image prompts in
+fine-grained image editing, cooperating with the text prompt to better describe
+the editing content. To increase the flexibility while maintaining content
+consistency, we locally combine stochastic differential equation (SDE) into the
+ordinary differential equation (ODE) sampling. In addition, we incorporate
+regional score-based gradient guidance and a time travel strategy into the
+diffusion sampling, further improving the editing quality. Extensive
+experiments demonstrate that our method can efficiently achieve
+state-of-the-art performance on various fine-grained image editing tasks,
+including editing within a single image (e.g., object moving, resizing, and
+content dragging) and across images (e.g., appearance replacing and object
+pasting). Our source code is released at
+https://github.com/MC-E/DragonDiffusion.",cs.CV,"['cs.CV', 'cs.LG']"
+DiVAS: Video and Audio Synchronization with Dynamic Frame Rates,Clara Maria Fernandez Labrador · Mertcan Akcay · Eitan Abecassis · Joan Massich · Christopher Schroers, ,,https://link.springer.com/article/10.1007/s11042-023-17728-1,,,,,nan
+EvDiG: Event-guided Direct and Global Components Separation,xinyu zhou · Peiqi Duan · Boyu Li · Chu Zhou · Chao Xu · Boxin Shi, ,http://export.arxiv.org/abs/2312.16933v1,,2312.16933v1.pdf,EvPlug: Learn a Plug-and-Play Module for Event and Image Fusion,"Event cameras and RGB cameras exhibit complementary characteristics in
+imaging: the former possesses high dynamic range (HDR) and high temporal
+resolution, while the latter provides rich texture and color information. This
+makes the integration of event cameras into middle- and high-level RGB-based
+vision tasks highly promising. However, challenges arise in multi-modal fusion,
+data annotation, and model architecture design. In this paper, we propose
+EvPlug, which learns a plug-and-play event and image fusion module from the
+supervision of the existing RGB-based model. The learned fusion module
+integrates event streams with image features in the form of a plug-in, endowing
+the RGB-based model to be robust to HDR and fast motion scenes while enabling
+high temporal resolution inference. Our method only requires unlabeled
+event-image pairs (no pixel-wise alignment required) and does not alter the
+structure or weights of the RGB-based model. We demonstrate the superiority of
+EvPlug in several vision tasks such as object detection, semantic segmentation,
+and 3D hand pose estimation",cs.CV,"['cs.CV', 'cs.AI']"
+Face2Diffusion for Fast and Editable Face Personalization,Kaede Shiohara · Toshihiko Yamasaki,https://mapooon.github.io/Face2DiffusionPage/,https://arxiv.org/abs/2403.05094,,2403.05094.pdf,Face2Diffusion for Fast and Editable Face Personalization,"Face personalization aims to insert specific faces, taken from images, into
+pretrained text-to-image diffusion models. However, it is still challenging for
+previous methods to preserve both the identity similarity and editability due
+to overfitting to training samples. In this paper, we propose Face2Diffusion
+(F2D) for high-editability face personalization. The core idea behind F2D is
+that removing identity-irrelevant information from the training pipeline
+prevents the overfitting problem and improves editability of encoded faces. F2D
+consists of the following three novel components: 1) Multi-scale identity
+encoder provides well-disentangled identity features while keeping the benefits
+of multi-scale information, which improves the diversity of camera poses. 2)
+Expression guidance disentangles face expressions from identities and improves
+the controllability of face expressions. 3) Class-guided denoising
+regularization encourages models to learn how faces should be denoised, which
+boosts the text-alignment of backgrounds. Extensive experiments on the
+FaceForensics++ dataset and diverse prompts demonstrate our method greatly
+improves the trade-off between the identity- and text-fidelity compared to
+previous state-of-the-art methods.",cs.CV,['cs.CV']
+PeVL: Pose-Enhanced Vision-Language Model for Fine-Grained Human Action Recognition,Haosong Zhang · Mei Leong · Liyuan Li · Weisi Lin, ,https://ar5iv.labs.arxiv.org/html/2205.11169,,2205.11169.pdf,PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models,"Vision-language pre-training (VLP) has shown impressive performance on a wide
+range of cross-modal tasks, where VLP models without reliance on object
+detectors are becoming the mainstream due to their superior computation
+efficiency and competitive performance. However, the removal of object
+detectors also deprives the capability of VLP models in explicit object
+modeling, which is essential to various position-sensitive vision-language (VL)
+tasks, such as referring expression comprehension and visual commonsense
+reasoning. To address the challenge, we introduce PEVL that enhances the
+pre-training and prompt tuning of VLP models with explicit object position
+modeling. Specifically, PEVL reformulates discretized object positions and
+language in a unified language modeling framework, which facilitates explicit
+VL alignment during pre-training, and also enables flexible prompt tuning for
+various downstream tasks. We show that PEVL enables state-of-the-art
+performance of detector-free VLP models on position-sensitive tasks such as
+referring expression comprehension and phrase grounding, and also improves the
+performance on position-insensitive tasks with grounded inputs. We make the
+data and code for this paper publicly available at
+https://github.com/thunlp/PEVL.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']"
+HOI-M$^3$: Capture Multiple Humans and Objects Interaction within Contextual Environment,Juze Zhang · Jingyan Zhang · Zining Song · Zhanhe Shi · Chengfeng Zhao · Ye Shi · Jingyi Yu · Lan Xu · Jingya Wang, ,https://arxiv.org/abs/2404.00299,,2404.00299.pdf,HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment,"Humans naturally interact with both others and the surrounding multiple
+objects, engaging in various social activities. However, recent advances in
+modeling human-object interactions mostly focus on perceiving isolated
+individuals and objects, due to fundamental data scarcity. In this paper, we
+introduce HOI-M3, a novel large-scale dataset for modeling the interactions of
+Multiple huMans and Multiple objects. Notably, it provides accurate 3D tracking
+for both humans and objects from dense RGB and object-mounted IMU inputs,
+covering 199 sequences and 181M frames of diverse humans and objects under rich
+activities. With the unique HOI-M3 dataset, we introduce two novel data-driven
+tasks with companion strong baselines: monocular capture and unstructured
+generation of multiple human-object interactions. Extensive experiments
+demonstrate that our dataset is challenging and worthy of further research
+about multiple human-object interactions and behavior analysis. Our HOI-M3
+dataset, corresponding codes, and pre-trained models will be disseminated to
+the community for future research.",cs.CV,['cs.CV']
+GraCo: Granularity-Controllable Interactive Segmentation,Yian Zhao · Kehan Li · Zesen Cheng · Pengchong Qiao · Xiawu Zheng · Rongrong Ji · Chang Liu · Li Yuan · Jie Chen, ,https://arxiv.org/abs/2405.00587,,2405.00587.pdf,GraCo: Granularity-Controllable Interactive Segmentation,"Interactive Segmentation (IS) segments specific objects or parts in the image
+according to user input. Current IS pipelines fall into two categories:
+single-granularity output and multi-granularity output. The latter aims to
+alleviate the spatial ambiguity present in the former. However, the
+multi-granularity output pipeline suffers from limited interaction flexibility
+and produces redundant results. In this work, we introduce
+Granularity-Controllable Interactive Segmentation (GraCo), a novel approach
+that allows precise control of prediction granularity by introducing additional
+parameters to input. This enhances the customization of the interactive system
+and eliminates redundancy while resolving ambiguity. Nevertheless, the
+exorbitant cost of annotating multi-granularity masks and the lack of available
+datasets with granularity annotations make it difficult for models to acquire
+the necessary guidance to control output granularity. To address this problem,
+we design an any-granularity mask generator that exploits the semantic property
+of the pre-trained IS model to automatically generate abundant mask-granularity
+pairs without requiring additional manual annotation. Based on these pairs, we
+propose a granularity-controllable learning strategy that efficiently imparts
+the granularity controllability to the IS model. Extensive experiments on
+intricate scenarios at object and part levels demonstrate that our GraCo has
+significant advantages over previous methods. This highlights the potential of
+GraCo to be a flexible annotation tool, capable of adapting to diverse
+segmentation scenarios. The project page: https://zhao-yian.github.io/GraCo.",cs.CV,['cs.CV']
+Deep Equilibrium Diffusion Restoration with Parallel Sampling,Jiezhang Cao · Yue Shi · Kai Zhang · Yulun Zhang · Radu Timofte · Luc Van Gool, ,https://arxiv.org/abs/2311.11600,,2311.11600.pdf,Deep Equilibrium Diffusion Restoration with Parallel Sampling,"Diffusion model-based image restoration (IR) aims to use diffusion models to
+recover high-quality (HQ) images from degraded images, achieving promising
+performance. Due to the inherent property of diffusion models, most existing
+methods need long serial sampling chains to restore HQ images step-by-step,
+resulting in expensive sampling time and high computation costs. Moreover, such
+long sampling chains hinder understanding the relationship between inputs and
+restoration results since it is hard to compute the gradients in the whole
+chains. In this work, we aim to rethink the diffusion model-based IR models
+through a different perspective, i.e., a deep equilibrium (DEQ) fixed point
+system, called DeqIR. Specifically, we derive an analytical solution by
+modeling the entire sampling chain in these IR models as a joint multivariate
+fixed point system. Based on the analytical solution, we can conduct parallel
+sampling and restore HQ images without training. Furthermore, we compute fast
+gradients via DEQ inversion and found that initialization optimization can
+boost image quality and control the generation direction. Extensive experiments
+on benchmarks demonstrate the effectiveness of our method on typical IR tasks
+and real-world settings.",cs.CV,['cs.CV']
+Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network,Sizhe Zheng · Pan Gao · Peng Zhou · Jie Qin, ,https://arxiv.org/abs/2405.19775,,2405.19775.pdf,Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network,"Style transfer aims to render an image with the artistic features of a style
+image, while maintaining the original structure. Various methods have been put
+forward for this task, but some challenges still exist. For instance, it is
+difficult for CNN-based methods to handle global information and long-range
+dependencies between input images, for which transformer-based methods have
+been proposed. Although transformers can better model the relationship between
+content and style images, they require high-cost hardware and time-consuming
+inference. To address these issues, we design a novel transformer model that
+includes only the encoder, thus significantly reducing the computational cost.
+In addition, we also find that existing style transfer methods may lead to
+images under-stylied or missing content. In order to achieve better
+stylization, we design a content feature extractor and a style feature
+extractor, based on which pure content and style images can be fed to the
+transformer. Finally, we propose a novel network termed Puff-Net, i.e., pure
+content and style feature fusion network. Through qualitative and quantitative
+experiments, we demonstrate the advantages of our model compared to
+state-of-the-art ones in the literature.",cs.CV,['cs.CV']
+Tactile-Augmented Radiance Fields,Yiming Dou · Fengyu Yang · Yi Liu · Antonio Loquercio · Andrew Owens, ,https://arxiv.org/abs/2405.04534,,2405.04534.pdf,Tactile-Augmented Radiance Fields,"We present a scene representation, which we call a tactile-augmented radiance
+field (TaRF), that brings vision and touch into a shared 3D space. This
+representation can be used to estimate the visual and tactile signals for a
+given 3D position within a scene. We capture a scene's TaRF from a collection
+of photos and sparsely sampled touch probes. Our approach makes use of two
+insights: (i) common vision-based touch sensors are built on ordinary cameras
+and thus can be registered to images using methods from multi-view geometry,
+and (ii) visually and structurally similar regions of a scene share the same
+tactile features. We use these insights to register touch signals to a captured
+visual scene, and to train a conditional diffusion model that, provided with an
+RGB-D image rendered from a neural radiance field, generates its corresponding
+tactile signal. To evaluate our approach, we collect a dataset of TaRFs. This
+dataset contains more touch samples than previous real-world datasets, and it
+provides spatially aligned visual signals for each captured touch signal. We
+demonstrate the accuracy of our cross-modal generative model and the utility of
+the captured visual-tactile data on several downstream tasks. Project page:
+https://dou-yiming.github.io/TaRF",cs.CV,['cs.CV']
+The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes,Myeongseob Ko · Feiyang Kang · Weiyan Shi · Ming Jin · Zhou Yu · Ruoxi Jia, ,https://arxiv.org/abs/2402.08922,,2402.08922.pdf,The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes,"Large-scale black-box models have become ubiquitous across numerous
+applications. Understanding the influence of individual training data sources
+on predictions made by these models is crucial for improving their
+trustworthiness. Current influence estimation techniques involve computing
+gradients for every training point or repeated training on different subsets.
+These approaches face obvious computational challenges when scaled up to large
+datasets and models.
+  In this paper, we introduce and explore the Mirrored Influence Hypothesis,
+highlighting a reciprocal nature of influence between training and test data.
+Specifically, it suggests that evaluating the influence of training data on
+test predictions can be reformulated as an equivalent, yet inverse problem:
+assessing how the predictions for training samples would be altered if the
+model were trained on specific test samples. Through both empirical and
+theoretical validations, we demonstrate the wide applicability of our
+hypothesis. Inspired by this, we introduce a new method for estimating the
+influence of training data, which requires calculating gradients for specific
+test samples, paired with a forward pass for each training point. This approach
+can capitalize on the common asymmetry in scenarios where the number of test
+samples under concurrent examination is much smaller than the scale of the
+training dataset, thus gaining a significant improvement in efficiency compared
+to existing approaches.
+  We demonstrate the applicability of our method across a range of scenarios,
+including data attribution in diffusion models, data leakage detection,
+analysis of memorization, mislabeled data detection, and tracing behavior in
+language models. Our code will be made available at
+https://github.com/ruoxi-jia-group/Forward-INF.",cs.LG,"['cs.LG', 'stat.ML']"
+Logit Standardization in Knowledge Distillation,Shangquan Sun · Wenqi Ren · Jingzhi Li · Rui Wang · Xiaochun Cao,https://sunsean21.github.io/logit-stand-KD.html,https://arxiv.org/abs/2403.01427,,2403.01427.pdf,Logit Standardization in Knowledge Distillation,"Knowledge distillation involves transferring soft labels from a teacher to a
+student using a shared temperature-based softmax function. However, the
+assumption of a shared temperature between teacher and student implies a
+mandatory exact match between their logits in terms of logit range and
+variance. This side-effect limits the performance of student, considering the
+capacity discrepancy between them and the finding that the innate logit
+relations of teacher are sufficient for student to learn. To address this
+issue, we propose setting the temperature as the weighted standard deviation of
+logit and performing a plug-and-play Z-score pre-process of logit
+standardization before applying softmax and Kullback-Leibler divergence. Our
+pre-process enables student to focus on essential logit relations from teacher
+rather than requiring a magnitude match, and can improve the performance of
+existing logit-based distillation methods. We also show a typical case where
+the conventional setting of sharing temperature between teacher and student
+cannot reliably yield the authentic distillation evaluation; nonetheless, this
+challenge is successfully alleviated by our Z-score. We extensively evaluate
+our method for various student and teacher models on CIFAR-100 and ImageNet,
+showing its significant superiority. The vanilla knowledge distillation powered
+by our pre-process can achieve favorable performance against state-of-the-art
+methods, and other distillation variants can obtain considerable gain with the
+assistance of our pre-process.",cs.CV,['cs.CV']
+Fourier Priors-Guided Diffusion for Zero-Shot Joint Low-Light Enhancement and Deblurring,Xiaoqian Lv · Shengping Zhang · Chenyang Wang · Yichen Zheng · Bineng Zhong · Chongyi Li · Liqiang Nie, ,,https://www.sciencedirect.com/science/article/abs/pii/S0957417424005888,,,,,nan
+Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance,Dazhong Shen · Guanglu Song · Zeyue Xue · Fu-Yun Wang · Yu Liu, ,https://arxiv.org/abs/2404.05384,,2404.05384.pdf,Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance,"Classifier-Free Guidance (CFG) has been widely used in text-to-image
+diffusion models, where the CFG scale is introduced to control the strength of
+text guidance on the whole image space. However, we argue that a global CFG
+scale results in spatial inconsistency on varying semantic strengths and
+suboptimal image quality. To address this problem, we present a novel approach,
+Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance
+degrees for different semantic units in text-to-image diffusion models.
+Specifically, we first design a training-free semantic segmentation method to
+partition the latent image into relatively independent semantic regions at each
+denoising step. In particular, the cross-attention map in the denoising U-net
+backbone is renormalized for assigning each patch to the corresponding token,
+while the self-attention map is used to complete the semantic regions. Then, to
+balance the amplification of diverse semantic units, we adaptively adjust the
+CFG scales across different semantic regions to rescale the text guidance
+degrees into a uniform level. Finally, extensive experiments demonstrate the
+superiority of S-CFG over the original CFG strategy on various text-to-image
+diffusion models, without requiring any extra training cost. our codes are
+available at https://github.com/SmilesDZgk/S-CFG.",cs.CV,"['cs.CV', 'cs.AI']"
+Scalable 3D Registration via Truncated Entry-wise Absolute Residuals,Tianyu Huang · Liangzu Peng · Rene Vidal · Yun-Hui Liu, ,https://arxiv.org/abs/2404.00915,,2404.00915.pdf,Scalable 3D Registration via Truncated Entry-wise Absolute Residuals,"Given an input set of $3$D point pairs, the goal of outlier-robust $3$D
+registration is to compute some rotation and translation that align as many
+point pairs as possible. This is an important problem in computer vision, for
+which many highly accurate approaches have been recently proposed. Despite
+their impressive performance, these approaches lack scalability, often
+overflowing the $16$GB of memory of a standard laptop to handle roughly
+$30,000$ point pairs. In this paper, we propose a $3$D registration approach
+that can process more than ten million ($10^7$) point pairs with over $99\%$
+random outliers. Moreover, our method is efficient, entails low memory costs,
+and maintains high accuracy at the same time. We call our method TEAR, as it
+involves minimizing an outlier-robust loss that computes Truncated Entry-wise
+Absolute Residuals. To minimize this loss, we decompose the original
+$6$-dimensional problem into two subproblems of dimensions $3$ and $2$,
+respectively, solved in succession to global optimality via a customized
+branch-and-bound method. While branch-and-bound is often slow and unscalable,
+this does not apply to TEAR as we propose novel bounding functions that are
+tight and computationally efficient. Experiments on various datasets are
+conducted to validate the scalability and efficiency of our method.",cs.CV,"['cs.CV', 'cs.RO']"
+Coupled Laplacian Eigenmaps for Locally-Aware 3D Rigid Point Cloud Matching,Matteo Bastico · Etienne Decencière · Laurent Corté · Yannick TILLIER · David Ryckelynck,https://github.com/matteo-bastico/CoupLap,,https://paperswithcode.com/paper/coupled-laplacian-eigenmaps-for-locally-aware,,,,,nan
+MatFuse: Controllable Material Generation with Diffusion Models,Giuseppe Vecchio · Renato Sortino · Simone Palazzo · Concetto Spampinato,https://gvecchio.com/matfuse/,https://arxiv.org/abs/2308.11408,,2308.11408.pdf,MatFuse: Controllable Material Generation with Diffusion Models,"Creating high-quality materials in computer graphics is a challenging and
+time-consuming task, which requires great expertise. To simplify this process,
+we introduce MatFuse, a unified approach that harnesses the generative power of
+diffusion models for creation and editing of 3D materials. Our method
+integrates multiple sources of conditioning, including color palettes,
+sketches, text, and pictures, enhancing creative possibilities and granting
+fine-grained control over material synthesis. Additionally, MatFuse enables
+map-level material editing capabilities through latent manipulation by means of
+a multi-encoder compression model which learns a disentangled latent
+representation for each map. We demonstrate the effectiveness of MatFuse under
+multiple conditioning settings and explore the potential of material editing.
+Finally, we assess the quality of the generated materials both quantitatively
+in terms of CLIP-IQA and FID scores and qualitatively by conducting a user
+study. Source code for training MatFuse and supplemental materials are publicly
+available at https://gvecchio.com/matfuse.",cs.CV,"['cs.CV', 'cs.GR']"
+Continuous Optical Zooming: A Benchmark for Arbitrary-Scale Image Super-Resolution in Real World,Huiyuan Fu · Fei Peng · Xianwei Li · Yejun Li · Xin Wang · Huadong Ma, ,,https://github.com/Weepingchestnut/Arbitrary-Scale-SR,,,,,nan
+DETRs Beat YOLOs on Real-time Object Detection,Yian Zhao · Wenyu Lv · Shangliang Xu · Jinman Wei · Guanzhong Wang · Qingqing Dang · Yi Liu · Jie Chen, ,https://arxiv.org/html/2304.08069v3,,2304.08069v3.pdf,DETRs Beat YOLOs on Real-time Object Detection,"The YOLO series has become the most popular framework for real-time object
+detection due to its reasonable trade-off between speed and accuracy. However,
+we observe that the speed and accuracy of YOLOs are negatively affected by the
+NMS. Recently, end-to-end Transformer-based detectors (DETRs) have provided an
+alternative to eliminating NMS. Nevertheless, the high computational cost
+limits their practicality and hinders them from fully exploiting the advantage
+of excluding NMS. In this paper, we propose the Real-Time DEtection TRansformer
+(RT-DETR), the first real-time end-to-end object detector to our best knowledge
+that addresses the above dilemma. We build RT-DETR in two steps, drawing on the
+advanced DETR: first we focus on maintaining accuracy while improving speed,
+followed by maintaining speed while improving accuracy. Specifically, we design
+an efficient hybrid encoder to expeditiously process multi-scale features by
+decoupling intra-scale interaction and cross-scale fusion to improve speed.
+Then, we propose the uncertainty-minimal query selection to provide
+high-quality initial queries to the decoder, thereby improving accuracy. In
+addition, RT-DETR supports flexible speed tuning by adjusting the number of
+decoder layers to adapt to various scenarios without retraining. Our
+RT-DETR-R50 / R101 achieves 53.1% / 54.3% AP on COCO and 108 / 74 FPS on T4
+GPU, outperforming previously advanced YOLOs in both speed and accuracy. We
+also develop scaled RT-DETRs that outperform the lighter YOLO detectors (S and
+M models). Furthermore, RT-DETR-R50 outperforms DINO-R50 by 2.2% AP in accuracy
+and about 21 times in FPS. After pre-training with Objects365, RT-DETR-R50 /
+R101 achieves 55.3% / 56.2% AP. The project page:
+https://zhao-yian.github.io/RTDETR.",cs.CV,['cs.CV']
+Weakly-Supervised Audio-Visual Video Parsing with Prototype-based Pseudo-Labeling,Kranthi Kumar Rachavarapu · Kalyan Ramakrishnan · A. N. Rajagopalan, ,https://arxiv.org/abs/2405.10690,,2405.10690.pdf,CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing,"Weakly supervised audio-visual video parsing (AVVP) methods aim to detect
+audible-only, visible-only, and audible-visible events using only video-level
+labels. Existing approaches tackle this by leveraging unimodal and cross-modal
+contexts. However, we argue that while cross-modal learning is beneficial for
+detecting audible-visible events, in the weakly supervised scenario, it
+negatively impacts unaligned audible or visible events by introducing
+irrelevant modality information. In this paper, we propose CoLeaF, a novel
+learning framework that optimizes the integration of cross-modal context in the
+embedding space such that the network explicitly learns to combine cross-modal
+information for audible-visible events while filtering them out for unaligned
+events. Additionally, as videos often involve complex class relationships,
+modelling them improves performance. However, this introduces extra
+computational costs into the network. Our framework is designed to leverage
+cross-class relationships during training without incurring additional
+computations at inference. Furthermore, we propose new metrics to better
+evaluate a method's capabilities in performing AVVP. Our extensive experiments
+demonstrate that CoLeaF significantly improves the state-of-the-art results by
+an average of 1.9% and 2.4% F-score on the LLP and UnAV-100 datasets,
+respectively.",cs.CV,['cs.CV']
+Estimating Noisy Class Posterior with Part-level Labels for Noisy Label Learning,Rui Zhao · Bin Shi · Jianfei Ruan · Tianze Pan · Bo Dong,https://github.com/RyanZhaoIc/PLM.git,https://arxiv.org/abs/2405.05714,,2405.05714.pdf,Estimating Noisy Class Posterior with Part-level Labels for Noisy Label Learning,"In noisy label learning, estimating noisy class posteriors plays a
+fundamental role for developing consistent classifiers, as it forms the basis
+for estimating clean class posteriors and the transition matrix. Existing
+methods typically learn noisy class posteriors by training a classification
+model with noisy labels. However, when labels are incorrect, these models may
+be misled to overemphasize the feature parts that do not reflect the instance
+characteristics, resulting in significant errors in estimating noisy class
+posteriors. To address this issue, this paper proposes to augment the
+supervised information with part-level labels, encouraging the model to focus
+on and integrate richer information from various parts. Specifically, our
+method first partitions features into distinct parts by cropping instances,
+yielding part-level labels associated with these various parts. Subsequently,
+we introduce a novel single-to-multiple transition matrix to model the
+relationship between the noisy and part-level labels, which incorporates
+part-level labels into a classifier-consistent framework. Utilizing this
+framework with part-level labels, we can learn the noisy class posteriors more
+precisely by guiding the model to integrate information from various parts,
+ultimately improving the classification performance. Our method is
+theoretically sound, while experiments show that it is empirically effective in
+synthetic and real-world noisy benchmarks.",cs.CV,"['cs.CV', 'cs.LG']"
+CA-Jaccard: Camera-aware Jaccard Distance for Person Re-identification,Yiyu Chen · Zheyi Fan · Zhaoru Chen · Yixuan Zhu, ,https://arxiv.org/abs/2311.10605,,2311.10605.pdf,CA-Jaccard: Camera-aware Jaccard Distance for Person Re-identification,"Person re-identification (re-ID) is a challenging task that aims to learn
+discriminative features for person retrieval. In person re-ID, Jaccard distance
+is a widely used distance metric, especially in re-ranking and clustering
+scenarios. However, we discover that camera variation has a significant
+negative impact on the reliability of Jaccard distance. In particular, Jaccard
+distance calculates the distance based on the overlap of relevant neighbors.
+Due to camera variation, intra-camera samples dominate the relevant neighbors,
+which reduces the reliability of the neighbors by introducing intra-camera
+negative samples and excluding inter-camera positive samples. To overcome this
+problem, we propose a novel camera-aware Jaccard (CA-Jaccard) distance that
+leverages camera information to enhance the reliability of Jaccard distance.
+Specifically, we design camera-aware k-reciprocal nearest neighbors (CKRNNs) to
+find k-reciprocal nearest neighbors on the intra-camera and inter-camera
+ranking lists, which improves the reliability of relevant neighbors and
+guarantees the contribution of inter-camera samples in the overlap. Moreover,
+we propose a camera-aware local query expansion (CLQE) to mine reliable samples
+in relevant neighbors by exploiting camera variation as a strong constraint and
+assign these samples higher weights in overlap, further improving the
+reliability. Our CA-Jaccard distance is simple yet effective and can serve as a
+general distance metric for person re-ID methods with high reliability and low
+computational cost. Extensive experiments demonstrate the effectiveness of our
+method.",cs.CV,['cs.CV']
+SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation,Aysim Toker · Marvin Eisenberger · Daniel Cremers · Laura Leal-Taixe, ,https://arxiv.org/abs/2403.16605,,2403.16605.pdf,SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation,"In recent years, semantic segmentation has become a pivotal tool in
+processing and interpreting satellite imagery. Yet, a prevalent limitation of
+supervised learning techniques remains the need for extensive manual
+annotations by experts. In this work, we explore the potential of generative
+image diffusion to address the scarcity of annotated data in earth observation
+tasks. The main idea is to learn the joint data manifold of images and labels,
+leveraging recent advancements in denoising diffusion probabilistic models. To
+the best of our knowledge, we are the first to generate both images and
+corresponding masks for satellite segmentation. We find that the obtained pairs
+not only display high quality in fine-scale features but also ensure a wide
+sampling diversity. Both aspects are crucial for earth observation data, where
+semantic classes can vary severely in scale and occurrence frequency. We employ
+the novel data instances for downstream segmentation, as a form of data
+augmentation. In our experiments, we provide comparisons to prior works based
+on discriminative diffusion models or GANs. We demonstrate that integrating
+generated samples yields significant quantitative improvements for satellite
+semantic segmentation -- both compared to baselines and when training only on
+the original data.",cs.CV,['cs.CV']
+EASE-DETR: Easing the Competition among Object Queries,Yulu Gao · Yifan Sun · Xudong Ding · Chuyang Zhao · Si Liu, ,https://arxiv.org/abs/2310.08854,,2310.08854.pdf,Rank-DETR for High Quality Object Detection,"Modern detection transformers (DETRs) use a set of object queries to predict
+a list of bounding boxes, sort them by their classification confidence scores,
+and select the top-ranked predictions as the final detection results for the
+given input image. A highly performant object detector requires accurate
+ranking for the bounding box predictions. For DETR-based detectors, the
+top-ranked bounding boxes suffer from less accurate localization quality due to
+the misalignment between classification scores and localization accuracy, thus
+impeding the construction of high-quality detectors. In this work, we introduce
+a simple and highly performant DETR-based object detector by proposing a series
+of rank-oriented designs, combinedly called Rank-DETR. Our key contributions
+include: (i) a rank-oriented architecture design that can prompt positive
+predictions and suppress the negative ones to ensure lower false positive
+rates, as well as (ii) a rank-oriented loss function and matching cost design
+that prioritizes predictions of more accurate localization accuracy during
+ranking to boost the AP under high IoU thresholds. We apply our method to
+improve the recent SOTA methods (e.g., H-DETR and DINO-DETR) and report strong
+COCO object detection results when using different backbones such as
+ResNet-$50$, Swin-T, and Swin-L, demonstrating the effectiveness of our
+approach. Code is available at \url{https://github.com/LeapLabTHU/Rank-DETR}.",cs.CV,"['cs.CV', 'cs.LG']"
+Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation,Razvan Pasca · Alexey Gavryushin · Muhammad Hamza · Yen-Ling Kuo · Kaichun Mo · Luc Van Gool · Otmar Hilliges · Xi Wang, ,,https://dblp.org/rec/journals/corr/abs-2301-09209,,,,,nan
+LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs,Yunsheng Ma · Can Cui · Xu Cao · Wenqian Ye · Peiran Liu · Juanwu Lu · Amr Abdelraouf · Rohit Gupta · Kyungtae Han · Aniket Bera · James Rehg · Ziran Wang, ,https://arxiv.org/abs/2312.04372v2,,2312.04372v2.pdf,LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs,"Autonomous driving (AD) has made significant strides in recent years.
+However, existing frameworks struggle to interpret and execute spontaneous user
+instructions, such as ""overtake the car ahead."" Large Language Models (LLMs)
+have demonstrated impressive reasoning capabilities showing potential to bridge
+this gap. In this paper, we present LaMPilot, a novel framework that integrates
+LLMs into AD systems, enabling them to follow user instructions by generating
+code that leverages established functional primitives. We also introduce
+LaMPilot-Bench, the first benchmark dataset specifically designed to
+quantitatively evaluate the efficacy of language model programs in AD. Adopting
+the LaMPilot framework, we conduct extensive experiments to assess the
+performance of off-the-shelf LLMs on LaMPilot-Bench. Our results demonstrate
+the potential of LLMs in handling diverse driving scenarios and following user
+instructions in driving. To facilitate further research in this area, we
+release our code and data at https://github.com/PurdueDigitalTwin/LaMPilot.",cs.CL,"['cs.CL', 'cs.AI']"
+C3: High-performance and low-complexity neural compression from a single image or video,Hyunjik Kim · Matthias Bauer · Lucas Theis · Jonathan Richard Schwarz · Emilien Dupont, ,https://arxiv.org/abs/2312.02753,,2312.02753.pdf,C3: High-performance and low-complexity neural compression from a single image or video,"Most neural compression models are trained on large datasets of images or
+videos in order to generalize to unseen data. Such generalization typically
+requires large and expressive architectures with a high decoding complexity.
+Here we introduce C3, a neural compression method with strong rate-distortion
+(RD) performance that instead overfits a small model to each image or video
+separately. The resulting decoding complexity of C3 can be an order of
+magnitude lower than neural baselines with similar RD performance. C3 builds on
+COOL-CHIC (Ladune et al.) and makes several simple and effective improvements
+for images. We further develop new methodology to apply C3 to videos. On the
+CLIC2020 image benchmark, we match the RD performance of VTM, the reference
+implementation of the H.266 codec, with less than 3k MACs/pixel for decoding.
+On the UVG video benchmark, we match the RD performance of the Video
+Compression Transformer (Mentzer et al.), a well-established neural video
+codec, with less than 5k MACs/pixel for decoding.",eess.IV,"['eess.IV', 'cs.CV', 'cs.LG', 'stat.ML']"
+Quantifying Uncertainty in Motion Prediction with Variational Bayesian Mixture,Juanwu Lu · Can Cui · Yunsheng Ma · Aniket Bera · Ziran Wang, ,https://arxiv.org/abs/2404.03789,,2404.03789.pdf,Quantifying Uncertainty in Motion Prediction with Variational Bayesian Mixture,"Safety and robustness are crucial factors in developing trustworthy
+autonomous vehicles. One essential aspect of addressing these factors is to
+equip vehicles with the capability to predict future trajectories for all
+moving objects in the surroundings and quantify prediction uncertainties. In
+this paper, we propose the Sequential Neural Variational Agent (SeNeVA), a
+generative model that describes the distribution of future trajectories for a
+single moving object. Our approach can distinguish Out-of-Distribution data
+while quantifying uncertainty and achieving competitive performance compared to
+state-of-the-art methods on the Argoverse 2 and INTERACTION datasets.
+Specifically, a 0.446 meters minimum Final Displacement Error, a 0.203 meters
+minimum Average Displacement Error, and a 5.35% Miss Rate are achieved on the
+INTERACTION test set. Extensive qualitative and quantitative analysis is also
+provided to evaluate the proposed model. Our open-source code is available at
+https://github.com/PurdueDigitalTwin/seneva.",cs.CV,"['cs.CV', 'cs.AI']"
+Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers,Sanghyeok Lee · Joonmyung Choi · Hyunwoo J. Kim,https://github.com/mlvlab/MCTF,https://arxiv.org/abs/2403.10030,,2403.10030.pdf,Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers,"Vision Transformer (ViT) has emerged as a prominent backbone for computer
+vision. For more efficient ViTs, recent works lessen the quadratic cost of the
+self-attention layer by pruning or fusing the redundant tokens. However, these
+works faced the speed-accuracy trade-off caused by the loss of information.
+Here, we argue that token fusion needs to consider diverse relations between
+tokens to minimize information loss. In this paper, we propose a Multi-criteria
+Token Fusion (MCTF), that gradually fuses the tokens based on multi-criteria
+(e.g., similarity, informativeness, and size of fused tokens). Further, we
+utilize the one-step-ahead attention, which is the improved approach to capture
+the informativeness of the tokens. By training the model equipped with MCTF
+using a token reduction consistency, we achieve the best speed-accuracy
+trade-off in the image classification (ImageNet1K). Experimental results prove
+that MCTF consistently surpasses the previous reduction methods with and
+without training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs by
+about 44% while improving the performance (+0.5%, and +0.3%) over the base
+model, respectively. We also demonstrate the applicability of MCTF in various
+Vision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedup
+without performance degradation. Code is available at
+https://github.com/mlvlab/MCTF.",cs.CV,['cs.CV']
+Fooling Polarization-based Vision using Locally Controllable Polarizing Projection,Zhuoxiao Li · Zhihang Zhong · Shohei Nobuhara · Ko Nishino · Yinqiang Zheng, ,,https://paperswithcode.com/search?q=author:Ko+Nishino,,,,,nan
+MuGE: Multiple Granularity Edge Detection,Caixia Zhou · Yaping Huang · Mengyang Pu · Qingji Guan · Ruoxi Deng · Haibin Ling, ,,https://www.semanticscholar.org/paper/Practical-Edge-Detection-via-Robust-Collaborative-Fu-Guo/1b7f58d62ac5bcb292da96863482ade8348c9534,,,,,nan
+Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos,Leonhard Sommer · Artur Jesslen · Eddy Ilg · Adam Kortylewski, ,https://arxiv.org/abs/2404.05626,,2404.05626.pdf,Learning a Category-level Object Pose Estimator without Pose Annotations,"3D object pose estimation is a challenging task. Previous works always
+require thousands of object images with annotated poses for learning the 3D
+pose correspondence, which is laborious and time-consuming for labeling. In
+this paper, we propose to learn a category-level 3D object pose estimator
+without pose annotations. Instead of using manually annotated images, we
+leverage diffusion models (e.g., Zero-1-to-3) to generate a set of images under
+controlled pose differences and propose to learn our object pose estimator with
+those images. Directly using the original diffusion model leads to images with
+noisy poses and artifacts. To tackle this issue, firstly, we exploit an image
+encoder, which is learned from a specially designed contrastive pose learning,
+to filter the unreasonable details and extract image feature maps.
+Additionally, we propose a novel learning strategy that allows the model to
+learn object poses from those generated image sets without knowing the
+alignment of their canonical poses. Experimental results show that our method
+has the capability of category-level object pose estimation from a single shot
+setting (as pose definition), while significantly outperforming other
+state-of-the-art methods on the few-shot category-level object pose estimation
+benchmarks.",cs.CV,['cs.CV']
+Long-Tailed Anomaly Detection with Learnable Class Names,Chih-Hui Ho · Kuan-Chuan Peng · Nuno Vasconcelos,http://www.svcl.ucsd.edu/projects/ltad/,https://arxiv.org/abs/2403.20236,,,Long-Tailed Anomaly Detection with Learnable Class Names,"Anomaly detection (AD) aims to identify defective images and localize their
+defects (if any). Ideally, AD models should be able to detect defects over many
+image classes; without relying on hard-coded class names that can be
+uninformative or inconsistent across datasets; learn without anomaly
+supervision; and be robust to the long-tailed distributions of real-world
+applications. To address these challenges, we formulate the problem of
+long-tailed AD by introducing several datasets with different levels of class
+imbalance and metrics for performance evaluation. We then propose a novel
+method, LTAD, to detect defects from multiple and long-tailed classes, without
+relying on dataset class names. LTAD combines AD by reconstruction and semantic
+AD modules. AD by reconstruction is implemented with a transformer-based
+reconstruction module. Semantic AD is implemented with a binary classifier,
+which relies on learned pseudo class names and a pretrained foundation model.
+These modules are learned over two phases. Phase 1 learns the pseudo-class
+names and a variational autoencoder (VAE) for feature synthesis that augments
+the training data to combat long-tails. Phase 2 then learns the parameters of
+the reconstruction and classification modules of LTAD. Extensive experiments
+using the proposed long-tailed datasets show that LTAD substantially
+outperforms the state-of-the-art methods for most forms of dataset imbalance.
+The long-tailed dataset split is available at
+https://zenodo.org/records/10854201 .",cs.CV,['cs.CV']
+DiffusionRegPose: Enhancing Multi-Person Pose Estimation using a Diffusion-Based End-to-End Regression Approach,Dayi Tan · Hansheng Chen · Wei Tian · Lu Xiong, ,https://arxiv.org/abs/2401.04921,,2401.04921.pdf,Diffusion-based Pose Refinement and Muti-hypothesis Generation for 3D Human Pose Estimaiton,"Previous probabilistic models for 3D Human Pose Estimation (3DHPE) aimed to
+enhance pose accuracy by generating multiple hypotheses. However, most of the
+hypotheses generated deviate substantially from the true pose. Compared to
+deterministic models, the excessive uncertainty in probabilistic models leads
+to weaker performance in single-hypothesis prediction. To address these two
+challenges, we propose a diffusion-based refinement framework called DRPose,
+which refines the output of deterministic models by reverse diffusion and
+achieves more suitable multi-hypothesis prediction for the current pose
+benchmark by multi-step refinement with multiple noises. To this end, we
+propose a Scalable Graph Convolution Transformer (SGCT) and a Pose Refinement
+Module (PRM) for denoising and refining. Extensive experiments on Human3.6M and
+MPI-INF-3DHP datasets demonstrate that our method achieves state-of-the-art
+performance on both single and multi-hypothesis 3DHPE. Code is available at
+https://github.com/KHB1698/DRPose.",cs.CV,['cs.CV']
+Style Blind Domain Generalized Semantic Segmentation via Covariance Alignment and Semantic Consistence Contrastive Learning,Woo-Jin Ahn · Geun-Yeong Yang · Hyunduck Choi · Myo-Taeg Lim,https://github.com/root0yang/BlindNet,https://arxiv.org/abs/2403.06122,,2403.06122.pdf,Style Blind Domain Generalized Semantic Segmentation via Covariance Alignment and Semantic Consistence Contrastive Learning,"Deep learning models for semantic segmentation often experience performance
+degradation when deployed to unseen target domains unidentified during the
+training phase. This is mainly due to variations in image texture (\ie style)
+from different data sources. To tackle this challenge, existing domain
+generalized semantic segmentation (DGSS) methods attempt to remove style
+variations from the feature. However, these approaches struggle with the
+entanglement of style and content, which may lead to the unintentional removal
+of crucial content information, causing performance degradation. This study
+addresses this limitation by proposing BlindNet, a novel DGSS approach that
+blinds the style without external modules or datasets. The main idea behind our
+proposed approach is to alleviate the effect of style in the encoder whilst
+facilitating robust segmentation in the decoder. To achieve this, BlindNet
+comprises two key components: covariance alignment and semantic consistency
+contrastive learning. Specifically, the covariance alignment trains the encoder
+to uniformly recognize various styles and preserve the content information of
+the feature, rather than removing the style-sensitive factor. Meanwhile,
+semantic consistency contrastive learning enables the decoder to construct
+discriminative class embedding space and disentangles features that are
+vulnerable to misclassification. Through extensive experiments, our approach
+outperforms existing DGSS methods, exhibiting robustness and superior
+performance for semantic segmentation on unseen target domains.",cs.CV,['cs.CV']
+Deep-TROJ: An Inference Stage Trojan Insertion Algorithm through Efficient Weight Replacement Attack,Sabbir Ahmed · RANYANG ZHOU · Shaahin Angizi · Adnan Rakin Rakin, ,,,,,,,nan
+Robust Image Denoising through Adversarial Frequency Mixup,Donghun Ryou · Inju Ha · Hyewon Yoo · Dongwan Kim · Bohyung Han, ,https://arxiv.org/abs/2306.16050,,2306.16050.pdf,Evaluating Similitude and Robustness of Deep Image Denoising Models via Adversarial Attack,"Deep neural networks (DNNs) have shown superior performance comparing to
+traditional image denoising algorithms. However, DNNs are inevitably vulnerable
+while facing adversarial attacks. In this paper, we propose an adversarial
+attack method named denoising-PGD which can successfully attack all the current
+deep denoising models while keep the noise distribution almost unchanged. We
+surprisingly find that the current mainstream non-blind denoising models
+(DnCNN, FFDNet, ECNDNet, BRDNet), blind denoising models (DnCNN-B, Noise2Noise,
+RDDCNN-B, FAN), plug-and-play (DPIR, CurvPnP) and unfolding denoising models
+(DeamNet) almost share the same adversarial sample set on both grayscale and
+color images, respectively. Shared adversarial sample set indicates that all
+these models are similar in term of local behaviors at the neighborhood of all
+the test samples. Thus, we further propose an indicator to measure the local
+similarity of models, called robustness similitude. Non-blind denoising models
+are found to have high robustness similitude across each other, while
+hybrid-driven models are also found to have high robustness similitude with
+pure data-driven non-blind denoising models. According to our robustness
+assessment, data-driven non-blind denoising models are the most robust. We use
+adversarial training to complement the vulnerability to adversarial attacks.
+Moreover, the model-driven image denoising BM3D shows resistance on adversarial
+attacks.",cs.CV,"['cs.CV', 'cs.LG', 'eess.IV']"
+Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction,Jianping Jiang · xinyu zhou · Bingxuan Wang · Xiaoming Deng · Chao Xu · Boxin Shi, ,https://arxiv.org/abs/2403.07346v1,,2403.07346v1.pdf,Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction,"Reliable hand mesh reconstruction (HMR) from commonly-used color and depth
+sensors is challenging especially under scenarios with varied illuminations and
+fast motions. Event camera is a highly promising alternative for its high
+dynamic range and dense temporal resolution properties, but it lacks key
+texture appearance for hand mesh reconstruction. In this paper, we propose
+EvRGBHand -- the first approach for 3D hand mesh reconstruction with an event
+camera and an RGB camera compensating for each other. By fusing two modalities
+of data across time, space, and information dimensions,EvRGBHand can tackle
+overexposure and motion blur issues in RGB-based HMR and foreground scarcity
+and background overflow issues in event-based HMR. We further propose
+EvRGBDegrader, which allows our model to generalize effectively in challenging
+scenes, even when trained solely on standard scenes, thus reducing data
+acquisition costs. Experiments on real-world data demonstrate that EvRGBHand
+can effectively solve the challenging issues when using either type of camera
+alone via retaining the merits of both, and shows the potential of
+generalization to outdoor scenes and another type of event camera.",cs.CV,['cs.CV']
+Friendly Sharpness-Aware Minimization,Tao Li · Pan Zhou · Zhengbao He · Xinwen Cheng · Xiaolin Huang, ,https://arxiv.org/abs/2403.12350,,2403.12350.pdf,Friendly Sharpness-Aware Minimization,"Sharpness-Aware Minimization (SAM) has been instrumental in improving deep
+neural network training by minimizing both training loss and loss sharpness.
+Despite the practical success, the mechanisms behind SAM's generalization
+enhancements remain elusive, limiting its progress in deep learning
+optimization. In this work, we investigate SAM's core components for
+generalization improvement and introduce ""Friendly-SAM"" (F-SAM) to further
+enhance SAM's generalization. Our investigation reveals the key role of
+batch-specific stochastic gradient noise within the adversarial perturbation,
+i.e., the current minibatch gradient, which significantly influences SAM's
+generalization performance. By decomposing the adversarial perturbation in SAM
+into full gradient and stochastic gradient noise components, we discover that
+relying solely on the full gradient component degrades generalization while
+excluding it leads to improved performance. The possible reason lies in the
+full gradient component's increase in sharpness loss for the entire dataset,
+creating inconsistencies with the subsequent sharpness minimization step solely
+on the current minibatch data. Inspired by these insights, F-SAM aims to
+mitigate the negative effects of the full gradient component. It removes the
+full gradient estimated by an exponentially moving average (EMA) of historical
+stochastic gradients, and then leverages stochastic gradient noise for improved
+generalization. Moreover, we provide theoretical validation for the EMA
+approximation and prove the convergence of F-SAM on non-convex problems.
+Extensive experiments demonstrate the superior generalization performance and
+robustness of F-SAM over vanilla SAM. Code is available at
+https://github.com/nblt/F-SAM.",cs.LG,['cs.LG']
+Efficient Hyperparameter Optimization with Adaptive Fidelity Identification,Jiantong Jiang · Zeyi Wen · Atif Mansoor · Ajmal Mian, ,https://arxiv.org/html/2405.15605v2,,2405.15605v2.pdf,Fast-PGM: Fast Probabilistic Graphical Model Learning and Inference,"Probabilistic graphical models (PGMs) serve as a powerful framework for
+modeling complex systems with uncertainty and extracting valuable insights from
+data. However, users face challenges when applying PGMs to their problems in
+terms of efficiency and usability. This paper presents Fast-PGM, an efficient
+and open-source library for PGM learning and inference. Fast-PGM supports
+comprehensive tasks on PGMs, including structure and parameter learning, as
+well as exact and approximate inference, and enhances efficiency of the tasks
+through computational and memory optimizations and parallelization techniques.
+Concurrently, Fast-PGM furnishes developers with flexible building blocks,
+furnishes learners with detailed documentation, and affords non-experts
+user-friendly interfaces, thereby ameliorating the usability of PGMs to users
+across a spectrum of expertise levels. The source code of Fast-PGM is available
+at https://github.com/jjiantong/FastPGM.",cs.LG,['cs.LG']
+Exploring Pose-Aware Human-Object Interaction via Hybrid Learning,EASTMAN Z Y WU · Yali Li · Yuan Wang · Shengjin Wang, ,https://arxiv.org/abs/2403.07246,,2403.07246.pdf,Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration,"Human-object interaction (HOI) detection aims to locate human-object pairs
+and identify their interaction categories in images. Most existing methods
+primarily focus on supervised learning, which relies on extensive manual HOI
+annotations. In this paper, we propose a novel framework, termed Knowledge
+Integration to HOI (KI2HOI), that effectively integrates the knowledge of
+visual-language model to improve zero-shot HOI detection. Specifically, the
+verb feature learning module is designed based on visual semantics, by
+employing the verb extraction decoder to convert corresponding verb queries
+into interaction-specific category representations. We develop an effective
+additive self-attention mechanism to generate more comprehensive visual
+representations. Moreover, the innovative interaction representation decoder
+effectively extracts informative regions by integrating spatial and visual
+feature information through a cross-attention mechanism. To deal with zero-shot
+learning in low-data, we leverage a priori knowledge from the CLIP text encoder
+to initialize the linear classifier for enhanced interaction understanding.
+Extensive experiments conducted on the mainstream HICO-DET and V-COCO datasets
+demonstrate that our model outperforms the previous methods in various
+zero-shot and full-supervised settings.",cs.CV,['cs.CV']
+HPL-ESS: Hybrid Pseudo-Labeling for Unsupervised Event-based Semantic Segmentation,Linglin Jing · Yiming Ding · Yunpeng Gao · Zhigang Wang · Xu Yan · Dong Wang · Gerald Schaefer · Hui Fang · Bin Zhao · Xuelong Li, ,https://arxiv.org/abs/2403.16788,,2403.16788.pdf,HPL-ESS: Hybrid Pseudo-Labeling for Unsupervised Event-based Semantic Segmentation,"Event-based semantic segmentation has gained popularity due to its capability
+to deal with scenarios under high-speed motion and extreme lighting conditions,
+which cannot be addressed by conventional RGB cameras. Since it is hard to
+annotate event data, previous approaches rely on event-to-image reconstruction
+to obtain pseudo labels for training. However, this will inevitably introduce
+noise, and learning from noisy pseudo labels, especially when generated from a
+single source, may reinforce the errors. This drawback is also called
+confirmation bias in pseudo-labeling. In this paper, we propose a novel hybrid
+pseudo-labeling framework for unsupervised event-based semantic segmentation,
+HPL-ESS, to alleviate the influence of noisy pseudo labels. In particular, we
+first employ a plain unsupervised domain adaptation framework as our baseline,
+which can generate a set of pseudo labels through self-training. Then, we
+incorporate offline event-to-image reconstruction into the framework, and
+obtain another set of pseudo labels by predicting segmentation maps on the
+reconstructed images. A noisy label learning strategy is designed to mix the
+two sets of pseudo labels and enhance the quality. Moreover, we propose a soft
+prototypical alignment module to further improve the consistency of target
+domain features. Extensive experiments show that our proposed method
+outperforms existing state-of-the-art methods by a large margin on the
+DSEC-Semantic dataset (+5.88% accuracy, +10.32% mIoU), which even surpasses
+several supervised methods.",cs.CV,['cs.CV']
+Text-to-3D Generation with Bidirectional Diffusion using both 3D and 2D priors,Lihe Ding · Shaocong Dong · Zhanpeng Huang · Zibin Wang · Yiyuan Zhang · Kaixiong Gong · Dan Xu · Tianfan Xue, ,https://arxiv.org/abs/2312.04963,,2312.04963.pdf,Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors,"Most 3D generation research focuses on up-projecting 2D foundation models
+into the 3D space, either by minimizing 2D Score Distillation Sampling (SDS)
+loss or fine-tuning on multi-view datasets. Without explicit 3D priors, these
+methods often lead to geometric anomalies and multi-view inconsistency.
+Recently, researchers have attempted to improve the genuineness of 3D objects
+by directly training on 3D datasets, albeit at the cost of low-quality texture
+generation due to the limited texture diversity in 3D datasets. To harness the
+advantages of both approaches, we propose Bidirectional Diffusion(BiDiff), a
+unified framework that incorporates both a 3D and a 2D diffusion process, to
+preserve both 3D fidelity and 2D texture richness, respectively. Moreover, as a
+simple combination may yield inconsistent generation results, we further bridge
+them with novel bidirectional guidance. In addition, our method can be used as
+an initialization of optimization-based models to further improve the quality
+of 3D model and efficiency of optimization, reducing the generation process
+from 3.4 hours to 20 minutes. Experimental results have shown that our model
+achieves high-quality, diverse, and scalable 3D generation. Project website:
+https://bidiff.github.io/.",cs.CV,"['cs.CV', 'cs.AI']"
+Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives,Kristen Grauman · Andrew Westbury · Lorenzo Torresani · Kris Kitani · Jitendra Malik · Triantafyllos Afouras · Kumar Ashutosh · Vijay Baiyya · Siddhant Bansal · Bikram Boote · Eugene Byrne · Zachary Chavis · Joya Chen · Feng Cheng · Fu-Jen Chu · Sean Crane · Avijit Dasgupta · Jing Dong · Maria Escobar · Cristhian David Forigua Diaz · Abrham Gebreselasie · Sanjay Haresh · Jing Huang · Md Mohaiminul Islam · Suyog Jain · Rawal Khirodkar · Devansh Kukreja · Kevin Liang · Jia-Wei Liu · Sagnik Majumder · Yongsen Mao · Miguel Martin · Effrosyni Mavroudi · Tushar Nagarajan · Francesco Ragusa · Santhosh Kumar Ramakrishnan · Luigi Seminara · Arjun Somayazulu · Yale Song · Shan Su · Zihui Xue · Edward Zhang · Jinxu Zhang · Angela Castillo · Changan Chen · Fu Xinzhu · Ryosuke Furuta · Cristina González · Gupta · Jiabo Hu · Yifei Huang · Yiming Huang · Weslie Khoo · Anush Kumar · Robert Kuo · Sach Lakhavani · Miao Liu · Mi Luo · Zhengyi Luo · Brighid Meredith · Austin Miller · Oluwatumininu Oguntola · Xiaqing Pan · Penny Peng · Shraman Pramanick · Merey Ramazanova · Fiona Ryan · Wei Shan · Kiran Somasundaram · Chenan Song · Audrey Southerland · Masatoshi Tateno · Huiyu Wang · Yuchen Wang · Takuma Yagi · Mingfei Yan · Xitong Yang · Zecheng Yu · Shengxin Zha · Chen Zhao · Ziwei Zhao · Zhifan Zhu · Jeff Zhuo · Pablo ARBELAEZ · Gedas Bertasius · Dima Damen · Jakob Engel · Giovanni Maria Farinella · Antonino Furnari · Bernard Ghanem · Judy Hoffman · C.V. Jawahar · Richard Newcombe · Hyun Soo Park · James Rehg · Yoichi Sato · Manolis Savva · Jianbo Shi · Mike Zheng Shou · Michael Wray,https://ego-exo4d-data.org,https://arxiv.org/abs/2311.18259,,2311.18259.pdf,Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives,"We present Ego-Exo4D, a diverse, large-scale multimodal multiview video
+dataset and benchmark challenge. Ego-Exo4D centers around
+simultaneously-captured egocentric and exocentric video of skilled human
+activities (e.g., sports, music, dance, bike repair). 740 participants from 13
+cities worldwide performed these activities in 123 different natural scene
+contexts, yielding long-form captures from 1 to 42 minutes each and 1,286 hours
+of video combined. The multimodal nature of the dataset is unprecedented: the
+video is accompanied by multichannel audio, eye gaze, 3D point clouds, camera
+poses, IMU, and multiple paired language descriptions -- including a novel
+""expert commentary"" done by coaches and teachers and tailored to the
+skilled-activity domain. To push the frontier of first-person video
+understanding of skilled human activity, we also present a suite of benchmark
+tasks and their annotations, including fine-grained activity understanding,
+proficiency estimation, cross-view translation, and 3D hand/body pose. All
+resources are open sourced to fuel new research in the community. Project page:
+http://ego-exo4d-data.org/",cs.CV,"['cs.CV', 'cs.AI']"
+Control4D: Efficient 4D Portrait  Editing with Text,Ruizhi Shao · Jingxiang Sun · Cheng Peng · Zerong Zheng · Boyao ZHOU · Hongwen Zhang · Yebin Liu,https://control4darxiv.github.io,https://arxiv.org/abs/2405.17405,,2405.17405.pdf,Human4DiT: Free-view Human Video Generation with 4D Diffusion Transformer,"We present a novel approach for generating high-quality, spatio-temporally
+coherent human videos from a single image under arbitrary viewpoints. Our
+framework combines the strengths of U-Nets for accurate condition injection and
+diffusion transformers for capturing global correlations across viewpoints and
+time. The core is a cascaded 4D transformer architecture that factorizes
+attention across views, time, and spatial dimensions, enabling efficient
+modeling of the 4D space. Precise conditioning is achieved by injecting human
+identity, camera parameters, and temporal signals into the respective
+transformers. To train this model, we curate a multi-dimensional dataset
+spanning images, videos, multi-view data and 3D/4D scans, along with a
+multi-dimensional training strategy. Our approach overcomes the limitations of
+previous methods based on GAN or UNet-based diffusion models, which struggle
+with complex motions and viewpoint changes. Through extensive experiments, we
+demonstrate our method's ability to synthesize realistic, coherent and
+free-view human videos, paving the way for advanced multimedia applications in
+areas such as virtual reality and animation. Our project website is
+https://human4dit.github.io.",cs.CV,['cs.CV']
+Building Bridges across Spatial and Temporal Resolutions: Reference-Based Super-Resolution via Change Priors and Conditional Diffusion Model,Runmin Dong · Shuai Yuan · Bin Luo · Mengxuan Chen · Jinxiao Zhang · Lixian Zhang · Weijia Li · Juepeng Zheng · Haohuan Fu, ,https://arxiv.org/abs/2403.17460,,2403.17460.pdf,Building Bridges across Spatial and Temporal Resolutions: Reference-Based Super-Resolution via Change Priors and Conditional Diffusion Model,"Reference-based super-resolution (RefSR) has the potential to build bridges
+across spatial and temporal resolutions of remote sensing images. However,
+existing RefSR methods are limited by the faithfulness of content
+reconstruction and the effectiveness of texture transfer in large scaling
+factors. Conditional diffusion models have opened up new opportunities for
+generating realistic high-resolution images, but effectively utilizing
+reference images within these models remains an area for further exploration.
+Furthermore, content fidelity is difficult to guarantee in areas without
+relevant reference information. To solve these issues, we propose a
+change-aware diffusion model named Ref-Diff for RefSR, using the land cover
+change priors to guide the denoising process explicitly. Specifically, we
+inject the priors into the denoising model to improve the utilization of
+reference information in unchanged areas and regulate the reconstruction of
+semantically relevant content in changed areas. With this powerful guidance, we
+decouple the semantics-guided denoising and reference texture-guided denoising
+processes to improve the model performance. Extensive experiments demonstrate
+the superior effectiveness and robustness of the proposed method compared with
+state-of-the-art RefSR methods in both quantitative and qualitative
+evaluations. The code and data are available at
+https://github.com/dongrunmin/RefDiff.",eess.IV,"['eess.IV', 'cs.CV']"
+MaskCLR: Attention-Guided Contrastive Learning for Robust Action Representation Learning,Mohamed Abdelfattah · Mariam Hassan · Alex Alahi, ,https://arxiv.org/abs/2312.04819,,2312.04819.pdf,Attention-Guided Contrastive Role Representations for Multi-Agent Reinforcement Learning,"Real-world multi-agent tasks usually involve dynamic team composition with
+the emergence of roles, which should also be a key to efficient cooperation in
+multi-agent reinforcement learning (MARL). Drawing inspiration from the
+correlation between roles and agent's behavior patterns, we propose a novel
+framework of **A**ttention-guided **CO**ntrastive **R**ole representation
+learning for **M**ARL (**ACORM**) to promote behavior heterogeneity, knowledge
+transfer, and skillful coordination across agents. First, we introduce mutual
+information maximization to formalize role representation learning, derive a
+contrastive learning objective, and concisely approximate the distribution of
+negative pairs. Second, we leverage an attention mechanism to prompt the global
+state to attend to learned role representations in value decomposition,
+implicitly guiding agent coordination in a skillful role space to yield more
+expressive credit assignment. Experiments on challenging StarCraft II
+micromanagement and Google research football tasks demonstrate the
+state-of-the-art performance of our method and its advantages over existing
+approaches. Our code is available at
+[https://github.com/NJU-RL/ACORM](https://github.com/NJU-RL/ACORM).",cs.MA,['cs.MA']
+"Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs",Sunghwan Hong · Jaewoo Jung · Heeseong Shin · Jiaolong Yang · Chong Luo · Seungryong Kim,https://ku-cvlab.github.io/CoPoNeRF/,https://arxiv.org/abs/2312.07246,,2312.07246.pdf,"Unifying Correspondence, Pose and NeRF for Pose-Free Novel View Synthesis from Stereo Pairs","This work delves into the task of pose-free novel view synthesis from stereo
+pairs, a challenging and pioneering task in 3D vision. Our innovative
+framework, unlike any before, seamlessly integrates 2D correspondence matching,
+camera pose estimation, and NeRF rendering, fostering a synergistic enhancement
+of these tasks. We achieve this through designing an architecture that utilizes
+a shared representation, which serves as a foundation for enhanced 3D geometry
+understanding. Capitalizing on the inherent interplay between the tasks, our
+unified framework is trained end-to-end with the proposed training strategy to
+improve overall model accuracy. Through extensive evaluations across diverse
+indoor and outdoor scenes from two real-world datasets, we demonstrate that our
+approach achieves substantial improvement over previous methodologies,
+especially in scenarios characterized by extreme viewpoint changes and the
+absence of accurate camera poses.",cs.CV,['cs.CV']
+Neural Markov Random Field for Stereo Matching,Tongfan Guan · Chen Wang · Yun-Hui Liu,https://github.com/aeolusguan/NMRF,https://arxiv.org/abs/2403.11193,,2403.11193.pdf,Neural Markov Random Field for Stereo Matching,"Stereo matching is a core task for many computer vision and robotics
+applications. Despite their dominance in traditional stereo methods, the
+hand-crafted Markov Random Field (MRF) models lack sufficient modeling accuracy
+compared to end-to-end deep models. While deep learning representations have
+greatly improved the unary terms of the MRF models, the overall accuracy is
+still severely limited by the hand-crafted pairwise terms and message passing.
+To address these issues, we propose a neural MRF model, where both potential
+functions and message passing are designed using data-driven neural networks.
+Our fully data-driven model is built on the foundation of variational inference
+theory, to prevent convergence issues and retain stereo MRF's graph inductive
+bias. To make the inference tractable and scale well to high-resolution images,
+we also propose a Disparity Proposal Network (DPN) to adaptively prune the
+search space of disparity. The proposed approach ranks $1^{st}$ on both KITTI
+2012 and 2015 leaderboards among all published methods while running faster
+than 100 ms. This approach significantly outperforms prior global methods,
+e.g., lowering D1 metric by more than 50% on KITTI 2015. In addition, our
+method exhibits strong cross-domain generalization and can recover sharp edges.
+The codes at https://github.com/aeolusguan/NMRF",cs.CV,['cs.CV']
+Self-supervised debiasing using low rank regularization,Geon Yeong Park · Chanyong Jung · Sangmin Lee · Jong Chul Ye · Sang Wan Lee, ,,https://bispl.weebly.com/bispl-news/four-papers-got-accepted-for-cvpr-2024,,,,,nan
+Language-driven Object Fusion into Neural Radiance Fields with Pose-Conditioned Dataset Updates,Ka Chun SHUM · Jaeyeon Kim · Binh-Son Hua · Thanh Nguyen · Sai-Kit Yeung,https://github.com/kcshum/pose-conditioned-NeRF-object-fusion,https://arxiv.org/abs/2309.11281,,2309.11281.pdf,Language-driven Object Fusion into Neural Radiance Fields with Pose-Conditioned Dataset Updates,"Neural radiance field is an emerging rendering method that generates
+high-quality multi-view consistent images from a neural scene representation
+and volume rendering. Although neural radiance field-based techniques are
+robust for scene reconstruction, their ability to add or remove objects remains
+limited. This paper proposes a new language-driven approach for object
+manipulation with neural radiance fields through dataset updates. Specifically,
+to insert a new foreground object represented by a set of multi-view images
+into a background radiance field, we use a text-to-image diffusion model to
+learn and generate combined images that fuse the object of interest into the
+given background across views. These combined images are then used for refining
+the background radiance field so that we can render view-consistent images
+containing both the object and the background. To ensure view consistency, we
+propose a dataset updates strategy that prioritizes radiance field training
+with camera views close to the already-trained views prior to propagating the
+training to remaining views. We show that under the same dataset updates
+strategy, we can easily adapt our method for object insertion using data from
+text-to-3D models as well as object removal. Experimental results show that our
+method generates photorealistic images of the edited scenes, and outperforms
+state-of-the-art methods in 3D reconstruction and neural radiance field
+blending.",cs.CV,['cs.CV']
+Frozen Feature Augmentation for Few-Shot Image Classification,Andreas Bär · Neil Houlsby · Mostafa Dehghani · Manoj Kumar,https://frozen-feature-augmentation.github.io/,https://arxiv.org/abs/2403.10519,,2403.10519.pdf,Frozen Feature Augmentation for Few-Shot Image Classification,"Training a linear classifier or lightweight model on top of pretrained vision
+model outputs, so-called 'frozen features', leads to impressive performance on
+a number of downstream few-shot tasks. Currently, frozen features are not
+modified during training. On the other hand, when networks are trained directly
+on images, data augmentation is a standard recipe that improves performance
+with no substantial overhead. In this paper, we conduct an extensive pilot
+study on few-shot image classification that explores applying data
+augmentations in the frozen feature space, dubbed 'frozen feature augmentation
+(FroFA)', covering twenty augmentations in total. Our study demonstrates that
+adopting a deceptively simple pointwise FroFA, such as brightness, can improve
+few-shot performance consistently across three network architectures, three
+large pretraining datasets, and eight transfer datasets.",cs.CV,['cs.CV']
+VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning,Ziyang Luo · Nian Liu · Wangbo Zhao · Xuguang Yang · Dingwen Zhang · Deng-Ping Fan · Fahad Shahbaz Khan · Junwei Han, ,https://arxiv.org/abs/2311.15011,,2311.15011.pdf,VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning,"Salient object detection (SOD) and camouflaged object detection (COD) are
+related yet distinct binary mapping tasks. These tasks involve multiple
+modalities, sharing commonalities and unique cues. Existing research often
+employs intricate task-specific specialist models, potentially leading to
+redundancy and suboptimal results. We introduce VSCode, a generalist model with
+novel 2D prompt learning, to jointly address four SOD tasks and three COD
+tasks. We utilize VST as the foundation model and introduce 2D prompts within
+the encoder-decoder architecture to learn domain and task-specific knowledge on
+two separate dimensions. A prompt discrimination loss helps disentangle
+peculiarities to benefit model optimization. VSCode outperforms
+state-of-the-art methods across six tasks on 26 datasets and exhibits zero-shot
+generalization to unseen tasks by combining 2D prompts, such as RGB-D COD.
+Source code has been available at https://github.com/Sssssuperior/VSCode.",cs.CV,['cs.CV']
+GLiDR: Topologically Regularized Graph Generative Network for Sparse LiDAR Point Clouds,Prashant Kumar · Kshitij Madhav Bhat · Vedang Bhupesh Shenvi Nadkarni · Prem Kalra, ,https://arxiv.org/abs/2312.00068,,2312.00068.pdf,GLiDR: Topologically Regularized Graph Generative Network for Sparse LiDAR Point Clouds,"Sparse LiDAR point clouds cause severe loss of detail of static structures
+and reduce the density of static points available for navigation. Reduced
+density can be detrimental to navigation under several scenarios. We observe
+that despite high sparsity, in most cases, the global topology of LiDAR
+outlining the static structures can be inferred. We utilize this property to
+obtain a backbone skeleton of a LiDAR scan in the form of a single connected
+component that is a proxy to its global topology. We utilize the backbone to
+augment new points along static structures to overcome sparsity. Newly
+introduced points could correspond to existing static structures or to static
+points that were earlier obstructed by dynamic objects. To the best of our
+knowledge, we are the first to use such a strategy for sparse LiDAR point
+clouds. Existing solutions close to our approach fail to identify and preserve
+the global static LiDAR topology and generate sub-optimal points. We propose
+GLiDR, a Graph Generative network that is topologically regularized using
+0-dimensional Persistent Homology ($\mathcal{PH}$) constraints. This enables
+GLiDR to introduce newer static points along a topologically consistent global
+static LiDAR backbone. GLiDR generates precise static points using $32\times$
+sparser dynamic scans and performs better than the baselines across three
+datasets. GLiDR generates a valuable byproduct - an accurate binary
+segmentation mask of static and dynamic objects that are helpful for navigation
+planning and safety in constrained environments. The newly introduced static
+points allow GLiDR to outperform LiDAR-based navigation using SLAM in several
+settings. Source code is available at https://kshitijbhat.github.io/glidr",cs.RO,"['cs.RO', 'cs.CV']"
+The STVchrono Dataset: Towards Continuous Change Recognition in Time,Yanjun Sun · Yue Qiu · Mariia Khan · Fumiya Matsuzawa · Kenji Iwata, ,,https://www.youtube.com/watch?v=44o-Xl60ipI,,,,,nan
+NECA: Neural Customizable Human Avatar,Junjin Xiao · Qing Zhang · Zhan Xu · Wei-Shi Zheng,https://github.com/iSEE-Laboratory/NECA,https://arxiv.org/abs/2403.10335,,2403.10335.pdf,NECA: Neural Customizable Human Avatar,"Human avatar has become a novel type of 3D asset with various applications.
+Ideally, a human avatar should be fully customizable to accommodate different
+settings and environments. In this work, we introduce NECA, an approach capable
+of learning versatile human representation from monocular or sparse-view
+videos, enabling granular customization across aspects such as pose, shadow,
+shape, lighting and texture. The core of our approach is to represent humans in
+complementary dual spaces and predict disentangled neural fields of geometry,
+albedo, shadow, as well as an external lighting, from which we are able to
+derive realistic rendering with high-frequency details via volumetric
+rendering. Extensive experiments demonstrate the advantage of our method over
+the state-of-the-art methods in photorealistic rendering, as well as various
+editing tasks such as novel pose synthesis and relighting. The code is
+available at https://github.com/iSEE-Laboratory/NECA.",cs.CV,['cs.CV']
+Continual Segmentation with Disentangled Objectness Learning and Class Recognition,Yizheng Gong · Siyue Yu · Xiaoyang Wang · Jimin Xiao, ,https://arxiv.org/abs/2403.03477,,2403.03477.pdf,Continual Segmentation with Disentangled Objectness Learning and Class Recognition,"Most continual segmentation methods tackle the problem as a per-pixel
+classification task. However, such a paradigm is very challenging, and we find
+query-based segmenters with built-in objectness have inherent advantages
+compared with per-pixel ones, as objectness has strong transfer ability and
+forgetting resistance. Based on these findings, we propose CoMasTRe by
+disentangling continual segmentation into two stages: forgetting-resistant
+continual objectness learning and well-researched continual classification.
+CoMasTRe uses a two-stage segmenter learning class-agnostic mask proposals at
+the first stage and leaving recognition to the second stage. During continual
+learning, a simple but effective distillation is adopted to strengthen
+objectness. To further mitigate the forgetting of old classes, we design a
+multi-label class distillation strategy suited for segmentation. We assess the
+effectiveness of CoMasTRe on PASCAL VOC and ADE20K. Extensive experiments show
+that our method outperforms per-pixel and query-based methods on both datasets.
+Code will be available at https://github.com/jordangong/CoMasTRe.",cs.CV,['cs.CV']
+Text2Loc: 3D Point Cloud Localization from Natural Language,Yan Xia · Letian Shi · Zifeng Ding · João F. Henriques · Daniel Cremers, ,https://arxiv.org/abs/2311.15977,,2311.15977.pdf,Text2Loc: 3D Point Cloud Localization from Natural Language,"We tackle the problem of 3D point cloud localization based on a few natural
+linguistic descriptions and introduce a novel neural network, Text2Loc, that
+fully interprets the semantic relationship between points and text. Text2Loc
+follows a coarse-to-fine localization pipeline: text-submap global place
+recognition, followed by fine localization. In global place recognition,
+relational dynamics among each textual hint are captured in a hierarchical
+transformer with max-pooling (HTM), whereas a balance between positive and
+negative pairs is maintained using text-submap contrastive learning. Moreover,
+we propose a novel matching-free fine localization method to further refine the
+location predictions, which completely removes the need for complicated
+text-instance matching and is lighter, faster, and more accurate than previous
+methods. Extensive experiments show that Text2Loc improves the localization
+accuracy by up to $2\times$ over the state-of-the-art on the KITTI360Pose
+dataset. Our project page is publicly available at
+\url{https://yan-xia.github.io/projects/text2loc/}.",cs.CV,['cs.CV']
+OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation,Ganlong Zhao · Guanbin Li · Weikai Chen · Yizhou Yu, ,https://arxiv.org/abs/2403.17334,,2403.17334.pdf,OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation,"Recent advances in Iterative Vision-and-Language Navigation (IVLN) introduce
+a more meaningful and practical paradigm of VLN by maintaining the agent's
+memory across tours of scenes. Although the long-term memory aligns better with
+the persistent nature of the VLN task, it poses more challenges on how to
+utilize the highly unstructured navigation memory with extremely sparse
+supervision. Towards this end, we propose OVER-NAV, which aims to go over and
+beyond the current arts of IVLN techniques. In particular, we propose to
+incorporate LLMs and open-vocabulary detectors to distill key information and
+establish correspondence between multi-modal signals. Such a mechanism
+introduces reliable cross-modal supervision and enables on-the-fly
+generalization to unseen scenes without the need of extra annotation and
+re-training. To fully exploit the interpreted navigation data, we further
+introduce a structured representation, coded Omnigraph, to effectively
+integrate multi-modal information along the tour. Accompanied with a novel
+omnigraph fusion mechanism, OVER-NAV is able to extract the most relevant
+knowledge from omnigraph for a more accurate navigating action. In addition,
+OVER-NAV seamlessly supports both discrete and continuous environments under a
+unified framework. We demonstrate the superiority of OVER-NAV in extensive
+experiments.",cs.CV,['cs.CV']
+Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework,Ziyao Huang · Fan Tang · Yong Zhang · Xiaodong Cun · Juan Cao · Jintao Li · Tong-yee Lee,https://github.com/ICTMCG/Make-Your-Anchor,https://arxiv.org/abs/2403.16510,,2403.16510.pdf,Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework,"Despite the remarkable process of talking-head-based avatar-creating
+solutions, directly generating anchor-style videos with full-body motions
+remains challenging. In this study, we propose Make-Your-Anchor, a novel system
+necessitating only a one-minute video clip of an individual for training,
+subsequently enabling the automatic generation of anchor-style videos with
+precise torso and hand movements. Specifically, we finetune a proposed
+structure-guided diffusion model on input video to render 3D mesh conditions
+into human appearances. We adopt a two-stage training strategy for the
+diffusion model, effectively binding movements with specific appearances. To
+produce arbitrary long temporal video, we extend the 2D U-Net in the frame-wise
+diffusion model to a 3D style without additional training cost, and a simple
+yet effective batch-overlapped temporal denoising module is proposed to bypass
+the constraints on video length during inference. Finally, a novel
+identity-specific face enhancement module is introduced to improve the visual
+quality of facial regions in the output videos. Comparative experiments
+demonstrate the effectiveness and superiority of the system in terms of visual
+quality, temporal coherence, and identity preservation, outperforming SOTA
+diffusion/non-diffusion methods. Project page:
+\url{https://github.com/ICTMCG/Make-Your-Anchor}.",cs.CV,['cs.CV']
+SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking,Xiaojun Hou · Jiazheng Xing · Yijie Qian · Yaowei Guo · Shuo Xin · Junhao Chen · Kai Tang · Mengmeng Wang · Zhengkai Jiang · Liang Liu · Yong Liu,https://github.com/hoqolo/SDSTrack,https://arxiv.org/abs/2403.16002,,2403.16002.pdf,SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking,"Multimodal Visual Object Tracking (VOT) has recently gained significant
+attention due to its robustness. Early research focused on fully fine-tuning
+RGB-based trackers, which was inefficient and lacked generalized representation
+due to the scarcity of multimodal data. Therefore, recent studies have utilized
+prompt tuning to transfer pre-trained RGB-based trackers to multimodal data.
+However, the modality gap limits pre-trained knowledge recall, and the
+dominance of the RGB modality persists, preventing the full utilization of
+information from other modalities. To address these issues, we propose a novel
+symmetric multimodal tracking framework called SDSTrack. We introduce
+lightweight adaptation for efficient fine-tuning, which directly transfers the
+feature extraction ability from RGB to other domains with a small number of
+trainable parameters and integrates multimodal features in a balanced,
+symmetric manner. Furthermore, we design a complementary masked patch
+distillation strategy to enhance the robustness of trackers in complex
+environments, such as extreme weather, poor imaging, and sensor failure.
+Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art
+methods in various multimodal tracking scenarios, including RGB+Depth,
+RGB+Thermal, and RGB+Event tracking, and exhibits impressive results in extreme
+conditions. Our source code is available at https://github.com/hoqolo/SDSTrack.",cs.CV,['cs.CV']
+"Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular, Stereo, and RGB-D Cameras",Huajian Huang · Longwei Li · Hui Cheng · Sai-Kit Yeung, ,https://arxiv.org/abs/2311.16728,,2311.16728.pdf,"Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular, Stereo, and RGB-D Cameras","The integration of neural rendering and the SLAM system recently showed
+promising results in joint localization and photorealistic view reconstruction.
+However, existing methods, fully relying on implicit representations, are so
+resource-hungry that they cannot run on portable devices, which deviates from
+the original intention of SLAM. In this paper, we present Photo-SLAM, a novel
+SLAM framework with a hyper primitives map. Specifically, we simultaneously
+exploit explicit geometric features for localization and learn implicit
+photometric features to represent the texture information of the observed
+environment. In addition to actively densifying hyper primitives based on
+geometric features, we further introduce a Gaussian-Pyramid-based training
+method to progressively learn multi-level features, enhancing photorealistic
+mapping performance. The extensive experiments with monocular, stereo, and
+RGB-D datasets prove that our proposed system Photo-SLAM significantly
+outperforms current state-of-the-art SLAM systems for online photorealistic
+mapping, e.g., PSNR is 30% higher and rendering speed is hundreds of times
+faster in the Replica dataset. Moreover, the Photo-SLAM can run at real-time
+speed using an embedded platform such as Jetson AGX Orin, showing the potential
+of robotics applications.",cs.CV,['cs.CV']
+Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation,Wenhao Li · Mengyuan Liu · Hong Liu · Pichao Wang · Jialun Cai · Nicu Sebe,https://github.com/NationalGAILab/HoT,,https://paperswithcode.com/paper/hourglass-tokenizer-for-efficient-transformer,,,,,nan
+Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection,Jiaming Li · Jiacheng Zhang · Jichang Li · Ge Li · Si Liu · Liang Lin · Guanbin Li, ,https://arxiv.org/abs/2404.09216,,2404.09216.pdf,DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection,"Existing open-vocabulary object detectors typically require a predefined set
+of categories from users, significantly confining their application scenarios.
+In this paper, we introduce DetCLIPv3, a high-performing detector that excels
+not only at both open-vocabulary object detection, but also generating
+hierarchical labels for detected objects. DetCLIPv3 is characterized by three
+core designs: 1. Versatile model architecture: we derive a robust open-set
+detection framework which is further empowered with generation ability via the
+integration of a caption head. 2. High information density data: we develop an
+auto-annotation pipeline leveraging visual large language model to refine
+captions for large-scale image-text pairs, providing rich, multi-granular
+object labels to enhance the training. 3. Efficient training strategy: we
+employ a pre-training stage with low-resolution inputs that enables the object
+captioner to efficiently learn a broad spectrum of visual concepts from
+extensive image-text paired data. This is followed by a fine-tuning stage that
+leverages a small number of high-resolution samples to further enhance
+detection performance. With these effective designs, DetCLIPv3 demonstrates
+superior open-vocabulary detection performance, \eg, our Swin-T backbone model
+achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark,
+outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP,
+respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense
+captioning task on VG dataset, showcasing its strong generative capability.",cs.CV,['cs.CV']
+3D Feature Tracking via Event Camera,Siqi Li · Zhou Zhikuan · Zhou Xue · Yipeng Li · Shaoyi Du · Yue Gao, ,https://cvpr.thecvf.com/Conferences/2023/AuthorQAEventCameras,,,,,,nan
+Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset,Yiming Li · Zhiheng Li · Nuo Chen · Moonjun Gong · Zonglin Lyu · Zehong Wang · Peili Jiang · Chen Feng, ,https://ar5iv.labs.arxiv.org/html/2202.08449,,2202.08449.pdf,V2X-Sim: Multi-Agent Collaborative Perception Dataset and Benchmark for Autonomous Driving,"Vehicle-to-everything (V2X) communication techniques enable the collaboration
+between vehicles and many other entities in the neighboring environment, which
+could fundamentally improve the perception system for autonomous driving.
+However, the lack of a public dataset significantly restricts the research
+progress of collaborative perception. To fill this gap, we present V2X-Sim, a
+comprehensive simulated multi-agent perception dataset for V2X-aided autonomous
+driving. V2X-Sim provides: (1) \hl{multi-agent} sensor recordings from the
+road-side unit (RSU) and multiple vehicles that enable collaborative
+perception, (2) multi-modality sensor streams that facilitate multi-modality
+perception, and (3) diverse ground truths that support various perception
+tasks. Meanwhile, we build an open-source testbed and provide a benchmark for
+the state-of-the-art collaborative perception algorithms on three tasks,
+including detection, tracking and segmentation. V2X-Sim seeks to stimulate
+collaborative perception research for autonomous driving before realistic
+datasets become widely available. Our dataset and code are available at
+\url{https://ai4ce.github.io/V2X-Sim/}.",cs.CV,['cs.CV']
+Taming Stable Diffusion for Text to 360$^{\circ}$ Panorama Image Generation,Cheng Zhang · Qianyi Wu · Camilo Cruz Gambardella · Xiaoshui Huang · Dinh Phung · Wanli Ouyang · Jianfei Cai, ,https://arxiv.org/abs/2404.07949,,2404.07949.pdf,Taming Stable Diffusion for Text to 360° Panorama Image Generation,"Generative models, e.g., Stable Diffusion, have enabled the creation of
+photorealistic images from text prompts. Yet, the generation of 360-degree
+panorama images from text remains a challenge, particularly due to the dearth
+of paired text-panorama data and the domain gap between panorama and
+perspective images. In this paper, we introduce a novel dual-branch diffusion
+model named PanFusion to generate a 360-degree image from a text prompt. We
+leverage the stable diffusion model as one branch to provide prior knowledge in
+natural image generation and register it to another panorama branch for
+holistic image generation. We propose a unique cross-attention mechanism with
+projection awareness to minimize distortion during the collaborative denoising
+process. Our experiments validate that PanFusion surpasses existing methods
+and, thanks to its dual-branch structure, can integrate additional constraints
+like room layout for customized panorama outputs. Code is available at
+https://chengzhag.github.io/publication/panfusion.",cs.CV,['cs.CV']
+Frequency-aware Event-based Video Deblurring for Real-World Motion Blur,Taewoo Kim · Hoonhee Cho · Kuk-Jin Yoon, ,https://arxiv.org/abs/2404.12168,,,Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization,"As recent advances in mobile camera technology have enabled the capability to
+capture high-resolution images, such as 4K images, the demand for an efficient
+deblurring model handling large motion has increased. In this paper, we
+discover that the image residual errors, i.e., blur-sharp pixel differences,
+can be grouped into some categories according to their motion blur type and how
+complex their neighboring pixels are. Inspired by this, we decompose the
+deblurring (regression) task into blur pixel discretization (pixel-level blur
+classification) and discrete-to-continuous conversion (regression with blur
+class map) tasks. Specifically, we generate the discretized image residual
+errors by identifying the blur pixels and then transform them to a continuous
+form, which is computationally more efficient than naively solving the original
+regression problem with continuous values. Here, we found that the
+discretization result, i.e., blur segmentation map, remarkably exhibits visual
+similarity with the image residual errors. As a result, our efficient model
+shows comparable performance to state-of-the-art methods in realistic
+benchmarks, while our method is up to 10 times computationally more efficient.",cs.CV,"['cs.CV', 'cs.AI']"
+Snapshot Lidar: Fourier embedding of amplitude and phase for single-image depth reconstruction,Sarah Friday · Yunzi Shi · Yaswanth Kumar Cherivirala · Vishwanath Saragadam · Adithya Pediredla, ,https://arxiv.org/abs/2311.10950,,2311.10950.pdf,Single-shot Phase Retrieval from a Fractional Fourier Transform Perspective,"The realm of classical phase retrieval concerns itself with the arduous task
+of recovering a signal from its Fourier magnitude measurements, which are
+fraught with inherent ambiguities. A single-exposure intensity measurement is
+commonly deemed insufficient for the reconstruction of the primal signal, given
+that the absent phase component is imperative for the inverse transformation.
+In this work, we present a novel single-shot phase retrieval paradigm from a
+fractional Fourier transform (FrFT) perspective, which involves integrating the
+FrFT-based physical measurement model within a self-supervised reconstruction
+scheme. Specifically, the proposed FrFT-based measurement model addresses the
+aliasing artifacts problem in the numerical calculation of Fresnel diffraction,
+featuring adaptability to both short-distance and long-distance propagation
+scenarios. Moreover, the intensity measurement in the FrFT domain proves highly
+effective in alleviating the ambiguities of phase retrieval and relaxing the
+previous conditions on oversampled or multiple measurements in the Fourier
+domain. Furthermore, the proposed self-supervised reconstruction approach
+harnesses the fast discrete algorithm of FrFT alongside untrained neural
+network priors, thereby attaining preeminent results. Through numerical
+simulations, we demonstrate that both amplitude and phase objects can be
+effectively retrieved from a single-shot intensity measurement using the
+proposed approach and provide a promising technique for support-free coherent
+diffraction imaging.",cs.CV,"['cs.CV', 'physics.optics']"
+ODCR: Orthogonal Decoupling Contrastive Regularization for Unpaired Image Dehazing,Zhongze Wang · Haitao Zhao · Jingchao Peng · Lujian Yao · Kaijie Zhao, ,https://arxiv.org/abs/2404.17825,,2404.17825.pdf,ODCR: Orthogonal Decoupling Contrastive Regularization for Unpaired Image Dehazing,"Unpaired image dehazing (UID) holds significant research importance due to
+the challenges in acquiring haze/clear image pairs with identical backgrounds.
+This paper proposes a novel method for UID named Orthogonal Decoupling
+Contrastive Regularization (ODCR). Our method is grounded in the assumption
+that an image consists of both haze-related features, which influence the
+degree of haze, and haze-unrelated features, such as texture and semantic
+information. ODCR aims to ensure that the haze-related features of the dehazing
+result closely resemble those of the clear image, while the haze-unrelated
+features align with the input hazy image. To accomplish the motivation,
+Orthogonal MLPs optimized geometrically on the Stiefel manifold are proposed,
+which can project image features into an orthogonal space, thereby reducing the
+relevance between different features. Furthermore, a task-driven Depth-wise
+Feature Classifier (DWFC) is proposed, which assigns weights to the orthogonal
+features based on the contribution of each channel's feature in predicting
+whether the feature source is hazy or clear in a self-supervised fashion.
+Finally, a Weighted PatchNCE (WPNCE) loss is introduced to achieve the pulling
+of haze-related features in the output image toward those of clear images,
+while bringing haze-unrelated features close to those of the hazy input.
+Extensive experiments demonstrate the superior performance of our ODCR method
+on UID.",cs.CV,['cs.CV']
+MaxQ: Multi-Axis Query for N:M Sparsity Network,Jingyang Xiang · Siqi Li · Junhao Chen · Zhuangzhi Chen · Tianxin Huang · Linpeng Peng · Yong Liu,https://github.com/JingyangXiang/MaxQ,https://arxiv.org/abs/2312.07061,,2312.07061.pdf,MaxQ: Multi-Axis Query for N:M Sparsity Network,"N:M sparsity has received increasing attention due to its remarkable
+performance and latency trade-off compared with structured and unstructured
+sparsity. However, existing N:M sparsity methods do not differentiate the
+relative importance of weights among blocks and leave important weights
+underappreciated. Besides, they directly apply N:M sparsity to the whole
+network, which will cause severe information loss. Thus, they are still
+sub-optimal. In this paper, we propose an efficient and effective Multi-Axis
+Query methodology, dubbed as MaxQ, to rectify these problems. During the
+training, MaxQ employs a dynamic approach to generate soft N:M masks,
+considering the weight importance across multiple axes. This method enhances
+the weights with more importance and ensures more effective updates. Meanwhile,
+a sparsity strategy that gradually increases the percentage of N:M weight
+blocks is applied, which allows the network to heal from the pruning-induced
+damage progressively. During the runtime, the N:M soft masks can be precomputed
+as constants and folded into weights without causing any distortion to the
+sparse pattern and incurring additional computational overhead. Comprehensive
+experiments demonstrate that MaxQ achieves consistent improvements across
+diverse CNN architectures in various computer vision tasks, including image
+classification, object detection and instance segmentation. For ResNet50 with
+1:16 sparse pattern, MaxQ can achieve 74.6\% top-1 accuracy on ImageNet and
+improve by over 2.8\% over the state-of-the-art. Codes and checkpoints are
+available at \url{https://github.com/JingyangXiang/MaxQ}.",cs.CV,['cs.CV']
+Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework,Vu Minh Hieu Phan · Yutong Xie · Yuankai Qi · Lingqiao Liu · Liyang Liu · Bowen Zhang · Zhibin Liao · Qi Wu · Minh-Son To · Johan Verjans, ,https://arxiv.org/abs/2403.07636v2,,2403.07636v2.pdf,Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework,"Medical vision language pre-training (VLP) has emerged as a frontier of
+research, enabling zero-shot pathological recognition by comparing the query
+image with the textual descriptions for each disease. Due to the complex
+semantics of biomedical texts, current methods struggle to align medical images
+with key pathological findings in unstructured reports. This leads to the
+misalignment with the target disease's textual representation. In this paper,
+we introduce a novel VLP framework designed to dissect disease descriptions
+into their fundamental aspects, leveraging prior knowledge about the visual
+manifestations of pathologies. This is achieved by consulting a large language
+model and medical experts. Integrating a Transformer module, our approach
+aligns an input image with the diverse elements of a disease, generating
+aspect-centric image representations. By consolidating the matches from each
+aspect, we improve the compatibility between an image and its associated
+disease. Additionally, capitalizing on the aspect-oriented representations, we
+present a dual-head Transformer tailored to process known and unknown diseases,
+optimizing the comprehensive detection efficacy. Conducting experiments on
+seven downstream datasets, ours improves the accuracy of recent methods by up
+to 8.56% and 17.0% for seen and unseen categories, respectively. Our code is
+released at https://github.com/HieuPhan33/MAVL.",cs.CV,['cs.CV']
+EAGLE: Eigen Aggregation Learning for Object-Centric Unsupervised Semantic Segmentation,Chanyoung Kim · Woojung Han · Dayun Ju · Seong Jae Hwang,https://micv-yonsei.github.io/eagle2024/,https://arxiv.org/abs/2403.01482,,2403.01482.pdf,EAGLE: Eigen Aggregation Learning for Object-Centric Unsupervised Semantic Segmentation,"Semantic segmentation has innately relied on extensive pixel-level annotated
+data, leading to the emergence of unsupervised methodologies. Among them,
+leveraging self-supervised Vision Transformers for unsupervised semantic
+segmentation (USS) has been making steady progress with expressive deep
+features. Yet, for semantically segmenting images with complex objects, a
+predominant challenge remains: the lack of explicit object-level semantic
+encoding in patch-level features. This technical limitation often leads to
+inadequate segmentation of complex objects with diverse structures. To address
+this gap, we present a novel approach, EAGLE, which emphasizes object-centric
+representation learning for unsupervised semantic segmentation. Specifically,
+we introduce EiCue, a spectral technique providing semantic and structural cues
+through an eigenbasis derived from the semantic similarity matrix of deep image
+features and color affinity from an image. Further, by incorporating our
+object-centric contrastive loss with EiCue, we guide our model to learn
+object-level representations with intra- and inter-image object-feature
+consistency, thereby enhancing semantic accuracy. Extensive experiments on
+COCO-Stuff, Cityscapes, and Potsdam-3 datasets demonstrate the state-of-the-art
+USS results of EAGLE with accurate and consistent semantic segmentation across
+complex scenes.",cs.CV,['cs.CV']
+StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On,Jeongho Kim · Gyojung Gu · Minho Park · Sunghyun Park · Jaegul Choo,https://rlawjdghek.github.io/StableVITON/,https://arxiv.org/abs/2312.01725,,2312.01725.pdf,StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On,"Given a clothing image and a person image, an image-based virtual try-on aims
+to generate a customized image that appears natural and accurately reflects the
+characteristics of the clothing image. In this work, we aim to expand the
+applicability of the pre-trained diffusion model so that it can be utilized
+independently for the virtual try-on task.The main challenge is to preserve the
+clothing details while effectively utilizing the robust generative capability
+of the pre-trained model. In order to tackle these issues, we propose
+StableVITON, learning the semantic correspondence between the clothing and the
+human body within the latent space of the pre-trained diffusion model in an
+end-to-end manner. Our proposed zero cross-attention blocks not only preserve
+the clothing details by learning the semantic correspondence but also generate
+high-fidelity images by utilizing the inherent knowledge of the pre-trained
+model in the warping process. Through our proposed novel attention total
+variation loss and applying augmentation, we achieve the sharp attention map,
+resulting in a more precise representation of clothing details. StableVITON
+outperforms the baselines in qualitative and quantitative evaluation, showing
+promising quality in arbitrary person images. Our code is available at
+https://github.com/rlawjdghek/StableVITON.",cs.CV,['cs.CV']
+Towards Robust 3D Object Detection with LiDAR and 4D Radar Fusion in Various Weather Conditions,Yujeong Chae · Hyeonseong Kim · Kuk-Jin Yoon, ,https://arxiv.org/abs/2310.00944,,2310.00944.pdf,Towards Robust 3D Object Detection In Rainy Conditions,"LiDAR sensors are used in autonomous driving applications to accurately
+perceive the environment. However, they are affected by adverse weather
+conditions such as snow, fog, and rain. These everyday phenomena introduce
+unwanted noise into the measurements, severely degrading the performance of
+LiDAR-based perception systems. In this work, we propose a framework for
+improving the robustness of LiDAR-based 3D object detectors against road spray.
+Our approach uses a state-of-the-art adverse weather detection network to
+filter out spray from the LiDAR point cloud, which is then used as input for
+the object detector. In this way, the detected objects are less affected by the
+adverse weather in the scene, resulting in a more accurate perception of the
+environment. In addition to adverse weather filtering, we explore the use of
+radar targets to further filter false positive detections. Tests on real-world
+data show that our approach improves the robustness to road spray of several
+popular 3D object detectors.",cs.CV,"['cs.CV', 'cs.LG']"
+ConCon-Chi: Concept-Context Chimera Benchmark for Personalized Vision-Language Tasks,Andrea Rosasco · Stefano Berti · Giulia Pasquale · Damiano Malafronte · Shogo Sato · Hiroyuki Segawa · Tetsugo Inada · Lorenzo Natale, ,,https://paperswithcode.com/paper/open-ended-vqa-benchmarking-of-vision,,,,,nan
+Honeybee: Locality-enhanced Projector for Multimodal LLM,Junbum Cha · Woo-Young Kang · Jonghwan Mun · Byungseok Roh, ,https://arxiv.org/abs/2312.06742,,2312.06742.pdf,Honeybee: Locality-enhanced Projector for Multimodal LLM,"In Multimodal Large Language Models (MLLMs), a visual projector plays a
+crucial role in bridging pre-trained vision encoders with LLMs, enabling
+profound visual understanding while harnessing the LLMs' robust capabilities.
+Despite the importance of the visual projector, it has been relatively less
+explored. In this study, we first identify two essential projector properties:
+(i) flexibility in managing the number of visual tokens, crucial for MLLMs'
+overall efficiency, and (ii) preservation of local context from visual
+features, vital for spatial understanding. Based on these findings, we propose
+a novel projector design that is both flexible and locality-enhanced,
+effectively satisfying the two desirable properties. Additionally, we present
+comprehensive strategies to effectively utilize multiple and multifaceted
+instruction datasets. Through extensive experiments, we examine the impact of
+individual design choices. Finally, our proposed MLLM, Honeybee, remarkably
+outperforms previous state-of-the-art methods across various benchmarks,
+including MME, MMBench, SEED-Bench, and LLaVA-Bench, achieving significantly
+higher efficiency. Code and models are available at
+https://github.com/kakaobrain/honeybee.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']"
+AZ-NAS: Assembling Zero-Cost Proxies for Network Architecture Search,Junghyup Lee · Bumsub Ham, ,https://arxiv.org/abs/2403.19232,,2403.19232.pdf,AZ-NAS: Assembling Zero-Cost Proxies for Network Architecture Search,"Training-free network architecture search (NAS) aims to discover
+high-performing networks with zero-cost proxies, capturing network
+characteristics related to the final performance. However, network rankings
+estimated by previous training-free NAS methods have shown weak correlations
+with the performance. To address this issue, we propose AZ-NAS, a novel
+approach that leverages the ensemble of various zero-cost proxies to enhance
+the correlation between a predicted ranking of networks and the ground truth
+substantially in terms of the performance. To achieve this, we introduce four
+novel zero-cost proxies that are complementary to each other, analyzing
+distinct traits of architectures in the views of expressivity, progressivity,
+trainability, and complexity. The proxy scores can be obtained simultaneously
+within a single forward and backward pass, making an overall NAS process highly
+efficient. In order to integrate the rankings predicted by our proxies
+effectively, we introduce a non-linear ranking aggregation method that
+highlights the networks highly-ranked consistently across all the proxies.
+Experimental results conclusively demonstrate the efficacy and efficiency of
+AZ-NAS, outperforming state-of-the-art methods on standard benchmarks, all
+while maintaining a reasonable runtime cost.",cs.CV,"['cs.CV', 'cs.LG']"
+Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching,Xianqi Wang · Gangwei Xu · Hao Jia · Xin Yang,https://github.com/Windsrain/Selective-Stereo,https://arxiv.org/abs/2403.00486,,2403.00486.pdf,Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching,"Stereo matching methods based on iterative optimization, like RAFT-Stereo and
+IGEV-Stereo, have evolved into a cornerstone in the field of stereo matching.
+However, these methods struggle to simultaneously capture high-frequency
+information in edges and low-frequency information in smooth regions due to the
+fixed receptive field. As a result, they tend to lose details, blur edges, and
+produce false matches in textureless areas. In this paper, we propose Selective
+Recurrent Unit (SRU), a novel iterative update operator for stereo matching.
+The SRU module can adaptively fuse hidden disparity information at multiple
+frequencies for edge and smooth regions. To perform adaptive fusion, we
+introduce a new Contextual Spatial Attention (CSA) module to generate attention
+maps as fusion weights. The SRU empowers the network to aggregate hidden
+disparity information across multiple frequencies, mitigating the risk of vital
+hidden disparity information loss during iterative processes. To verify SRU's
+universality, we apply it to representative iterative stereo matching methods,
+collectively referred to as Selective-Stereo. Our Selective-Stereo ranks
+$1^{st}$ on KITTI 2012, KITTI 2015, ETH3D, and Middlebury leaderboards among
+all published methods. Code is available at
+https://github.com/Windsrain/Selective-Stereo.",cs.CV,['cs.CV']
+Learning the 3D Fauna of the Web,Zizhang Li · Dor Litvak · Ruining Li · Yunzhi Zhang · Tomas Jakab · Christian Rupprecht · Shangzhe Wu · Andrea Vedaldi · Jiajun Wu, ,https://arxiv.org/abs/2401.02400,,2401.02400.pdf,Learning the 3D Fauna of the Web,"Learning 3D models of all animals on the Earth requires massively scaling up
+existing solutions. With this ultimate goal in mind, we develop 3D-Fauna, an
+approach that learns a pan-category deformable 3D animal model for more than
+100 animal species jointly. One crucial bottleneck of modeling animals is the
+limited availability of training data, which we overcome by simply learning
+from 2D Internet images. We show that prior category-specific attempts fail to
+generalize to rare species with limited training images. We address this
+challenge by introducing the Semantic Bank of Skinned Models (SBSM), which
+automatically discovers a small set of base animal shapes by combining
+geometric inductive priors with semantic knowledge implicitly captured by an
+off-the-shelf self-supervised feature extractor. To train such a model, we also
+contribute a new large-scale dataset of diverse animal species. At inference
+time, given a single image of any quadruped animal, our model reconstructs an
+articulated 3D mesh in a feed-forward fashion within seconds.",cs.CV,['cs.CV']
+LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking,Jialin Li · Qiang Nie · Weifu Fu · Yuhuan Lin · Guangpin Tao · Yong Liu · Chengjie Wang, ,https://arxiv.org/abs/2403.04303,,2403.04303.pdf,LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking,"Deep learning models, particularly those based on transformers, often employ
+numerous stacked structures, which possess identical architectures and perform
+similar functions. While effective, this stacking paradigm leads to a
+substantial increase in the number of parameters, posing challenges for
+practical applications. In today's landscape of increasingly large models,
+stacking depth can even reach dozens, further exacerbating this issue. To
+mitigate this problem, we introduce LORS (LOw-rank Residual Structure). LORS
+allows stacked modules to share the majority of parameters, requiring a much
+smaller number of unique ones per module to match or even surpass the
+performance of using entirely distinct ones, thereby significantly reducing
+parameter usage. We validate our method by applying it to the stacked decoders
+of a query-based object detector, and conduct extensive experiments on the
+widely used MS COCO dataset. Experimental results demonstrate the effectiveness
+of our method, as even with a 70\% reduction in the parameters of the decoder,
+our method still enables the model to achieve comparable or",cs.CV,['cs.CV']
+VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation,Yang Chen · Yingwei Pan · haibo yang · Ting Yao · Tao Mei,https://vp3d-cvpr24.github.io/,https://arxiv.org/abs/2403.17001,,2403.17001.pdf,VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation,"Recent innovations on text-to-3D generation have featured Score Distillation
+Sampling (SDS), which enables the zero-shot learning of implicit 3D models
+(NeRF) by directly distilling prior knowledge from 2D diffusion models.
+However, current SDS-based models still struggle with intricate text prompts
+and commonly result in distorted 3D models with unrealistic textures or
+cross-view inconsistency issues. In this work, we introduce a novel Visual
+Prompt-guided text-to-3D diffusion model (VP3D) that explicitly unleashes the
+visual appearance knowledge in 2D visual prompt to boost text-to-3D generation.
+Instead of solely supervising SDS with text prompt, VP3D first capitalizes on
+2D diffusion model to generate a high-quality image from input text, which
+subsequently acts as visual prompt to strengthen SDS optimization with explicit
+visual appearance. Meanwhile, we couple the SDS optimization with additional
+differentiable reward function that encourages rendering images of 3D models to
+better visually align with 2D visual prompt and semantically match with text
+prompt. Through extensive experiments, we show that the 2D Visual Prompt in our
+VP3D significantly eases the learning of visual appearance of 3D models and
+thus leads to higher visual fidelity with more detailed textures. It is also
+appealing in view that when replacing the self-generating visual prompt with a
+given reference image, VP3D is able to trigger a new task of stylized
+text-to-3D generation. Our project page is available at
+https://vp3d-cvpr24.github.io.",cs.CV,"['cs.CV', 'cs.MM']"
+Vlogger: Make Your Dream A Vlog,Shaobin Zhuang · Kunchang Li · Xinyuan Chen · Yaohui Wang · Ziwei Liu · Yu Qiao · Yali Wang,https://github.com/zhuangshaobin/Vlogger,https://arxiv.org/abs/2401.09414,,2401.09414.pdf,Vlogger: Make Your Dream A Vlog,"In this work, we present Vlogger, a generic AI system for generating a
+minute-level video blog (i.e., vlog) of user descriptions. Different from short
+videos with a few seconds, vlog often contains a complex storyline with
+diversified scenes, which is challenging for most existing video generation
+approaches. To break through this bottleneck, our Vlogger smartly leverages
+Large Language Model (LLM) as Director and decomposes a long video generation
+task of vlog into four key stages, where we invoke various foundation models to
+play the critical roles of vlog professionals, including (1) Script, (2) Actor,
+(3) ShowMaker, and (4) Voicer. With such a design of mimicking human beings,
+our Vlogger can generate vlogs through explainable cooperation of top-down
+planning and bottom-up shooting. Moreover, we introduce a novel video diffusion
+model, ShowMaker, which serves as a videographer in our Vlogger for generating
+the video snippet of each shooting scene. By incorporating Script and Actor
+attentively as textual and visual prompts, it can effectively enhance
+spatial-temporal coherence in the snippet. Besides, we design a concise mixed
+training paradigm for ShowMaker, boosting its capacity for both T2V generation
+and prediction. Finally, the extensive experiments show that our method
+achieves state-of-the-art performance on zero-shot T2V generation and
+prediction tasks. More importantly, Vlogger can generate over 5-minute vlogs
+from open-world descriptions, without loss of video coherence on script and
+actor. The code and model is all available at
+https://github.com/zhuangshaobin/Vlogger.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.MM']"
+KP-RED: Exploiting Semantic Keypoints for Joint 3D Shape Retrieval and Deformation,Ruida Zhang · Chenyangguang Zhang · Yan Di · Fabian Manhardt · Xingyu Liu · Federico Tombari · Xiangyang Ji, ,https://arxiv.org/abs/2403.10099,,2403.10099.pdf,KP-RED: Exploiting Semantic Keypoints for Joint 3D Shape Retrieval and Deformation,"In this paper, we present KP-RED, a unified KeyPoint-driven REtrieval and
+Deformation framework that takes object scans as input and jointly retrieves
+and deforms the most geometrically similar CAD models from a pre-processed
+database to tightly match the target. Unlike existing dense matching based
+methods that typically struggle with noisy partial scans, we propose to
+leverage category-consistent sparse keypoints to naturally handle both full and
+partial object scans. Specifically, we first employ a lightweight retrieval
+module to establish a keypoint-based embedding space, measuring the similarity
+among objects by dynamically aggregating deformation-aware local-global
+features around extracted keypoints. Objects that are close in the embedding
+space are considered similar in geometry. Then we introduce the neural
+cage-based deformation module that estimates the influence vector of each
+keypoint upon cage vertices inside its local support region to control the
+deformation of the retrieved shape. Extensive experiments on the synthetic
+dataset PartNet and the real-world dataset Scan2CAD demonstrate that KP-RED
+surpasses existing state-of-the-art approaches by a large margin. Codes and
+trained models will be released in https://github.com/lolrudy/KP-RED.",cs.CV,['cs.CV']
+AssistGUI: Task-Oriented PC Graphical User Interface Automation,Difei Gao · Lei Ji · Zechen Bai · Mingyu Ouyang · Peiran Li · Dongxing Mao · Qin WU · Weichen Zhang · Peiyi Wang · Xiangwu Guo · Hengxu Wang · Luowei Zhou · Mike Zheng Shou, ,https://arxiv.org/abs/2312.13108,,2312.13108.pdf,ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation,"Graphical User Interface (GUI) automation holds significant promise for
+assisting users with complex tasks, thereby boosting human productivity.
+Existing works leveraging Large Language Model (LLM) or LLM-based AI agents
+have shown capabilities in automating tasks on Android and Web platforms.
+However, these tasks are primarily aimed at simple device usage and
+entertainment operations. This paper presents a novel benchmark, AssistGUI, to
+evaluate whether models are capable of manipulating the mouse and keyboard on
+the Windows platform in response to user-requested tasks. We carefully
+collected a set of 100 tasks from nine widely-used software applications, such
+as, After Effects and MS Word, each accompanied by the necessary project files
+for better evaluation. Moreover, we propose an advanced Actor-Critic Embodied
+Agent framework, which incorporates a sophisticated GUI parser driven by an
+LLM-agent and an enhanced reasoning mechanism adept at handling lengthy
+procedural tasks. Our experimental results reveal that our GUI Parser and
+Reasoning mechanism outshine existing methods in performance. Nevertheless, the
+potential remains substantial, with the best model attaining only a 46% success
+rate on our benchmark. We conclude with a thorough analysis of the current
+methods' limitations, setting the stage for future breakthroughs in this
+domain.",cs.CV,['cs.CV']
+MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision,Chenyangguang Zhang · Guanlong Jiao · Yan Di · Gu Wang · Ziqin Huang · Ruida Zhang · Fabian Manhardt · Bowen Fu · Federico Tombari · Xiangyang Ji, ,https://arxiv.org/abs/2310.11696,,2310.11696.pdf,MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision,"Previous works concerning single-view hand-held object reconstruction
+typically rely on supervision from 3D ground-truth models, which are hard to
+collect in real world. In contrast, readily accessible hand-object videos offer
+a promising training data source, but they only give heavily occluded object
+observations. In this paper, we present a novel synthetic-to-real framework to
+exploit Multi-view Occlusion-aware supervision from hand-object videos for
+Hand-held Object reconstruction (MOHO) from a single image, tackling two
+predominant challenges in such setting: hand-induced occlusion and object's
+self-occlusion. First, in the synthetic pre-training stage, we render a
+large-scaled synthetic dataset SOMVideo with hand-object images and multi-view
+occlusion-free supervisions, adopted to address hand-induced occlusion in both
+2D and 3D spaces. Second, in the real-world finetuning stage, MOHO leverages
+the amodal-mask-weighted geometric supervision to mitigate the unfaithful
+guidance caused by the hand-occluded supervising views in real world. Moreover,
+domain-consistent occlusion-aware features are amalgamated in MOHO to resist
+object's self-occlusion for inferring the complete object shape. Extensive
+experiments on HO3D and DexYCB datasets demonstrate 2D-supervised MOHO gains
+superior results against 3D-supervised methods by a large margin.",cs.CV,['cs.CV']
+Text-Guided 3D Face Synthesis - From Generation to Editing,Yunjie Wu · Yapeng Meng · Zhipeng Hu · Lincheng Li · Haoqian Wu · Kun Zhou · Weiwei Xu · Xin Yu, ,https://arxiv.org/abs/2312.00375,,2312.00375.pdf,Text-Guided 3D Face Synthesis -- From Generation to Editing,"Text-guided 3D face synthesis has achieved remarkable results by leveraging
+text-to-image (T2I) diffusion models. However, most existing works focus solely
+on the direct generation, ignoring the editing, restricting them from
+synthesizing customized 3D faces through iterative adjustments. In this paper,
+we propose a unified text-guided framework from face generation to editing. In
+the generation stage, we propose a geometry-texture decoupled generation to
+mitigate the loss of geometric details caused by coupling. Besides, decoupling
+enables us to utilize the generated geometry as a condition for texture
+generation, yielding highly geometry-texture aligned results. We further employ
+a fine-tuned texture diffusion model to enhance texture quality in both RGB and
+YUV space. In the editing stage, we first employ a pre-trained diffusion model
+to update facial geometry or texture based on the texts. To enable sequential
+editing, we introduce a UV domain consistency preservation regularization,
+preventing unintentional changes to irrelevant facial attributes. Besides, we
+propose a self-guided consistency weight strategy to improve editing efficacy
+while preserving consistency. Through comprehensive experiments, we showcase
+our method's superiority in face synthesis. Project page:
+https://faceg2e.github.io/.",cs.CV,['cs.CV']
+Characteristics Matching Based Hash Codes Generation for Efficient Fine-grained Image Retrieval,Zhen-Duo Chen · Li-Jun Zhao · Zi-Chao Zhang · Xin Luo · Xin-Shun Xu, ,https://arxiv.org/abs/2311.06067,,2311.06067.pdf,Attributes Grouping and Mining Hashing for Fine-Grained Image Retrieval,"In recent years, hashing methods have been popular in the large-scale media
+search for low storage and strong representation capabilities. To describe
+objects with similar overall appearance but subtle differences, more and more
+studies focus on hashing-based fine-grained image retrieval. Existing hashing
+networks usually generate both local and global features through attention
+guidance on the same deep activation tensor, which limits the diversity of
+feature representations. To handle this limitation, we substitute convolutional
+descriptors for attention-guided features and propose an Attributes Grouping
+and Mining Hashing (AGMH), which groups and embeds the category-specific visual
+attributes in multiple descriptors to generate a comprehensive feature
+representation for efficient fine-grained image retrieval. Specifically, an
+Attention Dispersion Loss (ADL) is designed to force the descriptors to attend
+to various local regions and capture diverse subtle details. Moreover, we
+propose a Stepwise Interactive External Attention (SIEA) to mine critical
+attributes in each descriptor and construct correlations between fine-grained
+attributes and objects. The attention mechanism is dedicated to learning
+discrete attributes, which will not cost additional computations in hash codes
+generation. Finally, the compact binary codes are learned by preserving
+pairwise similarities. Experimental results demonstrate that AGMH consistently
+yields the best performance against state-of-the-art methods on fine-grained
+benchmark datasets.",cs.IR,"['cs.IR', 'cs.AI', 'cs.CV']"
+VOODOO 3D: VOlumetric pOrtrait Disentanglement fOr Online 3D head reenactment,Phong Tran · Egor Zakharov · Long Nhat Ho · Anh Tran · Liwen Hu · Hao Li, ,https://arxiv.org/abs/2312.04651,,2312.04651.pdf,VOODOO 3D: Volumetric Portrait Disentanglement for One-Shot 3D Head Reenactment,"We present a 3D-aware one-shot head reenactment method based on a fully
+volumetric neural disentanglement framework for source appearance and driver
+expressions. Our method is real-time and produces high-fidelity and
+view-consistent output, suitable for 3D teleconferencing systems based on
+holographic displays. Existing cutting-edge 3D-aware reenactment methods often
+use neural radiance fields or 3D meshes to produce view-consistent appearance
+encoding, but, at the same time, they rely on linear face models, such as 3DMM,
+to achieve its disentanglement with facial expressions. As a result, their
+reenactment results often exhibit identity leakage from the driver or have
+unnatural expressions. To address these problems, we propose a neural
+self-supervised disentanglement approach that lifts both the source image and
+driver video frame into a shared 3D volumetric representation based on
+tri-planes. This representation can then be freely manipulated with expression
+tri-planes extracted from the driving images and rendered from an arbitrary
+view using neural radiance fields. We achieve this disentanglement via
+self-supervised learning on a large in-the-wild video dataset. We further
+introduce a highly effective fine-tuning approach to improve the
+generalizability of the 3D lifting using the same real-world data. We
+demonstrate state-of-the-art performance on a wide range of datasets, and also
+showcase high-quality 3D-aware head reenactment on highly challenging and
+diverse subjects, including non-frontal head poses and complex expressions for
+both source and driver.",cs.CV,['cs.CV']
+Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence,Junyi Zhang · Charles Herrmann · Junhwa Hur · Eric Chen · Varun Jampani · Deqing Sun · Ming-Hsuan Yang,telling-left-from-right.github.io,https://arxiv.org/abs/2311.17034,,2311.17034.pdf,Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence,"While pre-trained large-scale vision models have shown significant promise
+for semantic correspondence, their features often struggle to grasp the
+geometry and orientation of instances. This paper identifies the importance of
+being geometry-aware for semantic correspondence and reveals a limitation of
+the features of current foundation models under simple post-processing. We show
+that incorporating this information can markedly enhance semantic
+correspondence performance with simple but effective solutions in both
+zero-shot and supervised settings. We also construct a new challenging
+benchmark for semantic correspondence built from an existing animal pose
+estimation dataset, for both pre-training validating models. Our method
+achieves a PCK@0.10 score of 65.4 (zero-shot) and 85.6 (supervised) on the
+challenging SPair-71k dataset, outperforming the state of the art by 5.5p and
+11.0p absolute gains, respectively. Our code and datasets are publicly
+available at: https://telling-left-from-right.github.io/.",cs.CV,['cs.CV']
+Federated Generalized Category Discovery,Nan Pu · Wenjing Li · Xinyuan Ji · Yalan Qin · Nicu Sebe · Zhun Zhong, ,https://arxiv.org/abs/2403.07369,,2403.07369.pdf,Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery,"In this paper, we study the problem of Generalized Category Discovery (GCD),
+which aims to cluster unlabeled data from both known and unknown categories
+using the knowledge of labeled data from known categories. Current GCD methods
+rely on only visual cues, which however neglect the multi-modality perceptive
+nature of human cognitive processes in discovering novel visual categories. To
+address this, we propose a two-phase TextGCD framework to accomplish
+multi-modality GCD by exploiting powerful Visual-Language Models. TextGCD
+mainly includes a retrieval-based text generation (RTG) phase and a
+cross-modality co-teaching (CCT) phase. First, RTG constructs a visual lexicon
+using category tags from diverse datasets and attributes from Large Language
+Models, generating descriptive texts for images in a retrieval manner. Second,
+CCT leverages disparities between textual and visual modalities to foster
+mutual learning, thereby enhancing visual GCD. In addition, we design an
+adaptive class aligning strategy to ensure the alignment of category
+perceptions between modalities as well as a soft-voting mechanism to integrate
+multi-modality cues. Experiments on eight datasets show the large superiority
+of our approach over state-of-the-art methods. Notably, our approach
+outperforms the best competitor, by 7.7% and 10.8% in All accuracy on
+ImageNet-1k and CUB, respectively.",cs.CV,['cs.CV']
+LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes,Shanlin Sun · Bingbing Zhuang · Ziyu Jiang · Buyu Liu · Xiaohui Xie · Manmohan Chandraker, ,https://arxiv.org/abs/2405.00900,,2405.00900.pdf,LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes,"Photorealistic simulation plays a crucial role in applications such as
+autonomous driving, where advances in neural radiance fields (NeRFs) may allow
+better scalability through the automatic creation of digital 3D assets.
+However, reconstruction quality suffers on street scenes due to largely
+collinear camera motions and sparser samplings at higher speeds. On the other
+hand, the application often demands rendering from camera views that deviate
+from the inputs to accurately simulate behaviors like lane changes. In this
+paper, we propose several insights that allow a better utilization of Lidar
+data to improve NeRF quality on street scenes. First, our framework learns a
+geometric scene representation from Lidar, which is fused with the implicit
+grid-based representation for radiance decoding, thereby supplying stronger
+geometric information offered by explicit point cloud. Second, we put forth a
+robust occlusion-aware depth supervision scheme, which allows utilizing
+densified Lidar points by accumulation. Third, we generate augmented training
+views from Lidar points for further improvement. Our insights translate to
+largely improved novel view synthesis under real driving scenes.",cs.CV,['cs.CV']
+Learning Occupancy for Monocular 3D Object Detection,Liang Peng · Junkai Xu · Haoran Cheng · Zheng Yang · Xiaopei Wu · Wei Qian · Wenxiao Wang · Boxi Wu · Deng Cai, ,https://arxiv.org/abs/2308.09421,,2308.09421.pdf,MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection,"In the field of monocular 3D detection, it is common practice to utilize
+scene geometric clues to enhance the detector's performance. However, many
+existing works adopt these clues explicitly such as estimating a depth map and
+back-projecting it into 3D space. This explicit methodology induces sparsity in
+3D representations due to the increased dimensionality from 2D to 3D, and leads
+to substantial information loss, especially for distant and occluded objects.
+To alleviate this issue, we propose MonoNeRD, a novel detection framework that
+can infer dense 3D geometry and occupancy. Specifically, we model scenes with
+Signed Distance Functions (SDF), facilitating the production of dense 3D
+representations. We treat these representations as Neural Radiance Fields
+(NeRF) and then employ volume rendering to recover RGB images and depth maps.
+To the best of our knowledge, this work is the first to introduce volume
+rendering for M3D, and demonstrates the potential of implicit reconstruction
+for image-based 3D perception. Extensive experiments conducted on the KITTI-3D
+benchmark and Waymo Open Dataset demonstrate the effectiveness of MonoNeRD.
+Codes are available at https://github.com/cskkxjk/MonoNeRD.",cs.CV,['cs.CV']
+CaDeT: a Causal Disentanglement Approach for Robust Trajectory Prediction in Autonomous Driving,Mozhgan Pourkeshavarz · Junrui Zhang · Amir Rasouli, ,https://arxiv.org/abs/2404.12538,,2404.12538.pdf,TrACT: A Training Dynamics Aware Contrastive Learning Framework for Long-tail Trajectory Prediction,"As a safety critical task, autonomous driving requires accurate predictions
+of road users' future trajectories for safe motion planning, particularly under
+challenging conditions. Yet, many recent deep learning methods suffer from a
+degraded performance on the challenging scenarios, mainly because these
+scenarios appear less frequently in the training data. To address such a
+long-tail issue, existing methods force challenging scenarios closer together
+in the feature space during training to trigger information sharing among them
+for more robust learning. These methods, however, primarily rely on the motion
+patterns to characterize scenarios, omitting more informative contextual
+information, such as interactions and scene layout. We argue that exploiting
+such information not only improves prediction accuracy but also scene
+compliance of the generated trajectories. In this paper, we propose to
+incorporate richer training dynamics information into a prototypical
+contrastive learning framework. More specifically, we propose a two-stage
+process. First, we generate rich contextual features using a baseline
+encoder-decoder framework. These features are split into clusters based on the
+model's output errors, using the training dynamics information, and a prototype
+is computed within each cluster. Second, we retrain the model using the
+prototypes in a contrastive learning framework. We conduct empirical
+evaluations of our approach using two large-scale naturalistic datasets and
+show that our method achieves state-of-the-art performance by improving
+accuracy and scene compliance on the long-tail samples. Furthermore, we perform
+experiments on a subset of the clusters to highlight the additional benefit of
+our approach in reducing training bias.",cs.CV,"['cs.CV', 'cs.LG']"
+Towards HDR and HFR Video from Rolling-Mixed-Bit Spikings,Yakun Chang · Yeliduosi Xiaokaiti · Yujia Liu · Bin Fan · Zhaojun Huang · Tiejun Huang · Boxin Shi, ,https://arxiv.org/abs/2405.00244,,,Towards Real-World HDR Video Reconstruction: A Large-Scale Benchmark Dataset and A Two-Stage Alignment Network,"As an important and practical way to obtain high dynamic range (HDR) video,
+HDR video reconstruction from sequences with alternating exposures is still
+less explored, mainly due to the lack of large-scale real-world datasets.
+Existing methods are mostly trained on synthetic datasets, which perform poorly
+in real scenes. In this work, to facilitate the development of real-world HDR
+video reconstruction, we present Real-HDRV, a large-scale real-world benchmark
+dataset for HDR video reconstruction, featuring various scenes, diverse motion
+patterns, and high-quality labels. Specifically, our dataset contains 500
+LDRs-HDRs video pairs, comprising about 28,000 LDR frames and 4,000 HDR labels,
+covering daytime, nighttime, indoor, and outdoor scenes. To our best knowledge,
+our dataset is the largest real-world HDR video reconstruction dataset.
+Correspondingly, we propose an end-to-end network for HDR video reconstruction,
+where a novel two-stage strategy is designed to perform alignment sequentially.
+Specifically, the first stage performs global alignment with the adaptively
+estimated global offsets, reducing the difficulty of subsequent alignment. The
+second stage implicitly performs local alignment in a coarse-to-fine manner at
+the feature level using the adaptive separable convolution. Extensive
+experiments demonstrate that: (1) models trained on our dataset can achieve
+better performance on real scenes than those trained on synthetic datasets; (2)
+our method outperforms previous state-of-the-art methods. Our dataset is
+available at https://github.com/yungsyu99/Real-HDRV.",cs.CV,['cs.CV']
+Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression,Hancheng Ye · Chong Yu · Peng Ye · Renqiu Xia · Bo Zhang · Yansong Tang · Jiwen Lu · Tao Chen, ,https://arxiv.org/abs/2403.15835,,2403.15835.pdf,Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression,"Recent Vision Transformer Compression (VTC) works mainly follow a two-stage
+scheme, where the importance score of each model unit is first evaluated or
+preset in each submodule, followed by the sparsity score evaluation according
+to the target sparsity constraint. Such a separate evaluation process induces
+the gap between importance and sparsity score distributions, thus causing high
+search costs for VTC. In this work, for the first time, we investigate how to
+integrate the evaluations of importance and sparsity scores into a single
+stage, searching the optimal subnets in an efficient manner. Specifically, we
+present OFB, a cost-efficient approach that simultaneously evaluates both
+importance and sparsity scores, termed Once for Both (OFB), for VTC. First, a
+bi-mask scheme is developed by entangling the importance score and the
+differentiable sparsity score to jointly determine the pruning potential
+(prunability) of each unit. Such a bi-mask search strategy is further used
+together with a proposed adaptive one-hot loss to realize the
+progressive-and-efficient search for the most important subnet. Finally,
+Progressive Masked Image Modeling (PMIM) is proposed to regularize the feature
+space to be more representative during the search process, which may be
+degraded by the dimension reduction. Extensive experiments demonstrate that OFB
+can achieve superior compression performance over state-of-the-art
+searching-based and pruning-based methods under various Vision Transformer
+architectures, meanwhile promoting search efficiency significantly, e.g.,
+costing one GPU search day for the compression of DeiT-S on ImageNet-1K.",cs.CV,['cs.CV']
+GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians,Liangxiao Hu · Hongwen Zhang · Yuxiang Zhang · Boyao ZHOU · Boning Liu · Shengping Zhang · Liqiang Nie,https://huliangxiao.github.io/GaussianAvatar,https://arxiv.org/abs/2312.02134,,2312.02134.pdf,GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians,"We present GaussianAvatar, an efficient approach to creating realistic human
+avatars with dynamic 3D appearances from a single video. We start by
+introducing animatable 3D Gaussians to explicitly represent humans in various
+poses and clothing styles. Such an explicit and animatable representation can
+fuse 3D appearances more efficiently and consistently from 2D observations. Our
+representation is further augmented with dynamic properties to support
+pose-dependent appearance modeling, where a dynamic appearance network along
+with an optimizable feature tensor is designed to learn the
+motion-to-appearance mapping. Moreover, by leveraging the differentiable motion
+condition, our method enables a joint optimization of motions and appearances
+during avatar modeling, which helps to tackle the long-standing issue of
+inaccurate motion estimation in monocular settings. The efficacy of
+GaussianAvatar is validated on both the public dataset and our collected
+dataset, demonstrating its superior performances in terms of appearance quality
+and rendering efficiency.",cs.CV,['cs.CV']
+OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning,Lingyi Hong · Shilin Yan · Renrui Zhang · Wanyun Li · Xinyu Zhou · Pinxue Guo · Kaixun Jiang · Yiting Cheng · Jinglun Li · Zhaoyu Chen · Wenqiang Zhang, ,https://arxiv.org/abs/2403.09634,,2403.09634.pdf,OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning,"Visual object tracking aims to localize the target object of each frame based
+on its initial appearance in the first frame. Depending on the input modility,
+tracking tasks can be divided into RGB tracking and RGB+X (e.g. RGB+N, and
+RGB+D) tracking. Despite the different input modalities, the core aspect of
+tracking is the temporal matching. Based on this common ground, we present a
+general framework to unify various tracking tasks, termed as OneTracker.
+OneTracker first performs a large-scale pre-training on a RGB tracker called
+Foundation Tracker. This pretraining phase equips the Foundation Tracker with a
+stable ability to estimate the location of the target object. Then we regard
+other modality information as prompt and build Prompt Tracker upon Foundation
+Tracker. Through freezing the Foundation Tracker and only adjusting some
+additional trainable parameters, Prompt Tracker inhibits the strong
+localization ability from Foundation Tracker and achieves parameter-efficient
+finetuning on downstream RGB+X tracking tasks. To evaluate the effectiveness of
+our general framework OneTracker, which is consisted of Foundation Tracker and
+Prompt Tracker, we conduct extensive experiments on 6 popular tracking tasks
+across 11 benchmarks and our OneTracker outperforms other models and achieves
+state-of-the-art performance.",cs.CV,['cs.CV']
+TTA-EVF: Test-Time Adaptation for Event-based Video Frame Interpolation via Reliable Pixel and Sample Estimation,Hoonhee Cho · Taewoo Kim · Yuhwan Jeong · Kuk-Jin Yoon, ,https://arxiv.org/abs/2404.18156,,,Event-based Video Frame Interpolation with Edge Guided Motion Refinement,"Video frame interpolation, the process of synthesizing intermediate frames
+between sequential video frames, has made remarkable progress with the use of
+event cameras. These sensors, with microsecond-level temporal resolution, fill
+information gaps between frames by providing precise motion cues. However,
+contemporary Event-Based Video Frame Interpolation (E-VFI) techniques often
+neglect the fact that event data primarily supply high-confidence features at
+scene edges during multi-modal feature fusion, thereby diminishing the role of
+event signals in optical flow (OF) estimation and warping refinement. To
+address this overlooked aspect, we introduce an end-to-end E-VFI learning
+method (referred to as EGMR) to efficiently utilize edge features from event
+signals for motion flow and warping enhancement. Our method incorporates an
+Edge Guided Attentive (EGA) module, which rectifies estimated video motion
+through attentive aggregation based on the local correlation of multi-modal
+features in a coarse-to-fine strategy. Moreover, given that event data can
+provide accurate visual references at scene edges between consecutive frames,
+we introduce a learned visibility map derived from event data to adaptively
+mitigate the occlusion problem in the warping refinement process. Extensive
+experiments on both synthetic and real datasets show the effectiveness of the
+proposed approach, demonstrating its potential for higher quality video frame
+interpolation.",cs.CV,['cs.CV']
+GigaTraj: Predicting Long-term Trajectories of Hundreds of Pedestrians in Gigapixel Complex Scenes,Haozhe Lin · Chunyu Wei · Li He · Yuchen Guo · Yuchy Zhao · Shanglong Li · Lu Fang, ,https://arxiv.org/abs/2402.19002,,2402.19002.pdf,GoalNet: Goal Areas Oriented Pedestrian Trajectory Prediction,"Predicting the future trajectories of pedestrians on the road is an important
+task for autonomous driving. The pedestrian trajectory prediction is affected
+by scene paths, pedestrian's intentions and decision-making, which is a
+multi-modal problem. Most recent studies use past trajectories to predict a
+variety of potential future trajectory distributions, which do not account for
+the scene context and pedestrian targets. Instead of predicting the future
+trajectory directly, we propose to use scene context and observed trajectory to
+predict the goal points first, and then reuse the goal points to predict the
+future trajectories. By leveraging the information from scene context and
+observed trajectory, the uncertainty can be limited to a few target areas,
+which represent the ""goals"" of the pedestrians. In this paper, we propose
+GoalNet, a new trajectory prediction neural network based on the goal areas of
+a pedestrian. Our network can predict both pedestrian's trajectories and
+bounding boxes. The overall model is efficient and modular, and its outputs can
+be changed according to the usage scenario. Experimental results show that
+GoalNet significantly improves the previous state-of-the-art performance by
+48.7% on the JAAD and 40.8% on the PIE dataset.",cs.CV,"['cs.CV', 'cs.AI']"
+Discovering Syntactic Interaction Clues for Human-Object Interaction Detection,Jinguo Luo · Weihong Ren · Weibo Jiang · Xi'ai Chen · Qiang Wang · Zhi Han · Honghai LIU, ,,https://www.youtube.com/watch?v=YxKgZAoqzpY,,,,,nan
+Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches,Qing Yu · Mikihiro Tanaka · Kent Fujiwara,https://yu1ut.com/MotionPatches-HP/,https://arxiv.org/abs/2405.04771,,2405.04771.pdf,Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches,"To build a cross-modal latent space between 3D human motion and language,
+acquiring large-scale and high-quality human motion data is crucial. However,
+unlike the abundance of image data, the scarcity of motion data has limited the
+performance of existing motion-language models. To counter this, we introduce
+""motion patches"", a new representation of motion sequences, and propose using
+Vision Transformers (ViT) as motion encoders via transfer learning, aiming to
+extract useful knowledge from the image domain and apply it to the motion
+domain. These motion patches, created by dividing and sorting skeleton joints
+based on body parts in motion sequences, are robust to varying skeleton
+structures, and can be regarded as color image patches in ViT. We find that
+transfer learning with pre-trained weights of ViT obtained through training
+with 2D image data can boost the performance of motion analysis, presenting a
+promising direction for addressing the issue of limited motion data. Our
+extensive experiments show that the proposed motion patches, used jointly with
+ViT, achieve state-of-the-art performance in the benchmarks of text-to-motion
+retrieval, and other novel challenging tasks, such as cross-skeleton
+recognition, zero-shot motion classification, and human interaction
+recognition, which are currently impeded by the lack of data.",cs.CV,['cs.CV']
+ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification,Jiangbo Shi · Chen Li · Tieliang Gong · Yefeng Zheng · Huazhu Fu, ,https://arxiv.org/abs/2312.01099,,2312.01099.pdf,Rethinking Multiple Instance Learning for Whole Slide Image Classification: A Bag-Level Classifier is a Good Instance-Level Teacher,"Multiple Instance Learning (MIL) has demonstrated promise in Whole Slide
+Image (WSI) classification. However, a major challenge persists due to the high
+computational cost associated with processing these gigapixel images. Existing
+methods generally adopt a two-stage approach, comprising a non-learnable
+feature embedding stage and a classifier training stage. Though it can greatly
+reduce the memory consumption by using a fixed feature embedder pre-trained on
+other domains, such scheme also results in a disparity between the two stages,
+leading to suboptimal classification accuracy. To address this issue, we
+propose that a bag-level classifier can be a good instance-level teacher. Based
+on this idea, we design Iteratively Coupled Multiple Instance Learning (ICMIL)
+to couple the embedder and the bag classifier at a low cost. ICMIL initially
+fix the patch embedder to train the bag classifier, followed by fixing the bag
+classifier to fine-tune the patch embedder. The refined embedder can then
+generate better representations in return, leading to a more accurate
+classifier for the next iteration. To realize more flexible and more effective
+embedder fine-tuning, we also introduce a teacher-student framework to
+efficiently distill the category knowledge in the bag classifier to help the
+instance-level embedder fine-tuning. Thorough experiments were conducted on
+four distinct datasets to validate the effectiveness of ICMIL. The experimental
+results consistently demonstrate that our method significantly improves the
+performance of existing MIL backbones, achieving state-of-the-art results. The
+code is available at: https://github.com/Dootmaan/ICMIL/tree/confidence_based",cs.CV,['cs.CV']
+Neural Visibility Field for Uncertainty-Driven Active Mapping,Shangjie Xue · Jesse Dill · Pranay Mathur · Frank Dellaert · Panagiotis Tsiotras · Danfei Xu, ,http://export.arxiv.org/abs/2308.16246,,2308.16246.pdf,Active Neural Mapping,"We address the problem of active mapping with a continually-learned neural
+scene representation, namely Active Neural Mapping. The key lies in actively
+finding the target space to be explored with efficient agent movement, thus
+minimizing the map uncertainty on-the-fly within a previously unseen
+environment. In this paper, we examine the weight space of the
+continually-learned neural field, and show empirically that the neural
+variability, the prediction robustness against random weight perturbation, can
+be directly utilized to measure the instant uncertainty of the neural map.
+Together with the continuous geometric information inherited in the neural map,
+the agent can be guided to find a traversable path to gradually gain knowledge
+of the environment. We present for the first time an active mapping system with
+a coordinate-based implicit neural representation for online scene
+reconstruction. Experiments in the visually-realistic Gibson and Matterport3D
+environment demonstrate the efficacy of the proposed method.",cs.CV,['cs.CV']
+Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors,Haoxuanye Ji · Pengpeng Liang · Erkang Cheng, ,https://arxiv.org/abs/2403.06093,,2403.06093.pdf,Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors,"Multi-camera-based 3D object detection has made notable progress in the past
+several years. However, we observe that there are cases (e.g. faraway regions)
+in which popular 2D object detectors are more reliable than state-of-the-art 3D
+detectors. In this paper, to improve the performance of query-based 3D object
+detectors, we present a novel query generating approach termed QAF2D, which
+infers 3D query anchors from 2D detection results. A 2D bounding box of an
+object in an image is lifted to a set of 3D anchors by associating each sampled
+point within the box with depth, yaw angle, and size candidates. Then, the
+validity of each 3D anchor is verified by comparing its projection in the image
+with its corresponding 2D box, and only valid anchors are kept and used to
+construct queries. The class information of the 2D bounding box associated with
+each query is also utilized to match the predicted boxes with ground truth for
+the set-based loss. The image feature extraction backbone is shared between the
+3D detector and 2D detector by adding a small number of prompt parameters. We
+integrate QAF2D into three popular query-based 3D object detectors and carry
+out comprehensive evaluations on the nuScenes dataset. The largest improvement
+that QAF2D can bring about on the nuScenes validation subset is $2.3\%$ NDS and
+$2.7\%$ mAP. Code is available at https://github.com/nullmax-vision/QAF2D.",cs.CV,['cs.CV']
+Resolution Limit of Single-Photon LIDAR,Stanley H. Chan · Hashan K Weerasooriya · Weijian Zhang · Pamela Abshire · Istvan Gyongy · Robert Henderson, ,https://arxiv.org/abs/2403.17719,,2403.17719.pdf,Resolution Limit of Single-Photon LiDAR,"Single-photon Light Detection and Ranging (LiDAR) systems are often equipped
+with an array of detectors for improved spatial resolution and sensing speed.
+However, given a fixed amount of flux produced by the laser transmitter across
+the scene, the per-pixel Signal-to-Noise Ratio (SNR) will decrease when more
+pixels are packed in a unit space. This presents a fundamental trade-off
+between the spatial resolution of the sensor array and the SNR received at each
+pixel. Theoretical characterization of this fundamental limit is explored. By
+deriving the photon arrival statistics and introducing a series of new
+approximation techniques, the Mean Squared Error (MSE) of the
+maximum-likelihood estimator of the time delay is derived. The theoretical
+predictions align well with simulations and real data.",eess.SP,"['eess.SP', 'cs.CV']"
+VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models,Xiang Li · Qianli Shen · Kenji Kawaguchi, ,https://arxiv.org/abs/2312.00057,,2312.00057.pdf,VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models,"The booming use of text-to-image generative models has raised concerns about
+their high risk of producing copyright-infringing content. While probabilistic
+copyright protection methods provide a probabilistic guarantee against such
+infringement, in this paper, we introduce Virtually Assured Amplification
+Attack (VA3), a novel online attack framework that exposes the vulnerabilities
+of these protection mechanisms. The proposed framework significantly amplifies
+the probability of generating infringing content on the sustained interactions
+with generative models and a non-trivial lower-bound on the success probability
+of each engagement. Our theoretical and experimental results demonstrate the
+effectiveness of our approach under various scenarios. These findings highlight
+the potential risk of implementing probabilistic copyright protection in
+practical applications of text-to-image generative models. Code is available at
+https://github.com/South7X/VA3.",cs.CR,"['cs.CR', 'cs.AI', 'cs.CV', 'cs.MM']"
+Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion,Fan Zhang · Shaodi You · Yu Li · Ying Fu,https://github.com/zkawfanx/Atlantis,https://arxiv.org/abs/2312.12471,,2312.12471.pdf,Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion,"Monocular depth estimation has experienced significant progress on
+terrestrial images in recent years, largely due to deep learning advancements.
+However, it remains inadequate for underwater scenes, primarily because of data
+scarcity. Given the inherent challenges of light attenuation and backscattering
+in water, acquiring clear underwater images or precise depth information is
+notably difficult and costly. Consequently, learning-based approaches often
+rely on synthetic data or turn to unsupervised or self-supervised methods to
+mitigate this lack of data. Nonetheless, the performance of these methods is
+often constrained by the domain gap and looser constraints. In this paper, we
+propose a novel pipeline for generating photorealistic underwater images using
+accurate terrestrial depth data. This approach facilitates the training of
+supervised models for underwater depth estimation, effectively reducing the
+performance disparity between terrestrial and underwater environments. Contrary
+to prior synthetic datasets that merely apply style transfer to terrestrial
+images without altering the scene content, our approach uniquely creates
+vibrant, non-existent underwater scenes by leveraging terrestrial depth data
+through the innovative Stable Diffusion model. Specifically, we introduce a
+unique Depth2Underwater ControlNet, trained on specially prepared \{Underwater,
+Depth, Text\} data triplets, for this generation task. Our newly developed
+dataset enables terrestrial depth estimation models to achieve considerable
+improvements, both quantitatively and qualitatively, on unseen underwater
+images, surpassing their terrestrial pre-trained counterparts. Moreover, the
+enhanced depth accuracy for underwater scenes also aids underwater image
+restoration techniques that rely on depth maps, further demonstrating our
+dataset's utility. The dataset will be available at
+https://github.com/zkawfanx/Atlantis.",cs.CV,['cs.CV']
+ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations,Maitreya Patel · Changhoon Kim · Sheng Cheng · Chitta Baral · 'YZ' Yezhou Yang, ,https://arxiv.org/abs/2312.04655,,2312.04655.pdf,ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations,"Text-to-image (T2I) diffusion models, notably the unCLIP models (e.g.,
+DALL-E-2), achieve state-of-the-art (SOTA) performance on various compositional
+T2I benchmarks, at the cost of significant computational resources. The unCLIP
+stack comprises T2I prior and diffusion image decoder. The T2I prior model
+alone adds a billion parameters compared to the Latent Diffusion Models, which
+increases the computational and high-quality data requirements. We introduce
+ECLIPSE, a novel contrastive learning method that is both parameter and
+data-efficient. ECLIPSE leverages pre-trained vision-language models (e.g.,
+CLIP) to distill the knowledge into the prior model. We demonstrate that the
+ECLIPSE trained prior, with only 3.3% of the parameters and trained on a mere
+2.8% of the data, surpasses the baseline T2I priors with an average of 71.6%
+preference score under resource-limited setting. It also attains performance on
+par with SOTA big models, achieving an average of 63.36% preference score in
+terms of the ability to follow the text compositions. Extensive experiments on
+two unCLIP diffusion image decoders, Karlo and Kandinsky, affirm that ECLIPSE
+priors consistently deliver high performance while significantly reducing
+resource dependency.",cs.CV,['cs.CV']
+Hallucination Augmented Contrastive Learning for Multimodal Large Language Model,Chaoya Jiang · Haiyang Xu · Mengfan Dong · Jiaxing Chen · Wei Ye · Ming Yan · Qinghao Ye · Ji Zhang · Fei Huang · Fei Huang · Shikun Zhang, ,https://arxiv.org/abs/2312.06968,,2312.06968.pdf,Hallucination Augmented Contrastive Learning for Multimodal Large Language Model,"Multi-modal large language models (MLLMs) have been shown to efficiently
+integrate natural language with visual information to handle multi-modal tasks.
+However, MLLMs still face a fundamental limitation of hallucinations, where
+they tend to generate erroneous or fabricated information. In this paper, we
+address hallucinations in MLLMs from a novel perspective of representation
+learning. We first analyzed the representation distribution of textual and
+visual tokens in MLLM, revealing two important findings: 1) there is a
+significant gap between textual and visual representations, indicating
+unsatisfactory cross-modal representation alignment; 2) representations of
+texts that contain and do not contain hallucinations are entangled, making it
+challenging to distinguish them. These two observations inspire us with a
+simple yet effective method to mitigate hallucinations. Specifically, we
+introduce contrastive learning into MLLMs and use text with hallucination as
+hard negative examples, naturally bringing representations of non-hallucinative
+text and visual samples closer while pushing way representations of
+non-hallucinating and hallucinative text. We evaluate our method quantitatively
+and qualitatively, showing its effectiveness in reducing hallucination
+occurrences and improving performance across multiple benchmarks. On the
+MMhal-Bench benchmark, our method obtains a 34.66% /29.5% improvement over the
+baseline MiniGPT-4/LLaVA. Our code is available on
+https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl.",cs.CV,['cs.CV']
+Active Domain Adaptation with False Negative Prediction for Object Detection,Yuzuru Nakamura · Yasunori Ishii · Takayoshi Yamashita, ,https://arxiv.org/abs/2307.07944,,2307.07944.pdf,"Revisiting Domain-Adaptive 3D Object Detection by Reliable, Diverse and Class-balanced Pseudo-Labeling","Unsupervised domain adaptation (DA) with the aid of pseudo labeling
+techniques has emerged as a crucial approach for domain-adaptive 3D object
+detection. While effective, existing DA methods suffer from a substantial drop
+in performance when applied to a multi-class training setting, due to the
+co-existence of low-quality pseudo labels and class imbalance issues. In this
+paper, we address this challenge by proposing a novel ReDB framework tailored
+for learning to detect all classes at once. Our approach produces Reliable,
+Diverse, and class-Balanced pseudo 3D boxes to iteratively guide the
+self-training on a distributionally different target domain. To alleviate
+disruptions caused by the environmental discrepancy (e.g., beam numbers), the
+proposed cross-domain examination (CDE) assesses the correctness of pseudo
+labels by copy-pasting target instances into a source environment and measuring
+the prediction consistency. To reduce computational overhead and mitigate the
+object shift (e.g., scales and point densities), we design an overlapped boxes
+counting (OBC) metric that allows to uniformly downsample pseudo-labeled
+objects across different geometric characteristics. To confront the issue of
+inter-class imbalance, we progressively augment the target point clouds with a
+class-balanced set of pseudo-labeled target instances and source objects, which
+boosts recognition accuracies on both frequently appearing and rare classes.
+Experimental results on three benchmark datasets using both voxel-based (i.e.,
+SECOND) and point-based 3D detectors (i.e., PointRCNN) demonstrate that our
+proposed ReDB approach outperforms existing 3D domain adaptation methods by a
+large margin, improving 23.15% mAP on the nuScenes $\rightarrow$ KITTI task.
+The code is available at https://github.com/zhuoxiao-chen/ReDB-DA-3Ddet.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Material Palette: Extraction of Materials from a Single Image,Ivan Lopes · Fabio Pizzati · Raoul de Charette,https://astra-vision.github.io/MaterialPalette/,https://arxiv.org/abs/2311.17060v1,,2311.17060v1.pdf,Material Palette: Extraction of Materials from a Single Image,"In this paper, we propose a method to extract physically-based rendering
+(PBR) materials from a single real-world image. We do so in two steps: first,
+we map regions of the image to material concepts using a diffusion model, which
+allows the sampling of texture images resembling each material in the scene.
+Second, we benefit from a separate network to decompose the generated textures
+into Spatially Varying BRDFs (SVBRDFs), providing us with materials ready to be
+used in rendering applications. Our approach builds on existing synthetic
+material libraries with SVBRDF ground truth, but also exploits a
+diffusion-generated RGB texture dataset to allow generalization to new samples
+using unsupervised domain adaptation (UDA). Our contributions are thoroughly
+evaluated on synthetic and real-world datasets. We further demonstrate the
+applicability of our method for editing 3D scenes with materials estimated from
+real photographs. The code and models will be made open-source. Project page:
+https://astra-vision.github.io/MaterialPalette/",cs.CV,"['cs.CV', 'cs.GR']"
+DiffAssemble: A Unified Graph-Diffusion Model for 2D and 3D Reassembly,Gianluca Scarpellini · Stefano Fiorini · Francesco Giuliari · Pietro Morerio · Alessio Del Bue,https://iit-pavis.github.io/DiffAssemble/,https://arxiv.org/abs/2402.19302,,2402.19302.pdf,DiffAssemble: A Unified Graph-Diffusion Model for 2D and 3D Reassembly,"Reassembly tasks play a fundamental role in many fields and multiple
+approaches exist to solve specific reassembly problems. In this context, we
+posit that a general unified model can effectively address them all,
+irrespective of the input data type (images, 3D, etc.). We introduce
+DiffAssemble, a Graph Neural Network (GNN)-based architecture that learns to
+solve reassembly tasks using a diffusion model formulation. Our method treats
+the elements of a set, whether pieces of 2D patch or 3D object fragments, as
+nodes of a spatial graph. Training is performed by introducing noise into the
+position and rotation of the elements and iteratively denoising them to
+reconstruct the coherent initial pose. DiffAssemble achieves state-of-the-art
+(SOTA) results in most 2D and 3D reassembly tasks and is the first
+learning-based approach that solves 2D puzzles for both rotation and
+translation. Furthermore, we highlight its remarkable reduction in run-time,
+performing 11 times faster than the quickest optimization-based method for
+puzzle solving. Code available at https://github.com/IIT-PAVIS/DiffAssemble",cs.CV,['cs.CV']
+Situational Awareness Matters in 3D Vision Language Reasoning,Yunze Man · Liang-Yan Gui · Yu-Xiong Wang, ,https://arxiv.org/abs/2401.09340,,2401.09340.pdf,SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding,"3D vision-language grounding, which focuses on aligning language with the 3D
+physical environment, stands as a cornerstone in the development of embodied
+agents. In comparison to recent advancements in the 2D domain, grounding
+language in 3D scenes faces several significant challenges: (i) the inherent
+complexity of 3D scenes due to the diverse object configurations, their rich
+attributes, and intricate relationships; (ii) the scarcity of paired 3D
+vision-language data to support grounded learning; and (iii) the absence of a
+unified learning framework to distill knowledge from grounded 3D data. In this
+work, we aim to address these three major challenges in 3D vision-language by
+examining the potential of systematically upscaling 3D vision-language learning
+in indoor environments. We introduce the first million-scale 3D vision-language
+dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising
+2.5M vision-language pairs derived from both human annotations and our scalable
+scene-graph-based generation approach. We demonstrate that this scaling allows
+for a unified pre-training framework, Grounded Pre-training for Scenes (GPS),
+for 3D vision-language learning. Through extensive experiments, we showcase the
+effectiveness of GPS by achieving state-of-the-art performance on all existing
+3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is
+unveiled through zero-shot transfer experiments in the challenging 3D
+vision-language tasks. Project website: https://scene-verse.github.io.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG', 'cs.RO']"
+PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness,Anh-Quan Cao · Angela Dai · Raoul de Charette,https://astra-vision.github.io/PaSCo/,https://arxiv.org/abs/2312.02158,,2312.02158.pdf,PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness,"We propose the task of Panoptic Scene Completion (PSC) which extends the
+recently popular Semantic Scene Completion (SSC) task with instance-level
+information to produce a richer understanding of the 3D scene. Our PSC proposal
+utilizes a hybrid mask-based technique on the non-empty voxels from sparse
+multi-scale completions. Whereas the SSC literature overlooks uncertainty which
+is critical for robotics applications, we instead propose an efficient
+ensembling to estimate both voxel-wise and instance-wise uncertainties along
+PSC. This is achieved by building on a multi-input multi-output (MIMO)
+strategy, while improving performance and yielding better uncertainty for
+little additional compute. Additionally, we introduce a technique to aggregate
+permutation-invariant mask predictions. Our experiments demonstrate that our
+method surpasses all baselines in both Panoptic Scene Completion and
+uncertainty estimation on three large-scale autonomous driving datasets. Our
+code and data are available at https://astra-vision.github.io/PaSCo .",cs.CV,"['cs.CV', 'cs.AI']"
+DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models,Yukang Cao · Yan-Pei Cao · Kai Han · Ying Shan · Kwan-Yee K. Wong,https://yukangcao.github.io/DreamAvatar/,https://arxiv.org/html/2402.17292v1,,2402.17292v1.pdf,DivAvatar: Diverse 3D Avatar Generation with a Single Prompt,"Text-to-Avatar generation has recently made significant strides due to
+advancements in diffusion models. However, most existing work remains
+constrained by limited diversity, producing avatars with subtle differences in
+appearance for a given text prompt. We design DivAvatar, a novel framework that
+generates diverse avatars, empowering 3D creatives with a multitude of distinct
+and richly varied 3D avatars from a single text prompt. Different from most
+existing work that exploits scene-specific 3D representations such as NeRF,
+DivAvatar finetunes a 3D generative model (i.e., EVA3D), allowing diverse
+avatar generation from simply noise sampling in inference time. DivAvatar has
+two key designs that help achieve generation diversity and visual quality. The
+first is a noise sampling technique during training phase which is critical in
+generating diverse appearances. The second is a semantic-aware zoom mechanism
+and a novel depth loss, the former producing appearances of high textual
+fidelity by separate fine-tuning of specific body parts and the latter
+improving geometry quality greatly by smoothing the generated mesh in the
+features space. Extensive experiments show that DivAvatar is highly versatile
+in generating avatars of diverse appearances.",cs.CV,['cs.CV']
+Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities,AJ Piergiovanni · Isaac Noble · Dahun Kim · Michael Ryoo · Victor Gomes · Anelia Angelova, ,https://arxiv.org/abs/2311.05698,,2311.05698.pdf,Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities,"One of the main challenges of multimodal learning is the need to combine
+heterogeneous modalities (e.g., video, audio, text). For example, video and
+audio are obtained at much higher rates than text and are roughly aligned in
+time. They are often not synchronized with text, which comes as a global
+context, e.g., a title, or a description. Furthermore, video and audio inputs
+are of much larger volumes, and grow as the video length increases, which
+naturally requires more compute dedicated to these modalities and makes
+modeling of long-range dependencies harder.
+  We here decouple the multimodal modeling, dividing it into separate, focused
+autoregressive models, processing the inputs according to the characteristics
+of the modalities. We propose a multimodal model, called Mirasol3B, consisting
+of an autoregressive component for the time-synchronized modalities (audio and
+video), and an autoregressive component for the context modalities which are
+not necessarily aligned in time but are still sequential. To address the
+long-sequences of the video-audio inputs, we propose to further partition the
+video and audio sequences in consecutive snippets and autoregressively process
+their representations. To that end, we propose a Combiner mechanism, which
+models the audio-video information jointly within a timeframe. The Combiner
+learns to extract audio and video features from raw spatio-temporal signals,
+and then learns to fuse these features producing compact but expressive
+representations per snippet.
+  Our approach achieves the state-of-the-art on well established multimodal
+benchmarks, outperforming much larger models. It effectively addresses the high
+computational demand of media inputs by both learning compact representations,
+controlling the sequence length of the audio-video feature representations, and
+modeling their dependencies in time.",cs.CV,['cs.CV']
+Discontinuity-preserving Normal Integration with Auxiliary Edges,Hyomin Kim · Yucheol Jung · Seungyong Lee, ,https://arxiv.org/abs/2404.03138,,2404.03138.pdf,Discontinuity-preserving Normal Integration with Auxiliary Edges,"Many surface reconstruction methods incorporate normal integration, which is
+a process to obtain a depth map from surface gradients. In this process, the
+input may represent a surface with discontinuities, e.g., due to
+self-occlusion. To reconstruct an accurate depth map from the input normal map,
+hidden surface gradients occurring from the jumps must be handled. To model
+these jumps correctly, we design a novel discretization scheme for the domain
+of normal integration. Our key idea is to introduce auxiliary edges, which
+bridge between piecewise-smooth patches in the domain so that the magnitude of
+hidden jumps can be explicitly expressed. Using the auxiliary edges, we design
+a novel algorithm to optimize the discontinuity and the depth map from the
+input normal map. Our method optimizes discontinuities by using a combination
+of iterative re-weighted least squares and iterative filtering of the jump
+magnitudes on auxiliary edges to provide strong sparsity regularization.
+Compared to previous discontinuity-preserving normal integration methods, which
+model the magnitudes of jumps only implicitly, our method reconstructs subtle
+discontinuities accurately thanks to our explicit representation of jumps
+allowing for strong sparsity regularization.",cs.CV,"['cs.CV', 'cs.GR', 'I.4.5']"
+Depth  Information Assisted Collaborative Mutual Promotion Network for Single Image Dehazing,Yafei Zhang · Shen Zhou · Huafeng Li, ,https://arxiv.org/abs/2403.01105,,2403.01105.pdf,Depth Information Assisted Collaborative Mutual Promotion Network for Single Image Dehazing,"Recovering a clear image from a single hazy image is an open inverse problem.
+Although significant research progress has been made, most existing methods
+ignore the effect that downstream tasks play in promoting upstream dehazing.
+From the perspective of the haze generation mechanism, there is a potential
+relationship between the depth information of the scene and the hazy image.
+Based on this, we propose a dual-task collaborative mutual promotion framework
+to achieve the dehazing of a single image. This framework integrates depth
+estimation and dehazing by a dual-task interaction mechanism and achieves
+mutual enhancement of their performance. To realize the joint optimization of
+the two tasks, an alternative implementation mechanism with the difference
+perception is developed. On the one hand, the difference perception between the
+depth maps of the dehazing result and the ideal image is proposed to promote
+the dehazing network to pay attention to the non-ideal areas of the dehazing.
+On the other hand, by improving the depth estimation performance in the
+difficult-to-recover areas of the hazy image, the dehazing network can
+explicitly use the depth information of the hazy image to assist the clear
+image recovery. To promote the depth estimation, we propose to use the
+difference between the dehazed image and the ground truth to guide the depth
+estimation network to focus on the dehazed unideal areas. It allows dehazing
+and depth estimation to leverage their strengths in a mutually reinforcing
+manner. Experimental results show that the proposed method can achieve better
+performance than that of the state-of-the-art approaches.",cs.CV,['cs.CV']
+ExtDM: Distribution Extrapolation Diffusion Model for Video Prediction,Zhicheng Zhang · Junyao Hu · Wentao Cheng · Danda Paudel · Jufeng Yang,https://zzcheng.top/ExtDM/,,https://junyaohu.github.io/publication/,,,,,nan
+VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis,Linshan Wu · Linshan Wu · Jia-Xin Zhuang · Hao Chen, ,https://arxiv.org/abs/2402.17300v1,,2402.17300v1.pdf,VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis,"Self-Supervised Learning (SSL) has demonstrated promising results in 3D
+medical image analysis. However, the lack of high-level semantics in
+pre-training still heavily hinders the performance of downstream tasks. We
+observe that 3D medical images contain relatively consistent contextual
+position information, i.e., consistent geometric relations between different
+organs, which leads to a potential way for us to learn consistent semantic
+representations in pre-training. In this paper, we propose a
+simple-yet-effective Volume Contrast (VoCo) framework to leverage the
+contextual position priors for pre-training. Specifically, we first generate a
+group of base crops from different regions while enforcing feature discrepancy
+among them, where we employ them as class assignments of different regions.
+Then, we randomly crop sub-volumes and predict them belonging to which class
+(located at which region) by contrasting their similarity to different base
+crops, which can be seen as predicting contextual positions of different
+sub-volumes. Through this pretext task, VoCo implicitly encodes the contextual
+position priors into model representations without the guidance of annotations,
+enabling us to effectively improve the performance of downstream tasks that
+require high-level semantics. Extensive experimental results on six downstream
+tasks demonstrate the superior effectiveness of VoCo. Code will be available at
+https://github.com/Luffy03/VoCo.",eess.IV,['eess.IV']
+JoAPR: Cleaning the Lens of Prompt Learning for Vision-Language Models,YUNCHENG GUO · Xiaodong Gu, ,https://arxiv.org/abs/2312.01564,,2312.01564.pdf,APoLLo: Unified Adapter and Prompt Learning for Vision Language Models,"The choice of input text prompt plays a critical role in the performance of
+Vision-Language Pretrained (VLP) models such as CLIP. We present APoLLo, a
+unified multi-modal approach that combines Adapter and Prompt learning for
+Vision-Language models. Our method is designed to substantially improve the
+generalization capabilities of VLP models when they are fine-tuned in a
+few-shot setting. We introduce trainable cross-attention-based adapter layers
+in conjunction with vision and language encoders to strengthen the alignment
+between the two modalities. We enforce consistency between the respective
+encoder branches (receiving augmented inputs) to prevent overfitting in
+downstream tasks. Our method is evaluated on three representative tasks:
+generalization to novel classes, cross-dataset evaluation, and unseen domain
+shifts. In practice, APoLLo achieves a relative gain up to 6.03% over MaPLe
+(SOTA) on novel classes for 10 diverse image recognition datasets.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CL', 'cs.CV']"
+F$^3$Loc: Fusion and Filtering for Floorplan Localization,Changan Chen · Rui Wang · Christoph Vogel · Marc Pollefeys, ,https://arxiv.org/abs/2403.03370,,2403.03370.pdf,F$^3$Loc: Fusion and Filtering for Floorplan Localization,"In this paper we propose an efficient data-driven solution to
+self-localization within a floorplan. Floorplan data is readily available,
+long-term persistent and inherently robust to changes in the visual appearance.
+Our method does not require retraining per map and location or demand a large
+database of images of the area of interest. We propose a novel probabilistic
+model consisting of an observation and a novel temporal filtering module.
+Operating internally with an efficient ray-based representation, the
+observation module consists of a single and a multiview module to predict
+horizontal depth from images and fuses their results to benefit from advantages
+offered by either methodology. Our method operates on conventional consumer
+hardware and overcomes a common limitation of competing methods that often
+demand upright images. Our full system meets real-time requirements, while
+outperforming the state-of-the-art by a significant margin.",cs.CV,"['cs.CV', 'cs.RO']"
+Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance,Phuc Nguyen · Tuan Duc Ngo · Evangelos Kalogerakis · Chuang Gan · Anh Tran · Cuong Pham · Khoi Nguyen,https://open3dis.github.io/,https://arxiv.org/abs/2312.10671,,2312.10671.pdf,Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance,"We introduce Open3DIS, a novel solution designed to tackle the problem of
+Open-Vocabulary Instance Segmentation within 3D scenes. Objects within 3D
+environments exhibit diverse shapes, scales, and colors, making precise
+instance-level identification a challenging task. Recent advancements in
+Open-Vocabulary scene understanding have made significant strides in this area
+by employing class-agnostic 3D instance proposal networks for object
+localization and learning queryable features for each 3D mask. While these
+methods produce high-quality instance proposals, they struggle with identifying
+small-scale and geometrically ambiguous objects. The key idea of our method is
+a new module that aggregates 2D instance masks across frames and maps them to
+geometrically coherent point cloud regions as high-quality object proposals
+addressing the above limitations. These are then combined with 3D
+class-agnostic instance proposals to include a wide range of objects in the
+real world. To validate our approach, we conducted experiments on three
+prominent datasets, including ScanNet200, S3DIS, and Replica, demonstrating
+significant performance gains in segmenting objects with diverse categories
+over the state-of-the-art approaches.",cs.CV,['cs.CV']
+Binarized Low-light Raw Video Enhancement,Gengchen Zhang · Yulun Zhang · Xin Yuan · Ying Fu, ,https://arxiv.org/abs/2403.19944,,2403.19944.pdf,Binarized Low-light Raw Video Enhancement,"Recently, deep neural networks have achieved excellent performance on
+low-light raw video enhancement. However, they often come with high
+computational complexity and large memory costs, which hinder their
+applications on resource-limited devices. In this paper, we explore the
+feasibility of applying the extremely compact binary neural network (BNN) to
+low-light raw video enhancement. Nevertheless, there are two main issues with
+binarizing video enhancement models. One is how to fuse the temporal
+information to improve low-light denoising without complex modules. The other
+is how to narrow the performance gap between binary convolutions with the full
+precision ones. To address the first issue, we introduce a spatial-temporal
+shift operation, which is easy-to-binarize and effective. The temporal shift
+efficiently aggregates the features of neighbor frames and the spatial shift
+handles the misalignment caused by the large motion in videos. For the second
+issue, we present a distribution-aware binary convolution, which captures the
+distribution characteristics of real-valued input and incorporates them into
+plain binary convolutions to alleviate the degradation in performance.
+Extensive quantitative and qualitative experiments have shown our
+high-efficiency binarized low-light raw video enhancement method can attain a
+promising performance.",cs.CV,"['cs.CV', 'eess.IV']"
+Generating Non-Stationary Textures using Self-Rectification,Yang Zhou · Rongjun Xiao · Dani Lischinski · Daniel Cohen-Or · Hui Huang,https://vcc.tech/research/2024/TexRec,https://arxiv.org/abs/2401.02847,,2401.02847.pdf,Generating Non-Stationary Textures using Self-Rectification,"This paper addresses the challenge of example-based non-stationary texture
+synthesis. We introduce a novel twostep approach wherein users first modify a
+reference texture using standard image editing tools, yielding an initial rough
+target for the synthesis. Subsequently, our proposed method, termed
+""self-rectification"", automatically refines this target into a coherent,
+seamless texture, while faithfully preserving the distinct visual
+characteristics of the reference exemplar. Our method leverages a pre-trained
+diffusion network, and uses self-attention mechanisms, to gradually align the
+synthesized texture with the reference, ensuring the retention of the
+structures in the provided target. Through experimental validation, our
+approach exhibits exceptional proficiency in handling non-stationary textures,
+demonstrating significant advancements in texture synthesis when compared to
+existing state-of-the-art techniques. Code is available at
+https://github.com/xiaorongjun000/Self-Rectification",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering,Tao Hu · Fangzhou Hong · Ziwei Liu, ,https://arxiv.org/abs/2404.01225,,2404.01225.pdf,SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering,"Dynamic human rendering from video sequences has achieved remarkable progress
+by formulating the rendering as a mapping from static poses to human images.
+However, existing methods focus on the human appearance reconstruction of every
+single frame while the temporal motion relations are not fully explored. In
+this paper, we propose a new 4D motion modeling paradigm, SurMo, that jointly
+models the temporal dynamics and human appearances in a unified framework with
+three key designs: 1) Surface-based motion encoding that models 4D human
+motions with an efficient compact surface-based triplane. It encodes both
+spatial and temporal motion relations on the dense surface manifold of a
+statistical body template, which inherits body topology priors for
+generalizable novel view synthesis with sparse training observations. 2)
+Physical motion decoding that is designed to encourage physical motion learning
+by decoding the motion triplane features at timestep t to predict both spatial
+derivatives and temporal derivatives at the next timestep t+1 in the training
+stage. 3) 4D appearance decoding that renders the motion triplanes into images
+by an efficient volumetric surface-conditioned renderer that focuses on the
+rendering of body surfaces with motion learning conditioning. Extensive
+experiments validate the state-of-the-art performance of our new paradigm and
+illustrate the expressiveness of surface-based motion triplanes for rendering
+high-fidelity view-consistent humans with fast motions and even
+motion-dependent shadows. Our project page is at:
+https://taohuumd.github.io/projects/SurMo/",cs.CV,['cs.CV']
+MultiDiff: Consistent Novel View Synthesis from a Single Image,Norman Müller · Katja Schwarz · Katja Schwarz · Barbara Roessle · Lorenzo Porzi · Samuel Rota Bulò · Matthias Nießner · Peter Kontschieder, ,,https://sirwyver.github.io/publications/,,,,,nan
+Vector Graphics Generation via Mutually Impulsed Dual-domain Diffusion,Zhongyin Zhao · Ye Chen · Zhangli Hu · Xuanhong Chen · Bingbing Ni, ,https://arxiv.org/abs/2312.10540,,,VecFusion: Vector Font Generation with Diffusion,"We present VecFusion, a new neural architecture that can generate vector
+fonts with varying topological structures and precise control point positions.
+Our approach is a cascaded diffusion model which consists of a raster diffusion
+model followed by a vector diffusion model. The raster model generates
+low-resolution, rasterized fonts with auxiliary control point information,
+capturing the global style and shape of the font, while the vector model
+synthesizes vector fonts conditioned on the low-resolution raster fonts from
+the first stage. To synthesize long and complex curves, our vector diffusion
+model uses a transformer architecture and a novel vector representation that
+enables the modeling of diverse vector geometry and the precise prediction of
+control points. Our experiments show that, in contrast to previous generative
+models for vector graphics, our new cascaded vector diffusion model generates
+higher quality vector fonts, with complex structures and diverse styles.",cs.CV,"['cs.CV', 'cs.GR']"
+Equivariant plug-and-play image reconstruction,Matthieu Terris · Thomas Moreau · Nelly Pustelnik · Julián Tachella, ,https://arxiv.org/html/2312.01831v2,,2312.01831v2.pdf,Equivariant plug-and-play image reconstruction,"Plug-and-play algorithms constitute a popular framework for solving inverse
+imaging problems that rely on the implicit definition of an image prior via a
+denoiser. These algorithms can leverage powerful pre-trained denoisers to solve
+a wide range of imaging tasks, circumventing the necessity to train models on a
+per-task basis. Unfortunately, plug-and-play methods often show unstable
+behaviors, hampering their promise of versatility and leading to suboptimal
+quality of reconstructed images. In this work, we show that enforcing
+equivariance to certain groups of transformations (rotations, reflections,
+and/or translations) on the denoiser strongly improves the stability of the
+algorithm as well as its reconstruction quality. We provide a theoretical
+analysis that illustrates the role of equivariance on better performance and
+stability. We present a simple algorithm that enforces equivariance on any
+existing denoiser by simply applying a random transformation to the input of
+the denoiser and the inverse transformation to the output at each iteration of
+the algorithm. Experiments on multiple imaging modalities and denoising
+networks show that the equivariant plug-and-play algorithm improves both the
+reconstruction performance and the stability compared to their non-equivariant
+counterparts.",eess.IV,"['eess.IV', 'cs.CV']"
+"SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM",Nikhil Keetha · Jay Karhade · Krishna Murthy Jatavallabhula · Gengshan Yang · Sebastian Scherer · Deva Ramanan · Jonathon Luiten,https://spla-tam.github.io/,https://arxiv.org/abs/2312.02126,,2312.02126.pdf,"SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM","Dense simultaneous localization and mapping (SLAM) is crucial for robotics
+and augmented reality applications. However, current methods are often hampered
+by the non-volumetric or implicit way they represent a scene. This work
+introduces SplaTAM, an approach that, for the first time, leverages explicit
+volumetric representations, i.e., 3D Gaussians, to enable high-fidelity
+reconstruction from a single unposed RGB-D camera, surpassing the capabilities
+of existing methods. SplaTAM employs a simple online tracking and mapping
+system tailored to the underlying Gaussian representation. It utilizes a
+silhouette mask to elegantly capture the presence of scene density. This
+combination enables several benefits over prior representations, including fast
+rendering and dense optimization, quickly determining if areas have been
+previously mapped, and structured map expansion by adding more Gaussians.
+Extensive experiments show that SplaTAM achieves up to 2x superior performance
+in camera pose estimation, map construction, and novel-view synthesis over
+existing methods, paving the way for more immersive high-fidelity SLAM
+applications.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']"
+Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring,Chengxu Liu · Xuan Wang · Xiangyu Xu · Ruhao Tian · Shuai Li · Xueming Qian · Ming-Hsuan Yang, ,https://arxiv.org/abs/2404.13153,,2404.13153.pdf,Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring,"Eliminating image blur produced by various kinds of motion has been a
+challenging problem. Dominant approaches rely heavily on model capacity to
+remove blurring by reconstructing residual from blurry observation in feature
+space. These practices not only prevent the capture of spatially variable
+motion in the real world but also ignore the tailored handling of various
+motions in image space. In this paper, we propose a novel real-world deblurring
+filtering model called the Motion-adaptive Separable Collaborative (MISC)
+Filter. In particular, we use a motion estimation network to capture motion
+information from neighborhoods, thereby adaptively estimating spatially-variant
+motion flow, mask, kernels, weights, and offsets to obtain the MISC Filter. The
+MISC Filter first aligns the motion-induced blurring patterns to the motion
+middle along the predicted flow direction, and then collaboratively filters the
+aligned image through the predicted kernels, weights, and offsets to generate
+the output. This design can handle more generalized and complex motion in a
+spatially differentiated manner. Furthermore, we analyze the relationships
+between the motion estimation network and the residual reconstruction network.
+Extensive experiments on four widely used benchmarks demonstrate that our
+method provides an effective solution for real-world motion blur removal and
+achieves state-of-the-art performance. Code is available at
+https://github.com/ChengxuLiu/MISCFilter",eess.IV,"['eess.IV', 'cs.CV']"
+BoQ: A Place is Worth a Bag of Learnable Queries,Amar Ali-bey · Brahim Chaib-draa · Philippe Giguère, ,https://arxiv.org/abs/2405.07364,,2405.07364.pdf,BoQ: A Place is Worth a Bag of Learnable Queries,"In visual place recognition, accurately identifying and matching images of
+locations under varying environmental conditions and viewpoints remains a
+significant challenge. In this paper, we introduce a new technique, called
+Bag-of-Queries (BoQ), which learns a set of global queries designed to capture
+universal place-specific attributes. Unlike existing methods that employ
+self-attention and generate the queries directly from the input features, BoQ
+employs distinct learnable global queries, which probe the input features via
+cross-attention, ensuring consistent information aggregation. In addition, our
+technique provides an interpretable attention mechanism and integrates with
+both CNN and Vision Transformer backbones. The performance of BoQ is
+demonstrated through extensive experiments on 14 large-scale benchmarks. It
+consistently outperforms current state-of-the-art techniques including NetVLAD,
+MixVPR and EigenPlaces. Moreover, as a global retrieval technique (one-stage),
+BoQ surpasses two-stage retrieval methods, such as Patch-NetVLAD, TransVPR and
+R2Former, all while being orders of magnitude faster and more efficient. The
+code and model weights are publicly available at
+https://github.com/amaralibey/Bag-of-Queries.",cs.CV,['cs.CV']
+Deformable One-shot Face Stylization via DINO Semantic Guidance,Yang Zhou · Zichong Chen · Hui Huang,https://vcc.tech/research/2024/DoesFS,https://arxiv.org/abs/2403.00459,,2403.00459.pdf,Deformable One-shot Face Stylization via DINO Semantic Guidance,"This paper addresses the complex issue of one-shot face stylization, focusing
+on the simultaneous consideration of appearance and structure, where previous
+methods have fallen short. We explore deformation-aware face stylization that
+diverges from traditional single-image style reference, opting for a real-style
+image pair instead. The cornerstone of our method is the utilization of a
+self-supervised vision transformer, specifically DINO-ViT, to establish a
+robust and consistent facial structure representation across both real and
+style domains. Our stylization process begins by adapting the StyleGAN
+generator to be deformation-aware through the integration of spatial
+transformers (STN). We then introduce two innovative constraints for generator
+fine-tuning under the guidance of DINO semantics: i) a directional deformation
+loss that regulates directional vectors in DINO space, and ii) a relative
+structural consistency constraint based on DINO token self-similarities,
+ensuring diverse generation. Additionally, style-mixing is employed to align
+the color generation with the reference, minimizing inconsistent
+correspondences. This framework delivers enhanced deformability for general
+one-shot face stylization, achieving notable efficiency with a fine-tuning
+duration of approximately 10 minutes. Extensive qualitative and quantitative
+comparisons demonstrate our superiority over state-of-the-art one-shot face
+stylization methods. Code is available at https://github.com/zichongc/DoesFS",cs.CV,['cs.CV']
+Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion,Yuanxun Lu · Jingyang Zhang · Shiwei Li · Tian Fang · David McKinnon · Yanghai Tsin · Long Quan · Xun Cao · Yao Yao,https://nju-3dv.github.io/projects/direct25/,https://arxiv.org/abs/2311.15980,,2311.15980.pdf,Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion,"Recent advances in generative AI have unveiled significant potential for the
+creation of 3D content. However, current methods either apply a pre-trained 2D
+diffusion model with the time-consuming score distillation sampling (SDS), or a
+direct 3D diffusion model trained on limited 3D data losing generation
+diversity. In this work, we approach the problem by employing a multi-view 2.5D
+diffusion fine-tuned from a pre-trained 2D diffusion model. The multi-view 2.5D
+diffusion directly models the structural distribution of 3D data, while still
+maintaining the strong generalization ability of the original 2D diffusion
+model, filling the gap between 2D diffusion-based and direct 3D diffusion-based
+methods for 3D content generation. During inference, multi-view normal maps are
+generated using the 2.5D diffusion, and a novel differentiable rasterization
+scheme is introduced to fuse the almost consistent multi-view normal maps into
+a consistent 3D model. We further design a normal-conditioned multi-view image
+generation module for fast appearance generation given the 3D geometry. Our
+method is a one-pass diffusion process and does not require any SDS
+optimization as post-processing. We demonstrate through extensive experiments
+that, our direct 2.5D generation with the specially-designed fusion scheme can
+achieve diverse, mode-seeking-free, and high-fidelity 3D content generation in
+only 10 seconds. Project page: https://nju-3dv.github.io/projects/direct25.",cs.CV,['cs.CV']
+Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation,Ba Hung Ngo · Nhat-Tuong Do-Tran · Tuan-Ngoc Nguyen · Hae-Gon Jeon · Tae Jong Choi,https://dotrannhattuong.github.io/ECB/website/,https://arxiv.org/abs/2403.18360,,2403.18360.pdf,Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation,"Most domain adaptation (DA) methods are based on either a convolutional
+neural networks (CNNs) or a vision transformers (ViTs). They align the
+distribution differences between domains as encoders without considering their
+unique characteristics. For instance, ViT excels in accuracy due to its
+superior ability to capture global representations, while CNN has an advantage
+in capturing local representations. This fact has led us to design a hybrid
+method to fully take advantage of both ViT and CNN, called Explicitly
+Class-specific Boundaries (ECB). ECB learns CNN on ViT to combine their
+distinct strengths. In particular, we leverage ViT's properties to explicitly
+find class-specific decision boundaries by maximizing the discrepancy between
+the outputs of the two classifiers to detect target samples far from the source
+support. In contrast, the CNN encoder clusters target features based on the
+previously defined class-specific boundaries by minimizing the discrepancy
+between the probabilities of the two classifiers. Finally, ViT and CNN mutually
+exchange knowledge to improve the quality of pseudo labels and reduce the
+knowledge discrepancies of these models. Compared to conventional DA methods,
+our ECB achieves superior performance, which verifies its effectiveness in this
+hybrid model. The project website can be found
+https://dotrannhattuong.github.io/ECB/website.",cs.CV,['cs.CV']
+Versatile Navigation under Partial Observability via Value-Guided Diffusion Policy,Gengyu Zhang · Hao Tang · Yan Yan, ,https://arxiv.org/abs/2404.02176,,2404.02176.pdf,Versatile Navigation under Partial Observability via Value-guided Diffusion Policy,"Route planning for navigation under partial observability plays a crucial
+role in modern robotics and autonomous driving. Existing route planning
+approaches can be categorized into two main classes: traditional autoregressive
+and diffusion-based methods. The former often fails due to its myopic nature,
+while the latter either assumes full observability or struggles to adapt to
+unfamiliar scenarios, due to strong couplings with behavior cloning from
+experts. To address these deficiencies, we propose a versatile diffusion-based
+approach for both 2D and 3D route planning under partial observability.
+Specifically, our value-guided diffusion policy first generates plans to
+predict actions across various timesteps, providing ample foresight to the
+planning. It then employs a differentiable planner with state estimations to
+derive a value function, directing the agent's exploration and goal-seeking
+behaviors without seeking experts while explicitly addressing partial
+observability. During inference, our policy is further enhanced by a
+best-plan-selection strategy, substantially boosting the planning success rate.
+Moreover, we propose projecting point clouds, derived from RGB-D inputs, onto
+2D grid-based bird-eye-view maps via semantic segmentation, generalizing to 3D
+environments. This simple yet effective adaption enables zero-shot transfer
+from 2D-trained policy to 3D, cutting across the laborious training for 3D
+policy, and thus certifying our versatility. Experimental results demonstrate
+our superior performance, particularly in navigating situations beyond expert
+demonstrations, surpassing state-of-the-art autoregressive and diffusion-based
+baselines for both 2D and 3D scenarios.",cs.RO,"['cs.RO', 'cs.AI']"
+PerAda: Parameter-Efficient Federated Learning Personalization with Generalization Guarantees,Chulin Xie · De-An Huang · Wenda Chu · Daguang Xu · Chaowei Xiao · Bo Li · Anima Anandkumar, ,https://arxiv.org/abs/2405.09771,,2405.09771.pdf,Harmonizing Generalization and Personalization in Federated Prompt Learning,"Federated Prompt Learning (FPL) incorporates large pre-trained
+Vision-Language models (VLM) into federated learning through prompt tuning. The
+transferable representations and remarkable generalization capacity of VLM make
+them highly compatible with the integration of federated learning. Addressing
+data heterogeneity in federated learning requires personalization, but
+excessive focus on it across clients could compromise the model's ability to
+generalize effectively. To preserve the impressive generalization capability of
+VLM, it is crucial to strike a balance between personalization and
+generalization in FPL. To tackle this challenge, we proposed Federated Prompt
+Learning with CLIP Generalization and low-rank Personalization (FedPGP), which
+employs pre-trained CLIP to provide knowledge-guidance on the global prompt for
+improved generalization and incorporates a low-rank adaptation term to
+personalize the global prompt. Further, FedPGP integrates a prompt-wise
+contrastive loss to achieve knowledge guidance and personalized adaptation
+simultaneously, enabling a harmonious balance between personalization and
+generalization in FPL. We conduct extensive experiments on various datasets to
+explore base-to-novel generalization in both category-level and domain-level
+scenarios with heterogeneous data, showing the superiority of FedPGP in
+balancing generalization and personalization.",cs.LG,['cs.LG']
+Bi-Causal: Group Activity Recognition via Bidirectional Causality,Youliang Zhang · Wenxuan Liu · danni xu · Zhuo Zhou · Zheng Wang, ,https://arxiv.org/html/2312.00404v1,,2312.00404v1.pdf,A Causality-Aware Pattern Mining Scheme for Group Activity Recognition in a Pervasive Sensor Space,"Human activity recognition (HAR) is a key challenge in pervasive computing
+and its solutions have been presented based on various disciplines.
+Specifically, for HAR in a smart space without privacy and accessibility
+issues, data streams generated by deployed pervasive sensors are leveraged. In
+this paper, we focus on a group activity by which a group of users perform a
+collaborative task without user identification and propose an efficient group
+activity recognition scheme which extracts causality patterns from pervasive
+sensor event sequences generated by a group of users to support as good
+recognition accuracy as the state-of-the-art graphical model. To filter out
+irrelevant noise events from a given data stream, a set of rules is leveraged
+to highlight causally related events. Then, a pattern-tree algorithm extracts
+frequent causal patterns by means of a growing tree structure. Based on the
+extracted patterns, a weighted sum-based pattern matching algorithm computes
+the likelihoods of stored group activities to the given test event sequence by
+means of matched event pattern counts for group activity recognition. We
+evaluate the proposed scheme using the data collected from our testbed and
+CASAS datasets where users perform their tasks on a daily basis and validate
+its effectiveness in a real environment. Experiment results show that the
+proposed scheme performs higher recognition accuracy and with a small amount of
+runtime overhead than the existing schemes.",cs.LG,"['cs.LG', 'cs.DB']"
+MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning,Yixin Liu · Chenrui Fan · Yutong Dai · Xun Chen · Pan Zhou · Lichao Sun, ,https://arxiv.org/abs/2311.13127v3,,2311.13127v3.pdf,MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning,"Text-to-image diffusion models allow seamless generation of personalized
+images from scant reference photos. Yet, these tools, in the wrong hands, can
+fabricate misleading or harmful content, endangering individuals. To address
+this problem, existing poisoning-based approaches perturb user images in an
+imperceptible way to render them ""unlearnable"" from malicious uses. We identify
+two limitations of these defending approaches: i) sub-optimal due to the
+hand-crafted heuristics for solving the intractable bilevel optimization and
+ii) lack of robustness against simple data transformations like Gaussian
+filtering. To solve these challenges, we propose MetaCloak, which solves the
+bi-level poisoning problem with a meta-learning framework with an additional
+transformation sampling process to craft transferable and robust perturbation.
+Specifically, we employ a pool of surrogate diffusion models to craft
+transferable and model-agnostic perturbation. Furthermore, by incorporating an
+additional transformation process, we design a simple denoising-error
+maximization loss that is sufficient for causing transformation-robust semantic
+distortion and degradation in a personalized generation. Extensive experiments
+on the VGGFace2 and CelebA-HQ datasets show that MetaCloak outperforms existing
+approaches. Notably, MetaCloak can successfully fool online training services
+like Replicate, in a black-box manner, demonstrating the effectiveness of
+MetaCloak in real-world scenarios. Our code is available at
+https://github.com/liuyixin-louis/MetaCloak.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CR']"
+BIVDiff: A Training-free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models,Fengyuan Shi · Jiaxi Gu · Hang Xu · Songcen Xu · Wei Zhang · Limin Wang,https://github.com/MCG-NJU/BIVDiff,https://arxiv.org/abs/2312.02813,,2312.02813.pdf,BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models,"Diffusion models have made tremendous progress in text-driven image and video
+generation. Now text-to-image foundation models are widely applied to various
+downstream image synthesis tasks, such as controllable image generation and
+image editing, while downstream video synthesis tasks are less explored for
+several reasons. First, it requires huge memory and computation overhead to
+train a video generation foundation model. Even with video foundation models,
+additional costly training is still required for downstream video synthesis
+tasks. Second, although some works extend image diffusion models into videos in
+a training-free manner, temporal consistency cannot be well preserved. Finally,
+these adaption methods are specifically designed for one task and fail to
+generalize to different tasks. To mitigate these issues, we propose a
+training-free general-purpose video synthesis framework, coined as {\bf
+BIVDiff}, via bridging specific image diffusion models and general
+text-to-video foundation diffusion models. Specifically, we first use a
+specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for
+frame-wise video generation, then perform Mixed Inversion on the generated
+video, and finally input the inverted latents into the video diffusion models
+(e.g., VidRD and ZeroScope) for temporal smoothing. This decoupled framework
+enables flexible image model selection for different purposes with strong task
+generalization and high efficiency. To validate the effectiveness and general
+use of BIVDiff, we perform a wide range of video synthesis tasks, including
+controllable video generation, video editing, video inpainting, and
+outpainting.",cs.CV,"['cs.CV', 'cs.AI']"
+A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames,Pinelopi Papalampidi · Skanda Koppula · Shreya Pathak · Justin Chiu · Joseph Heyward · Viorica Patraucean · Jiajun Shen · Antoine Miech · Andrew Zisserman · Aida Nematzadeh, ,https://arxiv.org/abs/2312.07395,,2312.07395.pdf,A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames,"Understanding long, real-world videos requires modeling of long-range visual
+dependencies. To this end, we explore video-first architectures, building on
+the common paradigm of transferring large-scale, image--text models to video
+via shallow temporal fusion. However, we expose two limitations to the
+approach: (1) decreased spatial capabilities, likely due to poor
+video--language alignment in standard video datasets, and (2) higher memory
+consumption, bottlenecking the number of frames that can be processed. To
+mitigate the memory bottleneck, we systematically analyze the memory/accuracy
+trade-off of various efficient methods: factorized attention,
+parameter-efficient image-to-video adaptation, input masking, and
+multi-resolution patchification. Surprisingly, simply masking large portions of
+the video (up to 75%) during contrastive pre-training proves to be one of the
+most robust ways to scale encoders to videos up to 4.3 minutes at 1 FPS. Our
+simple approach for training long video-to-text models, which scales to 1B
+parameters, does not add new architectural complexity and is able to outperform
+the popular paradigm of using much larger LLMs as an information aggregator
+over segment-based information on benchmarks with long-range temporal
+dependencies (YouCook2, EgoSchema).",cs.CV,"['cs.CV', 'cs.CL']"
+DaReNeRF: Direction-aware Representation for Dynamic Scenes,Ange Lou · Benjamin Planche · Zhongpai Gao · Yamin Li · Tianyu Luan · Hao Ding · Terrence Chen · Jack Noble · Ziyan Wu, ,https://arxiv.org/abs/2403.02265v1,,2403.02265v1.pdf,DaReNeRF: Direction-aware Representation for Dynamic Scenes,"Addressing the intricate challenge of modeling and re-rendering dynamic
+scenes, most recent approaches have sought to simplify these complexities using
+plane-based explicit representations, overcoming the slow training time issues
+associated with methods like Neural Radiance Fields (NeRF) and implicit
+representations. However, the straightforward decomposition of 4D dynamic
+scenes into multiple 2D plane-based representations proves insufficient for
+re-rendering high-fidelity scenes with complex motions. In response, we present
+a novel direction-aware representation (DaRe) approach that captures scene
+dynamics from six different directions. This learned representation undergoes
+an inverse dual-tree complex wavelet transformation (DTCWT) to recover
+plane-based information. DaReNeRF computes features for each space-time point
+by fusing vectors from these recovered planes. Combining DaReNeRF with a tiny
+MLP for color regression and leveraging volume rendering in training yield
+state-of-the-art performance in novel view synthesis for complex dynamic
+scenes. Notably, to address redundancy introduced by the six real and six
+imaginary direction-aware wavelet coefficients, we introduce a trainable
+masking approach, mitigating storage issues without significant performance
+decline. Moreover, DaReNeRF maintains a 2x reduction in training time compared
+to prior art while delivering superior performance.",cs.CV,"['cs.CV', 'cs.GR']"
+Learning without Exact Guidance: Updating Large-scale High-resolution Land Cover Maps from Low-resolution Historical Labels,Zhuohong Li · Wei He · Jiepan Li · Fangxiao Lu · Hongyan Zhang, ,https://arxiv.org/abs/2403.02746,,2403.02746.pdf,Learning without Exact Guidance: Updating Large-scale High-resolution Land Cover Maps from Low-resolution Historical Labels,"Large-scale high-resolution (HR) land-cover mapping is a vital task to survey
+the Earth's surface and resolve many challenges facing humanity. However, it is
+still a non-trivial task hindered by complex ground details, various landforms,
+and the scarcity of accurate training labels over a wide-span geographic area.
+In this paper, we propose an efficient, weakly supervised framework
+(Paraformer) to guide large-scale HR land-cover mapping with easy-access
+historical land-cover data of low resolution (LR). Specifically, existing
+land-cover mapping approaches reveal the dominance of CNNs in preserving local
+ground details but still suffer from insufficient global modeling in various
+landforms. Therefore, we design a parallel CNN-Transformer feature extractor in
+Paraformer, consisting of a downsampling-free CNN branch and a Transformer
+branch, to jointly capture local and global contextual information. Besides,
+facing the spatial mismatch of training data, a pseudo-label-assisted training
+(PLAT) module is adopted to reasonably refine LR labels for weakly supervised
+semantic segmentation of HR images. Experiments on two large-scale datasets
+demonstrate the superiority of Paraformer over other state-of-the-art methods
+for automatically updating HR land-cover maps from LR historical labels.",cs.CV,"['cs.CV', 'cs.LG']"
+SG-PGM: Partial Graph Matching Network with Semantic Geometric Fusion for 3D Scene Graph Alignment and Its Downstream Tasks,Yaxu Xie · Alain Pagani · Didier Stricker, ,https://arxiv.org/abs/2403.19474,,2403.19474.pdf,SG-PGM: Partial Graph Matching Network with Semantic Geometric Fusion for 3D Scene Graph Alignment and Its Downstream Tasks,"Scene graphs have been recently introduced into 3D spatial understanding as a
+comprehensive representation of the scene. The alignment between 3D scene
+graphs is the first step of many downstream tasks such as scene graph aided
+point cloud registration, mosaicking, overlap checking, and robot navigation.
+In this work, we treat 3D scene graph alignment as a partial graph-matching
+problem and propose to solve it with a graph neural network. We reuse the
+geometric features learned by a point cloud registration method and associate
+the clustered point-level geometric features with the node-level semantic
+feature via our designed feature fusion module. Partial matching is enabled by
+using a learnable method to select the top-k similar node pairs. Subsequent
+downstream tasks such as point cloud registration are achieved by running a
+pre-trained registration network within the matched regions. We further propose
+a point-matching rescoring method, that uses the node-wise alignment of the 3D
+scene graph to reweight the matching candidates from a pre-trained point cloud
+registration method. It reduces the false point correspondences estimated
+especially in low-overlapping cases. Experiments show that our method improves
+the alignment accuracy by 10~20% in low-overlap and random transformation
+scenarios and outperforms the existing work in multiple downstream tasks.",cs.CV,"['cs.CV', 'cs.RO']"
+Frequency-Adaptive Dilated Convolution for Semantic Segmentation,Linwei Chen · Lin Gu · Dezhi Zheng · Ying Fu,https://github.com/Linwei-Chen/FADC,https://arxiv.org/abs/2403.05369,,2403.05369.pdf,Frequency-Adaptive Dilated Convolution for Semantic Segmentation,"Dilated convolution, which expands the receptive field by inserting gaps
+between its consecutive elements, is widely employed in computer vision. In
+this study, we propose three strategies to improve individual phases of dilated
+convolution from the view of spectrum analysis. Departing from the conventional
+practice of fixing a global dilation rate as a hyperparameter, we introduce
+Frequency-Adaptive Dilated Convolution (FADC), which dynamically adjusts
+dilation rates spatially based on local frequency components. Subsequently, we
+design two plug-in modules to directly enhance effective bandwidth and
+receptive field size. The Adaptive Kernel (AdaKern) module decomposes
+convolution weights into low-frequency and high-frequency components,
+dynamically adjusting the ratio between these components on a per-channel
+basis. By increasing the high-frequency part of convolution weights, AdaKern
+captures more high-frequency components, thereby improving effective bandwidth.
+The Frequency Selection (FreqSelect) module optimally balances high- and
+low-frequency components in feature representations through spatially variant
+reweighting. It suppresses high frequencies in the background to encourage FADC
+to learn a larger dilation, thereby increasing the receptive field for an
+expanded scope. Extensive experiments on segmentation and object detection
+consistently validate the efficacy of our approach. The code is publicly
+available at https://github.com/Linwei-Chen/FADC.",cs.CV,['cs.CV']
+Distilled Datamodel with Reverse Gradient Matching,Jingwen Ye · Ruonan Yu · Songhua Liu · Xinchao Wang, ,https://arxiv.org/abs/2404.14006,,2404.14006.pdf,Distilled Datamodel with Reverse Gradient Matching,"The proliferation of large-scale AI models trained on extensive datasets has
+revolutionized machine learning. With these models taking on increasingly
+central roles in various applications, the need to understand their behavior
+and enhance interpretability has become paramount. To investigate the impact of
+changes in training data on a pre-trained model, a common approach is
+leave-one-out retraining. This entails systematically altering the training
+dataset by removing specific samples to observe resulting changes within the
+model. However, retraining the model for each altered dataset presents a
+significant computational challenge, given the need to perform this operation
+for every dataset variation. In this paper, we introduce an efficient framework
+for assessing data impact, comprising offline training and online evaluation
+stages. During the offline training phase, we approximate the influence of
+training data on the target model through a distilled synset, formulated as a
+reversed gradient matching problem. For online evaluation, we expedite the
+leave-one-out process using the synset, which is then utilized to compute the
+attribution matrix based on the evaluation objective. Experimental evaluations,
+including training data attribution and assessments of data quality,
+demonstrate that our proposed method achieves comparable model behavior
+evaluation while significantly speeding up the process compared to the direct
+retraining method.",cs.LG,"['cs.LG', 'cs.CV']"
+Memory-based Adapters for Online 3D Scene Perception,Xiuwei Xu · Chong Xia · Ziwei Wang · Linqing Zhao · Linqing Zhao · Yueqi Duan · Jie Zhou · Jiwen Lu, ,https://arxiv.org/abs/2403.06974,,2403.06974.pdf,Memory-based Adapters for Online 3D Scene Perception,"In this paper, we propose a new framework for online 3D scene perception.
+Conventional 3D scene perception methods are offline, i.e., take an already
+reconstructed 3D scene geometry as input, which is not applicable in robotic
+applications where the input data is streaming RGB-D videos rather than a
+complete 3D scene reconstructed from pre-collected RGB-D videos. To deal with
+online 3D scene perception tasks where data collection and perception should be
+performed simultaneously, the model should be able to process 3D scenes frame
+by frame and make use of the temporal information. To this end, we propose an
+adapter-based plug-and-play module for the backbone of 3D scene perception
+model, which constructs memory to cache and aggregate the extracted RGB-D
+features to empower offline models with temporal learning ability.
+Specifically, we propose a queued memory mechanism to cache the supporting
+point cloud and image features. Then we devise aggregation modules which
+directly perform on the memory and pass temporal information to current frame.
+We further propose 3D-to-2D adapter to enhance image features with strong
+global context. Our adapters can be easily inserted into mainstream offline
+architectures of different tasks and significantly boost their performance on
+online tasks. Extensive experiments on ScanNet and SceneNN datasets demonstrate
+our approach achieves leading performance on three 3D scene perception tasks
+compared with state-of-the-art online methods by simply finetuning existing
+offline models, without any model and task-specific designs.
+\href{https://xuxw98.github.io/Online3D/}{Project page}.",cs.CV,['cs.CV']
+Ungeneralizable Examples,Jingwen Ye · Xinchao Wang, ,https://arxiv.org/abs/2404.14016,,2404.14016.pdf,Ungeneralizable Examples,"The training of contemporary deep learning models heavily relies on publicly
+available data, posing a risk of unauthorized access to online data and raising
+concerns about data privacy. Current approaches to creating unlearnable data
+involve incorporating small, specially designed noises, but these methods
+strictly limit data usability, overlooking its potential usage in authorized
+scenarios. In this paper, we extend the concept of unlearnable data to
+conditional data learnability and introduce \textbf{U}n\textbf{G}eneralizable
+\textbf{E}xamples (UGEs). UGEs exhibit learnability for authorized users while
+maintaining unlearnability for potential hackers. The protector defines the
+authorized network and optimizes UGEs to match the gradients of the original
+data and its ungeneralizable version, ensuring learnability. To prevent
+unauthorized learning, UGEs are trained by maximizing a designated distance
+loss in a common feature space. Additionally, to further safeguard the
+authorized side from potential attacks, we introduce additional undistillation
+optimization. Experimental results on multiple datasets and various networks
+demonstrate that the proposed UGEs framework preserves data usability while
+reducing training performance on hacker networks, even under different types of
+attacks.",cs.LG,"['cs.LG', 'cs.CV']"
+ColorPCR: Color Point Cloud Registration with Multi-Stage Geometric-Color Fusion,Juncheng Mu · Lin Bie · Shaoyi Du · Yue Gao, ,,https://www.mdpi.com/2072-4292/16/5/743,,,,,nan
+IIRP-Net: Iterative Inference Residual Pyramid Network for Enhanced Image Registration,Tai Ma · zhangsuwei · Jiafeng Li · Ying Wen, ,https://arxiv.org/html/2312.13396v1,,2312.13396v1.pdf,EPNet: An Efficient Pyramid Network for Enhanced Single-Image Super-Resolution with Reduced Computational Requirements,"Single-image super-resolution (SISR) has seen significant advancements
+through the integration of deep learning. However, the substantial
+computational and memory requirements of existing methods often limit their
+practical application. This paper introduces a new Efficient Pyramid Network
+(EPNet) that harmoniously merges an Edge Split Pyramid Module (ESPM) with a
+Panoramic Feature Extraction Module (PFEM) to overcome the limitations of
+existing methods, particularly in terms of computational efficiency. The ESPM
+applies a pyramid-based channel separation strategy, boosting feature
+extraction while maintaining computational efficiency. The PFEM, a novel fusion
+of CNN and Transformer structures, enables the concurrent extraction of local
+and global features, thereby providing a panoramic view of the image landscape.
+Our architecture integrates the PFEM in a manner that facilitates the
+streamlined exchange of feature information and allows for the further
+refinement of image texture details. Experimental results indicate that our
+model outperforms existing state-of-the-art methods in image resolution
+quality, while considerably decreasing computational and memory costs. This
+research contributes to the ongoing evolution of efficient and practical SISR
+methodologies, bearing broader implications for the field of computer vision.",cs.CV,['cs.CV']
+Towards Efficient Replay in Federated Incremental Learning,Yichen Li · Qunwei Li · Haozhao Wang · Ruixuan Li · Wenliang Zhong · Guannan Zhang, ,https://arxiv.org/abs/2403.05890,,2403.05890.pdf,Towards Efficient Replay in Federated Incremental Learning,"In Federated Learning (FL), the data in each client is typically assumed
+fixed or static. However, data often comes in an incremental manner in
+real-world applications, where the data domain may increase dynamically. In
+this work, we study catastrophic forgetting with data heterogeneity in
+Federated Incremental Learning (FIL) scenarios where edge clients may lack
+enough storage space to retain full data. We propose to employ a simple,
+generic framework for FIL named Re-Fed, which can coordinate each client to
+cache important samples for replay. More specifically, when a new task arrives,
+each client first caches selected previous samples based on their global and
+local importance. Then, the client trains the local model with both the cached
+samples and the samples from the new task. Theoretically, we analyze the
+ability of Re-Fed to discover important samples for replay thus alleviating the
+catastrophic forgetting problem. Moreover, we empirically show that Re-Fed
+achieves competitive performance compared to state-of-the-art methods.",cs.LG,"['cs.LG', 'cs.DC']"
+Disentangled Pre-training for Human-Object Interaction Detection,Zhuolong Li · Xingao Li · Changxing Ding · Xiangmin Xu,https://github.com/xingaoli/DP-HOI,https://arxiv.org/abs/2404.01725,,2404.01725.pdf,Disentangled Pre-training for Human-Object Interaction Detection,"Detecting human-object interaction (HOI) has long been limited by the amount
+of supervised data available. Recent approaches address this issue by
+pre-training according to pseudo-labels, which align object regions with HOI
+triplets parsed from image captions. However, pseudo-labeling is tricky and
+noisy, making HOI pre-training a complex process. Therefore, we propose an
+efficient disentangled pre-training method for HOI detection (DP-HOI) to
+address this problem. First, DP-HOI utilizes object detection and action
+recognition datasets to pre-train the detection and interaction decoder layers,
+respectively. Then, we arrange these decoder layers so that the pre-training
+architecture is consistent with the downstream HOI detection task. This
+facilitates efficient knowledge transfer. Specifically, the detection decoder
+identifies reliable human instances in each action recognition dataset image,
+generates one corresponding query, and feeds it into the interaction decoder
+for verb classification. Next, we combine the human instance verb predictions
+in the same image and impose image-level supervision. The DP-HOI structure can
+be easily adapted to the HOI detection task, enabling effective model parameter
+initialization. Therefore, it significantly enhances the performance of
+existing HOI detection models on a broad range of rare categories. The code and
+pre-trained weight are available at https://github.com/xingaoli/DP-HOI.",cs.CV,['cs.CV']
+RegionGPT: Towards Region Understanding Vision Language Model,Qiushan Guo · Shalini De Mello · Danny Yin · Wonmin Byeon · Ka Chun Cheung · Yizhou Yu · Ping Luo · Sifei Liu,https://guoqiushan.github.io/regiongpt.github.io/,https://arxiv.org/abs/2403.02330v1,,2403.02330v1.pdf,RegionGPT: Towards Region Understanding Vision Language Model,"Vision language models (VLMs) have experienced rapid advancements through the
+integration of large language models (LLMs) with image-text pairs, yet they
+struggle with detailed regional visual understanding due to limited spatial
+awareness of the vision encoder, and the use of coarse-grained training data
+that lacks detailed, region-specific captions. To address this, we introduce
+RegionGPT (short as RGPT), a novel framework designed for complex region-level
+captioning and understanding. RGPT enhances the spatial awareness of regional
+representation with simple yet effective modifications to existing visual
+encoders in VLMs. We further improve performance on tasks requiring a specific
+output scope by integrating task-guided instruction prompts during both
+training and inference phases, while maintaining the model's versatility for
+general-purpose tasks. Additionally, we develop an automated region caption
+data generation pipeline, enriching the training set with detailed region-level
+captions. We demonstrate that a universal RGPT model can be effectively applied
+and significantly enhancing performance across a range of region-level tasks,
+including but not limited to complex region descriptions, reasoning, object
+classification, and referring expressions comprehension.",cs.CV,['cs.CV']
+Turb-Seg-Res: A Segment-then-Restore Pipeline for Dynamic Videos with Atmospheric Turbulence,Ripon Saha · Dehao Qin · Nianyi Li · Jinwei Ye · Suren Jayasuriya, ,https://arxiv.org/abs/2404.13605,,2404.13605.pdf,Turb-Seg-Res: A Segment-then-Restore Pipeline for Dynamic Videos with Atmospheric Turbulence,"Tackling image degradation due to atmospheric turbulence, particularly in
+dynamic environment, remains a challenge for long-range imaging systems.
+Existing techniques have been primarily designed for static scenes or scenes
+with small motion. This paper presents the first segment-then-restore pipeline
+for restoring the videos of dynamic scenes in turbulent environment. We
+leverage mean optical flow with an unsupervised motion segmentation method to
+separate dynamic and static scene components prior to restoration. After camera
+shake compensation and segmentation, we introduce foreground/background
+enhancement leveraging the statistics of turbulence strength and a transformer
+model trained on a novel noise-based procedural turbulence generator for fast
+dataset augmentation. Benchmarked against existing restoration methods, our
+approach restores most of the geometric distortion and enhances sharpness for
+videos. We make our code, simulator, and data publicly available to advance the
+field of video restoration from turbulence: riponcs.github.io/TurbSegRes",cs.CV,"['cs.CV', 'eess.IV']"
+Spin-UP: Spin Light for Natural Light Uncalibrated Photometric Stereo,Zongrui Li · Zhan Lu · Haojie Yan · Boxin Shi · Gang Pan · Qian Zheng · Xudong Jiang, ,https://arxiv.org/abs/2404.01612,,2404.01612.pdf,Spin-UP: Spin Light for Natural Light Uncalibrated Photometric Stereo,"Natural Light Uncalibrated Photometric Stereo (NaUPS) relieves the strict
+environment and light assumptions in classical Uncalibrated Photometric Stereo
+(UPS) methods. However, due to the intrinsic ill-posedness and high-dimensional
+ambiguities, addressing NaUPS is still an open question. Existing works impose
+strong assumptions on the environment lights and objects' material, restricting
+the effectiveness in more general scenarios. Alternatively, some methods
+leverage supervised learning with intricate models while lacking
+interpretability, resulting in a biased estimation. In this work, we proposed
+Spin Light Uncalibrated Photometric Stereo (Spin-UP), an unsupervised method to
+tackle NaUPS in various environment lights and objects. The proposed method
+uses a novel setup that captures the object's images on a rotatable platform,
+which mitigates NaUPS's ill-posedness by reducing unknowns and provides
+reliable priors to alleviate NaUPS's ambiguities. Leveraging neural inverse
+rendering and the proposed training strategies, Spin-UP recovers surface
+normals, environment light, and isotropic reflectance under complex natural
+light with low computational cost. Experiments have shown that Spin-UP
+outperforms other supervised / unsupervised NaUPS methods and achieves
+state-of-the-art performance on synthetic and real-world datasets. Codes and
+data are available at https://github.com/LMozart/CVPR2024-SpinUP.",cs.CV,['cs.CV']
+Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training,Arun Reddy · William Paul · Corban Rivera · Ketul Shah · Celso M. de Melo · Rama Chellappa, ,https://arxiv.org/abs/2312.02914,,2312.02914.pdf,Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training,"In this work, we tackle the problem of unsupervised domain adaptation (UDA)
+for video action recognition. Our approach, which we call UNITE, uses an image
+teacher model to adapt a video student model to the target domain. UNITE first
+employs self-supervised pre-training to promote discriminative feature learning
+on target domain videos using a teacher-guided masked distillation objective.
+We then perform self-training on masked target data, using the video student
+model and image teacher model together to generate improved pseudolabels for
+unlabeled target videos. Our self-training process successfully leverages the
+strengths of both models to achieve strong transfer performance across domains.
+We evaluate our approach on multiple video domain adaptation benchmarks and
+observe significant improvements upon previously reported results.",cs.CV,"['cs.CV', 'cs.LG']"
+Would Deep Generative Models Amplify Bias in Future Models?,Tianwei Chen · Yusuke Hirota · Mayu Otani · Noa Garcia · Yuta Nakashima, ,https://arxiv.org/abs/2404.03242,,2404.03242.pdf,Would Deep Generative Models Amplify Bias in Future Models?,"We investigate the impact of deep generative models on potential social
+biases in upcoming computer vision models. As the internet witnesses an
+increasing influx of AI-generated images, concerns arise regarding inherent
+biases that may accompany them, potentially leading to the dissemination of
+harmful content. This paper explores whether a detrimental feedback loop,
+resulting in bias amplification, would occur if generated images were used as
+the training data for future models. We conduct simulations by progressively
+substituting original images in COCO and CC3M datasets with images generated
+through Stable Diffusion. The modified datasets are used to train OpenCLIP and
+image captioning models, which we evaluate in terms of quality and bias.
+Contrary to expectations, our findings indicate that introducing generated
+images during training does not uniformly amplify bias. Instead, instances of
+bias mitigation across specific tasks are observed. We further explore the
+factors that may influence these phenomena, such as artifacts in image
+generation (e.g., blurry faces) or pre-existing biases in the original
+datasets.",cs.CV,['cs.CV']
+Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models,Xingqian Xu · Jiayi Guo · Zhangyang Wang · Gao Huang · Irfan Essa · Humphrey Shi, ,,https://openreview.net/forum?id=QL3Zuth6E7,,,,,nan
+Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models,Jiayi Guo · Xingqian Xu · Yifan Pu · Zanlin Ni · Chaofei Wang · Manushree Vasu · Shiji Song · Gao Huang · Humphrey Shi,https://shi-labs.github.io/Smooth-Diffusion/,https://arxiv.org/abs/2312.04410,,2312.04410.pdf,Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models,"Recently, diffusion models have made remarkable progress in text-to-image
+(T2I) generation, synthesizing images with high fidelity and diverse contents.
+Despite this advancement, latent space smoothness within diffusion models
+remains largely unexplored. Smooth latent spaces ensure that a perturbation on
+an input latent corresponds to a steady change in the output image. This
+property proves beneficial in downstream tasks, including image interpolation,
+inversion, and editing. In this work, we expose the non-smoothness of diffusion
+latent spaces by observing noticeable visual fluctuations resulting from minor
+latent variations. To tackle this issue, we propose Smooth Diffusion, a new
+category of diffusion models that can be simultaneously high-performing and
+smooth. Specifically, we introduce Step-wise Variation Regularization to
+enforce the proportion between the variations of an arbitrary input latent and
+that of the output image is a constant at any diffusion training step. In
+addition, we devise an interpolation standard deviation (ISTD) metric to
+effectively assess the latent space smoothness of a diffusion model. Extensive
+quantitative and qualitative experiments demonstrate that Smooth Diffusion
+stands out as a more desirable solution not only in T2I generation but also
+across various downstream tasks. Smooth Diffusion is implemented as a
+plug-and-play Smooth-LoRA to work with various community models. Code is
+available at https://github.com/SHI-Labs/Smooth-Diffusion.",cs.CV,['cs.CV']
+PAIR Diffusion: A Comprehensive Multimodal Object-Level Image Editor,Vidit Goel · Elia Peruzzo · Yifan Jiang · Dejia Xu · Xingqian Xu · Nicu Sebe · Trevor Darrell · Zhangyang Wang · Humphrey Shi,https://vidit98.github.io/publication/conference-paper/pair_diff.html,,https://openreview.net/forum?id=cI5j8tEPNU,,,,,nan
+Large Language Models are Good Prompt Learners for Low-Shot Image Classification,Zhaoheng Zheng · Jingmin Wei · Xuefeng Hu · Haidong Zhu · Ram Nevatia, ,https://arxiv.org/abs/2312.04076,,2312.04076.pdf,Large Language Models are Good Prompt Learners for Low-Shot Image Classification,"Low-shot image classification, where training images are limited or
+inaccessible, has benefited from recent progress on pre-trained vision-language
+(VL) models with strong generalizability, e.g. CLIP. Prompt learning methods
+built with VL models generate text features from the class names that only have
+confined class-specific information. Large Language Models (LLMs), with their
+vast encyclopedic knowledge, emerge as the complement. Thus, in this paper, we
+discuss the integration of LLMs to enhance pre-trained VL models, specifically
+on low-shot classification. However, the domain gap between language and vision
+blocks the direct application of LLMs. Thus, we propose LLaMP, Large Language
+Models as Prompt learners, that produces adaptive prompts for the CLIP text
+encoder, establishing it as the connecting bridge. Experiments show that,
+compared with other state-of-the-art prompt learning methods, LLaMP yields
+better performance on both zero-shot generalization and few-shot image
+classification, over a spectrum of 11 datasets. Code will be made available at:
+https://github.com/zhaohengz/LLaMP.",cs.CV,['cs.CV']
+SpikeNeRF: Learning Neural Radiance Fields from Continuous Spike Stream,Lin Zhu · Kangmin Jia · Yifan Zhao · Yunshan Qi · Lizhi Wang · Hua Huang, ,https://arxiv.org/abs/2403.11222,,2403.11222.pdf,SpikeNeRF: Learning Neural Radiance Fields from Continuous Spike Stream,"Spike cameras, leveraging spike-based integration sampling and high temporal
+resolution, offer distinct advantages over standard cameras. However, existing
+approaches reliant on spike cameras often assume optimal illumination, a
+condition frequently unmet in real-world scenarios. To address this, we
+introduce SpikeNeRF, the first work that derives a NeRF-based volumetric scene
+representation from spike camera data. Our approach leverages NeRF's multi-view
+consistency to establish robust self-supervision, effectively eliminating
+erroneous measurements and uncovering coherent structures within exceedingly
+noisy input amidst diverse real-world illumination scenarios. The framework
+comprises two core elements: a spike generation model incorporating an
+integrate-and-fire neuron layer and parameters accounting for non-idealities,
+such as threshold variation, and a spike rendering loss capable of generalizing
+across varying illumination conditions. We describe how to effectively optimize
+neural radiance fields to render photorealistic novel views from the novel
+continuous spike stream, demonstrating advantages over other vision sensors in
+certain scenes. Empirical evaluations conducted on both real and novel
+realistically simulated sequences affirm the efficacy of our methodology. The
+dataset and source code are released at
+https://github.com/BIT-Vision/SpikeNeRF.",cs.CV,['cs.CV']
+UVEB: A Large-scale Benchmark and Baseline Towards Real-World Underwater Video Enhancement,yaofeng xie · Lingwei Kong · Kai Chen · Zheng Ziqiang · Xiao Yu · Zhibin Yu · Bing Zheng,https://github.com/yzbouc/UVEB,https://arxiv.org/abs/2404.14542,,2404.14542.pdf,UVEB: A Large-scale Benchmark and Baseline Towards Real-World Underwater Video Enhancement,"Learning-based underwater image enhancement (UIE) methods have made great
+progress. However, the lack of large-scale and high-quality paired training
+samples has become the main bottleneck hindering the development of UIE. The
+inter-frame information in underwater videos can accelerate or optimize the UIE
+process. Thus, we constructed the first large-scale high-resolution underwater
+video enhancement benchmark (UVEB) to promote the development of underwater
+vision.It contains 1,308 pairs of video sequences and more than 453,000
+high-resolution with 38\% Ultra-High-Definition (UHD) 4K frame pairs. UVEB
+comes from multiple countries, containing various scenes and video degradation
+types to adapt to diverse and complex underwater environments. We also propose
+the first supervised underwater video enhancement method, UVE-Net. UVE-Net
+converts the current frame information into convolutional kernels and passes
+them to adjacent frames for efficient inter-frame information exchange. By
+fully utilizing the redundant degraded information of underwater videos,
+UVE-Net completes video enhancement better. Experiments show the effective
+network design and good performance of UVE-Net.",cs.CV,"['cs.CV', 'I.4']"
+Single-View Scene Point Cloud Human Grasp Generation,Yan-Kang Wang · Chengyi Xing · Yi-Lin Wei · Xiao-Ming Wu · Wei-Shi Zheng, ,https://arxiv.org/abs/2404.15815,,2404.15815.pdf,Single-View Scene Point Cloud Human Grasp Generation,"In this work, we explore a novel task of generating human grasps based on
+single-view scene point clouds, which more accurately mirrors the typical
+real-world situation of observing objects from a single viewpoint. Due to the
+incompleteness of object point clouds and the presence of numerous scene
+points, the generated hand is prone to penetrating into the invisible parts of
+the object and the model is easily affected by scene points. Thus, we introduce
+S2HGrasp, a framework composed of two key modules: the Global Perception module
+that globally perceives partial object point clouds, and the DiffuGrasp module
+designed to generate high-quality human grasps based on complex inputs that
+include scene points. Additionally, we introduce S2HGD dataset, which comprises
+approximately 99,000 single-object single-view scene point clouds of 1,668
+unique objects, each annotated with one human grasp. Our extensive experiments
+demonstrate that S2HGrasp can not only generate natural human grasps regardless
+of scene points, but also effectively prevent penetration between the hand and
+invisible parts of the object. Moreover, our model showcases strong
+generalization capability when applied to unseen objects. Our code and dataset
+are available at https://github.com/iSEE-Laboratory/S2HGrasp.",cs.CV,['cs.CV']
+MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception,Yiran Qin · Enshen Zhou · Qichang Liu · Zhenfei Yin · Lu Sheng · Ruimao Zhang · Yu Qiao · Jing Shao,https://iranqin.github.io/MP5.github.io/,https://arxiv.org/abs/2312.07472,,2312.07472.pdf,MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception,"It is a long-lasting goal to design an embodied system that can solve
+long-horizon open-world tasks in human-like ways. However, existing approaches
+usually struggle with compound difficulties caused by the logic-aware
+decomposition and context-aware execution of these tasks. To this end, we
+introduce MP5, an open-ended multimodal embodied system built upon the
+challenging Minecraft simulator, which can decompose feasible sub-objectives,
+design sophisticated situation-aware plans, and perform embodied action
+control, with frequent communication with a goal-conditioned active perception
+scheme. Specifically, MP5 is developed on top of recent advances in Multimodal
+Large Language Models (MLLMs), and the system is modulated into functional
+modules that can be scheduled and collaborated to ultimately solve pre-defined
+context- and process-dependent tasks. Extensive experiments prove that MP5 can
+achieve a 22% success rate on difficult process-dependent tasks and a 91%
+success rate on tasks that heavily depend on the context. Moreover, MP5
+exhibits a remarkable ability to address many open-ended tasks that are
+entirely novel.",cs.CV,['cs.CV']
+Scaling Up Video Summarization Pretraining with Large Language Models,Dawit Argaw Argaw · Seunghyun Yoon · Fabian Caba Heilbron · Hanieh Deilamsalehy · Trung Bui · Zhaowen Wang · Franck Dernoncourt · Joon Chung, ,https://arxiv.org/abs/2404.03398,,2404.03398.pdf,Scaling Up Video Summarization Pretraining with Large Language Models,"Long-form video content constitutes a significant portion of internet
+traffic, making automated video summarization an essential research problem.
+However, existing video summarization datasets are notably limited in their
+size, constraining the effectiveness of state-of-the-art methods for
+generalization. Our work aims to overcome this limitation by capitalizing on
+the abundance of long-form videos with dense speech-to-video alignment and the
+remarkable capabilities of recent large language models (LLMs) in summarizing
+long text. We introduce an automated and scalable pipeline for generating a
+large-scale video summarization dataset using LLMs as Oracle summarizers. By
+leveraging the generated dataset, we analyze the limitations of existing
+approaches and propose a new video summarization model that effectively
+addresses them. To facilitate further research in the field, our work also
+presents a new benchmark dataset that contains 1200 long videos each with
+high-quality summaries annotated by professionals. Extensive experiments
+clearly indicate that our proposed approach sets a new state-of-the-art in
+video summarization across several benchmarks.",cs.CV,['cs.CV']
+CARZero: Cross-Attention Alignment for Radiology Zero-Shot Classification,Haoran Lai · Qingsong Yao · Zihang Jiang · Rongsheng Wang · Zhiyang He · Xiaodong Tao · S Kevin Zhou, ,https://arxiv.org/abs/2402.17417,,2402.17417.pdf,CARZero: Cross-Attention Alignment for Radiology Zero-Shot Classification,"The advancement of Zero-Shot Learning in the medical domain has been driven
+forward by using pre-trained models on large-scale image-text pairs, focusing
+on image-text alignment. However, existing methods primarily rely on cosine
+similarity for alignment, which may not fully capture the complex relationship
+between medical images and reports. To address this gap, we introduce a novel
+approach called Cross-Attention Alignment for Radiology Zero-Shot
+Classification (CARZero). Our approach innovatively leverages cross-attention
+mechanisms to process image and report features, creating a Similarity
+Representation that more accurately reflects the intricate relationships in
+medical semantics. This representation is then linearly projected to form an
+image-text similarity matrix for cross-modality alignment. Additionally,
+recognizing the pivotal role of prompt selection in zero-shot learning, CARZero
+incorporates a Large Language Model-based prompt alignment strategy. This
+strategy standardizes diverse diagnostic expressions into a unified format for
+both training and inference phases, overcoming the challenges of manual prompt
+design. Our approach is simple yet effective, demonstrating state-of-the-art
+performance in zero-shot classification on five official chest radiograph
+diagnostic test sets, including remarkable results on datasets with long-tail
+distributions of rare diseases. This achievement is attributed to our new
+image-text alignment strategy, which effectively addresses the complex
+relationship between medical images and reports. Code and models are available
+at https://github.com/laihaoran/CARZero.",cs.CV,['cs.CV']
+LiDAR-Net: A Real-scanned 3D Point Cloud Dataset for Indoor Scenes,Yanwen Guo · Yuanqi Li · Dayong Ren · Xiaohong Zhang · Jiawei Li · Liang Pu · Changfeng Ma · xiaoyu zhan · Jie Guo · Mingqiang Wei · Yan Zhang · Piaopiao Yu · Shuangyu Yang · Donghao Ji · Huisheng Ye · Hao Sun · Yansong Liu · Yinuo Chen · Jiaqi Zhu · Hongyu Liu, ,https://arxiv.org/html/2309.13596v2,,2309.13596v2.pdf,Advancements in 3D Lane Detection Using LiDAR Point Clouds: From Data Collection to Model Development,"Advanced Driver-Assistance Systems (ADAS) have successfully integrated
+learning-based techniques into vehicle perception and decision-making. However,
+their application in 3D lane detection for effective driving environment
+perception is hindered by the lack of comprehensive LiDAR datasets. The sparse
+nature of LiDAR point cloud data prevents an efficient manual annotation
+process. To solve this problem, we present LiSV-3DLane, a large-scale 3D lane
+dataset that comprises 20k frames of surround-view LiDAR point clouds with
+enriched semantic annotation. Unlike existing datasets confined to a frontal
+perspective, LiSV-3DLane provides a full 360-degree spatial panorama around the
+ego vehicle, capturing complex lane patterns in both urban and highway
+environments. We leverage the geometric traits of lane lines and the intrinsic
+spatial attributes of LiDAR data to design a simple yet effective automatic
+annotation pipeline for generating finer lane labels. To propel future
+research, we propose a novel LiDAR-based 3D lane detection model, LiLaDet,
+incorporating the spatial geometry learning of the LiDAR point cloud into
+Bird's Eye View (BEV) based lane identification. Experimental results indicate
+that LiLaDet outperforms existing camera- and LiDAR-based approaches in the 3D
+lane detection task on the K-Lane dataset and our LiSV-3DLane.",cs.CV,['cs.CV']
+Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection,Chuangchuang Tan · Huan Liu · Yao Zhao · Shikui Wei · Guanghua Gu · Ping Liu · Yunchao Wei,https://github.com/chuangchuangtan/NPR-DeepfakeDetection,https://arxiv.org/abs/2312.10461,,2312.10461.pdf,Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection,"Recently, the proliferation of highly realistic synthetic images, facilitated
+through a variety of GANs and Diffusions, has significantly heightened the
+susceptibility to misuse. While the primary focus of deepfake detection has
+traditionally centered on the design of detection algorithms, an investigative
+inquiry into the generator architectures has remained conspicuously absent in
+recent years. This paper contributes to this lacuna by rethinking the
+architectures of CNN-based generators, thereby establishing a generalized
+representation of synthetic artifacts. Our findings illuminate that the
+up-sampling operator can, beyond frequency-based artifacts, produce generalized
+forgery artifacts. In particular, the local interdependence among image pixels
+caused by upsampling operators is significantly demonstrated in synthetic
+images generated by GAN or diffusion. Building upon this observation, we
+introduce the concept of Neighboring Pixel Relationships(NPR) as a means to
+capture and characterize the generalized structural artifacts stemming from
+up-sampling operations. A comprehensive analysis is conducted on an open-world
+dataset, comprising samples generated by \tft{28 distinct generative models}.
+This analysis culminates in the establishment of a novel state-of-the-art
+performance, showcasing a remarkable \tft{11.6\%} improvement over existing
+methods. The code is available at
+https://github.com/chuangchuangtan/NPR-DeepfakeDetection.",cs.CV,['cs.CV']
+SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection,JUNSU KIM · Hoseong Cho · Jihyeon Kim · Yihalem Tiruneh · Seungryul Baek, ,https://arxiv.org/abs/2402.17323,,2402.17323.pdf,SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection,"In the field of class incremental learning (CIL), generative replay has
+become increasingly prominent as a method to mitigate the catastrophic
+forgetting, alongside the continuous improvements in generative models.
+However, its application in class incremental object detection (CIOD) has been
+significantly limited, primarily due to the complexities of scenes involving
+multiple labels. In this paper, we propose a novel approach called stable
+diffusion deep generative replay (SDDGR) for CIOD. Our method utilizes a
+diffusion-based generative model with pre-trained text-to-diffusion networks to
+generate realistic and diverse synthetic images. SDDGR incorporates an
+iterative refinement strategy to produce high-quality images encompassing old
+classes. Additionally, we adopt an L2 knowledge distillation technique to
+improve the retention of prior knowledge in synthetic images. Furthermore, our
+approach includes pseudo-labeling for old objects within new task images,
+preventing misclassification as background elements. Extensive experiments on
+the COCO 2017 dataset demonstrate that SDDGR significantly outperforms existing
+algorithms, achieving a new state-of-the-art in various CIOD scenarios. The
+source code will be made available to the public.",cs.CV,['cs.CV']
+Mean-Shift Feature Transformer,Takumi Kobayashi, ,https://arxiv.org/abs/2404.11062,,2404.11062.pdf,Generation of a precise time scale assisted by a near-continuously operating optical lattice clock,"We report on a reduced time variation of a time scale with respect to
+Coordinated Universal Time (UTC) by steering a hydrogen-maser-based time scale
+with a near-continuously operating optical lattice clock. The time scale is
+generated in a post-processing analysis for 230 days with a hydrogen maser with
+its fractional frequency stability limited by a flicker floor of
+$2\times10^{-15}$ and an Yb optical lattice clock operated with an uptime of
+81.6 $\%$. During the 230-day period, the root mean square time variation of
+our time scale with respect to UTC is 0.52 ns, which is a better performance
+compared with those of time scales steered by microwave fountain clocks that
+exhibit root mean square variations from 0.99 ns to 1.6 ns. With the high
+uptime achieved by the Yb optical lattice clock, our simulation implies the
+potential of generating a state-of-the-art time scale with a time variation of
+$<0.1$ ns over a month using a better hydrogen maser reaching the mid
+$10^{-16}$ level. This work demonstrates that a use of an optical clock with a
+high uptime enhances the stability of a time scale.",physics.atom-ph,['physics.atom-ph']
+TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding,Zhihao Zhang · Shengcao Cao · Yu-Xiong Wang, ,https://arxiv.org/abs/2402.18490,,2402.18490.pdf,TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding,"The limited scale of current 3D shape datasets hinders the advancements in 3D
+shape understanding, and motivates multi-modal learning approaches which
+transfer learned knowledge from data-abundant 2D image and language modalities
+to 3D shapes. However, even though the image and language representations have
+been aligned by cross-modal models like CLIP, we find that the image modality
+fails to contribute as much as the language in existing multi-modal 3D
+representation learning methods. This is attributed to the domain shift in the
+2D images and the distinct focus of each modality. To more effectively leverage
+both modalities in the pre-training, we introduce TriAdapter Multi-Modal
+Learning (TAMM) -- a novel two-stage learning approach based on three
+synergistic adapters. First, our CLIP Image Adapter mitigates the domain gap
+between 3D-rendered images and natural images, by adapting the visual
+representations of CLIP for synthetic image-text pairs. Subsequently, our Dual
+Adapters decouple the 3D shape representation space into two complementary
+sub-spaces: one focusing on visual attributes and the other for semantic
+understanding, which ensure a more comprehensive and effective multi-modal
+pre-training. Extensive experiments demonstrate that TAMM consistently enhances
+3D representations for a wide range of 3D encoder architectures, pre-training
+datasets, and downstream tasks. Notably, we boost the zero-shot classification
+accuracy on Objaverse-LVIS from 46.8\% to 50.7\%, and improve the 5-way 10-shot
+linear probing classification accuracy on ModelNet40 from 96.1\% to 99.0\%.
+Project page: https://alanzhangcs.github.io/tamm-page.",cs.CV,['cs.CV']
+Open-Vocabulary 3D Semantic Segmentation with Foundation Models,Li Jiang · Shaoshuai Shi · Bernt Schiele, ,https://arxiv.org/abs/2306.13631,,2306.13631.pdf,OpenMask3D: Open-Vocabulary 3D Instance Segmentation,"We introduce the task of open-vocabulary 3D instance segmentation. Current
+approaches for 3D instance segmentation can typically only recognize object
+categories from a pre-defined closed set of classes that are annotated in the
+training datasets. This results in important limitations for real-world
+applications where one might need to perform tasks guided by novel,
+open-vocabulary queries related to a wide variety of objects. Recently,
+open-vocabulary 3D scene understanding methods have emerged to address this
+problem by learning queryable features for each point in the scene. While such
+a representation can be directly employed to perform semantic segmentation,
+existing methods cannot separate multiple object instances. In this work, we
+address this limitation, and propose OpenMask3D, which is a zero-shot approach
+for open-vocabulary 3D instance segmentation. Guided by predicted
+class-agnostic 3D instance masks, our model aggregates per-mask features via
+multi-view fusion of CLIP-based image embeddings. Experiments and ablation
+studies on ScanNet200 and Replica show that OpenMask3D outperforms other
+open-vocabulary methods, especially on the long-tail distribution. Qualitative
+experiments further showcase OpenMask3D's ability to segment object properties
+based on free-form queries describing geometry, affordances, and materials.",cs.CV,['cs.CV']
+Multiplane Prior Guided Few-Shot Aerial Scene Rendering,Zihan Gao · Licheng Jiao · Lingling Li · Xu Liu · Fang Liu · Puhua Chen · Yuwei Guo, ,http://export.arxiv.org/abs/2402.16407,,2402.16407.pdf,CMC: Few-shot Novel View Synthesis via Cross-view Multiplane Consistency,"Neural Radiance Field (NeRF) has shown impressive results in novel view
+synthesis, particularly in Virtual Reality (VR) and Augmented Reality (AR),
+thanks to its ability to represent scenes continuously. However, when just a
+few input view images are available, NeRF tends to overfit the given views and
+thus make the estimated depths of pixels share almost the same value. Unlike
+previous methods that conduct regularization by introducing complex priors or
+additional supervisions, we propose a simple yet effective method that
+explicitly builds depth-aware consistency across input views to tackle this
+challenge. Our key insight is that by forcing the same spatial points to be
+sampled repeatedly in different input views, we are able to strengthen the
+interactions between views and therefore alleviate the overfitting problem. To
+achieve this, we build the neural networks on layered representations
+(\textit{i.e.}, multiplane images), and the sampling point can thus be
+resampled on multiple discrete planes. Furthermore, to regularize the unseen
+target views, we constrain the rendered colors and depths from different input
+views to be the same. Although simple, extensive experiments demonstrate that
+our proposed method can achieve better synthesis quality over state-of-the-art
+methods.",cs.CV,"['cs.CV', 'cs.GR']"
+One-step Diffusion with Distribution Matching Distillation,Tianwei Yin · Michaël Gharbi · Michaël Gharbi · Richard Zhang · Eli Shechtman · Fredo Durand · William Freeman · Taesung Park, ,https://arxiv.org/abs/2311.18828,,2311.18828.pdf,One-step Diffusion with Distribution Matching Distillation,"Diffusion models generate high-quality images but require dozens of forward
+passes. We introduce Distribution Matching Distillation (DMD), a procedure to
+transform a diffusion model into a one-step image generator with minimal impact
+on image quality. We enforce the one-step image generator match the diffusion
+model at distribution level, by minimizing an approximate KL divergence whose
+gradient can be expressed as the difference between 2 score functions, one of
+the target distribution and the other of the synthetic distribution being
+produced by our one-step generator. The score functions are parameterized as
+two diffusion models trained separately on each distribution. Combined with a
+simple regression loss matching the large-scale structure of the multi-step
+diffusion outputs, our method outperforms all published few-step diffusion
+approaches, reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot
+COCO-30k, comparable to Stable Diffusion but orders of magnitude faster.
+Utilizing FP16 inference, our model generates images at 20 FPS on modern
+hardware.",cs.CV,['cs.CV']
+Towards 3D Vision with Low-Cost Single-Photon Cameras,Fangzhou Mu · Carter Sifferman · Sacha Jungerman · Yiquan Li · Zhiyue Han · Michael Gleicher · Mohit Gupta · Yin Li,https://cpsiff.github.io/towards_3d_vision/,https://arxiv.org/abs/2403.17801,,2403.17801.pdf,Towards 3D Vision with Low-Cost Single-Photon Cameras,"We present a method for reconstructing 3D shape of arbitrary Lambertian
+objects based on measurements by miniature, energy-efficient, low-cost
+single-photon cameras. These cameras, operating as time resolved image sensors,
+illuminate the scene with a very fast pulse of diffuse light and record the
+shape of that pulse as it returns back from the scene at a high temporal
+resolution. We propose to model this image formation process, account for its
+non-idealities, and adapt neural rendering to reconstruct 3D geometry from a
+set of spatially distributed sensors with known poses. We show that our
+approach can successfully recover complex 3D shapes from simulated data. We
+further demonstrate 3D object reconstruction from real-world captures,
+utilizing measurements from a commodity proximity sensor. Our work draws a
+connection between image-based modeling and active range scanning and is a step
+towards 3D vision with single-photon cameras.",cs.CV,"['cs.CV', 'eess.IV']"
+RepKPU: Point Cloud Upsampling with Kernel Point Representation and Deformation,Yi Rong · Haoran Zhou · Kang Xia · Cheng Mei · Jiahao Wang · Tong Lu, ,,https://www.mdpi.com/2072-4292/16/3/450,,,,,nan
+UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence,Ruihai Wu · Haoran Lu · Yiyan Wang · Yubo Wang · Hao Dong, ,https://arxiv.org/abs/2405.06903,,2405.06903.pdf,UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence,"Garment manipulation (e.g., unfolding, folding and hanging clothes) is
+essential for future robots to accomplish home-assistant tasks, while highly
+challenging due to the diversity of garment configurations, geometries and
+deformations. Although able to manipulate similar shaped garments in a certain
+task, previous works mostly have to design different policies for different
+tasks, could not generalize to garments with diverse geometries, and often rely
+heavily on human-annotated data. In this paper, we leverage the property that,
+garments in a certain category have similar structures, and then learn the
+topological dense (point-level) visual correspondence among garments in the
+category level with different deformations in the self-supervised manner. The
+topological correspondence can be easily adapted to the functional
+correspondence to guide the manipulation policies for various downstream tasks,
+within only one or few-shot demonstrations. Experiments over garments in 3
+different categories on 3 representative tasks in diverse scenarios, using one
+or two arms, taking one or more steps, inputting flat or messy garments,
+demonstrate the effectiveness of our proposed method. Project page:
+https://warshallrho.github.io/unigarmentmanip.",cs.CV,['cs.CV']
+Learning Diffusion Texture Priors for Image Restoration,Tian Ye · Sixiang Chen · Wenhao Chai · Zhaohu Xing · Jing Qin · Ge lin · Lei Zhu, ,https://arxiv.org/abs/2312.08606,,2312.08606.pdf,VQCNIR: Clearer Night Image Restoration with Vector-Quantized Codebook,"Night photography often struggles with challenges like low light and
+blurring, stemming from dark environments and prolonged exposures. Current
+methods either disregard priors and directly fitting end-to-end networks,
+leading to inconsistent illumination, or rely on unreliable handcrafted priors
+to constrain the network, thereby bringing the greater error to the final
+result. We believe in the strength of data-driven high-quality priors and
+strive to offer a reliable and consistent prior, circumventing the restrictions
+of manual priors. In this paper, we propose Clearer Night Image Restoration
+with Vector-Quantized Codebook (VQCNIR) to achieve remarkable and consistent
+restoration outcomes on real-world and synthetic benchmarks. To ensure the
+faithful restoration of details and illumination, we propose the incorporation
+of two essential modules: the Adaptive Illumination Enhancement Module (AIEM)
+and the Deformable Bi-directional Cross-Attention (DBCA) module. The AIEM
+leverages the inter-channel correlation of features to dynamically maintain
+illumination consistency between degraded features and high-quality codebook
+features. Meanwhile, the DBCA module effectively integrates texture and
+structural information through bi-directional cross-attention and deformable
+convolution, resulting in enhanced fine-grained detail and structural fidelity
+across parallel decoders. Extensive experiments validate the remarkable
+benefits of VQCNIR in enhancing image quality under low-light conditions,
+showcasing its state-of-the-art performance on both synthetic and real-world
+datasets. The code is available at https://github.com/AlexZou14/VQCNIR.",cs.CV,['cs.CV']
+Move Anything with Layered Scene Diffusion,Jiawei Ren · Mengmeng Xu · Jui-Chieh Wu · Ziwei Liu · Tao Xiang · Antoine Toisoul, ,https://arxiv.org/abs/2404.07178,,2404.07178.pdf,Move Anything with Layered Scene Diffusion,"Diffusion models generate images with an unprecedented level of quality, but
+how can we freely rearrange image layouts? Recent works generate controllable
+scenes via learning spatially disentangled latent codes, but these methods do
+not apply to diffusion models due to their fixed forward process. In this work,
+we propose SceneDiffusion to optimize a layered scene representation during the
+diffusion sampling process. Our key insight is that spatial disentanglement can
+be obtained by jointly denoising scene renderings at different spatial layouts.
+Our generated scenes support a wide range of spatial editing operations,
+including moving, resizing, cloning, and layer-wise appearance editing
+operations, including object restyling and replacing. Moreover, a scene can be
+generated conditioned on a reference image, thus enabling object moving for
+in-the-wild images. Notably, this approach is training-free, compatible with
+general text-to-image diffusion models, and responsive in less than a second.",cs.CV,['cs.CV']
+MoML: Online Meta Adaptation for 3D Human Motion Prediction,Xiaoning Sun · Huaijiang Sun · Bin Li · Dong Wei · Weiqing Li · Jianfeng Lu, ,https://arxiv.org/abs/2405.02911,,,Multimodal Sense-Informed Prediction of 3D Human Motions,"Predicting future human pose is a fundamental application for machine
+intelligence, which drives robots to plan their behavior and paths ahead of
+time to seamlessly accomplish human-robot collaboration in real-world 3D
+scenarios. Despite encouraging results, existing approaches rarely consider the
+effects of the external scene on the motion sequence, leading to pronounced
+artifacts and physical implausibilities in the predictions. To address this
+limitation, this work introduces a novel multi-modal sense-informed motion
+prediction approach, which conditions high-fidelity generation on two modal
+information: external 3D scene, and internal human gaze, and is able to
+recognize their salience for future human activity. Furthermore, the gaze
+information is regarded as the human intention, and combined with both motion
+and scene features, we construct a ternary intention-aware attention to
+supervise the generation to match where the human wants to reach. Meanwhile, we
+introduce semantic coherence-aware attention to explicitly distinguish the
+salient point clouds and the underlying ones, to ensure a reasonable
+interaction of the generated sequence with the 3D scene. On two real-world
+benchmarks, the proposed method achieves state-of-the-art performance both in
+3D human pose and trajectory prediction.",cs.CV,['cs.CV']
+Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark,Ziyang Chen · Israel D. Gebru · Christian Richardt · Anurag Kumar · William Laney · Andrew Owens · Alexander Richard, ,,https://openreview.net/forum?id=Mk0Uf3zHtU,,,,,nan
+Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling,Shentong Mo · Pedro Morgado, ,https://arxiv.org/abs/2312.01017,,2312.01017.pdf,Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling,"Humans possess a remarkable ability to integrate auditory and visual
+information, enabling a deeper understanding of the surrounding environment.
+This early fusion of audio and visual cues, demonstrated through cognitive
+psychology and neuroscience research, offers promising potential for developing
+multimodal perception models. However, training early fusion architectures
+poses significant challenges, as the increased model expressivity requires
+robust learning frameworks to harness their enhanced capabilities. In this
+paper, we address this challenge by leveraging the masked reconstruction
+framework, previously successful in unimodal settings, to train audio-visual
+encoders with early fusion. Additionally, we propose an attention-based fusion
+module that captures interactions between local audio and visual
+representations, enhancing the model's ability to capture fine-grained
+interactions. While effective, this procedure can become computationally
+intractable, as the number of local representations increases. Thus, to address
+the computational complexity, we propose an alternative procedure that
+factorizes the local representations before representing audio-visual
+interactions. Extensive evaluations on a variety of datasets demonstrate the
+superiority of our approach in audio-event classification, visual sound
+localization, sound separation, and audio-visual segmentation. These
+contributions enable the efficient training of deeply integrated audio-visual
+models and significantly advance the usefulness of early fusion architectures.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.MM', 'cs.SD']"
+HumanNeRF-SE: A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses,Caoyuan Ma · Yu-Lun Liu · Zhixiang Wang · Wu Liu · Xinchen Liu · Zheng Wang, ,https://arxiv.org/abs/2312.02232,,2312.02232.pdf,HumanNeRF-SE: A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses,"We present HumanNeRF-SE, a simple yet effective method that synthesizes
+diverse novel pose images with simple input. Previous HumanNeRF works require a
+large number of optimizable parameters to fit the human images. Instead, we
+reload these approaches by combining explicit and implicit human
+representations to design both generalized rigid deformation and specific
+non-rigid deformation. Our key insight is that explicit shape can reduce the
+sampling points used to fit implicit representation, and frozen blending
+weights from SMPL constructing a generalized rigid deformation can effectively
+avoid overfitting and improve pose generalization performance. Our architecture
+involving both explicit and implicit representation is simple yet effective.
+Experiments demonstrate our model can synthesize images under arbitrary poses
+with few-shot input and increase the speed of synthesizing images by 15 times
+through a reduction in computational complexity without using any existing
+acceleration modules. Compared to the state-of-the-art HumanNeRF studies,
+HumanNeRF-SE achieves better performance with fewer learnable parameters and
+less training time.",cs.CV,['cs.CV']
+GALA: Generating Animatable Layered Assets from a Single Scan,Taeksoo Kim · Byungjun Kim · Shunsuke Saito · Hanbyul Joo, ,https://arxiv.org/abs/2401.12979,,2401.12979.pdf,GALA: Generating Animatable Layered Assets from a Single Scan,"We present GALA, a framework that takes as input a single-layer clothed 3D
+human mesh and decomposes it into complete multi-layered 3D assets. The outputs
+can then be combined with other assets to create novel clothed human avatars
+with any pose. Existing reconstruction approaches often treat clothed humans as
+a single-layer of geometry and overlook the inherent compositionality of humans
+with hairstyles, clothing, and accessories, thereby limiting the utility of the
+meshes for downstream applications. Decomposing a single-layer mesh into
+separate layers is a challenging task because it requires the synthesis of
+plausible geometry and texture for the severely occluded regions. Moreover,
+even with successful decomposition, meshes are not normalized in terms of poses
+and body shapes, failing coherent composition with novel identities and poses.
+To address these challenges, we propose to leverage the general knowledge of a
+pretrained 2D diffusion model as geometry and appearance prior for humans and
+other assets. We first separate the input mesh using the 3D surface
+segmentation extracted from multi-view 2D segmentations. Then we synthesize the
+missing geometry of different layers in both posed and canonical spaces using a
+novel pose-guided Score Distillation Sampling (SDS) loss. Once we complete
+inpainting high-fidelity 3D geometry, we also apply the same SDS loss to its
+texture to obtain the complete appearance including the initially occluded
+regions. Through a series of decomposition steps, we obtain multiple layers of
+3D assets in a shared canonical space normalized in terms of poses and human
+shapes, hence supporting effortless composition to novel identities and
+reanimation with novel poses. Our experiments demonstrate the effectiveness of
+our approach for decomposition, canonicalization, and composition tasks
+compared to existing solutions.",cs.CV,['cs.CV']
+A Vision Check-up for Language Models,Pratyusha Sharma · Tamar Rott Shaham · Manel Baradad · Stephanie Fu · Adrian Rodriguez-Munoz · Shivam Duggal · Phillip Isola · Antonio Torralba, ,https://arxiv.org/abs/2401.01862,,2401.01862.pdf,A Vision Check-up for Language Models,"What does learning to model relationships between strings teach large
+language models (LLMs) about the visual world? We systematically evaluate LLMs'
+abilities to generate and recognize an assortment of visual concepts of
+increasing complexity and then demonstrate how a preliminary visual
+representation learning system can be trained using models of text. As language
+models lack the ability to consume or output visual information as pixels, we
+use code to represent images in our study. Although LLM-generated images do not
+look like natural images, results on image generation and the ability of models
+to correct these generated images indicate that precise modeling of strings can
+teach language models about numerous aspects of the visual world. Furthermore,
+experiments on self-supervised visual representation learning, utilizing images
+generated with text models, highlight the potential to train vision models
+capable of making semantic assessments of natural images using just LLMs.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']"
+Boosting Adversarial Transferability by Block Shuffle and Rotation,Kunyu Wang · he xuanran · Wenxuan Wang · Xiaosen Wang, ,https://arxiv.org/abs/2308.10299,,2308.10299.pdf,Boosting Adversarial Transferability by Block Shuffle and Rotation,"Adversarial examples mislead deep neural networks with imperceptible
+perturbations and have brought significant threats to deep learning. An
+important aspect is their transferability, which refers to their ability to
+deceive other models, thus enabling attacks in the black-box setting. Though
+various methods have been proposed to boost transferability, the performance
+still falls short compared with white-box attacks. In this work, we observe
+that existing input transformation based attacks, one of the mainstream
+transfer-based attacks, result in different attention heatmaps on various
+models, which might limit the transferability. We also find that breaking the
+intrinsic relation of the image can disrupt the attention heatmap of the
+original image. Based on this finding, we propose a novel input transformation
+based attack called block shuffle and rotation (BSR). Specifically, BSR splits
+the input image into several blocks, then randomly shuffles and rotates these
+blocks to construct a set of new images for gradient calculation. Empirical
+evaluations on the ImageNet dataset demonstrate that BSR could achieve
+significantly better transferability than the existing input transformation
+based methods under single-model and ensemble-model settings. Combining BSR
+with the current input transformation method can further improve the
+transferability, which significantly outperforms the state-of-the-art methods.
+Code is available at https://github.com/Trustworthy-AI-Group/BSR",cs.CV,"['cs.CV', 'eess.IV']"
+Residual Learning in Diffusion Models,Junyu Zhang · Daochang Liu · Eunbyung Park · Shichao Zhang · Chang Xu, ,https://arxiv.org/abs/2308.13712,,2308.13712.pdf,Residual Denoising Diffusion Models,"We propose residual denoising diffusion models (RDDM), a novel dual diffusion
+process that decouples the traditional single denoising diffusion process into
+residual diffusion and noise diffusion. This dual diffusion framework expands
+the denoising-based diffusion models, initially uninterpretable for image
+restoration, into a unified and interpretable model for both image generation
+and restoration by introducing residuals. Specifically, our residual diffusion
+represents directional diffusion from the target image to the degraded input
+image and explicitly guides the reverse generation process for image
+restoration, while noise diffusion represents random perturbations in the
+diffusion process. The residual prioritizes certainty, while the noise
+emphasizes diversity, enabling RDDM to effectively unify tasks with varying
+certainty or diversity requirements, such as image generation and restoration.
+We demonstrate that our sampling process is consistent with that of DDPM and
+DDIM through coefficient transformation, and propose a partially
+path-independent generation process to better understand the reverse process.
+Notably, our RDDM enables a generic UNet, trained with only an L1 loss and a
+batch size of 1, to compete with state-of-the-art image restoration methods. We
+provide code and pre-trained models to encourage further exploration,
+application, and development of our innovative framework
+(https://github.com/nachifur/RDDM).",cs.CV,"['cs.CV', 'cs.LG']"
+"What, How, and When Should Object Detectors Update in Continually Changing Test Domains?",Jayeon Yoo · Dongkwan Lee · Inseop Chung · Donghyun Kim · Nojun Kwak, ,https://arxiv.org/abs/2312.08875,,2312.08875.pdf,"What, How, and When Should Object Detectors Update in Continually Changing Test Domains?","It is a well-known fact that the performance of deep learning models
+deteriorates when they encounter a distribution shift at test time. Test-time
+adaptation (TTA) algorithms have been proposed to adapt the model online while
+inferring test data. However, existing research predominantly focuses on
+classification tasks through the optimization of batch normalization layers or
+classification heads, but this approach limits its applicability to various
+model architectures like Transformers and makes it challenging to apply to
+other tasks, such as object detection. In this paper, we propose a novel online
+adaption approach for object detection in continually changing test domains,
+considering which part of the model to update, how to update it, and when to
+perform the update. By introducing architecture-agnostic and lightweight
+adaptor modules and only updating these while leaving the pre-trained backbone
+unchanged, we can rapidly adapt to new test domains in an efficient way and
+prevent catastrophic forgetting. Furthermore, we present a practical and
+straightforward class-wise feature aligning method for object detection to
+resolve domain shifts. Additionally, we enhance efficiency by determining when
+the model is sufficiently adapted or when additional adaptation is needed due
+to changes in the test distribution. Our approach surpasses baselines on widely
+used benchmarks, achieving improvements of up to 4.9\%p and 7.9\%p in mAP for
+COCO $\rightarrow$ COCO-corrupted and SHIFT, respectively, while maintaining
+about 20 FPS or higher.",cs.CV,['cs.CV']
+Soften to Defend: Towards Adversarial Robustness via Self-Guided Label Refinement,Daiwei Yu · Zhuorong Li · Lina Wei · Canghong Jin · Yun Zhang · Sixian Chan, ,https://arxiv.org/abs/2403.09101,,2403.09101.pdf,Soften to Defend: Towards Adversarial Robustness via Self-Guided Label Refinement,"Adversarial training (AT) is currently one of the most effective ways to
+obtain the robustness of deep neural networks against adversarial attacks.
+However, most AT methods suffer from robust overfitting, i.e., a significant
+generalization gap in adversarial robustness between the training and testing
+curves. In this paper, we first identify a connection between robust
+overfitting and the excessive memorization of noisy labels in AT from a view of
+gradient norm. As such label noise is mainly caused by a distribution mismatch
+and improper label assignments, we are motivated to propose a label refinement
+approach for AT. Specifically, our Self-Guided Label Refinement first
+self-refines a more accurate and informative label distribution from
+over-confident hard labels, and then it calibrates the training by dynamically
+incorporating knowledge from self-distilled models into the current model and
+thus requiring no external teachers. Empirical results demonstrate that our
+method can simultaneously boost the standard accuracy and robust performance
+across multiple benchmark datasets, attack types, and architectures. In
+addition, we also provide a set of analyses from the perspectives of
+information theory to dive into our method and suggest the importance of soft
+labels for robust generalization.",cs.LG,"['cs.LG', 'cs.CR', 'cs.CV']"
+SAOR: Single-View Articulated Object Reconstruction,Mehmet Aygun · Oisin Mac Aodha, ,,https://synthical.com/article/e8c0baeb-d277-4528-b526-8a08fcc46a22,,,,,nan
+Infrared Adversarial Car Stickers,Xiaopei Zhu · Yuqiu Liu · Zhanhao Hu · Jianmin Li · Xiaolin Hu, ,https://arxiv.org/abs/2405.09924,,2405.09924.pdf,Infrared Adversarial Car Stickers,"Infrared physical adversarial examples are of great significance for studying
+the security of infrared AI systems that are widely used in our lives such as
+autonomous driving. Previous infrared physical attacks mainly focused on 2D
+infrared pedestrian detection which may not fully manifest its destructiveness
+to AI systems. In this work, we propose a physical attack method against
+infrared detectors based on 3D modeling, which is applied to a real car. The
+goal is to design a set of infrared adversarial stickers to make cars invisible
+to infrared detectors at various viewing angles, distances, and scenes. We
+build a 3D infrared car model with real infrared characteristics and propose an
+infrared adversarial pattern generation method based on 3D mesh shadow. We
+propose a 3D control points-based mesh smoothing algorithm and use a set of
+smoothness loss functions to enhance the smoothness of adversarial meshes and
+facilitate the sticker implementation. Besides, We designed the aluminum
+stickers and conducted physical experiments on two real Mercedes-Benz A200L
+cars. Our adversarial stickers hid the cars from Faster RCNN, an object
+detector, at various viewing angles, distances, and scenes. The attack success
+rate (ASR) was 91.49% for real cars. In comparison, the ASRs of random stickers
+and no sticker were only 6.21% and 0.66%, respectively. In addition, the ASRs
+of the designed stickers against six unseen object detectors such as YOLOv3 and
+Deformable DETR were between 73.35%-95.80%, showing good transferability of the
+attack performance across detectors.",cs.CV,['cs.CV']
+Effective Video Mirror Detection with Inconsistent Motion Cues,Alex Warren · Ke Xu · Jiaying Lin · Gary Tam · Rynson W.H. Lau, ,,https://cronfa.swan.ac.uk/Record/cronfa65886/Details,,,,,nan
+Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer,Hyeongjin Nam · Daniel Jung · Gyeongsik Moon · Kyoung Mu Lee,https://github.com/dqj5182/CONTHO_RELEASE,https://arxiv.org/abs/2404.04819,,2404.04819.pdf,Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer,"Human-object contact serves as a strong cue to understand how humans
+physically interact with objects. Nevertheless, it is not widely explored to
+utilize human-object contact information for the joint reconstruction of 3D
+human and object from a single image. In this work, we present a novel joint 3D
+human-object reconstruction method (CONTHO) that effectively exploits contact
+information between humans and objects. There are two core designs in our
+system: 1) 3D-guided contact estimation and 2) contact-based 3D human and
+object refinement. First, for accurate human-object contact estimation, CONTHO
+initially reconstructs 3D humans and objects and utilizes them as explicit 3D
+guidance for contact estimation. Second, to refine the initial reconstructions
+of 3D human and object, we propose a novel contact-based refinement Transformer
+that effectively aggregates human features and object features based on the
+estimated human-object contact. The proposed contact-based refinement prevents
+the learning of erroneous correlation between human and object, which enables
+accurate 3D reconstruction. As a result, our CONTHO achieves state-of-the-art
+performance in both human-object contact estimation and joint reconstruction of
+3D human and object. The code is publicly available at
+https://github.com/dqj5182/CONTHO_RELEASE.",cs.CV,['cs.CV']
+SingularTrajectory: Universal Trajectory Predictor Using Diffusion Model,Inhwan Bae · Young-Jae Park · Hae-Gon Jeon,https://github.com/InhwanBae/SingularTrajectory,https://arxiv.org/abs/2403.18452v1,,2403.18452v1.pdf,SingularTrajectory: Universal Trajectory Predictor Using Diffusion Model,"There are five types of trajectory prediction tasks: deterministic,
+stochastic, domain adaptation, momentary observation, and few-shot. These
+associated tasks are defined by various factors, such as the length of input
+paths, data split and pre-processing methods. Interestingly, even though they
+commonly take sequential coordinates of observations as input and infer future
+paths in the same coordinates as output, designing specialized architectures
+for each task is still necessary. For the other task, generality issues can
+lead to sub-optimal performances. In this paper, we propose SingularTrajectory,
+a diffusion-based universal trajectory prediction framework to reduce the
+performance gap across the five tasks. The core of SingularTrajectory is to
+unify a variety of human dynamics representations on the associated tasks. To
+do this, we first build a Singular space to project all types of motion
+patterns from each task into one embedding space. We next propose an adaptive
+anchor working in the Singular space. Unlike traditional fixed anchor methods
+that sometimes yield unacceptable paths, our adaptive anchor enables correct
+anchors, which are put into a wrong location, based on a traversability map.
+Finally, we adopt a diffusion-based predictor to further enhance the prototype
+paths using a cascaded denoising process. Our unified framework ensures the
+generality across various benchmark settings such as input modality, and
+trajectory lengths. Extensive experiments on five public benchmarks demonstrate
+that SingularTrajectory substantially outperforms existing models, highlighting
+its effectiveness in estimating general dynamics of human movements. Code is
+publicly available at https://github.com/inhwanbae/SingularTrajectory .",cs.CV,"['cs.CV', 'cs.LG', 'cs.RO']"
+Gradient Reweighting: Towards Imbalanced Class-Incremental Learning,Jiangpeng He,https://github.com/JiangpengHe/imbalanced_cil,https://arxiv.org/abs/2402.18528,,2402.18528.pdf,Gradient Reweighting: Towards Imbalanced Class-Incremental Learning,"Class-Incremental Learning (CIL) trains a model to continually recognize new
+classes from non-stationary data while retaining learned knowledge. A major
+challenge of CIL arises when applying to real-world data characterized by
+non-uniform distribution, which introduces a dual imbalance problem involving
+(i) disparities between stored exemplars of old tasks and new class data
+(inter-phase imbalance), and (ii) severe class imbalances within each
+individual task (intra-phase imbalance). We show that this dual imbalance issue
+causes skewed gradient updates with biased weights in FC layers, thus inducing
+over/under-fitting and catastrophic forgetting in CIL. Our method addresses it
+by reweighting the gradients towards balanced optimization and unbiased
+classifier learning. Additionally, we observe imbalanced forgetting where
+paradoxically the instance-rich classes suffer higher performance degradation
+during CIL due to a larger amount of training data becoming unavailable in
+subsequent learning phases. To tackle this, we further introduce a
+distribution-aware knowledge distillation loss to mitigate forgetting by
+aligning output logits proportionally with the distribution of lost training
+data. We validate our method on CIFAR-100, ImageNetSubset, and Food101 across
+various evaluation protocols and demonstrate consistent improvements compared
+to existing works, showing great potential to apply CIL in real-world scenarios
+with enhanced robustness and effectiveness.",cs.CV,['cs.CV']
+OpenEQA: Embodied Question Answering in the Era of Foundation Models,Arjun Majumdar · Anurag Ajay · Xiaohan Zhang · Sriram Yenamandra · Mikael Henaff · Alexander Sax · Sneha Silwal · Paul McVay · Oleksandr Maksymets · Sergio Arnaud · Pranav Putta · Karmesh Yadav · Qiyang Li · Benjamin Newman · Mohit Sharma · Mohit Sharma · Vincent-Pierre Berges · Shiqi Zhang · Pulkit Agrawal · Dhruv Batra · Yonatan Bisk · Mrinal Kalakrishnan · Franziska Meier · Chris Paxton · Aravind Rajeswaran, ,,https://openreview.net/forum?id=7JIW6e1UJX,,,,,nan
+Batch Normalization Alleviates the Spectral Bias in Coordinate Networks,Zhicheng Cai · Hao Zhu · Qiu Shen · Xinran Wang · Xun Cao, ,https://arxiv.org/abs/2306.16999,,2306.16999.pdf,Spectral Batch Normalization: Normalization in the Frequency Domain,"Regularization is a set of techniques that are used to improve the
+generalization ability of deep neural networks. In this paper, we introduce
+spectral batch normalization (SBN), a novel effective method to improve
+generalization by normalizing feature maps in the frequency (spectral) domain.
+The activations of residual networks without batch normalization (BN) tend to
+explode exponentially in the depth of the network at initialization. This leads
+to extremely large feature map norms even though the parameters are relatively
+small. These explosive dynamics can be very detrimental to learning. BN makes
+weight decay regularization on the scaling factors $\gamma, \beta$
+approximately equivalent to an additive penalty on the norm of the feature
+maps, which prevents extremely large feature map norms to a certain degree.
+However, we show experimentally that, despite the approximate additive penalty
+of BN, feature maps in deep neural networks (DNNs) tend to explode at the
+beginning of the network and that feature maps of DNNs contain large values
+during the whole training. This phenomenon also occurs in a weakened form in
+non-residual networks. SBN addresses large feature maps by normalizing them in
+the frequency domain. In our experiments, we empirically show that SBN prevents
+exploding feature maps at initialization and large feature map values during
+the training. Moreover, the normalization of feature maps in the frequency
+domain leads to more uniform distributed frequency components. This discourages
+the DNNs to rely on single frequency components of feature maps. These,
+together with other effects of SBN, have a regularizing effect on the training
+of residual and non-residual networks. We show experimentally that using SBN in
+addition to standard regularization methods improves the performance of DNNs by
+a relevant margin, e.g. ResNet50 on ImageNet by 0.71%.",cs.CV,"['cs.CV', 'cs.LG']"
+Learning for Transductive Threshold Calibration in Open-World Recognition,Qin ZHANG · DONGSHENG An · Tianjun Xiao · Tong He · Qingming Tang · Ying Nian Wu · Joseph Tighe · Yifan Xing, ,,https://synthical.com/summary/ed7531f5-2d4e-43c1-95e3-15ec48a9b43d,,,,,nan
+MatSynth: A Modern PBR Materials Dataset,Giuseppe Vecchio · Valentin Deschaintre,https://gvecchio.com/matsynth/,https://arxiv.org/abs/2401.06056,,2401.06056.pdf,MatSynth: A Modern PBR Materials Dataset,"We introduce MatSynth, a dataset of 4,000+ CC0 ultra-high resolution PBR
+materials. Materials are crucial components of virtual relightable assets,
+defining the interaction of light at the surface of geometries. Given their
+importance, significant research effort was dedicated to their representation,
+creation and acquisition. However, in the past 6 years, most research in
+material acquisiton or generation relied either on the same unique dataset, or
+on company-owned huge library of procedural materials. With this dataset we
+propose a significantly larger, more diverse, and higher resolution set of
+materials than previously publicly available. We carefully discuss the data
+collection process and demonstrate the benefits of this dataset on material
+acquisition and generation applications. The complete data further contains
+metadata with each material's origin, license, category, tags, creation method
+and, when available, descriptions and physical size, as well as 3M+ renderings
+of the augmented materials, in 1K, under various environment lightings. The
+MatSynth dataset is released through the project page at:
+https://www.gvecchio.com/matsynth.",cs.CV,"['cs.CV', 'cs.GR']"
+Versatile Medical Image Segmentation Learned from Multi-Source Datasets via Model Self-Disambiguation,Xiaoyang Chen · Hao Zheng · Yuemeng LI · Yuncong Ma · Liang Ma · Hongming Li · Yong Fan, ,https://arxiv.org/abs/2311.10696,,2311.10696.pdf,Versatile Medical Image Segmentation Learned from Multi-Source Datasets via Model Self-Disambiguation,"A versatile medical image segmentation model applicable to images acquired
+with diverse equipment and protocols can facilitate model deployment and
+maintenance. However, building such a model typically demands a large, diverse,
+and fully annotated dataset, which is challenging to obtain due to the
+labor-intensive nature of data curation. To address this challenge, we propose
+a cost-effective alternative that harnesses multi-source data with only partial
+or sparse segmentation labels for training, substantially reducing the cost of
+developing a versatile model. We devise strategies for model
+self-disambiguation, prior knowledge incorporation, and imbalance mitigation to
+tackle challenges associated with inconsistently labeled multi-source data,
+including label ambiguity and modality, dataset, and class imbalances.
+Experimental results on a multi-modal dataset compiled from eight different
+sources for abdominal structure segmentation have demonstrated the
+effectiveness and superior performance of our method compared to
+state-of-the-art alternative approaches. We anticipate that its cost-saving
+features, which optimize the utilization of existing annotated data and reduce
+annotation efforts for new data, will have a significant impact in the field.",cs.CV,['cs.CV']
+ASAM: Boosting Segment Anything Model with Adversarial Tuning,Bo Li · Haoke Xiao · Lv Tang, ,https://arxiv.org/abs/2405.00256,,2405.00256.pdf,ASAM: Boosting Segment Anything Model with Adversarial Tuning,"In the evolving landscape of computer vision, foundation models have emerged
+as pivotal tools, exhibiting exceptional adaptability to a myriad of tasks.
+Among these, the Segment Anything Model (SAM) by Meta AI has distinguished
+itself in image segmentation. However, SAM, like its counterparts, encounters
+limitations in specific niche applications, prompting a quest for enhancement
+strategies that do not compromise its inherent capabilities. This paper
+introduces ASAM, a novel methodology that amplifies SAM's performance through
+adversarial tuning. We harness the potential of natural adversarial examples,
+inspired by their successful implementation in natural language processing. By
+utilizing a stable diffusion model, we augment a subset (1%) of the SA-1B
+dataset, generating adversarial instances that are more representative of
+natural variations rather than conventional imperceptible perturbations. Our
+approach maintains the photorealism of adversarial examples and ensures
+alignment with original mask annotations, thereby preserving the integrity of
+the segmentation task. The fine-tuned ASAM demonstrates significant
+improvements across a diverse range of segmentation tasks without necessitating
+additional data or architectural modifications. The results of our extensive
+evaluations confirm that ASAM establishes new benchmarks in segmentation tasks,
+thereby contributing to the advancement of foundational models in computer
+vision. Our project page is in https://asam2024.github.io/.",cs.CV,['cs.CV']
+FreeDrag: Feature Dragging for Reliable Point-based Image Editing,Pengyang Ling · Lin Chen · Pan Zhang · Huaian Chen · Yi Jin · Jinjin Zheng, ,https://arxiv.org/abs/2307.04684,,2307.04684.pdf,FreeDrag: Feature Dragging for Reliable Point-based Image Editing,"To serve the intricate and varied demands of image editing, precise and
+flexible manipulation in image content is indispensable. Recently, Drag-based
+editing methods have gained impressive performance. However, these methods
+predominantly center on point dragging, resulting in two noteworthy drawbacks,
+namely ""miss tracking"", where difficulties arise in accurately tracking the
+predetermined handle points, and ""ambiguous tracking"", where tracked points are
+potentially positioned in wrong regions that closely resemble the handle
+points. To address the above issues, we propose FreeDrag, a feature dragging
+methodology designed to free the burden on point tracking. The FreeDrag
+incorporates two key designs, i.e., template feature via adaptive updating and
+line search with backtracking, the former improves the stability against
+drastic content change by elaborately controls feature updating scale after
+each dragging, while the latter alleviates the misguidance from similar points
+by actively restricting the search area in a line. These two technologies
+together contribute to a more stable semantic dragging with higher efficiency.
+Comprehensive experimental results substantiate that our approach significantly
+outperforms pre-existing methodologies, offering reliable point-based editing
+even in various complex scenarios.",cs.CV,"['cs.CV', 'cs.HC', 'cs.LG']"
+ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization,Weiyao Wang · Pierre Gleize · Hao Tang · Xingyu Chen · Kevin Liang · Matt Feiszli, ,https://arxiv.org/abs/2401.08937,,2401.08937.pdf,ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization,"Neural Radiance Fields (NeRF) exhibit remarkable performance for Novel View
+Synthesis (NVS) given a set of 2D images. However, NeRF training requires
+accurate camera pose for each input view, typically obtained by
+Structure-from-Motion (SfM) pipelines. Recent works have attempted to relax
+this constraint, but they still often rely on decent initial poses which they
+can refine. Here we aim at removing the requirement for pose initialization. We
+present Incremental CONfidence (ICON), an optimization procedure for training
+NeRFs from 2D video frames. ICON only assumes smooth camera motion to estimate
+initial guess for poses. Further, ICON introduces ``confidence"": an adaptive
+measure of model quality used to dynamically reweight gradients. ICON relies on
+high-confidence poses to learn NeRF, and high-confidence 3D structure (as
+encoded by NeRF) to learn poses. We show that ICON, without prior pose
+initialization, achieves superior performance in both CO3D and HO3D versus
+methods which use SfM pose.",cs.CV,['cs.CV']
+StyLitGAN: Image-based Relighting via Latent Control,Anand Bhattad · James Soole · David Forsyth, ,https://ar5iv.labs.arxiv.org/html/2306.00987,,2306.00987.pdf,"StyleGAN knows Normal, Depth, Albedo, and More","Intrinsic images, in the original sense, are image-like maps of scene
+properties like depth, normal, albedo or shading. This paper demonstrates that
+StyleGAN can easily be induced to produce intrinsic images. The procedure is
+straightforward. We show that, if StyleGAN produces $G({w})$ from latents
+${w}$, then for each type of intrinsic image, there is a fixed offset ${d}_c$
+so that $G({w}+{d}_c)$ is that type of intrinsic image for $G({w})$. Here
+${d}_c$ is {\em independent of ${w}$}. The StyleGAN we used was pretrained by
+others, so this property is not some accident of our training regime. We show
+that there are image transformations StyleGAN will {\em not} produce in this
+fashion, so StyleGAN is not a generic image regression engine.
+  It is conceptually exciting that an image generator should ``know'' and
+represent intrinsic images. There may also be practical advantages to using a
+generative model to produce intrinsic images. The intrinsic images obtained
+from StyleGAN compare well both qualitatively and quantitatively with those
+obtained by using SOTA image regression techniques; but StyleGAN's intrinsic
+images are robust to relighting effects, unlike SOTA methods.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+Single Mesh Diffusion Models with Field Latents for Texture Generation,Thomas W. Mitchel · Carlos Esteves · Ameesh Makadia,https://single-mesh-diffusion.github.io/,https://arxiv.org/abs/2312.09250,,2312.09250.pdf,Single Mesh Diffusion Models with Field Latents for Texture Generation,"We introduce a framework for intrinsic latent diffusion models operating
+directly on the surfaces of 3D shapes, with the goal of synthesizing
+high-quality textures. Our approach is underpinned by two contributions: field
+latents, a latent representation encoding textures as discrete vector fields on
+the mesh vertices, and field latent diffusion models, which learn to denoise a
+diffusion process in the learned latent space on the surface. We consider a
+single-textured-mesh paradigm, where our models are trained to generate
+variations of a given texture on a mesh. We show the synthesized textures are
+of superior fidelity compared those from existing single-textured-mesh
+generative models. Our models can also be adapted for user-controlled editing
+tasks such as inpainting and label-guided generation. The efficacy of our
+approach is due in part to the equivariance of our proposed framework under
+isometries, allowing our models to seamlessly reproduce details across locally
+similar regions and opening the door to a notion of generative texture
+transfer.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+Label-Efficient Group Robustness via Out-of-Distribution Concept Curation,Yiwei Yang · Anthony Liu · Robert Wolfe · Aylin Caliskan · Bill Howe, ,https://arxiv.org/abs/2403.06392,,2403.06392.pdf,Towards Robust Out-of-Distribution Generalization Bounds via Sharpness,"Generalizing to out-of-distribution (OOD) data or unseen domain, termed OOD
+generalization, still lacks appropriate theoretical guarantees. Canonical OOD
+bounds focus on different distance measurements between source and target
+domains but fail to consider the optimization property of the learned model. As
+empirically shown in recent work, the sharpness of learned minima influences
+OOD generalization. To bridge this gap between optimization and OOD
+generalization, we study the effect of sharpness on how a model tolerates data
+change in domain shift which is usually captured by ""robustness"" in
+generalization. In this paper, we give a rigorous connection between sharpness
+and robustness, which gives better OOD guarantees for robust algorithms. It
+also provides a theoretical backing for ""flat minima leads to better OOD
+generalization"". Overall, we propose a sharpness-based OOD generalization bound
+by taking robustness into consideration, resulting in a tighter bound than
+non-robust guarantees. Our findings are supported by the experiments on a ridge
+regression model, as well as the experiments on deep learning classification
+tasks.",cs.LG,['cs.LG']
+EventPS: Real-Time Photometric Stereo Using an Event Camera,Bohan Yu · Jieji Ren · Jin Han · Feishi Wang · Jinxiu Liang · Boxin Shi, ,https://arxiv.org/abs/2312.11911,,2312.11911.pdf,"EVI-SAM: Robust, Real-time, Tightly-coupled Event-Visual-Inertial State Estimation and 3D Dense Mapping","Event cameras are bio-inspired, motion-activated sensors that demonstrate
+substantial potential in handling challenging situations, such as motion blur
+and high-dynamic range. In this paper, we proposed EVI-SAM to tackle the
+problem of 6 DoF pose tracking and 3D reconstruction using monocular event
+camera. A novel event-based hybrid tracking framework is designed to estimate
+the pose, leveraging the robustness of feature matching and the precision of
+direct alignment. Specifically, we develop an event-based 2D-2D alignment to
+construct the photometric constraint, and tightly integrate it with the
+event-based reprojection constraint. The mapping module recovers the dense and
+colorful depth of the scene through the image-guided event-based mapping
+method. Subsequently, the appearance, texture, and surface mesh of the 3D scene
+can be reconstructed by fusing the dense depth map from multiple viewpoints
+using truncated signed distance function (TSDF) fusion. To the best of our
+knowledge, this is the first non-learning work to realize event-based dense
+mapping. Numerical evaluations are performed on both publicly available and
+self-collected datasets, which qualitatively and quantitatively demonstrate the
+superior performance of our method. Our EVI-SAM effectively balances accuracy
+and robustness while maintaining computational efficiency, showcasing superior
+pose tracking and dense mapping performance in challenging scenarios. Video
+Demo: https://youtu.be/Nn40U4e5Si8.",cs.CV,"['cs.CV', 'cs.RO']"
+Towards Understanding and Improving Adversarial Robustness of Vision Transformers,Samyak Jain · Tanima Dutta, ,https://arxiv.org/html/2208.09602v2,,,Exploring Adversarial Robustness of Vision Transformers in the Spectral Perspective,"The Vision Transformer has emerged as a powerful tool for image
+classification tasks, surpassing the performance of convolutional neural
+networks (CNNs). Recently, many researchers have attempted to understand the
+robustness of Transformers against adversarial attacks. However, previous
+researches have focused solely on perturbations in the spatial domain. This
+paper proposes an additional perspective that explores the adversarial
+robustness of Transformers against frequency-selective perturbations in the
+spectral domain. To facilitate comparison between these two domains, an attack
+framework is formulated as a flexible tool for implementing attacks on images
+in the spatial and spectral domains. The experiments reveal that Transformers
+rely more on phase and low frequency information, which can render them more
+vulnerable to frequency-selective attacks than CNNs. This work offers new
+insights into the properties and adversarial robustness of Transformers.",cs.CV,['cs.CV']
+On Train-Test Class Overlap and Detection for Image Retrieval,Chull Hwan Song · Jooyoung Yoon · Taebaek Hwang · Shunghyun Choi · Yeong Hyeon Gu · Yannis Avrithis, ,https://arxiv.org/abs/2404.01524,,2404.01524.pdf,On Train-Test Class Overlap and Detection for Image Retrieval,"How important is it for training and evaluation sets to not have class
+overlap in image retrieval? We revisit Google Landmarks v2 clean, the most
+popular training set, by identifying and removing class overlap with Revisited
+Oxford and Paris [34], the most popular evaluation set. By comparing the
+original and the new RGLDv2-clean on a benchmark of reproduced state-of-the-art
+methods, our findings are striking. Not only is there a dramatic drop in
+performance, but it is inconsistent across methods, changing the ranking.What
+does it take to focus on objects or interest and ignore background clutter when
+indexing? Do we need to train an object detector and the representation
+separately? Do we need location supervision? We introduce Single-stage
+Detect-to-Retrieve (CiDeR), an end-to-end, single-stage pipeline to detect
+objects of interest and extract a global image representation. We outperform
+previous state-of-the-art on both existing training sets and the new
+RGLDv2-clean. Our dataset is available at
+https://github.com/dealicious-inc/RGLDv2-clean.",cs.CV,"['cs.CV', 'cs.AI']"
+Semantic Line Combination Detector,JINWON KO · Dongkwon Jin · Chang-Su Kim, ,https://arxiv.org/abs/2404.18399,,2404.18399.pdf,Semantic Line Combination Detector,"A novel algorithm, called semantic line combination detector (SLCD), to find
+an optimal combination of semantic lines is proposed in this paper. It
+processes all lines in each line combination at once to assess the overall
+harmony of the lines. First, we generate various line combinations from
+reliable lines. Second, we estimate the score of each line combination and
+determine the best one. Experimental results demonstrate that the proposed SLCD
+outperforms existing semantic line detectors on various datasets. Moreover, it
+is shown that SLCD can be applied effectively to three vision tasks of
+vanishing point detection, symmetry axis detection, and composition-based image
+retrieval. Our codes are available at https://github.com/Jinwon-Ko/SLCD.",cs.CV,['cs.CV']
+Robust Noisy Correspondence Learning with Equivariant Similarity Consistency,Yuchen Yang · Erkun Yang · Likai Wang · Cheng Deng, ,,https://dl.acm.org/doi/10.1145/3662732,,,,,nan
+Event-based Structure-from-Orbit,Ethan Elms · Yasir Latif · Tae Ha Park · Tat-Jun Chin, ,https://arxiv.org/abs/2405.06216,,2405.06216.pdf,Event-based Structure-from-Orbit,"Event sensors offer high temporal resolution visual sensing, which makes them
+ideal for perceiving fast visual phenomena without suffering from motion blur.
+Certain applications in robotics and vision-based navigation require 3D
+perception of an object undergoing circular or spinning motion in front of a
+static camera, such as recovering the angular velocity and shape of the object.
+The setting is equivalent to observing a static object with an orbiting camera.
+In this paper, we propose event-based structure-from-orbit (eSfO), where the
+aim is to simultaneously reconstruct the 3D structure of a fast spinning object
+observed from a static event camera, and recover the equivalent orbital motion
+of the camera. Our contributions are threefold: since state-of-the-art event
+feature trackers cannot handle periodic self-occlusion due to the spinning
+motion, we develop a novel event feature tracker based on spatio-temporal
+clustering and data association that can better track the helical trajectories
+of valid features in the event data. The feature tracks are then fed to our
+novel factor graph-based structure-from-orbit back-end that calculates the
+orbital motion parameters (e.g., spin rate, relative rotational axis) that
+minimize the reprojection error. For evaluation, we produce a new event dataset
+of objects under spinning motion. Comparisons against ground truth indicate the
+efficacy of eSfO.",cs.CV,['cs.CV']
+HandBooster: Boosting 3D Hand-Mesh Reconstruction by Conditional Synthesis and Sampling of Hand-Object Interactions,Hao Xu · Li Haipeng · Yinqiao Wang · Shuaicheng Liu · Chi-Wing Fu, ,https://arxiv.org/abs/2403.18575,,2403.18575.pdf,HandBooster: Boosting 3D Hand-Mesh Reconstruction by Conditional Synthesis and Sampling of Hand-Object Interactions,"Reconstructing 3D hand mesh robustly from a single image is very challenging,
+due to the lack of diversity in existing real-world datasets. While data
+synthesis helps relieve the issue, the syn-to-real gap still hinders its usage.
+In this work, we present HandBooster, a new approach to uplift the data
+diversity and boost the 3D hand-mesh reconstruction performance by training a
+conditional generative space on hand-object interactions and purposely sampling
+the space to synthesize effective data samples. First, we construct versatile
+content-aware conditions to guide a diffusion model to produce realistic images
+with diverse hand appearances, poses, views, and backgrounds; favorably,
+accurate 3D annotations are obtained for free. Then, we design a novel
+condition creator based on our similarity-aware distribution sampling
+strategies to deliberately find novel and realistic interaction poses that are
+distinctive from the training set. Equipped with our method, several baselines
+can be significantly improved beyond the SOTA on the HO3D and DexYCB
+benchmarks. Our code will be released on
+https://github.com/hxwork/HandBooster_Pytorch.",cs.CV,['cs.CV']
+Customization Assistant for Text-to-image Generation,Yufan Zhou · Ruiyi Zhang · Jiuxiang Gu · Tong Sun, ,https://arxiv.org/abs/2312.03045,,2312.03045.pdf,Customization Assistant for Text-to-image Generation,"Customizing pre-trained text-to-image generation model has attracted massive
+research interest recently, due to its huge potential in real-world
+applications. Although existing methods are able to generate creative content
+for a novel concept contained in single user-input image, their capability are
+still far from perfection. Specifically, most existing methods require
+fine-tuning the generative model on testing images. Some existing methods do
+not require fine-tuning, while their performance are unsatisfactory.
+Furthermore, the interaction between users and models are still limited to
+directive and descriptive prompts such as instructions and captions. In this
+work, we build a customization assistant based on pre-trained large language
+model and diffusion model, which can not only perform customized generation in
+a tuning-free manner, but also enable more user-friendly interactions: users
+can chat with the assistant and input either ambiguous text or clear
+instruction. Specifically, we propose a new framework consists of a new model
+design and a novel training strategy. The resulting assistant can perform
+customized generation in 2-5 seconds without any test time fine-tuning.
+Extensive experiments are conducted, competitive results have been obtained
+across different domains, illustrating the effectiveness of the proposed
+method.",cs.CV,['cs.CV']
+Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering,Jiawei Yao · Qi Qian · Juhua Hu,https://github.com/Alexander-Yao/Multi-MaP,https://arxiv.org/abs/2404.15655,,2404.15655.pdf,Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering,"Multiple clustering has gained significant attention in recent years due to
+its potential to reveal multiple hidden structures of data from different
+perspectives. The advent of deep multiple clustering techniques has notably
+advanced the performance by uncovering complex patterns and relationships
+within large datasets. However, a major challenge arises as users often do not
+need all the clusterings that algorithms generate, and figuring out the one
+needed requires a substantial understanding of each clustering result.
+Traditionally, aligning a user's brief keyword of interest with the
+corresponding vision components was challenging, but the emergence of
+multi-modal and large language models (LLMs) has begun to bridge this gap. In
+response, given unlabeled target visual data, we propose Multi-MaP, a novel
+method employing a multi-modal proxy learning process. It leverages CLIP
+encoders to extract coherent text and image embeddings, with GPT-4 integrating
+users' interests to formulate effective textual contexts. Moreover, reference
+word constraint and concept-level constraint are designed to learn the optimal
+text proxy according to the user's interest. Multi-MaP not only adeptly
+captures a user's interest via a keyword but also facilitates identifying
+relevant clusterings. Our extensive experiments show that Multi-MaP
+consistently outperforms state-of-the-art methods in all benchmark
+multi-clustering vision tasks. Our code is available at
+https://github.com/Alexander-Yao/Multi-MaP.",cs.CV,['cs.CV']
+Anchor-based Robust Finetuning of Vision-Language Models,Jinwei Han · Zhiwen Lin · Zhongyisun Sun · Yingguo Gao · Ke Yan · Shouhong Ding · Yuan Gao · Gui-Song Xia,https://github.com/LixDemon/ARF,https://arxiv.org/abs/2404.06244,,2404.06244.pdf,Anchor-based Robust Finetuning of Vision-Language Models,"We aim at finetuning a vision-language model without hurting its
+out-of-distribution (OOD) generalization. We address two types of OOD
+generalization, i.e., i) domain shift such as natural to sketch images, and ii)
+zero-shot capability to recognize the category that was not contained in the
+finetune data. Arguably, the diminished OOD generalization after finetuning
+stems from the excessively simplified finetuning target, which only provides
+the class information, such as ``a photo of a [CLASS]''. This is distinct from
+the process in that CLIP was pretrained, where there is abundant text
+supervision with rich semantic information. Therefore, we propose to compensate
+for the finetune process using auxiliary supervision with rich semantic
+information, which acts as anchors to preserve the OOD generalization.
+Specifically, two types of anchors are elaborated in our method, including i)
+text-compensated anchor which uses the images from the finetune set but
+enriches the text supervision from a pretrained captioner, ii) image-text-pair
+anchor which is retrieved from the dataset similar to pretraining data of CLIP
+according to the downstream task, associating with the original CLIP text with
+rich semantics. Those anchors are utilized as auxiliary semantic information to
+maintain the original feature space of CLIP, thereby preserving the OOD
+generalization capabilities. Comprehensive experiments demonstrate that our
+method achieves in-distribution performance akin to conventional finetuning
+while attaining new state-of-the-art results on domain shift and zero-shot
+learning benchmarks.",cs.CV,['cs.CV']
+LEAD: Exploring Logit Space Evolution for Model Selection,Zixuan Hu · Xiaotong Li · SHIXIANG TANG · Jun Liu · Yichun Hu · Ling-Yu Duan, ,https://arxiv.org/abs/2308.15074,,2308.15074.pdf,Exploring Model Transferability through the Lens of Potential Energy,"Transfer learning has become crucial in computer vision tasks due to the vast
+availability of pre-trained deep learning models. However, selecting the
+optimal pre-trained model from a diverse pool for a specific downstream task
+remains a challenge. Existing methods for measuring the transferability of
+pre-trained models rely on statistical correlations between encoded static
+features and task labels, but they overlook the impact of underlying
+representation dynamics during fine-tuning, leading to unreliable results,
+especially for self-supervised models. In this paper, we present an insightful
+physics-inspired approach named PED to address these challenges. We reframe the
+challenge of model selection through the lens of potential energy and directly
+model the interaction forces that influence fine-tuning dynamics. By capturing
+the motion of dynamic representations to decline the potential energy within a
+force-driven physical model, we can acquire an enhanced and more stable
+observation for estimating transferability. The experimental results on 10
+downstream tasks and 12 self-supervised models demonstrate that our approach
+can seamlessly integrate into existing ranking techniques and enhance their
+performances, revealing its effectiveness for the model selection task and its
+potential for understanding the mechanism in transfer learning. Code will be
+available at https://github.com/lixiaotong97/PED.",cs.CV,"['cs.CV', 'cs.LG']"
+Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions,Weizhen He · Yiheng Deng · SHIXIANG TANG · Qihao CHEN · Qingsong Xie · Yizhou Wang · Lei Bai · Feng Zhu · Rui Zhao · Wanli Ouyang · Donglian Qi · Yunfeng Yan, ,https://arxiv.org/abs/2306.07520,,2306.07520.pdf,Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions,"Human intelligence can retrieve any person according to both visual and
+language descriptions. However, the current computer vision community studies
+specific person re-identification (ReID) tasks in different scenarios
+separately, which limits the applications in the real world. This paper strives
+to resolve this problem by proposing a new instruct-ReID task that requires the
+model to retrieve images according to the given image or language instructions.
+Our instruct-ReID is a more general ReID setting, where existing 6 ReID tasks
+can be viewed as special cases by designing different instructions. We propose
+a large-scale OmniReID benchmark and an adaptive triplet loss as a baseline
+method to facilitate research in this new setting. Experimental results show
+that the proposed multi-purpose ReID model, trained on our OmniReID benchmark
+without fine-tuning, can improve +0.5%, +0.6%, +7.7% mAP on Market1501, MSMT17,
+CUHK03 for traditional ReID, +6.4%, +7.1%, +11.2% mAP on PRCC, VC-Clothes, LTCC
+for clothes-changing ReID, +11.7% mAP on COCAS+ real2 for clothes template
+based clothes-changing ReID when using only RGB images, +24.9% mAP on COCAS+
+real2 for our newly defined language-instructed ReID, +4.3% on LLCM for
+visible-infrared ReID, +2.6% on CUHK-PEDES for text-to-image ReID. The
+datasets, the model, and code will be available at
+https://github.com/hwz-zju/Instruct-ReID.",cs.CV,['cs.CV']
+Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation,Siteng Huang · Biao Gong · Yutong Feng · Xi Chen · Yuqian Fu · Yu Liu · Donglin Wang, ,https://arxiv.org/abs/2311.15841,,2311.15841.pdf,Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation,"This study focuses on a novel task in text-to-image (T2I) generation, namely
+action customization. The objective of this task is to learn the co-existing
+action from limited data and generalize it to unseen humans or even animals.
+Experimental results show that existing subject-driven customization methods
+fail to learn the representative characteristics of actions and struggle in
+decoupling actions from context features, including appearance. To overcome the
+preference for low-level features and the entanglement of high-level features,
+we propose an inversion-based method Action-Disentangled Identifier (ADI) to
+learn action-specific identifiers from the exemplar images. ADI first expands
+the semantic conditioning space by introducing layer-wise identifier tokens,
+thereby increasing the representational richness while distributing the
+inversion across different features. Then, to block the inversion of
+action-agnostic features, ADI extracts the gradient invariance from the
+constructed sample triples and masks the updates of irrelevant channels. To
+comprehensively evaluate the task, we present an ActionBench that includes a
+variety of actions, each accompanied by meticulously selected samples. Both
+quantitative and qualitative results show that our ADI outperforms existing
+baselines in action-customized T2I generation. Our project page is at
+https://adi-t2i.github.io/ADI.",cs.CV,['cs.CV']
+ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models,Xinyu Tian · Shu Zou · Zhaoyuan Yang · Jing Zhang, ,https://arxiv.org/abs/2311.16494,,2311.16494.pdf,ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models,"Although soft prompt tuning is effective in efficiently adapting
+Vision-Language (V&L) models for downstream tasks, it shows limitations in
+dealing with distribution shifts. We address this issue with Attribute-Guided
+Prompt Tuning (ArGue), making three key contributions. 1) In contrast to the
+conventional approach of directly appending soft prompts preceding class names,
+we align the model with primitive visual attributes generated by Large Language
+Models (LLMs). We posit that a model's ability to express high confidence in
+these attributes signifies its capacity to discern the correct class
+rationales. 2) We introduce attribute sampling to eliminate disadvantageous
+attributes, thus only semantically meaningful attributes are preserved. 3) We
+propose negative prompting, explicitly enumerating class-agnostic attributes to
+activate spurious correlations and encourage the model to generate highly
+orthogonal probability distributions in relation to these negative features. In
+experiments, our method significantly outperforms current state-of-the-art
+prompt tuning methods on both novel class prediction and out-of-distribution
+generalization tasks.",cs.CV,['cs.CV']
+Narrative Action Evaluation with Prompt-Guided Multimodal Interaction,Shiyi Zhang · Sule Bai · Guangyi Chen · Lei Chen · Jiwen Lu · Junle Wang · Yansong Tang,https://github.com/shiyi-zh0408/NAE_CVPR2024,https://arxiv.org/abs/2404.14471,,2404.14471.pdf,Narrative Action Evaluation with Prompt-Guided Multimodal Interaction,"In this paper, we investigate a new problem called narrative action
+evaluation (NAE). NAE aims to generate professional commentary that evaluates
+the execution of an action. Unlike traditional tasks such as score-based action
+quality assessment and video captioning involving superficial sentences, NAE
+focuses on creating detailed narratives in natural language. These narratives
+provide intricate descriptions of actions along with objective evaluations. NAE
+is a more challenging task because it requires both narrative flexibility and
+evaluation rigor. One existing possible solution is to use multi-task learning,
+where narrative language and evaluative information are predicted separately.
+However, this approach results in reduced performance for individual tasks
+because of variations between tasks and differences in modality between
+language information and evaluation information. To address this, we propose a
+prompt-guided multimodal interaction framework. This framework utilizes a pair
+of transformers to facilitate the interaction between different modalities of
+information. It also uses prompts to transform the score regression task into a
+video-text matching task, thus enabling task interactivity. To support further
+research in this field, we re-annotate the MTL-AQA and FineGym datasets with
+high-quality and comprehensive action narration. Additionally, we establish
+benchmarks for NAE. Extensive experiment results prove that our method
+outperforms separate learning methods and naive multi-task learning methods.
+Data and code are released at https://github.com/shiyi-zh0408/NAE_CVPR2024.",cs.CV,['cs.CV']
+Improved Implicit Neural Representation with Fourier Reparameterized Training,Kexuan Shi · Xingyu Zhou · Shuhang Gu, ,https://arxiv.org/abs/2401.07402,,2401.07402.pdf,Improved Implicit Neural Representation with Fourier Bases Reparameterized Training,"Implicit Neural Representation (INR) as a mighty representation paradigm has
+achieved success in various computer vision tasks recently. Due to the
+low-frequency bias issue of vanilla multi-layer perceptron (MLP), existing
+methods have investigated advanced techniques, such as positional encoding and
+periodic activation function, to improve the accuracy of INR. In this paper, we
+connect the network training bias with the reparameterization technique and
+theoretically prove that weight reparameterization could provide us a chance to
+alleviate the spectral bias of MLP. Based on our theoretical analysis, we
+propose a Fourier reparameterization method which learns coefficient matrix of
+fixed Fourier bases to compose the weights of MLP. We evaluate the proposed
+Fourier reparameterization method on different INR tasks with various MLP
+architectures, including vanilla MLP, MLP with positional encoding and MLP with
+advanced activation function, etc. The superiority approximation results on
+different MLP architectures clearly validate the advantage of our proposed
+method. Armed with our Fourier reparameterization method, better INR with more
+textures and less artifacts can be learned from the training data.",cs.CV,['cs.CV']
+"Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications",Karren Yang · Anurag Ranjan · Jen-Hao Rick Chang · Raviteja Vemulapalli · Oncel Tuzel, ,https://arxiv.org/abs/2311.18168,,2311.18168.pdf,"Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks, Methods, and Applications","We consider the task of animating 3D facial geometry from speech signal.
+Existing works are primarily deterministic, focusing on learning a one-to-one
+mapping from speech signal to 3D face meshes on small datasets with limited
+speakers. While these models can achieve high-quality lip articulation for
+speakers in the training set, they are unable to capture the full and diverse
+distribution of 3D facial motions that accompany speech in the real world.
+Importantly, the relationship between speech and facial motion is one-to-many,
+containing both inter-speaker and intra-speaker variations and necessitating a
+probabilistic approach. In this paper, we identify and address key challenges
+that have so far limited the development of probabilistic models: lack of
+datasets and metrics that are suitable for training and evaluating them, as
+well as the difficulty of designing a model that generates diverse results
+while remaining faithful to a strong conditioning signal as speech. We first
+propose large-scale benchmark datasets and metrics suitable for probabilistic
+modeling. Then, we demonstrate a probabilistic model that achieves both
+diversity and fidelity to speech, outperforming other methods across the
+proposed benchmarks. Finally, we showcase useful applications of probabilistic
+models trained on these large-scale datasets: we can generate diverse
+speech-driven 3D facial motion that matches unseen speaker styles extracted
+from reference clips; and our synthetic meshes can be used to improve the
+performance of downstream audio-visual models.",cs.CV,"['cs.CV', 'cs.LG', 'eess.AS']"
+EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting,Zitao Wang · Qiguang Miao · Yue Xi · Peipei Zhao, ,https://arxiv.org/abs/2308.12831,,2308.12831.pdf,EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting,"The portrait matting task aims to extract an alpha matte with complete
+semantics and finely-detailed contours. In comparison to CNN-based approaches,
+transformers with self-attention module have a better capacity to capture
+long-range dependencies and low-frequency semantic information of a portrait.
+However, the recent research shows that self-attention mechanism struggles with
+modeling high-frequency contour information and capturing fine contour details,
+which can lead to bias while predicting the portrait's contours. To deal with
+this issue, we propose EFormer to enhance the model's attention towards both of
+the low-frequency semantic and high-frequency contour features. For the
+high-frequency contours, our research demonstrates that cross-attention module
+between different resolutions can guide our model to allocate attention
+appropriately to these contour regions. Supported on this, we can successfully
+extract the high-frequency detail information around the portrait's contours,
+which are previously ignored by self-attention. Based on cross-attention
+module, we further build a semantic and contour detector (SCD) to accurately
+capture both of the low-frequency semantic and high-frequency contour features.
+And we design contour-edge extraction branch and semantic extraction branch to
+extract refined high-frequency contour features and complete low-frequency
+semantic information, respectively. Finally, we fuse the two kinds of features
+and leverage segmentation head to generate a predicted portrait matte.
+Experiments on VideoMatte240K (JPEG SD Format) and Adobe Image Matting (AIM)
+datasets demonstrate that EFormer outperforms previous portrait matte methods.",cs.CV,['cs.CV']
+Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation,Bingfeng Zhang · Siyue Yu · Yunchao Wei · Yao Zhao · Jimin Xiao, ,https://arxiv.org/html/2405.14294v1,,2405.14294v1.pdf,Tuning-free Universally-Supervised Semantic Segmentation,"This work presents a tuning-free semantic segmentation framework based on
+classifying SAM masks by CLIP, which is universally applicable to various types
+of supervision. Initially, we utilize CLIP's zero-shot classification ability
+to generate pseudo-labels or perform open-vocabulary segmentation. However, the
+misalignment between mask and CLIP text embeddings leads to suboptimal results.
+To address this issue, we propose discrimination-bias aligned CLIP to closely
+align mask and text embedding, offering an overhead-free performance gain. We
+then construct a global-local consistent classifier to classify SAM masks,
+which reveals the intrinsic structure of high-quality embeddings produced by
+DBA-CLIP and demonstrates robustness against noisy pseudo-labels. Extensive
+experiments validate the efficiency and effectiveness of our method, and we
+achieve state-of-the-art (SOTA) or competitive performance across various
+datasets and supervision types.",cs.CV,['cs.CV']
+A Simple Baseline for Efficient Hand Mesh Reconstruction,zhishan zhou · shihao zhou · Zhi Lv · minqiang zou · Yao Tang · Jiajun Liang,https://simplehand.github.io/,https://arxiv.org/abs/2403.01813,,2403.01813.pdf,A Simple Baseline for Efficient Hand Mesh Reconstruction,"3D hand pose estimation has found broad application in areas such as gesture
+recognition and human-machine interaction tasks. As performance improves, the
+complexity of the systems also increases, which can limit the comparative
+analysis and practical implementation of these methods. In this paper, we
+propose a simple yet effective baseline that not only surpasses
+state-of-the-art (SOTA) methods but also demonstrates computational efficiency.
+To establish this baseline, we abstract existing work into two components: a
+token generator and a mesh regressor, and then examine their core structures. A
+core structure, in this context, is one that fulfills intrinsic functions,
+brings about significant improvements, and achieves excellent performance
+without unnecessary complexities. Our proposed approach is decoupled from any
+modifications to the backbone, making it adaptable to any modern models. Our
+method outperforms existing solutions, achieving state-of-the-art (SOTA)
+results across multiple datasets. On the FreiHAND dataset, our approach
+produced a PA-MPJPE of 5.7mm and a PA-MPVPE of 6.0mm. Similarly, on the Dexycb
+dataset, we observed a PA-MPJPE of 5.5mm and a PA-MPVPE of 5.0mm. As for
+performance speed, our method reached up to 33 frames per second (fps) when
+using HRNet and up to 70 fps when employing FastViT-MA36",cs.CV,['cs.CV']
+Score-Guided Diffusion for 3D Human Recovery,Anastasis Stathopoulos · Ligong Han · Dimitris N. Metaxas,https://statho.github.io/ScoreHMR/,http://export.arxiv.org/abs/2403.09623,,2403.09623.pdf,Score-Guided Diffusion for 3D Human Recovery,"We present Score-Guided Human Mesh Recovery (ScoreHMR), an approach for
+solving inverse problems for 3D human pose and shape reconstruction. These
+inverse problems involve fitting a human body model to image observations,
+traditionally solved through optimization techniques. ScoreHMR mimics model
+fitting approaches, but alignment with the image observation is achieved
+through score guidance in the latent space of a diffusion model. The diffusion
+model is trained to capture the conditional distribution of the human model
+parameters given an input image. By guiding its denoising process with a
+task-specific score, ScoreHMR effectively solves inverse problems for various
+applications without the need for retraining the task-agnostic diffusion model.
+We evaluate our approach on three settings/applications. These are: (i)
+single-frame model fitting; (ii) reconstruction from multiple uncalibrated
+views; (iii) reconstructing humans in video sequences. ScoreHMR consistently
+outperforms all optimization baselines on popular benchmarks across all
+settings. We make our code and models available at the
+https://statho.github.io/ScoreHMR.",cs.CV,['cs.CV']
+Diversified and Personalized Multi-rater Medical Image Segmentation,Yicheng Wu · Xiangde Luo · Zhe Xu · Xiaoqing Guo · Lie Ju · Zongyuan Ge · Wenjun Liao · Jianfei Cai,https://github.com/ycwu1997/D-Persona,https://arxiv.org/abs/2403.13417,,2403.13417.pdf,Diversified and Personalized Multi-rater Medical Image Segmentation,"Annotation ambiguity due to inherent data uncertainties such as blurred
+boundaries in medical scans and different observer expertise and preferences
+has become a major obstacle for training deep-learning based medical image
+segmentation models. To address it, the common practice is to gather multiple
+annotations from different experts, leading to the setting of multi-rater
+medical image segmentation. Existing works aim to either merge different
+annotations into the ""groundtruth"" that is often unattainable in numerous
+medical contexts, or generate diverse results, or produce personalized results
+corresponding to individual expert raters. Here, we bring up a more ambitious
+goal for multi-rater medical image segmentation, i.e., obtaining both
+diversified and personalized results. Specifically, we propose a two-stage
+framework named D-Persona (first Diversification and then Personalization). In
+Stage I, we exploit multiple given annotations to train a Probabilistic U-Net
+model, with a bound-constrained loss to improve the prediction diversity. In
+this way, a common latent space is constructed in Stage I, where different
+latent codes denote diversified expert opinions. Then, in Stage II, we design
+multiple attention-based projection heads to adaptively query the corresponding
+expert prompts from the shared latent space, and then perform the personalized
+medical image segmentation. We evaluated the proposed model on our in-house
+Nasopharyngeal Carcinoma dataset and the public lung nodule dataset (i.e.,
+LIDC-IDRI). Extensive experiments demonstrated our D-Persona can provide
+diversified and personalized results at the same time, achieving new SOTA
+performance for multi-rater medical image segmentation. Our code will be
+released at https://github.com/ycwu1997/D-Persona.",cs.CV,['cs.CV']
+AnyDoor: Zero-shot Object-level Image Customization,Xi Chen · Lianghua Huang · Yu Liu · Yujun Shen · Deli Zhao · Hengshuang Zhao, ,https://arxiv.org/abs/2307.09481,,2307.09481.pdf,AnyDoor: Zero-shot Object-level Image Customization,"This work presents AnyDoor, a diffusion-based image generator with the power
+to teleport target objects to new scenes at user-specified locations in a
+harmonious way. Instead of tuning parameters for each object, our model is
+trained only once and effortlessly generalizes to diverse object-scene
+combinations at the inference stage. Such a challenging zero-shot setting
+requires an adequate characterization of a certain object. To this end, we
+complement the commonly used identity feature with detail features, which are
+carefully designed to maintain texture details yet allow versatile local
+variations (e.g., lighting, orientation, posture, etc.), supporting the object
+in favorably blending with different surroundings. We further propose to borrow
+knowledge from video datasets, where we can observe various forms (i.e., along
+the time axis) of a single object, leading to stronger model generalizability
+and robustness. Extensive experiments demonstrate the superiority of our
+approach over existing alternatives as well as its great potential in
+real-world applications, such as virtual try-on and object moving. Project page
+is https://damo-vilab.github.io/AnyDoor-Page/.",cs.CV,['cs.CV']
+Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis?,Hanxin Zhu · Tianyu He · Xin Li · Bingchen Li · Zhibo Chen, ,https://arxiv.org/abs/2403.06092,,2403.06092.pdf,Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis?,"Neural Radiance Field (NeRF) has achieved superior performance for novel view
+synthesis by modeling the scene with a Multi-Layer Perception (MLP) and a
+volume rendering procedure, however, when fewer known views are given (i.e.,
+few-shot view synthesis), the model is prone to overfit the given views. To
+handle this issue, previous efforts have been made towards leveraging learned
+priors or introducing additional regularizations. In contrast, in this paper,
+we for the first time provide an orthogonal method from the perspective of
+network structure. Given the observation that trivially reducing the number of
+model parameters alleviates the overfitting issue, but at the cost of missing
+details, we propose the multi-input MLP (mi-MLP) that incorporates the inputs
+(i.e., location and viewing direction) of the vanilla MLP into each layer to
+prevent the overfitting issue without harming detailed synthesis. To further
+reduce the artifacts, we propose to model colors and volume density separately
+and present two regularization terms. Extensive experiments on multiple
+datasets demonstrate that: 1) although the proposed mi-MLP is easy to
+implement, it is surprisingly effective as it boosts the PSNR of the baseline
+from $14.73$ to $24.23$. 2) the overall framework achieves state-of-the-art
+results on a wide range of benchmarks. We will release the code upon
+publication.",cs.CV,['cs.CV']
+PIGEON: Predicting Image Geolocations,Lukas Haas · Michal Skreta · Silas Alberti · Chelsea Finn,https://lukashaas.github.io/PIGEON-CVPR24/,,https://huggingface.co/papers/2307.05845,,,,,nan
+Nearest Is Not Dearest: Towards Practical Defense against Quantization-conditioned Backdoor Attacks,Boheng Li · Yishuo Cai · Haowei Li · Feng Xue · Zhifeng Li · Yiming Li, ,https://arxiv.org/abs/2405.12725,,2405.12725.pdf,Nearest is Not Dearest: Towards Practical Defense against Quantization-conditioned Backdoor Attacks,"Model quantization is widely used to compress and accelerate deep neural
+networks. However, recent studies have revealed the feasibility of weaponizing
+model quantization via implanting quantization-conditioned backdoors (QCBs).
+These special backdoors stay dormant on released full-precision models but will
+come into effect after standard quantization. Due to the peculiarity of QCBs,
+existing defenses have minor effects on reducing their threats or are even
+infeasible. In this paper, we conduct the first in-depth analysis of QCBs. We
+reveal that the activation of existing QCBs primarily stems from the nearest
+rounding operation and is closely related to the norms of neuron-wise
+truncation errors (i.e., the difference between the continuous full-precision
+weights and its quantized version). Motivated by these insights, we propose
+Error-guided Flipped Rounding with Activation Preservation (EFRAP), an
+effective and practical defense against QCBs. Specifically, EFRAP learns a
+non-nearest rounding strategy with neuron-wise error norm and layer-wise
+activation preservation guidance, flipping the rounding strategies of neurons
+crucial for backdoor effects but with minimal impact on clean accuracy.
+Extensive evaluations on benchmark datasets demonstrate that our EFRAP can
+defeat state-of-the-art QCB attacks under various settings. Code is available
+at https://github.com/AntigoneRandy/QuantBackdoor_EFRAP.",cs.CR,"['cs.CR', 'cs.CV']"
+Interactive3D: Create What You Want by Interactive 3D Generation,Shaocong Dong · Lihe Ding · Zhanpeng Huang · Zibin Wang · Tianfan Xue · Dan Xu, ,https://arxiv.org/abs/2404.16510,,2404.16510.pdf,Interactive3D: Create What You Want by Interactive 3D Generation,"3D object generation has undergone significant advancements, yielding
+high-quality results. However, fall short of achieving precise user control,
+often yielding results that do not align with user expectations, thus limiting
+their applicability. User-envisioning 3D object generation faces significant
+challenges in realizing its concepts using current generative models due to
+limited interaction capabilities. Existing methods mainly offer two approaches:
+(i) interpreting textual instructions with constrained controllability, or (ii)
+reconstructing 3D objects from 2D images. Both of them limit customization to
+the confines of the 2D reference and potentially introduce undesirable
+artifacts during the 3D lifting process, restricting the scope for direct and
+versatile 3D modifications. In this work, we introduce Interactive3D, an
+innovative framework for interactive 3D generation that grants users precise
+control over the generative process through extensive 3D interaction
+capabilities. Interactive3D is constructed in two cascading stages, utilizing
+distinct 3D representations. The first stage employs Gaussian Splatting for
+direct user interaction, allowing modifications and guidance of the generative
+direction at any intermediate step through (i) Adding and Removing components,
+(ii) Deformable and Rigid Dragging, (iii) Geometric Transformations, and (iv)
+Semantic Editing. Subsequently, the Gaussian splats are transformed into
+InstantNGP. We introduce a novel (v) Interactive Hash Refinement module to
+further add details and extract the geometry in the second stage. Our
+experiments demonstrate that Interactive3D markedly improves the
+controllability and quality of 3D generation. Our project webpage is available
+at \url{https://interactive-3d.github.io/}.",cs.GR,"['cs.GR', 'cs.CV']"
+Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models,Matthew Kowal · Richard P. Wildes · Kosta Derpanis,https://yorkucvil.github.io/VCC/,https://arxiv.org/abs/2404.02233,,2404.02233.pdf,Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models,"Understanding what deep network models capture in their learned
+representations is a fundamental challenge in computer vision. We present a new
+methodology to understanding such vision models, the Visual Concept Connectome
+(VCC), which discovers human interpretable concepts and their interlayer
+connections in a fully unsupervised manner. Our approach simultaneously reveals
+fine-grained concepts at a layer, connection weightings across all layers and
+is amendable to global analysis of network structure (e.g., branching pattern
+of hierarchical concept assemblies). Previous work yielded ways to extract
+interpretable concepts from single layers and examine their impact on
+classification, but did not afford multilayer concept analysis across an entire
+network architecture. Quantitative and qualitative empirical results show the
+effectiveness of VCCs in the domain of image classification. Also, we leverage
+VCCs for the application of failure mode debugging to reveal where mistakes
+arise in deep networks.",cs.CV,['cs.CV']
+GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding,Hao Li · Dingwen Zhang · Yalun Dai · Nian Liu · Lechao Cheng · Li Jingfeng · Jingdong Wang · Junwei Han, ,https://arxiv.org/abs/2311.11863,,2311.11863.pdf,GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding,"Applying NeRF to downstream perception tasks for scene understanding and
+representation is becoming increasingly popular. Most existing methods treat
+semantic prediction as an additional rendering task, \textit{i.e.}, the ""label
+rendering"" task, to build semantic NeRFs. However, by rendering
+semantic/instance labels per pixel without considering the contextual
+information of the rendered image, these methods usually suffer from unclear
+boundary segmentation and abnormal segmentation of pixels within an object. To
+solve this problem, we propose Generalized Perception NeRF (GP-NeRF), a novel
+pipeline that makes the widely used segmentation model and NeRF work compatibly
+under a unified framework, for facilitating context-aware 3D scene perception.
+To accomplish this goal, we introduce transformers to aggregate radiance as
+well as semantic embedding fields jointly for novel views and facilitate the
+joint volumetric rendering of both fields. In addition, we propose two
+self-distillation mechanisms, i.e., the Semantic Distill Loss and the
+Depth-Guided Semantic Distill Loss, to enhance the discrimination and quality
+of the semantic field and the maintenance of geometric consistency. In
+evaluation, we conduct experimental comparisons under two perception tasks
+(\textit{i.e.} semantic and instance segmentation) using both synthetic and
+real-world datasets. Notably, our method outperforms SOTA approaches by 6.94\%,
+11.76\%, and 8.47\% on generalized semantic segmentation, finetuning semantic
+segmentation, and instance segmentation, respectively.",cs.CV,['cs.CV']
+Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception,Lei Fan · Mingfu Liang · Yunxuan Li · Gang Hua · Ying Wu, ,https://arxiv.org/abs/2311.13793,,2311.13793.pdf,Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception,"Active recognition enables robots to intelligently explore novel
+observations, thereby acquiring more information while circumventing undesired
+viewing conditions. Recent approaches favor learning policies from simulated or
+collected data, wherein appropriate actions are more frequently selected when
+the recognition is accurate. However, most recognition modules are developed
+under the closed-world assumption, which makes them ill-equipped to handle
+unexpected inputs, such as the absence of the target object in the current
+observation. To address this issue, we propose treating active recognition as a
+sequential evidence-gathering process, providing by-step uncertainty
+quantification and reliable prediction under the evidence combination theory.
+Additionally, the reward function developed in this paper effectively
+characterizes the merit of actions when operating in open-world environments.
+To evaluate the performance, we collect a dataset from an indoor simulator,
+encompassing various recognition challenges such as distance, occlusion levels,
+and visibility. Through a series of experiments on recognition and robustness
+analysis, we demonstrate the necessity of introducing uncertainties to active
+recognition and the superior performance of the proposed method.",cs.CV,"['cs.CV', 'cs.RO']"
+CURSOR: Scalable Mixed-Order Hypergraph Matching with CUR Decomposition,Qixuan Zheng · Ming Zhang · Hong Yan, ,https://arxiv.org/abs/2402.16594,,2402.16594.pdf,CURSOR: Scalable Mixed-Order Hypergraph Matching with CUR Decomposition,"To achieve greater accuracy, hypergraph matching algorithms require
+exponential increases in computational resources. Recent kd-tree-based
+approximate nearest neighbor (ANN) methods, despite the sparsity of their
+compatibility tensor, still require exhaustive calculations for large-scale
+graph matching. This work utilizes CUR tensor decomposition and introduces a
+novel cascaded second and third-order hypergraph matching framework (CURSOR)
+for efficient hypergraph matching. A CUR-based second-order graph matching
+algorithm is used to provide a rough match, and then the core of CURSOR, a
+fiber-CUR-based tensor generation method, directly calculates entries of the
+compatibility tensor by leveraging the initial second-order match result. This
+significantly decreases the time complexity and tensor density. A probability
+relaxation labeling (PRL)-based matching algorithm, especially suitable for
+sparse tensors, is developed. Experiment results on large-scale synthetic
+datasets and widely-adopted benchmark sets demonstrate the superiority of
+CURSOR over existing methods. The tensor generation method in CURSOR can be
+integrated seamlessly into existing hypergraph matching methods to improve
+their performance and lower their computational costs.",cs.CV,['cs.CV']
+Total Selfie: Generating Full-Body Selfies,Bowei Chen · Brian Curless · Ira Kemelmacher-Shlizerman · Steve Seitz, ,https://arxiv.org/abs/2308.14740,,2308.14740.pdf,Total Selfie: Generating Full-Body Selfies,"We present a method to generate full-body selfies from photographs originally
+taken at arms length. Because self-captured photos are typically taken close
+up, they have limited field of view and exaggerated perspective that distorts
+facial shapes. We instead seek to generate the photo some one else would take
+of you from a few feet away. Our approach takes as input four selfies of your
+face and body, a background image, and generates a full-body selfie in a
+desired target pose. We introduce a novel diffusion-based approach to combine
+all of this information into high-quality, well-composed photos of you with the
+desired pose and background.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+Novel View Synthesis with View-Dependent Effects from a Single Image,Juan Luis Gonzalez Bello · Munchurl Kim,https://kaist-viclab.github.io/monovde-site/,https://arxiv.org/abs/2312.08071v1,,2312.08071v1.pdf,Novel View Synthesis with View-Dependent Effects from a Single Image,"In this paper, we firstly consider view-dependent effects into single
+image-based novel view synthesis (NVS) problems. For this, we propose to
+exploit the camera motion priors in NVS to model view-dependent appearance or
+effects (VDE) as the negative disparity in the scene. By recognizing
+specularities ""follow"" the camera motion, we infuse VDEs into the input images
+by aggregating input pixel colors along the negative depth region of the
+epipolar lines. Also, we propose a `relaxed volumetric rendering' approximation
+that allows computing the densities in a single pass, improving efficiency for
+NVS from single images. Our method can learn single-image NVS from image
+sequences only, which is a completely self-supervised learning method, for the
+first time requiring neither depth nor camera pose annotations. We present
+extensive experiment results and show that our proposed method can learn NVS
+with VDEs, outperforming the SOTA single-view NVS methods on the RealEstate10k
+and MannequinChallenge datasets.",cs.CV,"['cs.CV', 'eess.IV']"
+An Asymmetric Augmented Self-Supervised Learning Method for Unsupervised Fine-Grained Image Hashing,Feiran Hu · Chenlin Zhang · Jiangliang GUO · Xiu-Shen Wei · Lin Zhao · Anqi Xu · Lingyan Gao, ,,https://link.springer.com/article/10.1007/s11263-024-02009-7,,,,,nan
+TRINS: Towards Multimodal Language Models That Can Read,Ruiyi Zhang · Yanzhe Zhang · Jian Chen · Yufan Zhou · Jiuxiang Gu · Changyou Chen · Tong Sun, ,https://arxiv.org/html/2401.10005v1,,2401.10005v1.pdf,Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation,"The increasing demand for intelligent systems capable of interpreting and
+reasoning about visual content requires the development of Large Multi-Modal
+Models (LMMs) that are not only accurate but also have explicit reasoning
+capabilities. This paper presents a novel approach to imbue an LMM with the
+ability to conduct explicit reasoning based on visual content and textual
+instructions. We introduce a system that can ask a question to acquire
+necessary knowledge, thereby enhancing the robustness and explicability of the
+reasoning process. Our method comprises the development of a novel dataset
+generated by a Large Language Model (LLM), designed to promote chain-of-thought
+reasoning combined with a question-asking mechanism. We designed an LMM, which
+has high capabilities on region awareness to address the intricate requirements
+of image-text alignment. The model undergoes a three-stage training phase,
+starting with large-scale image-text alignment using a large-scale datasets,
+followed by instruction tuning, and fine-tuning with a focus on
+chain-of-thought reasoning. The results demonstrate a stride toward a more
+robust, accurate, and interpretable LMM, capable of reasoning explicitly and
+seeking information proactively when confronted with ambiguous visual input.",cs.CV,"['cs.CV', 'cs.CL']"
+DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors,Biwen Lei · Kai Yu · Mengyang Feng · Miaomiao Cui · Xuansong Xie, ,https://arxiv.org/abs/2312.16837,,2312.16837.pdf,DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors,"Text-guided domain adaptation and generation of 3D-aware portraits find many
+applications in various fields. However, due to the lack of training data and
+the challenges in handling the high variety of geometry and appearance, the
+existing methods for these tasks suffer from issues like inflexibility,
+instability, and low fidelity. In this paper, we propose a novel framework
+DiffusionGAN3D, which boosts text-guided 3D domain adaptation and generation by
+combining 3D GANs and diffusion priors. Specifically, we integrate the
+pre-trained 3D generative models (e.g., EG3D) and text-to-image diffusion
+models. The former provides a strong foundation for stable and high-quality
+avatar generation from text. And the diffusion models in turn offer powerful
+priors and guide the 3D generator finetuning with informative direction to
+achieve flexible and efficient text-guided domain adaptation. To enhance the
+diversity in domain adaptation and the generation capability in text-to-avatar,
+we introduce the relative distance loss and case-specific learnable triplane
+respectively. Besides, we design a progressive texture refinement module to
+improve the texture quality for both tasks above. Extensive experiments
+demonstrate that the proposed framework achieves excellent results in both
+domain adaptation and text-to-avatar tasks, outperforming existing methods in
+terms of generation quality and efficiency. The project homepage is at
+https://younglbw.github.io/DiffusionGAN3D-homepage/.",cs.CV,['cs.CV']
+Taming the Tail in Class-Conditional GANs: Knowledge Sharing via Unconditional Training at Lower Resolutions,Saeed Khorram · Mingqi Jiang · Mohamad Shahbazi · Mohamad Hosein Danesh · Li Fuxin, ,https://arxiv.org/abs/2402.17065,,2402.17065.pdf,Taming the Tail in Class-Conditional GANs: Knowledge Sharing via Unconditional Training at Lower Resolutions,"Despite the extensive research on training generative adversarial networks
+(GANs) with limited training data, learning to generate images from long-tailed
+training distributions remains fairly unexplored. In the presence of imbalanced
+multi-class training data, GANs tend to favor classes with more samples,
+leading to the generation of low-quality and less diverse samples in tail
+classes. In this study, we aim to improve the training of class-conditional
+GANs with long-tailed data. We propose a straightforward yet effective method
+for knowledge sharing, allowing tail classes to borrow from the rich
+information from classes with more abundant training data. More concretely, we
+propose modifications to existing class-conditional GAN architectures to ensure
+that the lower-resolution layers of the generator are trained entirely
+unconditionally while reserving class-conditional generation for the
+higher-resolution layers. Experiments on several long-tail benchmarks and GAN
+architectures demonstrate a significant improvement over existing methods in
+both the diversity and fidelity of the generated images. The code is available
+at https://github.com/khorrams/utlo.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection,Trevine Oorloff · Surya Koppisetti · Nicolo Bonettini · Divyaraj Solanki · Ben Colman · Yaser Yacoob · Ali Shahriyari · Gaurav Bharaj, ,https://arxiv.org/abs/2310.03827,,2310.03827.pdf,Integrating Audio-Visual Features for Multimodal Deepfake Detection,"Deepfakes are AI-generated media in which an image or video has been
+digitally modified. The advancements made in deepfake technology have led to
+privacy and security issues. Most deepfake detection techniques rely on the
+detection of a single modality. Existing methods for audio-visual detection do
+not always surpass that of the analysis based on single modalities. Therefore,
+this paper proposes an audio-visual-based method for deepfake detection, which
+integrates fine-grained deepfake identification with binary classification. We
+categorize the samples into four types by combining labels specific to each
+single modality. This method enhances the detection under intra-domain and
+cross-domain testing.",cs.CV,['cs.CV']
+Masked AutoDecoder is Effective Multi-Task Vision Generalist,Han Qiu · Jiaxing Huang · Peng Gao · Lewei Lu · Xiaoqin Zhang · Shijian Lu, ,https://arxiv.org/abs/2403.07692,,2403.07692.pdf,Masked AutoDecoder is Effective Multi-Task Vision Generalist,"Inspired by the success of general-purpose models in NLP, recent studies
+attempt to unify different vision tasks in the same sequence format and employ
+autoregressive Transformers for sequence prediction. They apply uni-directional
+attention to capture sequential dependencies and generate task sequences
+recursively. However, such autoregressive Transformers may not fit vision tasks
+well, as vision task sequences usually lack the sequential dependencies
+typically observed in natural languages. In this work, we design Masked
+AutoDecoder~(MAD), an effective multi-task vision generalist. MAD consists of
+two core designs. First, we develop a parallel decoding framework that
+introduces bi-directional attention to capture contextual dependencies
+comprehensively and decode vision task sequences in parallel. Second, we design
+a masked sequence modeling approach that learns rich task contexts by masking
+and reconstructing task sequences. In this way, MAD handles all the tasks by a
+single network branch and a simple cross-entropy loss with minimal
+task-specific designs. Extensive experiments demonstrate the great potential of
+MAD as a new paradigm for unifying various vision tasks. MAD achieves superior
+performance and inference efficiency compared to autoregressive counterparts
+while obtaining competitive accuracy with task-specific models. Code will be
+released.",cs.CV,['cs.CV']
+HiPose: Hierarchical Binary Surface Encoding and Correspondence Pruning for RGB-D 6DoF Object Pose Estimation,Yongliang Lin · Yongzhi Su · Praveen Nathan · Sandeep Inuganti · Yan Di · Martin Sundermeyer · Fabian Manhardt · Didier Stricker · Jason Rambach · Yu Zhang, ,https://arxiv.org/abs/2311.12588,,2311.12588.pdf,HiPose: Hierarchical Binary Surface Encoding and Correspondence Pruning for RGB-D 6DoF Object Pose Estimation,"In this work, we present a novel dense-correspondence method for 6DoF object
+pose estimation from a single RGB-D image. While many existing data-driven
+methods achieve impressive performance, they tend to be time-consuming due to
+their reliance on rendering-based refinement approaches. To circumvent this
+limitation, we present HiPose, which establishes 3D-3D correspondences in a
+coarse-to-fine manner with a hierarchical binary surface encoding. Unlike
+previous dense-correspondence methods, we estimate the correspondence surface
+by employing point-to-surface matching and iteratively constricting the surface
+until it becomes a correspondence point while gradually removing outliers.
+Extensive experiments on public benchmarks LM-O, YCB-V, and T-Less demonstrate
+that our method surpasses all refinement-free methods and is even on par with
+expensive refinement-based approaches. Crucially, our approach is
+computationally efficient and enables real-time critical applications with high
+accuracy requirements.",cs.CV,['cs.CV']
+Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D,Karran Pandey · Paul Guerrero · Matheus Gadelha · Yannick Hold-Geoffroy · Karan Singh · Niloy J. Mitra,https://diffusionhandles.github.io/,https://arxiv.org/abs/2312.02190,,2312.02190.pdf,Diffusion Handles: Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D,"Diffusion Handles is a novel approach to enabling 3D object edits on
+diffusion images. We accomplish these edits using existing pre-trained
+diffusion models, and 2D image depth estimation, without any fine-tuning or 3D
+object retrieval. The edited results remain plausible, photo-real, and preserve
+object identity. Diffusion Handles address a critically missing facet of
+generative image based creative design, and significantly advance the
+state-of-the-art in generative image editing. Our key insight is to lift
+diffusion activations for an object to 3D using a proxy depth, 3D-transform the
+depth and associated activations, and project them back to image space. The
+diffusion process applied to the manipulated activations with identity control,
+produces plausible edited images showing complex 3D occlusion and lighting
+effects. We evaluate Diffusion Handles: quantitatively, on a large synthetic
+data benchmark; and qualitatively by a user study, showing our output to be
+more plausible, and better than prior art at both, 3D editing and identity
+control. Project Webpage: https://diffusionhandles.github.io/",cs.CV,"['cs.CV', 'cs.GR']"
+CAM Back Again: Large Kernel CNNs from a Weakly Supervised Object Localization Perspective,Shunsuke Yasuki · Masato Taki, ,,https://github.com/snskysk/CAM-Back-Again,,,,,nan
+A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models,Julio Silva-Rodríguez · Sina Hajimiri · Ismail Ben Ayed · Jose Dolz,https://jusiro.github.io/projects/clap,https://arxiv.org/abs/2312.12730,,2312.12730.pdf,A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models,"Efficient transfer learning (ETL) is receiving increasing attention to adapt
+large pre-trained language-vision models on downstream tasks with a few labeled
+samples. While significant progress has been made, we reveal that
+state-of-the-art ETL approaches exhibit strong performance only in
+narrowly-defined experimental setups, and with a careful adjustment of
+hyperparameters based on a large corpus of labeled samples. In particular, we
+make two interesting, and surprising empirical observations. First, to
+outperform a simple Linear Probing baseline, these methods require to optimize
+their hyper-parameters on each target task. And second, they typically
+underperform -- sometimes dramatically -- standard zero-shot predictions in the
+presence of distributional drifts. Motivated by the unrealistic assumptions
+made in the existing literature, i.e., access to a large validation set and
+case-specific grid-search for optimal hyperparameters, we propose a novel
+approach that meets the requirements of real-world scenarios. More concretely,
+we introduce a CLass-Adaptive linear Probe (CLAP) objective, whose balancing
+term is optimized via an adaptation of the general Augmented Lagrangian method
+tailored to this context. We comprehensively evaluate CLAP on a broad span of
+datasets and scenarios, demonstrating that it consistently outperforms SoTA
+approaches, while yet being a much more efficient alternative.",cs.CV,['cs.CV']
+DiaLoc: An Iterative Approach to Embodied Dialog Localization,Chao Zhang · Mohan Li · Ignas Budvytis · Stephan Liwicki, ,https://arxiv.org/abs/2403.06846,,2403.06846.pdf,DiaLoc: An Iterative Approach to Embodied Dialog Localization,"Multimodal learning has advanced the performance for many vision-language
+tasks. However, most existing works in embodied dialog research focus on
+navigation and leave the localization task understudied. The few existing
+dialog-based localization approaches assume the availability of entire dialog
+prior to localizaiton, which is impractical for deployed dialog-based
+localization. In this paper, we propose DiaLoc, a new dialog-based localization
+framework which aligns with a real human operator behavior. Specifically, we
+produce an iterative refinement of location predictions which can visualize
+current pose believes after each dialog turn. DiaLoc effectively utilizes the
+multimodal data for multi-shot localization, where a fusion encoder fuses
+vision and dialog information iteratively. We achieve state-of-the-art results
+on embodied dialog-based localization task, in single-shot (+7.08% in
+Acc5@valUnseen) and multi-shot settings (+10.85% in Acc5@valUnseen). DiaLoc
+narrows the gap between simulation and real-world applications, opening doors
+for future research on collaborative localization and navigation.",cs.CV,['cs.CV']
+De-Diffusion Makes Text a Strong Cross-Modal Interface,Chen Wei · Chenxi Liu · Siyuan Qiao · Zhishuai Zhang · Alan L. Yuille · Jiahui Yu, ,https://arxiv.org/abs/2311.00618,,2311.00618.pdf,De-Diffusion Makes Text a Strong Cross-Modal Interface,"We demonstrate text as a strong cross-modal interface. Rather than relying on
+deep embeddings to connect image and language as the interface representation,
+our approach represents an image as text, from which we enjoy the
+interpretability and flexibility inherent to natural language. We employ an
+autoencoder that uses a pre-trained text-to-image diffusion model for decoding.
+The encoder is trained to transform an input image into text, which is then fed
+into the fixed text-to-image diffusion decoder to reconstruct the original
+input -- a process we term De-Diffusion. Experiments validate both the
+precision and comprehensiveness of De-Diffusion text representing images, such
+that it can be readily ingested by off-the-shelf text-to-image tools and LLMs
+for diverse multi-modal tasks. For example, a single De-Diffusion model can
+generalize to provide transferable prompts for different text-to-image tools,
+and also achieves a new state of the art on open-ended vision-language tasks by
+simply prompting large language models with few-shot examples.",cs.CV,['cs.CV']
+MMM: Generative Masked Motion Model,Ekkasit Pinyoanuntapong · Pu Wang · Minwoo Lee · Chen Chen, ,https://arxiv.org/abs/2312.03596,,2312.03596.pdf,MMM: Generative Masked Motion Model,"Recent advances in text-to-motion generation using diffusion and
+autoregressive models have shown promising results. However, these models often
+suffer from a trade-off between real-time performance, high fidelity, and
+motion editability. To address this gap, we introduce MMM, a novel yet simple
+motion generation paradigm based on Masked Motion Model. MMM consists of two
+key components: (1) a motion tokenizer that transforms 3D human motion into a
+sequence of discrete tokens in latent space, and (2) a conditional masked
+motion transformer that learns to predict randomly masked motion tokens,
+conditioned on the pre-computed text tokens. By attending to motion and text
+tokens in all directions, MMM explicitly captures inherent dependency among
+motion tokens and semantic mapping between motion and text tokens. During
+inference, this allows parallel and iterative decoding of multiple motion
+tokens that are highly consistent with fine-grained text descriptions,
+therefore simultaneously achieving high-fidelity and high-speed motion
+generation. In addition, MMM has innate motion editability. By simply placing
+mask tokens in the place that needs editing, MMM automatically fills the gaps
+while guaranteeing smooth transitions between editing and non-editing parts.
+Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that MMM
+surpasses current leading methods in generating high-quality motion (evidenced
+by superior FID scores of 0.08 and 0.429), while offering advanced editing
+features such as body-part modification, motion in-betweening, and the
+synthesis of long motion sequences. In addition, MMM is two orders of magnitude
+faster on a single mid-range GPU than editable motion diffusion models. Our
+project page is available at \url{https://exitudio.github.io/MMM-page}.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+RMem: Restricted Memory Banks Improve Video Object Segmentation,Junbao Zhou · Ziqi Pang · Yu-Xiong Wang, ,https://arxiv.org/abs/2403.11529,,2403.11529.pdf,Video Object Segmentation with Dynamic Query Modulation,"Storing intermediate frame segmentations as memory for long-range context
+modeling, spatial-temporal memory-based methods have recently showcased
+impressive results in semi-supervised video object segmentation (SVOS).
+However, these methods face two key limitations: 1) relying on non-local
+pixel-level matching to read memory, resulting in noisy retrieved features for
+segmentation; 2) segmenting each object independently without interaction.
+These shortcomings make the memory-based methods struggle in similar object and
+multi-object segmentation. To address these issues, we propose a query
+modulation method, termed QMVOS. This method summarizes object features into
+dynamic queries and then treats them as dynamic filters for mask prediction,
+thereby providing high-level descriptions and object-level perception for the
+model. Efficient and effective multi-object interactions are realized through
+inter-query attention. Extensive experiments demonstrate that our method can
+bring significant improvements to the memory-based SVOS method and achieve
+competitive performance on standard SVOS benchmarks. The code is available at
+https://github.com/zht8506/QMVOS.",cs.CV,['cs.CV']
+Neural Implicit Morphing of Face Images,Guilherme Schardong · Tiago Novello · Hallison Paz · Iurii Medvedev · Vinícius Silva · Luiz Velho · Nuno Gonçalves,https://schardong.github.io/ifmorph/,https://arxiv.org/abs/2308.13888,,2308.13888.pdf,Neural Implicit Morphing of Face Images,"Face morphing is a problem in computer graphics with numerous artistic and
+forensic applications. It is challenging due to variations in pose, lighting,
+gender, and ethnicity. This task consists of a warping for feature alignment
+and a blending for a seamless transition between the warped images. We propose
+to leverage coord-based neural networks to represent such warpings and
+blendings of face images. During training, we exploit the smoothness and
+flexibility of such networks by combining energy functionals employed in
+classical approaches without discretizations. Additionally, our method is
+time-dependent, allowing a continuous warping/blending of the images. During
+morphing inference, we need both direct and inverse transformations of the
+time-dependent warping. The first (second) is responsible for warping the
+target (source) image into the source (target) image. Our neural warping stores
+those maps in a single network dismissing the need for inverting them. The
+results of our experiments indicate that our method is competitive with both
+classical and generative models under the lens of image quality and
+face-morphing detectors. Aesthetically, the resulting images present a seamless
+blending of diverse faces not yet usual in the literature.",cs.CV,"['cs.CV', 'cs.LG', 'I.4.8; I.4.10']"
+"Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action",Jiasen Lu · Christopher Clark · Sangho Lee · Zichen Zhang · Savya Khosla · Ryan Marten · Derek Hoiem · Aniruddha Kembhavi,https://unified-io-2.allenai.org/,https://arxiv.org/abs/2312.17172,,2312.17172.pdf,"Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action","We present Unified-IO 2, the first autoregressive multimodal model that is
+capable of understanding and generating image, text, audio, and action. To
+unify different modalities, we tokenize inputs and outputs -- images, text,
+audio, action, bounding boxes, etc., into a shared semantic space and then
+process them with a single encoder-decoder transformer model. Since training
+with such diverse modalities is challenging, we propose various architectural
+improvements to stabilize model training. We train our model from scratch on a
+large multimodal pre-training corpus from diverse sources with a multimodal
+mixture of denoisers objective. To learn an expansive set of skills, such as
+following multimodal instructions, we construct and finetune on an ensemble of
+120 datasets with prompts and augmentations. With a single unified model,
+Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark and
+strong results in more than 35 benchmarks, including image generation and
+understanding, natural language understanding, video and audio understanding,
+and robotic manipulation. We release all our models to the research community.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']"
+SUGAR: Pre-training 3D Visual Representation for Robotics,Shizhe Chen · Ricardo Garcia Pinel · Ivan Laptev · Cordelia Schmid, ,https://arxiv.org/abs/2404.01491,,2404.01491.pdf,SUGAR: Pre-training 3D Visual Representations for Robotics,"Learning generalizable visual representations from Internet data has yielded
+promising results for robotics. Yet, prevailing approaches focus on
+pre-training 2D representations, being sub-optimal to deal with occlusions and
+accurately localize objects in complex 3D scenes. Meanwhile, 3D representation
+learning has been limited to single-object understanding. To address these
+limitations, we introduce a novel 3D pre-training framework for robotics named
+SUGAR that captures semantic, geometric and affordance properties of objects
+through 3D point clouds. We underscore the importance of cluttered scenes in 3D
+representation learning, and automatically construct a multi-object dataset
+benefiting from cost-free supervision in simulation. SUGAR employs a versatile
+transformer-based model to jointly address five pre-training tasks, namely
+cross-modal knowledge distillation for semantic learning, masked point modeling
+to understand geometry structures, grasping pose synthesis for object
+affordance, 3D instance segmentation and referring expression grounding to
+analyze cluttered scenes. We evaluate our learned representation on three
+robotic-related tasks, namely, zero-shot 3D object recognition, referring
+expression grounding, and language-driven robotic manipulation. Experimental
+results show that SUGAR's 3D representation outperforms state-of-the-art 2D and
+3D representations.",cs.CV,['cs.CV']
+GenN2N: Generative NeRF2NeRF Translation,Xiangyue Liu · Han Xue · Kunming Luo · Ping Tan · Li Yi, ,https://arxiv.org/abs/2404.02788,,2404.02788.pdf,GenN2N: Generative NeRF2NeRF Translation,"We present GenN2N, a unified NeRF-to-NeRF translation framework for various
+NeRF translation tasks such as text-driven NeRF editing, colorization,
+super-resolution, inpainting, etc. Unlike previous methods designed for
+individual translation tasks with task-specific schemes, GenN2N achieves all
+these NeRF editing tasks by employing a plug-and-play image-to-image translator
+to perform editing in the 2D domain and lifting 2D edits into the 3D NeRF
+space. Since the 3D consistency of 2D edits may not be assured, we propose to
+model the distribution of the underlying 3D edits through a generative model
+that can cover all possible edited NeRFs. To model the distribution of 3D
+edited NeRFs from 2D edited images, we carefully design a VAE-GAN that encodes
+images while decoding NeRFs. The latent space is trained to align with a
+Gaussian distribution and the NeRFs are supervised through an adversarial loss
+on its renderings. To ensure the latent code does not depend on 2D viewpoints
+but truly reflects the 3D edits, we also regularize the latent code through a
+contrastive learning scheme. Extensive experiments on various editing tasks
+show GenN2N, as a universal framework, performs as well or better than
+task-specific specialists while possessing flexible generative power. More
+results on our project page: https://xiangyueliu.github.io/GenN2N/",cs.CV,['cs.CV']
+UniHuman: A Unified Model For Editing Human Images in the Wild,Nannan Li · Qing Liu · Krishna Kumar Singh · Yilin Wang · Jianming Zhang · Bryan A. Plummer · Zhe Lin, ,https://arxiv.org/abs/2312.14985,,2312.14985.pdf,UniHuman: A Unified Model for Editing Human Images in the Wild,"Human image editing includes tasks like changing a person's pose, their
+clothing, or editing the image according to a text prompt. However, prior work
+often tackles these tasks separately, overlooking the benefit of mutual
+reinforcement from learning them jointly. In this paper, we propose UniHuman, a
+unified model that addresses multiple facets of human image editing in
+real-world settings. To enhance the model's generation quality and
+generalization capacity, we leverage guidance from human visual encoders and
+introduce a lightweight pose-warping module that can exploit different pose
+representations, accommodating unseen textures and patterns. Furthermore, to
+bridge the disparity between existing human editing benchmarks with real-world
+data, we curated 400K high-quality human image-text pairs for training and
+collected 2K human images for out-of-domain testing, both encompassing diverse
+clothing styles, backgrounds, and age groups. Experiments on both in-domain and
+out-of-domain test sets demonstrate that UniHuman outperforms task-specific
+models by a significant margin. In user studies, UniHuman is preferred by the
+users in an average of 77% of cases. Our project is available at
+https://github.com/NannanLi999/UniHuman.",cs.CV,['cs.CV']
+Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers,Tsai-Shien Chen · Aliaksandr Siarohin · Willi Menapace · Ekaterina Deyneka · Hsiang-wei Chao · Byung Jeon · Yuwei Fang · Hsin-Ying Lee · Jian Ren · Ming-Hsuan Yang · Sergey Tulyakov,https://snap-research.github.io/Panda-70M/,https://arxiv.org/abs/2402.19479,,2402.19479.pdf,Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers,"The quality of the data and annotation upper-bounds the quality of a
+downstream model. While there exist large text corpora and image-text pairs,
+high-quality video-text data is much harder to collect. First of all, manual
+labeling is more time-consuming, as it requires an annotator to watch an entire
+video. Second, videos have a temporal dimension, consisting of several scenes
+stacked together, and showing multiple actions. Accordingly, to establish a
+video dataset with high-quality captions, we propose an automatic approach
+leveraging multimodal inputs, such as textual video description, subtitles, and
+individual video frames. Specifically, we curate 3.8M high-resolution videos
+from the publicly available HD-VILA-100M dataset. We then split them into
+semantically consistent video clips, and apply multiple cross-modality teacher
+models to obtain captions for each video. Next, we finetune a retrieval model
+on a small subset where the best caption of each video is manually selected and
+then employ the model in the whole dataset to select the best caption as the
+annotation. In this way, we get 70M videos paired with high-quality text
+captions. We dub the dataset as Panda-70M. We show the value of the proposed
+dataset on three downstream tasks: video captioning, video and text retrieval,
+and text-driven video generation. The models trained on the proposed data score
+substantially better on the majority of metrics across all the tasks.",cs.CV,['cs.CV']
+Personalized Residuals for Concept-Driven Text-to-Image Generation,Cusuh Ham · Matthew Fisher · James Hays · Nicholas Kolkin · Yuchen Liu · Richard Zhang · Tobias Hinz, ,https://arxiv.org/abs/2405.12978,,2405.12978.pdf,Personalized Residuals for Concept-Driven Text-to-Image Generation,"We present personalized residuals and localized attention-guided sampling for
+efficient concept-driven generation using text-to-image diffusion models. Our
+method first represents concepts by freezing the weights of a pretrained
+text-conditioned diffusion model and learning low-rank residuals for a small
+subset of the model's layers. The residual-based approach then directly enables
+application of our proposed sampling technique, which applies the learned
+residuals only in areas where the concept is localized via cross-attention and
+applies the original diffusion weights in all other regions. Localized sampling
+therefore combines the learned identity of the concept with the existing
+generative prior of the underlying diffusion model. We show that personalized
+residuals effectively capture the identity of a concept in ~3 minutes on a
+single GPU without the use of regularization images and with fewer parameters
+than previous models, and localized sampling allows using the original model as
+strong prior for large parts of the image.",cs.CV,['cs.CV']
+SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection,Mingxuan Liu · Tyler Hayes · Elisa Ricci · Gabriela Csurka · Riccardo Volpi,https://github.com/naver/shine,https://arxiv.org/abs/2405.10053,,2405.10053.pdf,SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection,"Open-vocabulary object detection (OvOD) has transformed detection into a
+language-guided task, empowering users to freely define their class
+vocabularies of interest during inference. However, our initial investigation
+indicates that existing OvOD detectors exhibit significant variability when
+dealing with vocabularies across various semantic granularities, posing a
+concern for real-world deployment. To this end, we introduce Semantic Hierarchy
+Nexus (SHiNe), a novel classifier that uses semantic knowledge from class
+hierarchies. It runs offline in three steps: i) it retrieves relevant
+super-/sub-categories from a hierarchy for each target class; ii) it integrates
+these categories into hierarchy-aware sentences; iii) it fuses these sentence
+embeddings to generate the nexus classifier vector. Our evaluation on various
+detection benchmarks demonstrates that SHiNe enhances robustness across diverse
+vocabulary granularities, achieving up to +31.9% mAP50 with ground truth
+hierarchies, while retaining improvements using hierarchies generated by large
+language models. Moreover, when applied to open-vocabulary classification on
+ImageNet-1k, SHiNe improves the CLIP zero-shot baseline by +2.8% accuracy.
+SHiNe is training-free and can be seamlessly integrated with any off-the-shelf
+OvOD detector, without incurring additional computational overhead during
+inference. The code is open source.",cs.CV,['cs.CV']
+Diff-BGM: A Diffusion Model for Video Background Music Generation,Sizhe Li · Yiming Qin · Minghang Zheng · Xin Jin · Yang Liu, ,http://export.arxiv.org/abs/2405.11913,,2405.11913.pdf,Diff-BGM: A Diffusion Model for Video Background Music Generation,"When editing a video, a piece of attractive background music is
+indispensable. However, video background music generation tasks face several
+challenges, for example, the lack of suitable training datasets, and the
+difficulties in flexibly controlling the music generation process and
+sequentially aligning the video and music. In this work, we first propose a
+high-quality music-video dataset BGM909 with detailed annotation and shot
+detection to provide multi-modal information about the video and music. We then
+present evaluation metrics to assess music quality, including music diversity
+and alignment between music and video with retrieval precision metrics.
+Finally, we propose the Diff-BGM framework to automatically generate the
+background music for a given video, which uses different signals to control
+different aspects of the music during the generation process, i.e., uses
+dynamic video features to control music rhythm and semantic features to control
+the melody and atmosphere. We propose to align the video and music sequentially
+by introducing a segment-aware cross-attention layer. Experiments verify the
+effectiveness of our proposed method. The code and models are available at
+https://github.com/sizhelee/Diff-BGM.",cs.CV,['cs.CV']
+Efficient Detection of Long Consistent Cycles and its Application to Distributed Synchronization,Shaohan Li · Yunpeng Shi · Gilad Lerman, ,,https://www.semanticscholar.org/paper/Fully-distributed-synchronization-on-directed-via-Xia-Li/23d2c7b0150d90992f60c1d8a94d263beacb2bb0,,,,,nan
+Learning Discriminative Dynamics with Label Corruption for Noisy Label Detection,Suyeon Kim · Dongha Lee · SeongKu Kang · Sukang Chae · Sanghwan Jang · Hwanjo Yu, ,https://arxiv.org/abs/2405.19902,,2405.19902.pdf,Learning Discriminative Dynamics with Label Corruption for Noisy Label Detection,"Label noise, commonly found in real-world datasets, has a detrimental impact
+on a model's generalization. To effectively detect incorrectly labeled
+instances, previous works have mostly relied on distinguishable training
+signals, such as training loss, as indicators to differentiate between clean
+and noisy labels. However, they have limitations in that the training signals
+incompletely reveal the model's behavior and are not effectively generalized to
+various noise types, resulting in limited detection accuracy. In this paper, we
+propose DynaCor framework that distinguishes incorrectly labeled instances from
+correctly labeled ones based on the dynamics of the training signals. To cope
+with the absence of supervision for clean and noisy labels, DynaCor first
+introduces a label corruption strategy that augments the original dataset with
+intentionally corrupted labels, enabling indirect simulation of the model's
+behavior on noisy labels. Then, DynaCor learns to identify clean and noisy
+instances by inducing two clearly distinguishable clusters from the latent
+representations of training dynamics. Our comprehensive experiments show that
+DynaCor outperforms the state-of-the-art competitors and shows strong
+robustness to various noise types and noise rates.",cs.LG,"['cs.LG', 'stat.ML']"
+Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation,Qi Yang · Xing Nie · Tong Li · Gaopengfei · Ying Guo · Cheng Zhen · Pengfei Yan · Shiming Xiang, ,https://arxiv.org/abs/2312.06462,,2312.06462.pdf,Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation,"Recently, an audio-visual segmentation (AVS) task has been introduced, aiming
+to group pixels with sounding objects within a given video. This task
+necessitates a first-ever audio-driven pixel-level understanding of the scene,
+posing significant challenges. In this paper, we propose an innovative
+audio-visual transformer framework, termed COMBO, an acronym for COoperation of
+Multi-order Bilateral relatiOns. For the first time, our framework explores
+three types of bilateral entanglements within AVS: pixel entanglement, modality
+entanglement, and temporal entanglement. Regarding pixel entanglement, we
+employ a Siam-Encoder Module (SEM) that leverages prior knowledge to generate
+more precise visual features from the foundational model. For modality
+entanglement, we design a Bilateral-Fusion Module (BFM), enabling COMBO to
+align corresponding visual and auditory signals bi-directionally. As for
+temporal entanglement, we introduce an innovative adaptive inter-frame
+consistency loss according to the inherent rules of temporal. Comprehensive
+experiments and ablation studies on AVSBench-object (84.7 mIoU on S4, 59.2 mIou
+on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that
+COMBO surpasses previous state-of-the-art methods. Code and more results will
+be publicly available at https://yannqi.github.io/AVS-COMBO/.",cs.CV,"['cs.CV', 'cs.AI', 'cs.SD', 'eess.AS']"
+EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars,Nikita Drobyshev · Antoni Bigata Casademunt · Konstantinos Vougioukas · Zoe Landgraf · Stavros Petridis · Maja Pantic, ,https://arxiv.org/abs/2404.19110,,2404.19110.pdf,EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars,"Head avatars animated by visual signals have gained popularity, particularly
+in cross-driving synthesis where the driver differs from the animated
+character, a challenging but highly practical approach. The recently presented
+MegaPortraits model has demonstrated state-of-the-art results in this domain.
+We conduct a deep examination and evaluation of this model, with a particular
+focus on its latent space for facial expression descriptors, and uncover
+several limitations with its ability to express intense face motions. To
+address these limitations, we propose substantial changes in both training
+pipeline and model architecture, to introduce our EMOPortraits model, where we:
+  Enhance the model's capability to faithfully support intense, asymmetric face
+expressions, setting a new state-of-the-art result in the emotion transfer
+task, surpassing previous methods in both metrics and quality.
+  Incorporate speech-driven mode to our model, achieving top-tier performance
+in audio-driven facial animation, making it possible to drive source identity
+through diverse modalities, including visual signal, audio, or a blend of both.
+  We propose a novel multi-view video dataset featuring a wide range of intense
+and asymmetric facial expressions, filling the gap with absence of such data in
+existing datasets.",cs.CV,['cs.CV']
+NC-TTT: A Noise Constrastive Approach for Test-Time Training,David OSOWIECHI · Gustavo Vargas Hakim · Mehrdad Noori · Milad Cheraghalikhani · Ali Bahri · Moslem Yazdanpanah · Ismail Ben Ayed · Christian Desrosiers, ,https://arxiv.org/abs/2404.08392,,2404.08392.pdf,NC-TTT: A Noise Contrastive Approach for Test-Time Training,"Despite their exceptional performance in vision tasks, deep learning models
+often struggle when faced with domain shifts during testing. Test-Time Training
+(TTT) methods have recently gained popularity by their ability to enhance the
+robustness of models through the addition of an auxiliary objective that is
+jointly optimized with the main task. Being strictly unsupervised, this
+auxiliary objective is used at test time to adapt the model without any access
+to labels. In this work, we propose Noise-Contrastive Test-Time Training
+(NC-TTT), a novel unsupervised TTT technique based on the discrimination of
+noisy feature maps. By learning to classify noisy views of projected feature
+maps, and then adapting the model accordingly on new domains, classification
+performance can be recovered by an important margin. Experiments on several
+popular test-time adaptation baselines demonstrate the advantages of our method
+compared to recent approaches for this task. The code can be found
+at:https://github.com/GustavoVargasHakim/NCTTT.git",cs.CV,"['cs.CV', 'cs.LG']"
+Forecasting of 3D Whole-body Human Poses with Grasping Objects,yan haitao · Qiongjie Cui · Jiexin Xie · Shijie Guo, ,https://arxiv.org/abs/2312.11972,,2312.11972.pdf,Expressive Forecasting of 3D Whole-body Human Motions,"Human motion forecasting, with the goal of estimating future human behavior
+over a period of time, is a fundamental task in many real-world applications.
+However, existing works typically concentrate on predicting the major joints of
+the human body without considering the delicate movements of the human hands.
+In practical applications, hand gesture plays an important role in human
+communication with the real world, and expresses the primary intention of human
+beings. In this work, we are the first to formulate a whole-body human pose
+forecasting task, which jointly predicts the future body and hand activities.
+Correspondingly, we propose a novel Encoding-Alignment-Interaction (EAI)
+framework that aims to predict both coarse (body joints) and fine-grained
+(gestures) activities collaboratively, enabling expressive and
+cross-facilitated forecasting of 3D whole-body human motions. Specifically, our
+model involves two key constituents: cross-context alignment (XCA) and
+cross-context interaction (XCI). Considering the heterogeneous information
+within the whole-body, XCA aims to align the latent features of various human
+components, while XCI focuses on effectively capturing the context interaction
+among the human components. We conduct extensive experiments on a
+newly-introduced large-scale benchmark and achieve state-of-the-art
+performance. The code is public for research purposes at
+https://github.com/Dingpx/EAI.",cs.CV,['cs.CV']
+QN-Mixer: A Quasi-Newton MLP-Mixer Model for Sparse-View CT Reconstruction,Ishak Ayad · Nicolas Larue · Mai K. Nguyen, ,https://arxiv.org/abs/2402.17951,,2402.17951.pdf,QN-Mixer: A Quasi-Newton MLP-Mixer Model for Sparse-View CT Reconstruction,"Inverse problems span across diverse fields. In medical contexts, computed
+tomography (CT) plays a crucial role in reconstructing a patient's internal
+structure, presenting challenges due to artifacts caused by inherently
+ill-posed inverse problems. Previous research advanced image quality via
+post-processing and deep unrolling algorithms but faces challenges, such as
+extended convergence times with ultra-sparse data. Despite enhancements,
+resulting images often show significant artifacts, limiting their effectiveness
+for real-world diagnostic applications. We aim to explore deep second-order
+unrolling algorithms for solving imaging inverse problems, emphasizing their
+faster convergence and lower time complexity compared to common first-order
+methods like gradient descent. In this paper, we introduce QN-Mixer, an
+algorithm based on the quasi-Newton approach. We use learned parameters through
+the BFGS algorithm and introduce Incept-Mixer, an efficient neural architecture
+that serves as a non-local regularization term, capturing long-range
+dependencies within images. To address the computational demands typically
+associated with quasi-Newton algorithms that require full Hessian matrix
+computations, we present a memory-efficient alternative. Our approach
+intelligently downsamples gradient information, significantly reducing
+computational requirements while maintaining performance. The approach is
+validated through experiments on the sparse-view CT problem, involving various
+datasets and scanning protocols, and is compared with post-processing and deep
+unrolling state-of-the-art approaches. Our method outperforms existing
+approaches and achieves state-of-the-art performance in terms of SSIM and PSNR,
+all while reducing the number of unrolling iterations required.",eess.IV,"['eess.IV', 'cs.CV']"
+Class Tokens Infusion for Weakly Supervised Semantic Segmentation,Sung-Hoon Yoon · Hoyong Kwon · Hyeonseong Kim · Kuk-Jin Yoon, ,http://export.arxiv.org/abs/2308.03005,,2308.03005.pdf,MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation,"This paper proposes a novel transformer-based framework that aims to enhance
+weakly supervised semantic segmentation (WSSS) by generating accurate
+class-specific object localization maps as pseudo labels. Building upon the
+observation that the attended regions of the one-class token in the standard
+vision transformer can contribute to a class-agnostic localization map, we
+explore the potential of the transformer model to capture class-specific
+attention for class-discriminative object localization by learning multiple
+class tokens. We introduce a Multi-Class Token transformer, which incorporates
+multiple class tokens to enable class-aware interactions with the patch tokens.
+To achieve this, we devise a class-aware training strategy that establishes a
+one-to-one correspondence between the output class tokens and the ground-truth
+class labels. Moreover, a Contrastive-Class-Token (CCT) module is proposed to
+enhance the learning of discriminative class tokens, enabling the model to
+better capture the unique characteristics and properties of each class. As a
+result, class-discriminative object localization maps can be effectively
+generated by leveraging the class-to-patch attentions associated with different
+class tokens. To further refine these localization maps, we propose the
+utilization of patch-level pairwise affinity derived from the patch-to-patch
+transformer attention. Furthermore, the proposed framework seamlessly
+complements the Class Activation Mapping (CAM) method, resulting in
+significantly improved WSSS performance on the PASCAL VOC 2012 and MS COCO 2014
+datasets. These results underline the importance of the class token for WSSS.",cs.CV,['cs.CV']
+Dual-consistency Model Inversion for Non-exemplar Class Incremental Learning,Zihuan Qiu · Yi Xu · Fanman Meng · Hongliang Li · Linfeng Xu · Qingbo Wu, ,https://ar5iv.labs.arxiv.org/html/2303.10891,,2303.10891.pdf,Non-Exemplar Online Class-incremental Continual Learning via Dual-prototype Self-augment and Refinement,"This paper investigates a new, practical, but challenging problem named
+Non-exemplar Online Class-incremental continual Learning (NO-CL), which aims to
+preserve the discernibility of base classes without buffering data examples and
+efficiently learn novel classes continuously in a single-pass (i.e., online)
+data stream. The challenges of this task are mainly two-fold: (1) Both base and
+novel classes suffer from severe catastrophic forgetting as no previous samples
+are available for replay. (2) As the online data can only be observed once,
+there is no way to fully re-train the whole model, e.g., re-calibrate the
+decision boundaries via prototype alignment or feature distillation. In this
+paper, we propose a novel Dual-prototype Self-augment and Refinement method
+(DSR) for NO-CL problem, which consists of two strategies: 1) Dual class
+prototypes: vanilla and high-dimensional prototypes are exploited to utilize
+the pre-trained information and obtain robust quasi-orthogonal representations
+rather than example buffers for both privacy preservation and memory reduction.
+2) Self-augment and refinement: Instead of updating the whole network, we
+optimize high-dimensional prototypes alternatively with the extra projection
+module based on self-augment vanilla prototypes, through a bi-level
+optimization problem. Extensive experiments demonstrate the effectiveness and
+superiority of the proposed DSR in NO-CL.",cs.CV,['cs.CV']
+MoDE: CLIP Data Experts via Clustering,Jiawei Ma · Po-Yao Huang · Saining Xie · Shang-Wen Li · Luke Zettlemoyer · Shih-Fu Chang · Wen-tau Yih · Hu Xu,https://github.com/facebookresearch/MetaCLIP/tree/main/mode,https://arxiv.org/abs/2404.16030,,2404.16030.pdf,MoDE: CLIP Data Experts via Clustering,"The success of contrastive language-image pretraining (CLIP) relies on the
+supervision from the pairing between images and captions, which tends to be
+noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn
+a system of CLIP data experts via clustering. Each data expert is trained on
+one data cluster, being less sensitive to false negative noises in other
+clusters. At inference time, we ensemble their outputs by applying weights
+determined through the correlation between task metadata and cluster
+conditions. To estimate the correlation precisely, the samples in one cluster
+should be semantically similar, but the number of data experts should still be
+reasonable for training and inference. As such, we consider the ontology in
+human language and propose to use fine-grained cluster centers to represent
+each data expert at a coarse-grained level. Experimental studies show that four
+CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and
+OpenCLIP on zero-shot image classification but with less ($<$35\%) training
+cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly
+include new data experts. The code is available at
+https://github.com/facebookresearch/MetaCLIP/tree/main/mode.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']"
+FSC: Few-point Shape Completion,Xianzu Wu · Xianfeng Wu · Tianyu Luan · Yajing Bai · Zhongyuan Lai · Junsong Yuan, ,https://arxiv.org/abs/2403.07359v4,,2403.07359v4.pdf,FSC: Few-point Shape Completion,"While previous studies have demonstrated successful 3D object shape
+completion with a sufficient number of points, they often fail in scenarios
+when a few points, e.g. tens of points, are observed. Surprisingly, via entropy
+analysis, we find that even a few points, e.g. 64 points, could retain
+substantial information to help recover the 3D shape of the object. To address
+the challenge of shape completion with very sparse point clouds, we then
+propose Few-point Shape Completion (FSC) model, which contains a novel
+dual-branch feature extractor for handling extremely sparse inputs, coupled
+with an extensive branch for maximal point utilization with a saliency branch
+for dynamic importance assignment. This model is further bolstered by a
+two-stage revision network that refines both the extracted features and the
+decoder output, enhancing the detail and authenticity of the completed point
+cloud. Our experiments demonstrate the feasibility of recovering 3D shapes from
+a few points. The proposed Few-point Shape Completion (FSC) model outperforms
+previous methods on both few-point inputs and many-point inputs, and shows good
+generalizability to different object categories.",cs.CV,['cs.CV']
+Equivariant Multi-Modality Image Fusion,Zixiang Zhao · Haowen Bai · Jiangshe Zhang · Yulun Zhang · Kai Zhang · Shuang Xu · Dongdong Chen · Radu Timofte · Luc Van Gool, ,https://arxiv.org/abs/2402.02235,,2402.02235.pdf,Image Fusion via Vision-Language Model,"Image fusion integrates essential information from multiple source images
+into a single composite, emphasizing the highlighting structure and textures,
+and refining imperfect areas. Existing methods predominantly focus on
+pixel-level and semantic visual features for recognition. However, they
+insufficiently explore the deeper semantic information at a text-level beyond
+vision. Therefore, we introduce a novel fusion paradigm named image Fusion via
+vIsion-Language Model (FILM), for the first time, utilizing explicit textual
+information in different source images to guide image fusion. In FILM, input
+images are firstly processed to generate semantic prompts, which are then fed
+into ChatGPT to obtain rich textual descriptions. These descriptions are fused
+in the textual domain and guide the extraction of crucial visual features from
+the source images through cross-attention, resulting in a deeper level of
+contextual understanding directed by textual semantic information. The final
+fused image is created by vision feature decoder. This paradigm achieves
+satisfactory results in four image fusion tasks: infrared-visible, medical,
+multi-exposure, and multi-focus image fusion. We also propose a vision-language
+dataset containing ChatGPT-based paragraph descriptions for the ten image
+fusion datasets in four fusion tasks, facilitating future research in
+vision-language model-based image fusion. Code and dataset will be released.",cs.CV,['cs.CV']
+High-Quality Facial Geometry and Appearance Capture at Home,Yuxuan Han · Junfeng Lyu · Feng Xu,https://yxuhan.github.io/CoRA/index.html,https://arxiv.org/abs/2312.03442,,2312.03442.pdf,High-Quality Facial Geometry and Appearance Capture at Home,"Facial geometry and appearance capture have demonstrated tremendous success
+in 3D scanning real humans in studios. Recent works propose to democratize this
+technique while keeping the results high quality. However, they are still
+inconvenient for daily usage. In addition, they focus on an easier problem of
+only capturing facial skin. This paper proposes a novel method for high-quality
+face capture, featuring an easy-to-use system and the capability to model the
+complete face with skin, mouth interior, hair, and eyes. We reconstruct facial
+geometry and appearance from a single co-located smartphone flashlight sequence
+captured in a dim room where the flashlight is the dominant light source (e.g.
+rooms with curtains or at night). To model the complete face, we propose a
+novel hybrid representation to effectively model both eyes and other facial
+regions, along with novel techniques to learn it from images. We apply a
+combined lighting model to compactly represent real illuminations and exploit a
+morphable face albedo model as a reflectance prior to disentangle diffuse and
+specular. Experiments show that our method can capture high-quality 3D
+relightable scans.",cs.CV,['cs.CV']
+Multi-Object Tracking in the Dark,Xinzhe Wang · Kang Ma · Qiankun Liu · Yunhao Zou · Ying Fu, ,https://arxiv.org/abs/2405.06600,,2405.06600.pdf,Multi-Object Tracking in the Dark,"Low-light scenes are prevalent in real-world applications (e.g. autonomous
+driving and surveillance at night). Recently, multi-object tracking in various
+practical use cases have received much attention, but multi-object tracking in
+dark scenes is rarely considered. In this paper, we focus on multi-object
+tracking in dark scenes. To address the lack of datasets, we first build a
+Low-light Multi-Object Tracking (LMOT) dataset. LMOT provides well-aligned
+low-light video pairs captured by our dual-camera system, and high-quality
+multi-object tracking annotations for all videos. Then, we propose a low-light
+multi-object tracking method, termed as LTrack. We introduce the adaptive
+low-pass downsample module to enhance low-frequency components of images
+outside the sensor noises. The degradation suppression learning strategy
+enables the model to learn invariant information under noise disturbance and
+image quality degradation. These components improve the robustness of
+multi-object tracking in dark scenes. We conducted a comprehensive analysis of
+our LMOT dataset and proposed LTrack. Experimental results demonstrate the
+superiority of the proposed method and its competitiveness in real night
+low-light scenes. Dataset and Code: https: //github.com/ying-fu/LMOT",cs.CV,['cs.CV']
+VideoCon: Robust Video-Language Alignment via Contrast Captions,Hritik Bansal · Yonatan Bitton · Idan Szpektor · Kai-Wei Chang · Aditya Grover, ,https://arxiv.org/abs/2311.10111,,2311.10111.pdf,VideoCon: Robust Video-Language Alignment via Contrast Captions,"Despite being (pre)trained on a massive amount of data, state-of-the-art
+video-language alignment models are not robust to semantically-plausible
+contrastive changes in the video captions. Our work addresses this by
+identifying a broad spectrum of contrast misalignments, such as replacing
+entities, actions, and flipping event order, which alignment models should be
+robust against. To this end, we introduce the VideoCon, a video-language
+alignment dataset constructed by a large language model that generates
+plausible contrast video captions and explanations for differences between
+original and contrast video captions. Then, a generative video-language model
+is finetuned with VideoCon to assess video-language entailment and generate
+explanations. Our VideoCon-based alignment model significantly outperforms
+current models. It exhibits a 12-point increase in AUC for the video-language
+alignment task on human-generated contrast captions. Finally, our model sets
+new state of the art zero-shot performance in temporally-extensive
+video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video
+question answering (ATP-Hard). Moreover, our model shows superior performance
+on novel videos and human-crafted captions and explanations. Our code and data
+are available at https://github.com/Hritikbansal/videocon.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']"
+Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding,Sicong Leng · Hang Zhang · Guanzheng Chen · Xin Li · Shijian Lu · Chunyan Miao · Lidong Bing, ,https://arxiv.org/abs/2311.16922,,2311.16922.pdf,Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding,"Large Vision-Language Models (LVLMs) have advanced considerably, intertwining
+visual recognition and language understanding to generate content that is not
+only coherent but also contextually attuned. Despite their success, LVLMs still
+suffer from the issue of object hallucinations, where models generate plausible
+yet incorrect outputs that include objects that do not exist in the images. To
+mitigate this issue, we introduce Visual Contrastive Decoding (VCD), a simple
+and training-free method that contrasts output distributions derived from
+original and distorted visual inputs. The proposed VCD effectively reduces the
+over-reliance on statistical bias and unimodal priors, two essential causes of
+object hallucinations. This adjustment ensures the generated content is closely
+grounded to visual inputs, resulting in contextually accurate outputs. Our
+experiments show that VCD, without either additional training or the usage of
+external tools, significantly mitigates the object hallucination issue across
+different LVLM families. Beyond mitigating object hallucinations, VCD also
+excels in general LVLM benchmarks, highlighting its wide-ranging applicability.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']"
+SHINOBI: SHape and Illumination using Neural Object decomposition via BRDF optimization and Inverse rendering from unconstrained Image collections,Andreas Engelhardt · Amit Raj · Mark Boss · Yunzhi Zhang · Abhishek Kar · Yuanzhen Li · Ricardo Martin-Brualla · Jonathan T. Barron · Deqing Sun · Hendrik Lensch · Varun Jampani, ,https://arxiv.org/abs/2401.10171,,2401.10171.pdf,SHINOBI: Shape and Illumination using Neural Object Decomposition via BRDF Optimization In-the-wild,"We present SHINOBI, an end-to-end framework for the reconstruction of shape,
+material, and illumination from object images captured with varying lighting,
+pose, and background. Inverse rendering of an object based on unconstrained
+image collections is a long-standing challenge in computer vision and graphics
+and requires a joint optimization over shape, radiance, and pose. We show that
+an implicit shape representation based on a multi-resolution hash encoding
+enables faster and robust shape reconstruction with joint camera alignment
+optimization that outperforms prior work. Further, to enable the editing of
+illumination and object reflectance (i.e. material) we jointly optimize BRDF
+and illumination together with the object's shape. Our method is class-agnostic
+and works on in-the-wild image collections of objects to produce relightable 3D
+assets for several use cases such as AR/VR, movies, games, etc. Project page:
+https://shinobi.aengelhardt.com Video:
+https://www.youtube.com/watch?v=iFENQ6AcYd8&feature=youtu.be",cs.CV,"['cs.CV', 'cs.GR']"
+HEAL-SWIN: A Vision Transformer On The Sphere,Oscar Carlsson · Jan E. Gerken · Hampus Linander · Heiner Spiess · Fredrik Ohlsson · Christoffer Petersson · Daniel Persson, ,https://arxiv.org/abs/2307.07313,,2307.07313.pdf,HEAL-SWIN: A Vision Transformer On The Sphere,"High-resolution wide-angle fisheye images are becoming more and more
+important for robotics applications such as autonomous driving. However, using
+ordinary convolutional neural networks or vision transformers on this data is
+problematic due to projection and distortion losses introduced when projecting
+to a rectangular grid on the plane. We introduce the HEAL-SWIN transformer,
+which combines the highly uniform Hierarchical Equal Area iso-Latitude
+Pixelation (HEALPix) grid used in astrophysics and cosmology with the
+Hierarchical Shifted-Window (SWIN) transformer to yield an efficient and
+flexible model capable of training on high-resolution, distortion-free
+spherical data. In HEAL-SWIN, the nested structure of the HEALPix grid is used
+to perform the patching and windowing operations of the SWIN transformer,
+enabling the network to process spherical representations with minimal
+computational overhead. We demonstrate the superior performance of our model on
+both synthetic and real automotive datasets, as well as a selection of other
+image datasets, for semantic segmentation, depth regression and classification
+tasks. Our code is publicly available at
+https://github.com/JanEGerken/HEAL-SWIN.",cs.CV,"['cs.CV', 'cs.LG']"
+BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning,Siyuan Liang · Mingli Zhu · Aishan Liu · Baoyuan Wu · Xiaochun Cao · Ee-Chien Chang, ,https://arxiv.org/abs/2311.12075,,2311.12075.pdf,BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning,"Studying backdoor attacks is valuable for model copyright protection and
+enhancing defenses. While existing backdoor attacks have successfully infected
+multimodal contrastive learning models such as CLIP, they can be easily
+countered by specialized backdoor defenses for MCL models. This paper reveals
+the threats in this practical scenario that backdoor attacks can remain
+effective even after defenses and introduces the \emph{\toolns} attack, which
+is resistant to backdoor detection and model fine-tuning defenses. To achieve
+this, we draw motivations from the perspective of the Bayesian rule and propose
+a dual-embedding guided framework for backdoor attacks. Specifically, we ensure
+that visual trigger patterns approximate the textual target semantics in the
+embedding space, making it challenging to detect the subtle parameter
+variations induced by backdoor learning on such natural trigger patterns.
+Additionally, we optimize the visual trigger patterns to align the poisoned
+samples with target vision features in order to hinder the backdoor unlearning
+through clean fine-tuning. Extensive experiments demonstrate that our attack
+significantly outperforms state-of-the-art baselines (+45.3% ASR) in the
+presence of SoTA backdoor defenses, rendering these mitigation and detection
+strategies virtually ineffective. Furthermore, our approach effectively attacks
+some more rigorous scenarios like downstream tasks. We believe that this paper
+raises awareness regarding the potential threats associated with the practical
+application of multimodal contrastive learning and encourages the development
+of more robust defense mechanisms.",cs.CV,['cs.CV']
+Flexible Depth Completion for Sparse and Varying Point Densities,Jinhyung Park · Yu-Jhe Li · Kris Kitani, ,https://arxiv.org/abs/2405.09342,,2405.09342.pdf,Progressive Depth Decoupling and Modulating for Flexible Depth Completion,"Image-guided depth completion aims at generating a dense depth map from
+sparse LiDAR data and RGB image. Recent methods have shown promising
+performance by reformulating it as a classification problem with two sub-tasks:
+depth discretization and probability prediction. They divide the depth range
+into several discrete depth values as depth categories, serving as priors for
+scene depth distributions. However, previous depth discretization methods are
+easy to be impacted by depth distribution variations across different scenes,
+resulting in suboptimal scene depth distribution priors. To address the above
+problem, we propose a progressive depth decoupling and modulating network,
+which incrementally decouples the depth range into bins and adaptively
+generates multi-scale dense depth maps in multiple stages. Specifically, we
+first design a Bins Initializing Module (BIM) to construct the seed bins by
+exploring the depth distribution information within a sparse depth map,
+adapting variations of depth distribution. Then, we devise an incremental depth
+decoupling branch to progressively refine the depth distribution information
+from global to local. Meanwhile, an adaptive depth modulating branch is
+developed to progressively improve the probability representation from
+coarse-grained to fine-grained. And the bi-directional information interactions
+are proposed to strengthen the information interaction between those two
+branches (sub-tasks) for promoting information complementation in each branch.
+Further, we introduce a multi-scale supervision mechanism to learn the depth
+distribution information in latent features and enhance the adaptation
+capability across different scenes. Experimental results on public datasets
+demonstrate that our method outperforms the state-of-the-art methods. The code
+will be open-sourced at [this https URL](https://github.com/Cisse-away/PDDM).",cs.CV,['cs.CV']
+Neural Fields as Distributions: Signal Processing Beyond Euclidean Space,Daniel Rebain · Soroosh Yazdani · Kwang Moo Yi · Andrea Tagliasacchi, ,https://arxiv.org/abs/2404.13024,,,BANF: Band-limited Neural Fields for Levels of Detail Reconstruction,"Largely due to their implicit nature, neural fields lack a direct mechanism
+for filtering, as Fourier analysis from discrete signal processing is not
+directly applicable to these representations. Effective filtering of neural
+fields is critical to enable level-of-detail processing in downstream
+applications, and support operations that involve sampling the field on regular
+grids (e.g. marching cubes). Existing methods that attempt to decompose neural
+fields in the frequency domain either resort to heuristics or require extensive
+modifications to the neural field architecture. We show that via a simple
+modification, one can obtain neural fields that are low-pass filtered, and in
+turn show how this can be exploited to obtain a frequency decomposition of the
+entire signal. We demonstrate the validity of our technique by investigating
+level-of-detail reconstruction, and showing how coarser representations can be
+computed effectively.",cs.CV,"['cs.CV', 'eess.IV']"
+Accelerating Neural Field Training via Soft Mining,Shakiba Kheradmand · Daniel Rebain · Gopal Sharma · Hossam Isack · Abhishek Kar · Andrea Tagliasacchi · Kwang Moo Yi, ,https://arxiv.org/abs/2312.00075,,2312.00075.pdf,Accelerating Neural Field Training via Soft Mining,"We present an approach to accelerate Neural Field training by efficiently
+selecting sampling locations. While Neural Fields have recently become popular,
+it is often trained by uniformly sampling the training domain, or through
+handcrafted heuristics. We show that improved convergence and final training
+quality can be achieved by a soft mining technique based on importance
+sampling: rather than either considering or ignoring a pixel completely, we
+weigh the corresponding loss by a scalar. To implement our idea we use Langevin
+Monte-Carlo sampling. We show that by doing so, regions with higher error are
+being selected more frequently, leading to more than 2x improvement in
+convergence speed. The code and related resources for this study are publicly
+available at https://ubc-vision.github.io/nf-soft-mining/.",cs.CV,['cs.CV']
+Promptable Behaviors: Personalizing Multi-Objective Rewards from Human Preferences,Minyoung Hwang · Luca Weihs · Chanwoo Park · Kimin Lee · Aniruddha Kembhavi · Kiana Ehsani, ,https://arxiv.org/abs/2312.09337,,2312.09337.pdf,Promptable Behaviors: Personalizing Multi-Objective Rewards from Human Preferences,"Customizing robotic behaviors to be aligned with diverse human preferences is
+an underexplored challenge in the field of embodied AI. In this paper, we
+present Promptable Behaviors, a novel framework that facilitates efficient
+personalization of robotic agents to diverse human preferences in complex
+environments. We use multi-objective reinforcement learning to train a single
+policy adaptable to a broad spectrum of preferences. We introduce three
+distinct methods to infer human preferences by leveraging different types of
+interactions: (1) human demonstrations, (2) preference feedback on trajectory
+comparisons, and (3) language instructions. We evaluate the proposed method in
+personalized object-goal navigation and flee navigation tasks in ProcTHOR and
+RoboTHOR, demonstrating the ability to prompt agent behaviors to satisfy human
+preferences in various scenarios. Project page:
+https://promptable-behaviors.github.io",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']"
+Step differences in instructional video,Tushar Nagarajan · Lorenzo Torresani, ,https://arxiv.org/abs/2404.16222,,2404.16222.pdf,Step Differences in Instructional Video,"Comparing a user video to a reference how-to video is a key requirement for
+AR/VR technology delivering personalized assistance tailored to the user's
+progress. However, current approaches for language-based assistance can only
+answer questions about a single video. We propose an approach that first
+automatically generates large amounts of visual instruction tuning data
+involving pairs of videos from HowTo100M by leveraging existing step
+annotations and accompanying narrations, and then trains a video-conditioned
+language model to jointly reason across multiple raw videos. Our model achieves
+state-of-the-art performance at identifying differences between video pairs and
+ranking videos based on the severity of these differences, and shows promising
+ability to perform general reasoning over multiple videos.",cs.CV,['cs.CV']
+LEMON: Learning 3D Human-Object Interaction Relation from 2D Images,Yuhang Yang · Wei Zhai · Hongchen Luo · Yang Cao · Zheng-Jun Zha,https://yyvhang.github.io/LEMON/,https://arxiv.org/abs/2312.08963,,2312.08963.pdf,LEMON: Learning 3D Human-Object Interaction Relation from 2D Images,"Learning 3D human-object interaction relation is pivotal to embodied AI and
+interaction modeling. Most existing methods approach the goal by learning to
+predict isolated interaction elements, e.g., human contact, object affordance,
+and human-object spatial relation, primarily from the perspective of either the
+human or the object. Which underexploit certain correlations between the
+interaction counterparts (human and object), and struggle to address the
+uncertainty in interactions. Actually, objects' functionalities potentially
+affect humans' interaction intentions, which reveals what the interaction is.
+Meanwhile, the interacting humans and objects exhibit matching geometric
+structures, which presents how to interact. In light of this, we propose
+harnessing these inherent correlations between interaction counterparts to
+mitigate the uncertainty and jointly anticipate the above interaction elements
+in 3D space. To achieve this, we present LEMON (LEarning 3D huMan-Object
+iNteraction relation), a unified model that mines interaction intentions of the
+counterparts and employs curvatures to guide the extraction of geometric
+correlations, combining them to anticipate the interaction elements. Besides,
+the 3D Interaction Relation dataset (3DIR) is collected to serve as the test
+bed for training and evaluation. Extensive experiments demonstrate the
+superiority of LEMON over methods estimating each element in isolation.",cs.CV,['cs.CV']
+Physical Property Understanding from Language-Embedded Feature Fields,Albert J. Zhai · Yuan Shen · Emily Y. Chen · Gloria Wang · Xinlei Wang · Sheng Wang · Kaiyu Guan · Shenlong Wang, ,https://arxiv.org/abs/2404.04242,,2404.04242.pdf,Physical Property Understanding from Language-Embedded Feature Fields,"Can computers perceive the physical properties of objects solely through
+vision? Research in cognitive science and vision science has shown that humans
+excel at identifying materials and estimating their physical properties based
+purely on visual appearance. In this paper, we present a novel approach for
+dense prediction of the physical properties of objects using a collection of
+images. Inspired by how humans reason about physics through vision, we leverage
+large language models to propose candidate materials for each object. We then
+construct a language-embedded point cloud and estimate the physical properties
+of each 3D point using a zero-shot kernel regression approach. Our method is
+accurate, annotation-free, and applicable to any object in the open world.
+Experiments demonstrate the effectiveness of the proposed approach in various
+physical property reasoning tasks, such as estimating the mass of common
+objects, as well as other properties like friction and hardness.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']"
+ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval,Fang Kaipeng · Jingkuan Song · Lianli Gao · Pengpeng Zeng · Zhi-Qi Cheng · Xiyao LI · Heng Tao Shen,https://github.com/fangkaipeng/ProS,https://arxiv.org/abs/2312.12478,,2312.12478.pdf,ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval,"The goal of Universal Cross-Domain Retrieval (UCDR) is to achieve robust
+performance in generalized test scenarios, wherein data may belong to strictly
+unknown domains and categories during training. Recently, pre-trained models
+with prompt tuning have shown strong generalization capabilities and attained
+noteworthy achievements in various downstream tasks, such as few-shot learning
+and video-text retrieval. However, applying them directly to UCDR may not
+sufficiently to handle both domain shift (i.e., adapting to unfamiliar domains)
+and semantic shift (i.e., transferring to unknown categories). To this end, we
+propose \textbf{Pro}mpting-to-\textbf{S}imulate (ProS), the first method to
+apply prompt tuning for UCDR. ProS employs a two-step process to simulate
+Content-aware Dynamic Prompts (CaDP) which can impact models to produce
+generalized features for UCDR. Concretely, in Prompt Units Learning stage, we
+introduce two Prompt Units to individually capture domain and semantic
+knowledge in a mask-and-align way. Then, in Context-aware Simulator Learning
+stage, we train a Content-aware Prompt Simulator under a simulated test
+scenarios to produce the corresponding CaDP. Extensive experiments conducted on
+three benchmark datasets show that our method achieves new state-of-the-art
+performance without bringing excessive parameters. Our method is publicly
+available at https://github.com/fangkaipeng/ProS.",cs.CV,['cs.CV']
+CoDi-2: Interleaved and In-Context Any-to-Any Generation,Zineng Tang · Ziyi Yang · MAHMOUD KHADEMI · Yang Liu · Chenguang Zhu · Mohit Bansal, ,https://arxiv.org/abs/2311.18775,,2311.18775.pdf,"CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation","We present CoDi-2, a versatile and interactive Multimodal Large Language
+Model (MLLM) that can follow complex multimodal interleaved instructions,
+conduct in-context learning (ICL), reason, chat, edit, etc., in an any-to-any
+input-output modality paradigm. By aligning modalities with language for both
+encoding and generation, CoDi-2 empowers Large Language Models (LLMs) to not
+only understand complex modality-interleaved instructions and in-context
+examples, but also autoregressively generate grounded and coherent multimodal
+outputs in the continuous feature space. To train CoDi-2, we build a
+large-scale generation dataset encompassing in-context multimodal instructions
+across text, vision, and audio. CoDi-2 demonstrates a wide range of zero-shot
+capabilities for multimodal generation, such as in-context learning, reasoning,
+and compositionality of any-to-any modality generation through multi-round
+interactive conversation. CoDi-2 surpasses previous domain-specific models on
+tasks such as subject-driven image generation, vision transformation, and audio
+editing. CoDi-2 signifies a substantial breakthrough in developing a
+comprehensive multimodal foundation model adept at interpreting in-context
+language-vision-audio interleaved instructions and producing multimodal
+outputs.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG', 'cs.SD', 'eess.AS']"
+EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models,Jingyuan Yang · Jiawei Feng · Hui Huang, ,https://arxiv.org/abs/2401.04608,,2401.04608.pdf,EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models,"Recent years have witnessed remarkable progress in image generation task,
+where users can create visually astonishing images with high-quality. However,
+existing text-to-image diffusion models are proficient in generating concrete
+concepts (dogs) but encounter challenges with more abstract ones (emotions).
+Several efforts have been made to modify image emotions with color and style
+adjustments, facing limitations in effectively conveying emotions with fixed
+image contents. In this work, we introduce Emotional Image Content Generation
+(EICG), a new task to generate semantic-clear and emotion-faithful images given
+emotion categories. Specifically, we propose an emotion space and construct a
+mapping network to align it with the powerful Contrastive Language-Image
+Pre-training (CLIP) space, providing a concrete interpretation of abstract
+emotions. Attribute loss and emotion confidence are further proposed to ensure
+the semantic diversity and emotion fidelity of the generated images. Our method
+outperforms the state-of-the-art text-to-image approaches both quantitatively
+and qualitatively, where we derive three custom metrics, i.e., emotion
+accuracy, semantic clarity and semantic diversity. In addition to generation,
+our method can help emotion understanding and inspire emotional art design.",cs.CV,['cs.CV']
+Rapid 3D Model Generation with Intuitive 3D Input,Tianrun Chen · Chaotao Ding · Shangzhan Zhang · Chunan Yu · Ying Zang · Zejian Li · Sida Peng · Lingyun Sun, ,https://ar5iv.labs.arxiv.org/html/2309.13006,,2309.13006.pdf,Deep3DSketch+: Rapid 3D Modeling from Single Free-hand Sketches,"The rapid development of AR/VR brings tremendous demands for 3D content.
+While the widely-used Computer-Aided Design (CAD) method requires a
+time-consuming and labor-intensive modeling process, sketch-based 3D modeling
+offers a potential solution as a natural form of computer-human interaction.
+However, the sparsity and ambiguity of sketches make it challenging to generate
+high-fidelity content reflecting creators' ideas. Precise drawing from multiple
+views or strategic step-by-step drawings is often required to tackle the
+challenge but is not friendly to novice users. In this work, we introduce a
+novel end-to-end approach, Deep3DSketch+, which performs 3D modeling using only
+a single free-hand sketch without inputting multiple sketches or view
+information. Specifically, we introduce a lightweight generation network for
+efficient inference in real-time and a structural-aware adversarial training
+approach with a Stroke Enhancement Module (SEM) to capture the structural
+information to facilitate learning of the realistic and fine-detailed shape
+structures for high-fidelity performance. Extensive experiments demonstrated
+the effectiveness of our approach with the state-of-the-art (SOTA) performance
+on both synthetic and real datasets.",cs.CV,['cs.CV']
+L-MAGIC: Language Model Assisted Generation of Images with Consistency,zhipeng cai · Matthias Mueller · Reiner Birkl · Diana Wofk · Shao-Yen Tseng · JunDa Cheng · Gabriela Ben Melech Stan · Vasudev Lal · Michael Paulitsch, ,https://arxiv.org/abs/2311.16500,,2311.16500.pdf,LLMGA: Multimodal Large Language Model based Generation Assistant,"In this paper, we introduce a Multimodal Large Language Model-based
+Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and
+proficiency in reasoning, comprehension, and response inherent in Large
+Language Models (LLMs) to assist users in image generation and editing.
+Diverging from existing approaches where Multimodal Large Language Models
+(MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our
+LLMGA provides a detailed language generation prompt for precise control over
+SD. This not only augments LLM context understanding but also reduces noise in
+generation prompts, yields images with more intricate and precise content, and
+elevates the interpretability of the network. To this end, we curate a
+comprehensive dataset comprising prompt refinement, similar image generation,
+inpainting \& outpainting, and instruction-based editing. Moreover, we propose
+a two-stage training scheme. In the first stage, we train the MLLM to grasp the
+properties of image generation and editing, enabling it to generate detailed
+prompts. In the second stage, we optimize SD to align with the MLLM's
+generation prompts. Additionally, we propose a reference-based restoration
+network to alleviate texture, brightness, and contrast disparities between
+generated and preserved regions during inpainting and outpainting. Extensive
+results show that LLMGA has promising generation and editing capabilities and
+can enable more flexible and expansive applications in an interactive manner.",cs.CV,['cs.CV']
+Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering,Kim Youwang · Tae-Hyun Oh · Gerard Pons-Moll, ,https://arxiv.org/abs/2312.11360v1,,2312.11360v1.pdf,Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering,"We present Paint-it, a text-driven high-fidelity texture map synthesis method
+for 3D meshes via neural re-parameterized texture optimization. Paint-it
+synthesizes texture maps from a text description by
+synthesis-through-optimization, exploiting the Score-Distillation Sampling
+(SDS). We observe that directly applying SDS yields undesirable texture quality
+due to its noisy gradients. We reveal the importance of texture
+parameterization when using SDS. Specifically, we propose Deep Convolutional
+Physically-Based Rendering (DC-PBR) parameterization, which re-parameterizes
+the physically-based rendering (PBR) texture maps with randomly initialized
+convolution-based neural kernels, instead of a standard pixel-based
+parameterization. We show that DC-PBR inherently schedules the optimization
+curriculum according to texture frequency and naturally filters out the noisy
+signals from SDS. In experiments, Paint-it obtains remarkable quality PBR
+texture maps within 15 min., given only a text description. We demonstrate the
+generalizability and practicality of Paint-it by synthesizing high-quality
+texture maps for large-scale mesh datasets and showing test-time applications
+such as relighting and material control using a popular graphics engine.
+Project page: https://kim-youwang.github.io/paint-it",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR']"
+RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation,Peng Lu · Tao Jiang · Yining Li · Xiangtai Li · Kai Chen · Wenming Yang,https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo,https://arxiv.org/abs/2312.07526,,2312.07526.pdf,RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation,"Real-time multi-person pose estimation presents significant challenges in
+balancing speed and precision. While two-stage top-down methods slow down as
+the number of people in the image increases, existing one-stage methods often
+fail to simultaneously deliver high accuracy and real-time performance. This
+paper introduces RTMO, a one-stage pose estimation framework that seamlessly
+integrates coordinate classification by representing keypoints using dual 1-D
+heatmaps within the YOLO architecture, achieving accuracy comparable to
+top-down methods while maintaining high speed. We propose a dynamic coordinate
+classifier and a tailored loss function for heatmap learning, specifically
+designed to address the incompatibilities between coordinate classification and
+dense prediction models. RTMO outperforms state-of-the-art one-stage pose
+estimators, achieving 1.1% higher AP on COCO while operating about 9 times
+faster with the same backbone. Our largest model, RTMO-l, attains 74.8% AP on
+COCO val2017 and 141 FPS on a single V100 GPU, demonstrating its efficiency and
+accuracy. The code and models are available at
+https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo.",cs.CV,['cs.CV']
+Multi-Session SLAM using Wide-Baseline Optical Flow,Lahav Lipson · Jia Deng, ,https://arxiv.org/abs/2404.15263,,2404.15263.pdf,Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization,"We introduce a new system for Multi-Session SLAM, which tracks camera motion
+across multiple disjoint videos under a single global reference. Our approach
+couples the prediction of optical flow with solver layers to estimate camera
+pose. The backbone is trained end-to-end using a novel differentiable solver
+for wide-baseline two-view pose. The full system can connect disjoint
+sequences, perform visual odometry, and global optimization. Compared to
+existing approaches, our design is accurate and robust to catastrophic
+failures. Code is available at github.com/princeton-vl/MultiSlam_DiffPose",cs.CV,['cs.CV']
+Action Scene Graphs for Long-Form Understanding of Egocentric Videos,Ivan Rodin · Antonino Furnari · Kyle Min · Subarna Tripathi · Giovanni Maria Farinella,https://github.com/fpv-iplab/easg,https://arxiv.org/abs/2312.03391,,2312.03391.pdf,Action Scene Graphs for Long-Form Understanding of Egocentric Videos,"We present Egocentric Action Scene Graphs (EASGs), a new representation for
+long-form understanding of egocentric videos. EASGs extend standard
+manually-annotated representations of egocentric videos, such as verb-noun
+action labels, by providing a temporally evolving graph-based description of
+the actions performed by the camera wearer, including interacted objects, their
+relationships, and how actions unfold in time. Through a novel annotation
+procedure, we extend the Ego4D dataset by adding manually labeled Egocentric
+Action Scene Graphs offering a rich set of annotations designed for long-from
+egocentric video understanding. We hence define the EASG generation task and
+provide a baseline approach, establishing preliminary benchmarks. Experiments
+on two downstream tasks, egocentric action anticipation and egocentric activity
+summarization, highlight the effectiveness of EASGs for long-form egocentric
+video understanding. We will release the dataset and the code to replicate
+experiments and annotations.",cs.CV,['cs.CV']
+Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations,Sangmin Lee · Bolin Lai · Fiona Ryan · Bikram Boote · James Rehg,https://sangmin-git.github.io/projects/MMSI,https://arxiv.org/abs/2403.02090,,2403.02090.pdf,Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations,"Understanding social interactions involving both verbal and non-verbal cues
+is essential for effectively interpreting social situations. However, most
+prior works on multimodal social cues focus predominantly on single-person
+behaviors or rely on holistic visual representations that are not aligned to
+utterances in multi-party environments. Consequently, they are limited in
+modeling the intricate dynamics of multi-party interactions. In this paper, we
+introduce three new challenging tasks to model the fine-grained dynamics
+between multiple people: speaking target identification, pronoun coreference
+resolution, and mentioned player prediction. We contribute extensive data
+annotations to curate these new challenges in social deduction game settings.
+Furthermore, we propose a novel multimodal baseline that leverages densely
+aligned language-visual representations by synchronizing visual features with
+their corresponding utterances. This facilitates concurrently capturing verbal
+and non-verbal cues pertinent to social reasoning. Experiments demonstrate the
+effectiveness of the proposed approach with densely aligned multimodal
+representations in modeling fine-grained social interactions. Project website:
+https://sangmin-git.github.io/projects/MMSI.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']"
+Splatter Image: Ultra-Fast Single-View 3D Reconstruction,Stanislaw Szymanowicz · Christian Rupprecht · Andrea Vedaldi, ,https://arxiv.org/abs/2312.13150,,2312.13150.pdf,Splatter Image: Ultra-Fast Single-View 3D Reconstruction,"We introduce the \method, an ultra-efficient approach for monocular 3D object
+reconstruction. Splatter Image is based on Gaussian Splatting, which allows
+fast and high-quality reconstruction of 3D scenes from multiple images. We
+apply Gaussian Splatting to monocular reconstruction by learning a neural
+network that, at test time, performs reconstruction in a feed-forward manner,
+at 38 FPS. Our main innovation is the surprisingly straightforward design of
+this network, which, using 2D operators, maps the input image to one 3D
+Gaussian per pixel. The resulting set of Gaussians thus has the form an image,
+the Splatter Image. We further extend the method take several images as input
+via cross-view attention. Owning to the speed of the renderer (588 FPS), we use
+a single GPU for training while generating entire images at each iteration to
+optimize perceptual metrics like LPIPS. On several synthetic, real,
+multi-category and large-scale benchmark datasets, we achieve better results in
+terms of PSNR, LPIPS, and other metrics while training and evaluating much
+faster than prior works. Code, models, demo and more results are available at
+https://szymanowiczs.github.io/splatter-image.",cs.CV,['cs.CV']
+MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception,Thien-Minh Nguyen · Shenghai Yuan · Thien Nguyen · Pengyu Yin · Haozhi Cao · Lihua Xie · Maciej Wozniak · Patric Jensfelt · Marko Thiel · Justin Ziegenbein · Noel Blunder, ,https://arxiv.org/abs/2403.11496,,2403.11496.pdf,MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception,"Perception plays a crucial role in various robot applications. However,
+existing well-annotated datasets are biased towards autonomous driving
+scenarios, while unlabelled SLAM datasets are quickly over-fitted, and often
+lack environment and domain variations. To expand the frontier of these fields,
+we introduce a comprehensive dataset named MCD (Multi-Campus Dataset),
+featuring a wide range of sensing modalities, high-accuracy ground truth, and
+diverse challenging environments across three Eurasian university campuses. MCD
+comprises both CCS (Classical Cylindrical Spinning) and NRE (Non-Repetitive
+Epicyclic) lidars, high-quality IMUs (Inertial Measurement Units), cameras, and
+UWB (Ultra-WideBand) sensors. Furthermore, in a pioneering effort, we introduce
+semantic annotations of 29 classes over 59k sparse NRE lidar scans across three
+domains, thus providing a novel challenge to existing semantic segmentation
+research upon this largely unexplored lidar modality. Finally, we propose, for
+the first time to the best of our knowledge, continuous-time ground truth based
+on optimization-based registration of lidar-inertial data on large survey-grade
+prior maps, which are also publicly released, each several times the size of
+existing ones. We conduct a rigorous evaluation of numerous state-of-the-art
+algorithms on MCD, report their performance, and highlight the challenges
+awaiting solutions from the research community.",cs.RO,"['cs.RO', 'cs.AI']"
+FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio,Chao Xu · Yang Liu · Jiazheng Xing · Weida Wang · Mingze Sun · Jun Dan · Tianxin Huang · Siyuan Li · Zhi-Qi Cheng · Ying Tai · Baigui Sun, ,https://arxiv.org/abs/2403.01901,,2403.01901.pdf,FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio,"In this paper, we abstract the process of people hearing speech, extracting
+meaningful cues, and creating various dynamically audio-consistent talking
+faces, termed Listening and Imagining, into the task of high-fidelity diverse
+talking faces generation from a single audio. Specifically, it involves two
+critical challenges: one is to effectively decouple identity, content, and
+emotion from entangled audio, and the other is to maintain intra-video
+diversity and inter-video consistency. To tackle the issues, we first dig out
+the intricate relationships among facial factors and simplify the decoupling
+process, tailoring a Progressive Audio Disentanglement for accurate facial
+geometry and semantics learning, where each stage incorporates a customized
+training module responsible for a specific factor. Secondly, to achieve
+visually diverse and audio-synchronized animation solely from input audio
+within a single model, we introduce the Controllable Coherent Frame generation,
+which involves the flexible integration of three trainable adapters with frozen
+Latent Diffusion Models (LDMs) to focus on maintaining facial geometry and
+semantics, as well as texture and temporal coherence between frames. In this
+way, we inherit high-quality diverse generation from LDMs while significantly
+improving their controllability at a low training cost. Extensive experiments
+demonstrate the flexibility and effectiveness of our method in handling this
+paradigm. The codes will be released at
+https://github.com/modelscope/facechain.",cs.CV,['cs.CV']
+NeISF: Neural Incident Stokes Field for Geometry and Material Estimation,Chenhao Li · Taishi Ono · Takeshi Uemori · Hajime Mihara · Alexander Gatto · Hajime Nagahara · Yusuke Moriuchi, ,https://arxiv.org/abs/2311.13187v1,,2311.13187v1.pdf,NeISF: Neural Incident Stokes Field for Geometry and Material Estimation,"Multi-view inverse rendering is the problem of estimating the scene
+parameters such as shapes, materials, or illuminations from a sequence of
+images captured under different viewpoints. Many approaches, however, assume
+single light bounce and thus fail to recover challenging scenarios like
+inter-reflections. On the other hand, simply extending those methods to
+consider multi-bounced light requires more assumptions to alleviate the
+ambiguity. To address this problem, we propose Neural Incident Stokes Fields
+(NeISF), a multi-view inverse rendering framework that reduces ambiguities
+using polarization cues. The primary motivation for using polarization cues is
+that it is the accumulation of multi-bounced light, providing rich information
+about geometry and material. Based on this knowledge, the proposed incident
+Stokes field efficiently models the accumulated polarization effect with the
+aid of an original physically-based differentiable polarimetric renderer.
+Lastly, experimental results show that our method outperforms the existing
+works in synthetic and real scenarios.",cs.CV,['cs.CV']
+PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection,Kuan-Chih Huang · Weijie Lyu · Ming-Hsuan Yang · Yi-Hsuan Tsai, ,https://arxiv.org/abs/2312.08371,,2312.08371.pdf,PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection,"Recent temporal LiDAR-based 3D object detectors achieve promising performance
+based on the two-stage proposal-based approach. They generate 3D box candidates
+from the first-stage dense detector, followed by different temporal aggregation
+methods. However, these approaches require per-frame objects or whole point
+clouds, posing challenges related to memory bank utilization. Moreover, point
+clouds and trajectory features are combined solely based on concatenation,
+which may neglect effective interactions between them. In this paper, we
+propose a point-trajectory transformer with long short-term memory for
+efficient temporal 3D object detection. To this end, we only utilize point
+clouds of current-frame objects and their historical trajectories as input to
+minimize the memory bank storage requirement. Furthermore, we introduce modules
+to encode trajectory features, focusing on long short-term and future-aware
+perspectives, and then effectively aggregate them with point cloud features. We
+conduct extensive experiments on the large-scale Waymo dataset to demonstrate
+that our approach performs well against state-of-the-art methods. Code and
+models will be made publicly available at https://github.com/kuanchihhuang/PTT.",cs.CV,['cs.CV']
+SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation,Chen Sichen · Yingyi Zhang · Siming Huang · Ran Yi · Ke Fan · Ruixin Zhang · Peixian Chen · Jun Wang · Shouhong Ding · Lizhuang Ma, ,https://arxiv.org/abs/2404.03518,,2404.03518.pdf,SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation,"Recently, transformer-based methods have achieved state-of-the-art prediction
+quality on human pose estimation(HPE). Nonetheless, most of these
+top-performing transformer-based models are too computation-consuming and
+storage-demanding to deploy on edge computing platforms. Those
+transformer-based models that require fewer resources are prone to
+under-fitting due to their smaller scale and thus perform notably worse than
+their larger counterparts. Given this conundrum, we introduce SDPose, a new
+self-distillation method for improving the performance of small
+transformer-based models. To mitigate the problem of under-fitting, we design a
+transformer module named Multi-Cycled Transformer(MCT) based on multiple-cycled
+forwards to more fully exploit the potential of small model parameters.
+Further, in order to prevent the additional inference compute-consuming brought
+by MCT, we introduce a self-distillation scheme, extracting the knowledge from
+the MCT module to a naive forward model. Specifically, on the MSCOCO validation
+dataset, SDPose-T obtains 69.7% mAP with 4.4M parameters and 1.8 GFLOPs.
+Furthermore, SDPose-S-V2 obtains 73.5% mAP on the MSCOCO validation dataset
+with 6.2M parameters and 4.7 GFLOPs, achieving a new state-of-the-art among
+predominant tiny neural network methods. Our code is available at
+https://github.com/MartyrPenink/SDPose.",cs.CV,['cs.CV']
+Uncertainty-aware Action Decoupling Transformer for Action Anticipation,Hongji Guo · Nakul Agarwal · Shao-Yuan Lo · Kwonjoon Lee · Qiang Ji, ,https://arxiv.org/abs/2309.16397,,2309.16397.pdf,Uncertainty-Aware Decision Transformer for Stochastic Driving Environments,"Offline Reinforcement Learning (RL) has emerged as a promising framework for
+learning policies without active interactions, making it especially appealing
+for autonomous driving tasks. Recent successes of Transformers inspire casting
+offline RL as sequence modeling, which performs well in long-horizon tasks.
+However, they are overly optimistic in stochastic environments with incorrect
+assumptions that the same goal can be consistently achieved by identical
+actions. In this paper, we introduce an UNcertainty-awaRE deciSion Transformer
+(UNREST) for planning in stochastic driving environments without introducing
+additional transition or complex generative models. Specifically, UNREST
+estimates state uncertainties by the conditional mutual information between
+transitions and returns, and segments sequences accordingly. Discovering the
+`uncertainty accumulation' and `temporal locality' properties of driving
+environments, UNREST replaces the global returns in decision transformers with
+less uncertain truncated returns, to learn from true outcomes of agent actions
+rather than environment transitions. We also dynamically evaluate environmental
+uncertainty during inference for cautious planning. Extensive experimental
+results demonstrate UNREST's superior performance in various driving scenarios
+and the power of our uncertainty estimation strategy.",cs.LG,"['cs.LG', 'cs.AI']"
+Robust Overfitting Does Matter: Test-Time Adversarial Purification With FGSM,Linyu Tang · Lei Zhang, ,https://arxiv.org/abs/2403.11448,,2403.11448.pdf,Robust Overfitting Does Matter: Test-Time Adversarial Purification With FGSM,"Numerous studies have demonstrated the susceptibility of deep neural networks
+(DNNs) to subtle adversarial perturbations, prompting the development of many
+advanced adversarial defense methods aimed at mitigating adversarial attacks.
+Current defense strategies usually train DNNs for a specific adversarial attack
+method and can achieve good robustness in defense against this type of
+adversarial attack. Nevertheless, when subjected to evaluations involving
+unfamiliar attack modalities, empirical evidence reveals a pronounced
+deterioration in the robustness of DNNs. Meanwhile, there is a trade-off
+between the classification accuracy of clean examples and adversarial examples.
+Most defense methods often sacrifice the accuracy of clean examples in order to
+improve the adversarial robustness of DNNs. To alleviate these problems and
+enhance the overall robust generalization of DNNs, we propose the Test-Time
+Pixel-Level Adversarial Purification (TPAP) method. This approach is based on
+the robust overfitting characteristic of DNNs to the fast gradient sign method
+(FGSM) on training and test datasets. It utilizes FGSM for adversarial
+purification, to process images for purifying unknown adversarial perturbations
+from pixels at testing time in a ""counter changes with changelessness"" manner,
+thereby enhancing the defense capability of DNNs against various unknown
+adversarial attacks. Extensive experimental results show that our method can
+effectively improve both overall robust generalization of DNNs, notably over
+previous methods.",cs.CV,['cs.CV']
+Compositional Video Understanding with Spatiotemporal Structure-based Transformers,Hoyeoung Yun · Jinwoo Ahn · Minseo Kim · Eun-Sol Kim, ,https://arxiv.org/abs/2401.10831,,2401.10831.pdf,Understanding Video Transformers via Universal Concept Discovery,"This paper studies the problem of concept-based interpretability of
+transformer representations for videos. Concretely, we seek to explain the
+decision-making process of video transformers based on high-level,
+spatiotemporal concepts that are automatically discovered. Prior research on
+concept-based interpretability has concentrated solely on image-level tasks.
+Comparatively, video models deal with the added temporal dimension, increasing
+complexity and posing challenges in identifying dynamic concepts over time. In
+this work, we systematically address these challenges by introducing the first
+Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose
+an efficient approach for unsupervised identification of units of video
+transformer representations - concepts, and ranking their importance to the
+output of a model. The resulting concepts are highly interpretable, revealing
+spatio-temporal reasoning mechanisms and object-centric representations in
+unstructured video models. Performing this analysis jointly over a diverse set
+of supervised and self-supervised representations, we discover that some of
+these mechanism are universal in video transformers. Finally, we show that VTCD
+can be used for fine-grained action recognition and video object segmentation.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.RO']"
+Correlation-aware Coarse-to-fine MLPs for Deformable Medical Image Registration,Mingyuan Meng · Dagan Feng · Lei Bi · Jinman Kim, ,https://arxiv.org/abs/2311.16707,,2311.16707.pdf,Full-resolution MLPs Empower Medical Dense Prediction,"Dense prediction is a fundamental requirement for many medical vision tasks
+such as medical image restoration, registration, and segmentation. The most
+popular vision model, Convolutional Neural Networks (CNNs), has reached
+bottlenecks due to the intrinsic locality of convolution operations. Recently,
+transformers have been widely adopted for dense prediction for their capability
+to capture long-range visual dependence. However, due to the high computational
+complexity and large memory consumption of self-attention operations,
+transformers are usually used at downsampled feature resolutions. Such usage
+cannot effectively leverage the tissue-level textural information available
+only at the full image resolution. This textural information is crucial for
+medical dense prediction as it can differentiate the subtle human anatomy in
+medical images. In this study, we hypothesize that Multi-layer Perceptrons
+(MLPs) are superior alternatives to transformers in medical dense prediction
+where tissue-level details dominate the performance, as MLPs enable long-range
+dependence at the full image resolution. To validate our hypothesis, we develop
+a full-resolution hierarchical MLP framework that uses MLPs beginning from the
+full image resolution. We evaluate this framework with various MLP blocks on a
+wide range of medical dense prediction tasks including restoration,
+registration, and segmentation. Extensive experiments on six public
+well-benchmarked datasets show that, by simply using MLPs at full resolution,
+our framework outperforms its CNN and transformer counterparts and achieves
+state-of-the-art performance on various medical dense prediction tasks.",eess.IV,"['eess.IV', 'cs.CV']"
+Multimodal Representation Learning by Alternating Unimodal Adaptation,Xiaohui Zhang · Xiaohui Zhang · Jaehong Yoon · Mohit Bansal · Huaxiu Yao, ,https://arxiv.org/abs/2311.10707,,2311.10707.pdf,Multimodal Representation Learning by Alternating Unimodal Adaptation,"Multimodal learning, which integrates data from diverse sensory modes, plays
+a pivotal role in artificial intelligence. However, existing multimodal
+learning methods often struggle with challenges where some modalities appear
+more dominant than others during multimodal learning, resulting in suboptimal
+performance. To address this challenge, we propose MLA (Multimodal Learning
+with Alternating Unimodal Adaptation). MLA reframes the conventional joint
+multimodal learning process by transforming it into an alternating unimodal
+learning process, thereby minimizing interference between modalities.
+Simultaneously, it captures cross-modal interactions through a shared head,
+which undergoes continuous optimization across different modalities. This
+optimization process is controlled by a gradient modification mechanism to
+prevent the shared head from losing previously acquired information. During the
+inference phase, MLA utilizes a test-time uncertainty-based model fusion
+mechanism to integrate multimodal information. Extensive experiments are
+conducted on five diverse datasets, encompassing scenarios with complete
+modalities and scenarios with missing modalities. These experiments demonstrate
+the superiority of MLA over competing prior approaches. Our code is available
+at
+https://github.com/Cecile-hi/Multimodal-Learning-with-Alternating-Unimodal-Adaptation.",cs.LG,"['cs.LG', 'cs.CV']"
+Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition,Xiang Li · Jinglu Wang · Xiaohao Xu · Xiulian Peng · Rita Singh · Yan Lu · Bhiksha Raj, ,https://arxiv.org/abs/2310.00132,,2310.00132.pdf,QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition,"Audiovisual segmentation (AVS) is a challenging task that aims to segment
+visual objects in videos according to their associated acoustic cues. With
+multiple sound sources and background disturbances involved, establishing
+robust correspondences between audio and visual contents poses unique
+challenges due to (1) complex entanglement across sound sources and (2)
+frequent changes in the occurrence of distinct sound events. Assuming sound
+events occur independently, the multi-source semantic space can be represented
+as the Cartesian product of single-source sub-spaces. We are motivated to
+decompose the multi-source audio semantics into single-source semantics for
+more effective interactions with visual content. We propose a semantic
+decomposition method based on product quantization, where the multi-source
+semantics can be decomposed and represented by several disentangled and
+noise-suppressed single-source semantics. Furthermore, we introduce a
+global-to-local quantization mechanism, which distills knowledge from stable
+global (clip-level) features into local (frame-level) ones, to handle frequent
+changes in audio semantics. Extensive experiments demonstrate that our
+semantically decomposed audio representation significantly improves AVS
+performance, e.g., +21.2% mIoU on the challenging AVS-Semantic benchmark with
+ResNet50 backbone. https://github.com/lxa9867/QSD.",cs.CV,['cs.CV']
+MuRF: Multi-Baseline Radiance Fields,Haofei Xu · Anpei Chen · Yuedong Chen · Christos Sakaridis · Yulun Zhang · Marc Pollefeys · Andreas Geiger · Fisher Yu,https://haofeixu.github.io/murf/,https://arxiv.org/abs/2312.04565v1,,2312.04565v1.pdf,MuRF: Multi-Baseline Radiance Fields,"We present Multi-Baseline Radiance Fields (MuRF), a general feed-forward
+approach to solving sparse view synthesis under multiple different baseline
+settings (small and large baselines, and different number of input views). To
+render a target novel view, we discretize the 3D space into planes parallel to
+the target image plane, and accordingly construct a target view frustum volume.
+Such a target volume representation is spatially aligned with the target view,
+which effectively aggregates relevant information from the input views for
+high-quality rendering. It also facilitates subsequent radiance field
+regression with a convolutional network thanks to its axis-aligned nature. The
+3D context modeled by the convolutional network enables our method to synthesis
+sharper scene structures than prior works. Our MuRF achieves state-of-the-art
+performance across multiple different baseline settings and diverse scenarios
+ranging from simple objects (DTU) to complex indoor and outdoor scenes
+(RealEstate10K and LLFF). We also show promising zero-shot generalization
+abilities on the Mip-NeRF 360 dataset, demonstrating the general applicability
+of MuRF.",cs.CV,['cs.CV']
+Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection,Taeheon Kim · Sebin Shin · Youngjoon Yu · Hak Gu Kim · Yong Man Ro, ,https://arxiv.org/abs/2403.01300,,2403.01300.pdf,Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection,"RGBT multispectral pedestrian detection has emerged as a promising solution
+for safety-critical applications that require day/night operations. However,
+the modality bias problem remains unsolved as multispectral pedestrian
+detectors learn the statistical bias in datasets. Specifically, datasets in
+multispectral pedestrian detection mainly distribute between ROTO (day) and
+RXTO (night) data; the majority of the pedestrian labels statistically co-occur
+with their thermal features. As a result, multispectral pedestrian detectors
+show poor generalization ability on examples beyond this statistical
+correlation, such as ROTX data. To address this problem, we propose a novel
+Causal Mode Multiplexer (CMM) framework that effectively learns the causalities
+between multispectral inputs and predictions. Moreover, we construct a new
+dataset (ROTX-MP) to evaluate modality bias in multispectral pedestrian
+detection. ROTX-MP mainly includes ROTX examples not presented in previous
+datasets. Extensive experiments demonstrate that our proposed CMM framework
+generalizes well on existing datasets (KAIST, CVC-14, FLIR) and the new
+ROTX-MP. We will release our new dataset to the public for future research.",cs.CV,['cs.CV']
+GoMVS: Geometrically Consistent Cost Aggregation for Multi-View Stereo,Jiang Wu · Rui Li · Haofei Xu · Wenxun Zhao · Yu Zhu · Jinqiu Sun · Yanning Zhang, ,https://arxiv.org/abs/2404.07992v1,,2404.07992v1.pdf,GoMVS: Geometrically Consistent Cost Aggregation for Multi-View Stereo,"Matching cost aggregation plays a fundamental role in learning-based
+multi-view stereo networks. However, directly aggregating adjacent costs can
+lead to suboptimal results due to local geometric inconsistency. Related
+methods either seek selective aggregation or improve aggregated depth in the 2D
+space, both are unable to handle geometric inconsistency in the cost volume
+effectively. In this paper, we propose GoMVS to aggregate geometrically
+consistent costs, yielding better utilization of adjacent geometries. More
+specifically, we correspond and propagate adjacent costs to the reference pixel
+by leveraging the local geometric smoothness in conjunction with surface
+normals. We achieve this by the geometric consistent propagation (GCP) module.
+It computes the correspondence from the adjacent depth hypothesis space to the
+reference depth space using surface normals, then uses the correspondence to
+propagate adjacent costs to the reference geometry, followed by a convolution
+for aggregation. Our method achieves new state-of-the-art performance on DTU,
+Tanks & Temple, and ETH3D datasets. Notably, our method ranks 1st on the Tanks
+& Temple Advanced benchmark.",cs.CV,['cs.CV']
+Test-Time Adaptation for Depth Completion,Hyoungseob Park · Anjali W Gupta · Alex Wong, ,https://arxiv.org/abs/2402.03312,,2402.03312.pdf,Test-Time Adaptation for Depth Completion,"It is common to observe performance degradation when transferring models
+trained on some (source) datasets to target testing data due to a domain gap
+between them. Existing methods for bridging this gap, such as domain adaptation
+(DA), may require the source data on which the model was trained (often not
+available), while others, i.e., source-free DA, require many passes through the
+testing data. We propose an online test-time adaptation method for depth
+completion, the task of inferring a dense depth map from a single image and
+associated sparse depth map, that closes the performance gap in a single pass.
+We first present a study on how the domain shift in each data modality affects
+model performance. Based on our observations that the sparse depth modality
+exhibits a much smaller covariate shift than the image, we design an embedding
+module trained in the source domain that preserves a mapping from features
+encoding only sparse depth to those encoding image and sparse depth. During
+test time, sparse depth features are projected using this map as a proxy for
+source domain features and are used as guidance to train a set of auxiliary
+parameters (i.e., adaptation layer) to align image and sparse depth features
+from the target test domain to that of the source domain. We evaluate our
+method on indoor and outdoor scenarios and show that it improves over baselines
+by an average of 21.1%.",cs.CV,"['cs.CV', 'cs.LG']"
+FedSelect: Personalized Federated Learning with Customized Selection of Parameters for Fine-Tuning,Rishub Tamirisa · Chulin Xie · Wenxuan Bao · Andy Zhou · Ron Arel · Aviv Shamsian, ,https://arxiv.org/abs/2404.02478,,2404.02478.pdf,FedSelect: Personalized Federated Learning with Customized Selection of Parameters for Fine-Tuning,"Standard federated learning approaches suffer when client data distributions
+have sufficient heterogeneity. Recent methods addressed the client data
+heterogeneity issue via personalized federated learning (PFL) - a class of FL
+algorithms aiming to personalize learned global knowledge to better suit the
+clients' local data distributions. Existing PFL methods usually decouple global
+updates in deep neural networks by performing personalization on particular
+layers (i.e. classifier heads) and global aggregation for the rest of the
+network. However, preselecting network layers for personalization may result in
+suboptimal storage of global knowledge. In this work, we propose FedSelect, a
+novel PFL algorithm inspired by the iterative subnetwork discovery procedure
+used for the Lottery Ticket Hypothesis. FedSelect incrementally expands
+subnetworks to personalize client parameters, concurrently conducting global
+aggregations on the remaining parameters. This approach enables the
+personalization of both client parameters and subnetwork structure during the
+training process. Finally, we show that FedSelect outperforms recent
+state-of-the-art PFL algorithms under challenging client data heterogeneity
+settings and demonstrates robustness to various real-world distributional
+shifts. Our code is available at https://github.com/lapisrocks/fedselect.",cs.LG,"['cs.LG', 'cs.AI']"
+Mitigating Motion Blur in Neural Radiance Fields with Events and Frames,Marco Cannici · Davide Scaramuzza,https://github.com/uzh-rpg/EvDeblurNeRF,https://arxiv.org/abs/2403.19780,,2403.19780.pdf,Mitigating Motion Blur in Neural Radiance Fields with Events and Frames,"Neural Radiance Fields (NeRFs) have shown great potential in novel view
+synthesis. However, they struggle to render sharp images when the data used for
+training is affected by motion blur. On the other hand, event cameras excel in
+dynamic scenes as they measure brightness changes with microsecond resolution
+and are thus only marginally affected by blur. Recent methods attempt to
+enhance NeRF reconstructions under camera motion by fusing frames and events.
+However, they face challenges in recovering accurate color content or constrain
+the NeRF to a set of predefined camera poses, harming reconstruction quality in
+challenging conditions. This paper proposes a novel formulation addressing
+these issues by leveraging both model- and learning-based modules. We
+explicitly model the blur formation process, exploiting the event double
+integral as an additional model-based prior. Additionally, we model the
+event-pixel response using an end-to-end learnable response function, allowing
+our method to adapt to non-idealities in the real event-camera sensor. We show,
+on synthetic and real data, that the proposed approach outperforms existing
+deblur NeRFs that use only frames as well as those that combine frames and
+events by +6.13dB and +2.48dB, respectively.",cs.CV,['cs.CV']
+Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection,Jongha Kim · Jihwan Park · Jinyoung Park · Jinyoung Kim · Sehyung Kim · Hyunwoo J. Kim,https://github.com/mlvlab/speaq,https://arxiv.org/abs/2403.17709,,2403.17709.pdf,Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection,"Visual Relationship Detection (VRD) has seen significant advancements with
+Transformer-based architectures recently. However, we identify two key
+limitations in a conventional label assignment for training Transformer-based
+VRD models, which is a process of mapping a ground-truth (GT) to a prediction.
+Under the conventional assignment, an unspecialized query is trained since a
+query is expected to detect every relation, which makes it difficult for a
+query to specialize in specific relations. Furthermore, a query is also
+insufficiently trained since a GT is assigned only to a single prediction,
+therefore near-correct or even correct predictions are suppressed by being
+assigned no relation as a GT. To address these issues, we propose Groupwise
+Query Specialization and Quality-Aware Multi-Assignment (SpeaQ). Groupwise
+Query Specialization trains a specialized query by dividing queries and
+relations into disjoint groups and directing a query in a specific query group
+solely toward relations in the corresponding relation group. Quality-Aware
+Multi-Assignment further facilitates the training by assigning a GT to multiple
+predictions that are significantly close to a GT in terms of a subject, an
+object, and the relation in between. Experimental results and analyses show
+that SpeaQ effectively trains specialized queries, which better utilize the
+capacity of a model, resulting in consistent performance gains with zero
+additional inference cost across multiple VRD models and benchmarks. Code is
+available at https://github.com/mlvlab/SpeaQ.",cs.CV,['cs.CV']
+LUWA Dataset: Learning Lithic Use-Wear Analysis on Microscopic Images,Jing Zhang · Irving Fang · Hao Wu · Akshat Kaushik · Alice Rodriguez · Hanwen Zhao · Juexiao Zhang · Zhuo Zheng · Radu Iovita · Chen Feng, ,https://arxiv.org/abs/2403.13171,,2403.13171.pdf,LUWA Dataset: Learning Lithic Use-Wear Analysis on Microscopic Images,"Lithic Use-Wear Analysis (LUWA) using microscopic images is an underexplored
+vision-for-science research area. It seeks to distinguish the worked material,
+which is critical for understanding archaeological artifacts, material
+interactions, tool functionalities, and dental records. However, this
+challenging task goes beyond the well-studied image classification problem for
+common objects. It is affected by many confounders owing to the complex wear
+mechanism and microscopic imaging, which makes it difficult even for human
+experts to identify the worked material successfully. In this paper, we
+investigate the following three questions on this unique vision task for the
+first time:(i) How well can state-of-the-art pre-trained models (like DINOv2)
+generalize to the rarely seen domain? (ii) How can few-shot learning be
+exploited for scarce microscopic images? (iii) How do the ambiguous
+magnification and sensing modality influence the classification accuracy? To
+study these, we collaborated with archaeologists and built the first
+open-source and the largest LUWA dataset containing 23,130 microscopic images
+with different magnifications and sensing modalities. Extensive experiments
+show that existing pre-trained models notably outperform human experts but
+still leave a large gap for improvements. Most importantly, the LUWA dataset
+provides an underexplored opportunity for vision and learning communities and
+complements existing image classification problems on common objects.",cs.CV,['cs.CV']
+Flow-Guided Online Stereo Rectification for Wide Baseline Stereo,Anush Kumar · Fahim Mannan · Omid Hosseini Jafari · Shile Li · Felix Heide,https://light.princeton.edu/online-stereo-recification/,https://arxiv.org/abs/2309.10314,,2309.10314.pdf,Dive Deeper into Rectifying Homography for Stereo Camera Online Self-Calibration,"Accurate estimation of stereo camera extrinsic parameters is the key to
+guarantee the performance of stereo matching algorithms. In prior arts, the
+online self-calibration of stereo cameras has commonly been formulated as a
+specialized visual odometry problem, without taking into account the principles
+of stereo rectification. In this paper, we first delve deeply into the concept
+of rectifying homography, which serves as the cornerstone for the development
+of our novel stereo camera online self-calibration algorithm, for cases where
+only a single pair of images is available. Furthermore, we introduce a simple
+yet effective solution for global optimum extrinsic parameter estimation in the
+presence of stereo video sequences. Additionally, we emphasize the
+impracticality of using three Euler angles and three components in the
+translation vectors for performance quantification. Instead, we introduce four
+new evaluation metrics to quantify the robustness and accuracy of extrinsic
+parameter estimation, applicable to both single-pair and multi-pair cases.
+Extensive experiments conducted across indoor and outdoor environments using
+various experimental setups validate the effectiveness of our proposed
+algorithm. The comprehensive evaluation results demonstrate its superior
+performance in comparison to the baseline algorithm. Our source code, demo
+video, and supplement are publicly available at mias.group/StereoCalibrator.",cs.RO,"['cs.RO', 'cs.CV']"
+Diffusion-ES: Gradient-free Planning with Diffusion for Autonomous and Instruction-guided Driving,Brian Yang · Huangyuan Su · Nikolaos Gkanatsios · Tsung-Wei Ke · Ayush Jain · Jeff Schneider · Katerina Fragkiadaki, ,https://arxiv.org/abs/2402.06559,,2402.06559.pdf,Diffusion-ES: Gradient-free Planning with Diffusion for Autonomous Driving and Zero-Shot Instruction Following,"Diffusion models excel at modeling complex and multimodal trajectory
+distributions for decision-making and control. Reward-gradient guided denoising
+has been recently proposed to generate trajectories that maximize both a
+differentiable reward function and the likelihood under the data distribution
+captured by a diffusion model. Reward-gradient guided denoising requires a
+differentiable reward function fitted to both clean and noised samples,
+limiting its applicability as a general trajectory optimizer. In this paper, we
+propose DiffusionES, a method that combines gradient-free optimization with
+trajectory denoising to optimize black-box non-differentiable objectives while
+staying in the data manifold. Diffusion-ES samples trajectories during
+evolutionary search from a diffusion model and scores them using a black-box
+reward function. It mutates high-scoring trajectories using a truncated
+diffusion process that applies a small number of noising and denoising steps,
+allowing for much more efficient exploration of the solution space. We show
+that DiffusionES achieves state-of-the-art performance on nuPlan, an
+established closed-loop planning benchmark for autonomous driving. Diffusion-ES
+outperforms existing sampling-based planners, reactive deterministic or
+diffusion-based policies, and reward-gradient guidance. Additionally, we show
+that unlike prior guidance methods, our method can optimize non-differentiable
+language-shaped reward functions generated by few-shot LLM prompting. When
+guided by a human teacher that issues instructions to follow, our method can
+generate novel, highly complex behaviors, such as aggressive lane weaving,
+which are not present in the training data. This allows us to solve the hardest
+nuPlan scenarios which are beyond the capabilities of existing trajectory
+optimization methods and driving policies.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CL', 'cs.RO']"
+Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training,Yipeng Gao · Zeyu Wang · Wei-Shi Zheng · Cihang Xie · Yuyin Zhou,https://github.com/UCSC-VLAA/MixCon3D,https://arxiv.org/abs/2311.01734,,2311.01734.pdf,Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training,"Contrastive learning has emerged as a promising paradigm for 3D open-world
+understanding, i.e., aligning point cloud representation to image and text
+embedding space individually. In this paper, we introduce MixCon3D, a simple
+yet effective method aiming to sculpt holistic 3D representation in contrastive
+language-image-3D pre-training. In contrast to point cloud only, we develop the
+3D object-level representation from complementary perspectives, e.g.,
+multi-view rendered images with the point cloud. Then, MixCon3D performs
+language-3D contrastive learning, comprehensively depicting real-world 3D
+objects and bolstering text alignment. Additionally, we pioneer the first
+thorough investigation of various training recipes for the 3D contrastive
+learning paradigm, building a solid baseline with improved performance.
+Extensive experiments conducted on three representative benchmarks reveal that
+our method significantly improves over the baseline, surpassing the previous
+state-of-the-art performance on the challenging 1,156-category Objaverse-LVIS
+dataset by 5.7%. The versatility of MixCon3D is showcased in applications such
+as text-to-3D retrieval and point cloud captioning, further evidencing its
+efficacy in diverse scenarios. The code is available at
+https://github.com/UCSC-VLAA/MixCon3D.",cs.CV,['cs.CV']
+Cross-spectral Gated-RGB Stereo Depth Estimation,Samuel Brucker · Stefanie Walz · Mario Bijelic · Felix Heide,https://light.princeton.edu/publication/gatedrccbstereo/,https://arxiv.org/abs/2405.12759,,2405.12759.pdf,Cross-spectral Gated-RGB Stereo Depth Estimation,"Gated cameras flood-illuminate a scene and capture the time-gated impulse
+response of a scene. By employing nanosecond-scale gates, existing sensors are
+capable of capturing mega-pixel gated images, delivering dense depth improving
+on today's LiDAR sensors in spatial resolution and depth precision. Although
+gated depth estimation methods deliver a million of depth estimates per frame,
+their resolution is still an order below existing RGB imaging methods. In this
+work, we combine high-resolution stereo HDR RCCB cameras with gated imaging,
+allowing us to exploit depth cues from active gating, multi-view RGB and
+multi-view NIR sensing -- multi-view and gated cues across the entire spectrum.
+The resulting capture system consists only of low-cost CMOS sensors and
+flood-illumination. We propose a novel stereo-depth estimation method that is
+capable of exploiting these multi-modal multi-view depth cues, including the
+active illumination that is measured by the RCCB camera when removing the
+IR-cut filter. The proposed method achieves accurate depth at long ranges,
+outperforming the next best existing method by 39% for ranges of 100 to 220m in
+MAE on accumulated LiDAR ground-truth. Our code, models and datasets are
+available at https://light.princeton.edu/gatedrccbstereo/ .",cs.CV,['cs.CV']
+$\mathcal{Z}^*$: Zero-shot $\underline{S}$tyle $\underline{T}$ransfer via $\underline{A}$ttention $\underline{R}$eweighting,Yingying Deng · Xiangyu He · Fan Tang · Weiming Dong, ,https://arxiv.org/abs/2311.16491,,2311.16491.pdf,$Z^*$: Zero-shot Style Transfer via Attention Rearrangement,"Despite the remarkable progress in image style transfer, formulating style in
+the context of art is inherently subjective and challenging. In contrast to
+existing learning/tuning methods, this study shows that vanilla diffusion
+models can directly extract style information and seamlessly integrate the
+generative prior into the content image without retraining. Specifically, we
+adopt dual denoising paths to represent content/style references in latent
+space and then guide the content image denoising process with style latent
+codes. We further reveal that the cross-attention mechanism in latent diffusion
+models tends to blend the content and style images, resulting in stylized
+outputs that deviate from the original content image. To overcome this
+limitation, we introduce a cross-attention rearrangement strategy. Through
+theoretical analysis and experiments, we demonstrate the effectiveness and
+superiority of the diffusion-based $\underline{Z}$ero-shot $\underline{S}$tyle
+$\underline{T}$ransfer via $\underline{A}$ttention $\underline{R}$earrangement,
+Z-STAR.",cs.CV,['cs.CV']
+CommonCanvas: Open Diffusion Models Trained on Creative-Commons Images,Aaron Gokaslan · A. Feder Cooper · Jasmine Collins · Landan Seguin · Austin Jacobson · Mihir Patel · Jonathan Frankle · Cory Stephenson · Volodymyr Kuleshov, ,https://arxiv.org/abs/2310.16825,,2310.16825.pdf,CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images,"We assemble a dataset of Creative-Commons-licensed (CC) images, which we use
+to train a set of open diffusion models that are qualitatively competitive with
+Stable Diffusion 2 (SD2). This task presents two challenges: (1)
+high-resolution CC images lack the captions necessary to train text-to-image
+generative models; (2) CC images are relatively scarce. In turn, to address
+these challenges, we use an intuitive transfer learning technique to produce a
+set of high-quality synthetic captions paired with curated CC images. We then
+develop a data- and compute-efficient training recipe that requires as little
+as 3% of the LAION-2B data needed to train existing SD2 models, but obtains
+comparable quality. These results indicate that we have a sufficient number of
+CC images (~70 million) for training high-quality models. Our training recipe
+also implements a variety of optimizations that achieve ~3X training speed-ups,
+enabling rapid model iteration. We leverage this recipe to train several
+high-quality text-to-image models, which we dub the CommonCanvas family. Our
+largest model achieves comparable performance to SD2 on a human evaluation,
+despite being trained on our CC dataset that is significantly smaller than
+LAION and using synthetic captions for training. We release our models, data,
+and code at
+https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.md",cs.CV,"['cs.CV', 'cs.CY']"
+"HOIST-Former: Hand-held Objects Identification, Segmentation, and Tracking in the Wild",Supreeth Narasimhaswamy · Huy Anh Nguyen · Lihan Huang · Minh Hoai, ,https://arxiv.org/abs/2404.13819,,2404.13819.pdf,"HOIST-Former: Hand-held Objects Identification, Segmentation, and Tracking in the Wild","We address the challenging task of identifying, segmenting, and tracking
+hand-held objects, which is crucial for applications such as human action
+segmentation and performance evaluation. This task is particularly challenging
+due to heavy occlusion, rapid motion, and the transitory nature of objects
+being hand-held, where an object may be held, released, and subsequently picked
+up again. To tackle these challenges, we have developed a novel
+transformer-based architecture called HOIST-Former. HOIST-Former is adept at
+spatially and temporally segmenting hands and objects by iteratively pooling
+features from each other, ensuring that the processes of identification,
+segmentation, and tracking of hand-held objects depend on the hands' positions
+and their contextual appearance. We further refine HOIST-Former with a contact
+loss that focuses on areas where hands are in contact with objects. Moreover,
+we also contribute an in-the-wild video dataset called HOIST, which comprises
+4,125 videos complete with bounding boxes, segmentation masks, and tracking IDs
+for hand-held objects. Through experiments on the HOIST dataset and two
+additional public datasets, we demonstrate the efficacy of HOIST-Former in
+segmenting and tracking hand-held objects.",cs.CV,['cs.CV']
+SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples,Phillip Howard · Avinash Madasu · Tiep Le · Gustavo Lujan-Moreno · Anahita Bhiwandiwalla · Vasudev Lal, ,https://arxiv.org/abs/2312.00825,,2312.00825.pdf,SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples,"While vision-language models (VLMs) have achieved remarkable performance
+improvements recently, there is growing evidence that these models also posses
+harmful biases with respect to social attributes such as gender and race. Prior
+studies have primarily focused on probing such bias attributes individually
+while ignoring biases associated with intersections between social attributes.
+This could be due to the difficulty of collecting an exhaustive set of
+image-text pairs for various combinations of social attributes. To address this
+challenge, we employ text-to-image diffusion models to produce counterfactual
+examples for probing intersectional social biases at scale. Our approach
+utilizes Stable Diffusion with cross attention control to produce sets of
+counterfactual image-text pairs that are highly similar in their depiction of a
+subject (e.g., a given occupation) while differing only in their depiction of
+intersectional social attributes (e.g., race & gender). Through our
+over-generate-then-filter methodology, we produce SocialCounterfactuals, a
+high-quality dataset containing 171k image-text pairs for probing
+intersectional biases related to gender, race, and physical characteristics. We
+conduct extensive experiments to demonstrate the usefulness of our generated
+dataset for probing and mitigating intersectional social biases in
+state-of-the-art VLMs.",cs.CV,"['cs.CV', 'cs.AI']"
+Accurate Training Data for Occupancy Map Prediction in Automated Driving using Evidence Theory,Jonas Kälble · Sascha Wirges · Maxim Tatarchenko · Eddy Ilg, ,https://arxiv.org/abs/2405.10575,,2405.10575.pdf,Accurate Training Data for Occupancy Map Prediction in Automated Driving Using Evidence Theory,"Automated driving fundamentally requires knowledge about the surrounding
+geometry of the scene. Modern approaches use only captured images to predict
+occupancy maps that represent the geometry. Training these approaches requires
+accurate data that may be acquired with the help of LiDAR scanners. We show
+that the techniques used for current benchmarks and training datasets to
+convert LiDAR scans into occupancy grid maps yield very low quality, and
+subsequently present a novel approach using evidence theory that yields more
+accurate reconstructions. We demonstrate that these are superior by a large
+margin, both qualitatively and quantitatively, and that we additionally obtain
+meaningful uncertainty estimates. When converting the occupancy maps back to
+depth estimates and comparing them with the raw LiDAR measurements, our method
+yields a MAE improvement of 30% to 52% on nuScenes and 53% on Waymo over other
+occupancy ground-truth data. Finally, we use the improved occupancy maps to
+train a state-of-the-art occupancy prediction method and demonstrate that it
+improves the MAE by 25% on nuScenes.",cs.CV,['cs.CV']
+ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models,Fei Kong · Jinhao Duan · Lichao Sun · Hao Cheng · Renjing Xu · Heng Tao Shen · Xiaofeng Zhu · Xiaoshuang Shi · Kaidi Xu, ,https://arxiv.org/abs/2311.14097,,2311.14097.pdf,ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models,"Though diffusion models excel in image generation, their step-by-step
+denoising leads to slow generation speeds. Consistency training addresses this
+issue with single-step sampling but often produces lower-quality generations
+and requires high training costs. In this paper, we show that optimizing
+consistency training loss minimizes the Wasserstein distance between target and
+generated distributions. As timestep increases, the upper bound accumulates
+previous consistency training losses. Therefore, larger batch sizes are needed
+to reduce both current and accumulated losses. We propose Adversarial
+Consistency Training (ACT), which directly minimizes the Jensen-Shannon (JS)
+divergence between distributions at each timestep using a discriminator.
+Theoretically, ACT enhances generation quality, and convergence. By
+incorporating a discriminator into the consistency training framework, our
+method achieves improved FID scores on CIFAR10 and ImageNet 64$\times$64 and
+LSUN Cat 256$\times$256 datasets, retains zero-shot image inpainting
+capabilities, and uses less than $1/6$ of the original batch size and fewer
+than $1/2$ of the model parameters and training steps compared to the baseline
+method, this leads to a substantial reduction in resource consumption. Our code
+is available:https://github.com/kong13661/ACT",cs.CV,['cs.CV']
+StegoGAN: Leveraging Steganography for Non-Bijective Image-to-Image Translation,Sidi Wu · Yizi Chen · Loic Landrieu · Nicolas Gonthier · Samuel Mermet · Lorenz Hurni · Konrad Schindler, ,https://arxiv.org/abs/2403.20142,,2403.20142.pdf,StegoGAN: Leveraging Steganography for Non-Bijective Image-to-Image Translation,"Most image-to-image translation models postulate that a unique correspondence
+exists between the semantic classes of the source and target domains. However,
+this assumption does not always hold in real-world scenarios due to divergent
+distributions, different class sets, and asymmetrical information
+representation. As conventional GANs attempt to generate images that match the
+distribution of the target domain, they may hallucinate spurious instances of
+classes absent from the source domain, thereby diminishing the usefulness and
+reliability of translated images. CycleGAN-based methods are also known to hide
+the mismatched information in the generated images to bypass cycle consistency
+objectives, a process known as steganography. In response to the challenge of
+non-bijective image translation, we introduce StegoGAN, a novel model that
+leverages steganography to prevent spurious features in generated images. Our
+approach enhances the semantic consistency of the translated images without
+requiring additional postprocessing or supervision. Our experimental
+evaluations demonstrate that StegoGAN outperforms existing GAN-based models
+across various non-bijective image-to-image translation tasks, both
+qualitatively and quantitatively. Our code and pretrained models are accessible
+at https://github.com/sian-wusidi/StegoGAN.",cs.CV,"['cs.CV', 'eess.IV']"
+LASA: Instance Reconstruction from Real Scans using A Large-scale Aligned Shape Annotation Dataset,Haolin Liu · Chongjie Ye · Yinyu Nie · Yingfan He · Xiaoguang Han, ,https://arxiv.org/html/2312.12418v1,,2312.12418v1.pdf,LASA: Instance Reconstruction from Real Scans using A Large-scale Aligned Shape Annotation Dataset,"Instance shape reconstruction from a 3D scene involves recovering the full
+geometries of multiple objects at the semantic instance level. Many methods
+leverage data-driven learning due to the intricacies of scene complexity and
+significant indoor occlusions. Training these methods often requires a
+large-scale, high-quality dataset with aligned and paired shape annotations
+with real-world scans. Existing datasets are either synthetic or misaligned,
+restricting the performance of data-driven methods on real data. To this end,
+we introduce LASA, a Large-scale Aligned Shape Annotation Dataset comprising
+10,412 high-quality CAD annotations aligned with 920 real-world scene scans
+from ArkitScenes, created manually by professional artists. On this top, we
+propose a novel Diffusion-based Cross-Modal Shape Reconstruction (DisCo)
+method. It is empowered by a hybrid feature aggregation design to fuse
+multi-modal inputs and recover high-fidelity object geometries. Besides, we
+present an Occupancy-Guided 3D Object Detection (OccGOD) method and demonstrate
+that our shape annotations provide scene occupancy clues that can further
+improve 3D object detection. Supported by LASA, extensive experiments show that
+our methods achieve state-of-the-art performance in both instance-level scene
+reconstruction and 3D object detection tasks.",cs.CV,['cs.CV']
+Unsupervised Keypoints from Pretrained Diffusion Models,Eric Hedlin · Gopal Sharma · Shweta Mahajan · Xingzhe He · Hossam Isack · Abhishek Kar · Helge Rhodin · Andrea Tagliasacchi · Kwang Moo Yi, ,https://arxiv.org/abs/2312.00065,,2312.00065.pdf,Unsupervised Keypoints from Pretrained Diffusion Models,"Unsupervised learning of keypoints and landmarks has seen significant
+progress with the help of modern neural network architectures, but performance
+is yet to match the supervised counterpart, making their practicability
+questionable. We leverage the emergent knowledge within text-to-image diffusion
+models, towards more robust unsupervised keypoints. Our core idea is to find
+text embeddings that would cause the generative model to consistently attend to
+compact regions in images (i.e. keypoints). To do so, we simply optimize the
+text embedding such that the cross-attention maps within the denoising network
+are localized as Gaussians with small standard deviations. We validate our
+performance on multiple datasets: the CelebA, CUB-200-2011, Tai-Chi-HD,
+DeepFashion, and Human3.6m datasets. We achieve significantly improved
+accuracy, sometimes even outperforming supervised ones, particularly for data
+that is non-aligned and less curated. Our code is publicly available and can be
+found through our project page: https://ubc-vision.github.io/StableKeypoints/",cs.CV,['cs.CV']
+READ: Retrieval-Enhanced Asymmetric Diffusion for Motion Planning,Takeru Oba · Matthew Walter · Norimichi Ukita,https://obat2343.github.io/READ.github.io/,http://export.arxiv.org/abs/2308.01557,,2308.01557.pdf,Motion Planning Diffusion: Learning and Planning of Robot Motions with Diffusion Models,"Learning priors on trajectory distributions can help accelerate robot motion
+planning optimization. Given previously successful plans, learning trajectory
+generative models as priors for a new planning problem is highly desirable.
+Prior works propose several ways on utilizing this prior to bootstrapping the
+motion planning problem. Either sampling the prior for initializations or using
+the prior distribution in a maximum-a-posterior formulation for trajectory
+optimization. In this work, we propose learning diffusion models as priors. We
+then can sample directly from the posterior trajectory distribution conditioned
+on task goals, by leveraging the inverse denoising process of diffusion models.
+Furthermore, diffusion has been recently shown to effectively encode data
+multimodality in high-dimensional settings, which is particularly well-suited
+for large trajectory dataset. To demonstrate our method efficacy, we compare
+our proposed method - Motion Planning Diffusion - against several baselines in
+simulated planar robot and 7-dof robot arm manipulator environments. To assess
+the generalization capabilities of our method, we test it in environments with
+previously unseen obstacles. Our experiments show that diffusion models are
+strong priors to encode high-dimensional trajectory distributions of robot
+motions.",cs.RO,"['cs.RO', 'cs.AI', 'cs.LG']"
+On the Estimation of Image-matching Uncertainty in Visual Place Recognition,Mubariz Zaffar · Liangliang Nan · Julian F. P. Kooij, ,https://arxiv.org/abs/2404.00546,,2404.00546.pdf,On the Estimation of Image-matching Uncertainty in Visual Place Recognition,"In Visual Place Recognition (VPR) the pose of a query image is estimated by
+comparing the image to a map of reference images with known reference poses. As
+is typical for image retrieval problems, a feature extractor maps the query and
+reference images to a feature space, where a nearest neighbor search is then
+performed. However, till recently little attention has been given to
+quantifying the confidence that a retrieved reference image is a correct match.
+Highly certain but incorrect retrieval can lead to catastrophic failure of
+VPR-based localization pipelines. This work compares for the first time the
+main approaches for estimating the image-matching uncertainty, including the
+traditional retrieval-based uncertainty estimation, more recent data-driven
+aleatoric uncertainty estimation, and the compute-intensive geometric
+verification. We further formulate a simple baseline method, ``SUE'', which
+unlike the other methods considers the freely-available poses of the reference
+images in the map. Our experiments reveal that a simple L2-distance between the
+query and reference descriptors is already a better estimate of image-matching
+uncertainty than current data-driven approaches. SUE outperforms the other
+efficient uncertainty estimation methods, and its uncertainty estimates
+complement the computationally expensive geometric verification approach.
+Future works for uncertainty estimation in VPR should consider the baselines
+discussed in this work.",cs.CV,['cs.CV']
+GROUNDHOG: Grounding Large Language Models to Holistic Segmentation,Yichi Zhang · Ziqiao Ma · Xiaofeng Gao · Suhaila Shakiah · Qiaozi Gao · Joyce Chai,https://groundhog-mllm.github.io/,https://arxiv.org/abs/2402.16846,,2402.16846.pdf,GROUNDHOG: Grounding Large Language Models to Holistic Segmentation,"Most multimodal large language models (MLLMs) learn language-to-object
+grounding through causal language modeling where grounded objects are captured
+by bounding boxes as sequences of location tokens. This paradigm lacks
+pixel-level representations that are important for fine-grained visual
+understanding and diagnosis. In this work, we introduce GROUNDHOG, an MLLM
+developed by grounding Large Language Models to holistic segmentation.
+GROUNDHOG incorporates a masked feature extractor and converts extracted
+features into visual entity tokens for the MLLM backbone, which then connects
+groundable phrases to unified grounding masks by retrieving and merging the
+entity masks. To train GROUNDHOG, we carefully curated M3G2, a grounded visual
+instruction tuning dataset with Multi-Modal Multi-Grained Grounding, by
+harvesting a collection of segmentation-grounded datasets with rich
+annotations. Our experimental results show that GROUNDHOG achieves superior
+performance on various language grounding tasks without task-specific
+fine-tuning, and significantly reduces object hallucination. GROUNDHOG also
+demonstrates better grounding towards complex forms of visual input and
+provides easy-to-understand diagnosis in failure cases.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']"
+"Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model",Shraman Pramanick · Guangxing Han · Rui Hou · Sayan Nag · Ser-Nam Lim · Nicolas Ballas · Qifan Wang · Rama Chellappa · Amjad Almahairi, ,https://arxiv.org/abs/2312.12423,,2312.12423.pdf,"Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model","The ability of large language models (LLMs) to process visual inputs has
+given rise to general-purpose vision systems, unifying various vision-language
+(VL) tasks by instruction tuning. However, due to the enormous diversity in
+input-output formats in the vision domain, existing general-purpose models fail
+to successfully integrate segmentation and multi-image inputs with coarse-level
+tasks into a single framework. In this work, we introduce VistaLLM, a powerful
+visual system that addresses coarse- and fine-grained VL tasks over single and
+multiple input images using a unified framework. VistaLLM utilizes an
+instruction-guided image tokenizer that filters global embeddings using task
+descriptions to extract compressed and refined features from numerous images.
+Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to
+represent binary segmentation masks as sequences, significantly improving over
+previously used uniform sampling. To bolster the desired capability of
+VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning
+dataset with 6.8M samples. We also address the lack of multi-image grounding
+datasets by introducing a novel task, AttCoSeg (Attribute-level
+Co-Segmentation), which boosts the model's reasoning and grounding capability
+over multiple input images. Extensive experiments on a wide range of V- and VL
+tasks demonstrate the effectiveness of VistaLLM by achieving consistent
+state-of-the-art performance over strong baselines across all downstream tasks.
+Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.",cs.CV,"['cs.CV', 'cs.AI']"
+Spectrum AUC Difference (SAUCD): Human Aligned 3D Shape Evaluation,Tianyu Luan · Zhong Li · Lele Chen · Xuan Gong · Lichang Chen · Yi Xu · Junsong Yuan, ,https://arxiv.org/abs/2403.01619,,2403.01619.pdf,Spectrum AUC Difference (SAUCD): Human-aligned 3D Shape Evaluation,"Existing 3D mesh shape evaluation metrics mainly focus on the overall shape
+but are usually less sensitive to local details. This makes them inconsistent
+with human evaluation, as human perception cares about both overall and
+detailed shape. In this paper, we propose an analytic metric named Spectrum
+Area Under the Curve Difference (SAUCD) that demonstrates better consistency
+with human evaluation. To compare the difference between two shapes, we first
+transform the 3D mesh to the spectrum domain using the discrete
+Laplace-Beltrami operator and Fourier transform. Then, we calculate the Area
+Under the Curve (AUC) difference between the two spectrums, so that each
+frequency band that captures either the overall or detailed shape is equitably
+considered. Taking human sensitivity across frequency bands into account, we
+further extend our metric by learning suitable weights for each frequency band
+which better aligns with human perception. To measure the performance of SAUCD,
+we build a 3D mesh evaluation dataset called Shape Grading, along with manual
+annotations from more than 800 subjects. By measuring the correlation between
+our metric and human evaluation, we demonstrate that SAUCD is well aligned with
+human evaluation, and outperforms previous 3D mesh metrics.",cs.CV,"['cs.CV', 'cs.GR']"
+AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation,Haonan Wang · Qixiang ZHANG · Yi Li · Xiaomeng Li, ,https://arxiv.org/abs/2403.01818,,2403.01818.pdf,AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation,"Semi-supervised semantic segmentation (SSSS) has been proposed to alleviate
+the burden of time-consuming pixel-level manual labeling, which leverages
+limited labeled data along with larger amounts of unlabeled data. Current
+state-of-the-art methods train the labeled data with ground truths and
+unlabeled data with pseudo labels. However, the two training flows are
+separate, which allows labeled data to dominate the training process, resulting
+in low-quality pseudo labels and, consequently, sub-optimal results. To
+alleviate this issue, we present AllSpark, which reborns the labeled features
+from unlabeled ones with the channel-wise cross-attention mechanism. We further
+introduce a Semantic Memory along with a Channel Semantic Grouping strategy to
+ensure that unlabeled features adequately represent labeled features. The
+AllSpark shed new light on the architecture level designs of SSSS rather than
+framework level, which avoids increasingly complicated training pipeline
+designs. It can also be regarded as a flexible bottleneck module that can be
+seamlessly integrated into a general transformer-based segmentation model. The
+proposed AllSpark outperforms existing methods across all evaluation protocols
+on Pascal, Cityscapes and COCO benchmarks without bells-and-whistles. Code and
+model weights are available at: https://github.com/xmed-lab/AllSpark.",cs.CV,"['cs.CV', 'cs.AI']"
+Real-Time Simulated Avatar from Head-Mounted Sensors,Zhengyi Luo · Jinkun Cao · Rawal Khirodkar · Alexander Winkler · Jing Huang · Kris Kitani · Weipeng Xu, ,https://arxiv.org/abs/2403.06862,,2403.06862.pdf,Real-Time Simulated Avatar from Head-Mounted Sensors,"We present SimXR, a method for controlling a simulated avatar from
+information (headset pose and cameras) obtained from AR / VR headsets. Due to
+the challenging viewpoint of head-mounted cameras, the human body is often
+clipped out of view, making traditional image-based egocentric pose estimation
+challenging. On the other hand, headset poses provide valuable information
+about overall body motion, but lack fine-grained details about the hands and
+feet. To synergize headset poses with cameras, we control a humanoid to track
+headset movement while analyzing input images to decide body movement. When
+body parts are seen, the movements of hands and feet will be guided by the
+images; when unseen, the laws of physics guide the controller to generate
+plausible motion. We design an end-to-end method that does not rely on any
+intermediate representations and learns to directly map from images and headset
+poses to humanoid control signals. To train our method, we also propose a
+large-scale synthetic dataset created using camera configurations compatible
+with a commercially available VR headset (Quest 2) and show promising results
+on real-world captures. To demonstrate the applicability of our framework, we
+also test it on an AR headset with a forward-facing camera.",cs.CV,"['cs.CV', 'cs.GR', 'cs.RO']"
+Adversarial Backdoor Attack by Naturalistic Data Poisoning on Trajectory Prediction in Autonomous Driving,Mozhgan Pourkeshavarz · Mohammad Sabokrou · Amir Rasouli, ,https://arxiv.org/abs/2306.15755,,2306.15755.pdf,Adversarial Backdoor Attack by Naturalistic Data Poisoning on Trajectory Prediction in Autonomous Driving,"In autonomous driving, behavior prediction is fundamental for safe motion
+planning, hence the security and robustness of prediction models against
+adversarial attacks are of paramount importance. We propose a novel adversarial
+backdoor attack against trajectory prediction models as a means of studying
+their potential vulnerabilities. Our attack affects the victim at training time
+via naturalistic, hence stealthy, poisoned samples crafted using a novel
+two-step approach. First, the triggers are crafted by perturbing the trajectory
+of attacking vehicle and then disguised by transforming the scene using a
+bi-level optimization technique. The proposed attack does not depend on a
+particular model architecture and operates in a black-box manner, thus can be
+effective without any knowledge of the victim model. We conduct extensive
+empirical studies using state-of-the-art prediction models on two benchmark
+datasets using metrics customized for trajectory prediction. We show that the
+proposed attack is highly effective, as it can significantly hinder the
+performance of prediction models, unnoticeable by the victims, and efficient as
+it forces the victim to generate malicious behavior even under constrained
+conditions. Via ablative studies, we analyze the impact of different attack
+design choices followed by an evaluation of existing defence mechanisms against
+the proposed attack.",cs.CV,['cs.CV']
+MAPSeg: Unified Unsupervised Domain Adaptation for Heterogeneous Medical Image Segmentation Based on 3D Masked Autoencoding and Pseudo-Labeling,Xuzhe Zhang · Yuhao Wu · Elsa Angelini · Ang Li · Jia Guo · Jerod Rasmussen · Thomas O'Connor · Pathik Wadhwa · Andrea Jackowski · Hai Li · Jonathan Posner · Andrew Laine · YUN WANG · Yun Wang,https://github.com/XuzheZ/MAPSeg,,https://www.researchgate.net/publication/378738417_MAPSeg_Unified_Unsupervised_Domain_Adaptation_for_Heterogeneous_Medical_Image_Segmentation_Based_on_3D_Masked_Autoencoding_and_Pseudo-Labeling,,,,,nan
+KD-DETR: Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling,Yu Wang · Xin Li · Shengzhao Wen · gang zhang · Haixiao Yue · Haocheng Feng · Junyu Han · Errui Ding, ,https://arxiv.org/abs/2311.13657,,2311.13657.pdf,Efficient Transformer Knowledge Distillation: A Performance Review,"As pretrained transformer language models continue to achieve
+state-of-the-art performance, the Natural Language Processing community has
+pushed for advances in model compression and efficient attention mechanisms to
+address high computational requirements and limited input sequence length.
+Despite these separate efforts, no investigation has been done into the
+intersection of these two fields. In this work, we provide an evaluation of
+model compression via knowledge distillation on efficient attention
+transformers. We provide cost-performance trade-offs for the compression of
+state-of-the-art efficient attention architectures and the gains made in
+performance in comparison to their full attention counterparts. Furthermore, we
+introduce a new long-context Named Entity Recognition dataset, GONERD, to train
+and test the performance of NER models on long sequences. We find that
+distilled efficient attention transformers can preserve a significant amount of
+original model performance, preserving up to 98.6% across short-context tasks
+(GLUE, SQUAD, CoNLL-2003), up to 94.6% across long-context
+Question-and-Answering tasks (HotpotQA, TriviaQA), and up to 98.8% on
+long-context Named Entity Recognition (GONERD), while decreasing inference
+times by up to 57.8%. We find that, for most models on most tasks, performing
+knowledge distillation is an effective method to yield high-performing
+efficient attention models with low costs.",cs.CL,"['cs.CL', 'cs.LG']"
+Point-VOS: Pointing Up Video Object Segmentation,Sabarinath Mahadevan · Idil Esen Zulfikar · Paul Voigtlaender · Bastian Leibe, ,https://arxiv.org/abs/2402.05917v1,,2402.05917v1.pdf,Point-VOS: Pointing Up Video Object Segmentation,"Current state-of-the-art Video Object Segmentation (VOS) methods rely on
+dense per-object mask annotations both during training and testing. This
+requires time-consuming and costly video annotation mechanisms. We propose a
+novel Point-VOS task with a spatio-temporally sparse point-wise annotation
+scheme that substantially reduces the annotation effort. We apply our
+annotation scheme to two large-scale video datasets with text descriptions and
+annotate over 19M points across 133K objects in 32K videos. Based on our
+annotations, we propose a new Point-VOS benchmark, and a corresponding
+point-based training mechanism, which we use to establish strong baseline
+results. We show that existing VOS methods can easily be adapted to leverage
+our point annotations during training, and can achieve results close to the
+fully-supervised performance when trained on pseudo-masks generated from these
+points. In addition, we show that our data can be used to improve models that
+connect vision and language, by evaluating it on the Video Narrative Grounding
+(VNG) task. We will make our code and annotations available at
+https://pointvos.github.io.",cs.CV,['cs.CV']
+DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing,Yujun Shi · Chuhui Xue · Jun Hao Liew · Jiachun Pan · Hanshu Yan · Wenqing Zhang · Vincent Y. F. Tan · Song Bai, ,https://arxiv.org/abs/2306.14435,,2306.14435.pdf,DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing,"Accurate and controllable image editing is a challenging task that has
+attracted significant attention recently. Notably, DragGAN is an interactive
+point-based image editing framework that achieves impressive editing results
+with pixel-level precision. However, due to its reliance on generative
+adversarial networks (GANs), its generality is limited by the capacity of
+pretrained GAN models. In this work, we extend this editing framework to
+diffusion models and propose a novel approach DragDiffusion. By harnessing
+large-scale pretrained diffusion models, we greatly enhance the applicability
+of interactive point-based editing on both real and diffusion-generated images.
+Our approach involves optimizing the diffusion latents to achieve precise
+spatial control. The supervision signal of this optimization process is from
+the diffusion model's UNet features, which are known to contain rich semantic
+and geometric information. Moreover, we introduce two additional techniques,
+namely LoRA fine-tuning and latent-MasaCtrl, to further preserve the identity
+of the original image. Lastly, we present a challenging benchmark dataset
+called DragBench -- the first benchmark to evaluate the performance of
+interactive point-based image editing methods. Experiments across a wide range
+of challenging cases (e.g., images with multiple objects, diverse object
+categories, various styles, etc.) demonstrate the versatility and generality of
+DragDiffusion. Code: https://github.com/Yujun-Shi/DragDiffusion.",cs.CV,"['cs.CV', 'cs.LG']"
+Revisiting Adversarial Training at Scale,Zeyu Wang · Xianhang li · Hongru Zhu · Cihang Xie, ,https://arxiv.org/abs/2401.04727,,2401.04727.pdf,Revisiting Adversarial Training at Scale,"The machine learning community has witnessed a drastic change in the training
+pipeline, pivoted by those ''foundation models'' with unprecedented scales.
+However, the field of adversarial training is lagging behind, predominantly
+centered around small model sizes like ResNet-50, and tiny and low-resolution
+datasets like CIFAR-10. To bridge this transformation gap, this paper provides
+a modern re-examination with adversarial training, investigating its potential
+benefits when applied at scale. Additionally, we introduce an efficient and
+effective training strategy to enable adversarial training with giant models
+and web-scale data at an affordable computing cost. We denote this newly
+introduced framework as AdvXL.
+  Empirical results demonstrate that AdvXL establishes new state-of-the-art
+robust accuracy records under AutoAttack on ImageNet-1K. For example, by
+training on DataComp-1B dataset, our AdvXL empowers a vanilla ViT-g model to
+substantially surpass the previous records of $l_{\infty}$-, $l_{2}$-, and
+$l_{1}$-robust accuracy by margins of 11.4%, 14.2% and 12.9%, respectively.
+This achievement posits AdvXL as a pioneering approach, charting a new
+trajectory for the efficient training of robust visual representations at
+significantly larger scales. Our code is available at
+https://github.com/UCSC-VLAA/AdvXL.",cs.CV,['cs.CV']
+Seeing Motion at Nighttime with an Event Camera,Haoyue Liu · Shihan Peng · Lin Zhu · Yi Chang · Hanyu Zhou · Luxin Yan,https://github.com/Liu-haoyue/NER-Net,https://arxiv.org/abs/2404.11884,,2404.11884.pdf,Seeing Motion at Nighttime with an Event Camera,"We focus on a very challenging task: imaging at nighttime dynamic scenes.
+Most previous methods rely on the low-light enhancement of a conventional RGB
+camera. However, they would inevitably face a dilemma between the long exposure
+time of nighttime and the motion blur of dynamic scenes. Event cameras react to
+dynamic changes with higher temporal resolution (microsecond) and higher
+dynamic range (120dB), offering an alternative solution. In this work, we
+present a novel nighttime dynamic imaging method with an event camera.
+Specifically, we discover that the event at nighttime exhibits temporal
+trailing characteristics and spatial non-stationary distribution. Consequently,
+we propose a nighttime event reconstruction network (NER-Net) which mainly
+includes a learnable event timestamps calibration module (LETC) to align the
+temporal trailing events and a non-uniform illumination aware module (NIAM) to
+stabilize the spatiotemporal distribution of events. Moreover, we construct a
+paired real low-light event dataset (RLED) through a co-axial imaging system,
+including 64,200 spatially and temporally aligned image GTs and low-light
+events. Extensive experiments demonstrate that the proposed method outperforms
+state-of-the-art methods in terms of visual quality and generalization ability
+on real-world nighttime datasets. The project are available at:
+https://github.com/Liu-haoyue/NER-Net.",cs.CV,['cs.CV']
+Generative Unlearning for Any Identity,Juwon Seo · Sung-Hoon Lee · Tae-Young Lee · SeungJun Moon · Gyeong-Moon Park, ,https://arxiv.org/abs/2405.09879,,2405.09879.pdf,Generative Unlearning for Any Identity,"Recent advances in generative models trained on large-scale datasets have
+made it possible to synthesize high-quality samples across various domains.
+Moreover, the emergence of strong inversion networks enables not only a
+reconstruction of real-world images but also the modification of attributes
+through various editing methods. However, in certain domains related to privacy
+issues, e.g., human faces, advanced generative models along with strong
+inversion methods can lead to potential misuses. In this paper, we propose an
+essential yet under-explored task called generative identity unlearning, which
+steers the model not to generate an image of a specific identity. In the
+generative identity unlearning, we target the following objectives: (i)
+preventing the generation of images with a certain identity, and (ii)
+preserving the overall quality of the generative model. To satisfy these goals,
+we propose a novel framework, Generative Unlearning for Any Identity (GUIDE),
+which prevents the reconstruction of a specific identity by unlearning the
+generator with only a single image. GUIDE consists of two parts: (i) finding a
+target point for optimization that un-identifies the source latent code and
+(ii) novel loss functions that facilitate the unlearning procedure while less
+affecting the learned distribution. Our extensive experiments demonstrate that
+our proposed method achieves state-of-the-art performance in the generative
+machine unlearning task. The code is available at
+https://github.com/KHU-AGI/GUIDE.",cs.CV,"['cs.CV', 'cs.AI']"
+OmniMedVQA: A New Large-Scale Comprehensive  Evaluation Benchmark for Medical LVLM,Yutao Hu · Yutao Hu · Tianbin · Quanfeng Lu · Wenqi Shao · Junjun He · Yu Qiao · Ping Luo, ,https://arxiv.org/abs/2402.09181,,2402.09181.pdf,OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM,"Large Vision-Language Models (LVLMs) have demonstrated remarkable
+capabilities in various multimodal tasks. However, their potential in the
+medical domain remains largely unexplored. A significant challenge arises from
+the scarcity of diverse medical images spanning various modalities and
+anatomical regions, which is essential in real-world medical applications. To
+solve this problem, in this paper, we introduce OmniMedVQA, a novel
+comprehensive medical Visual Question Answering (VQA) benchmark. This benchmark
+is collected from 73 different medical datasets, including 12 different
+modalities and covering more than 20 distinct anatomical regions. Importantly,
+all images in this benchmark are sourced from authentic medical scenarios,
+ensuring alignment with the requirements of the medical field and suitability
+for evaluating LVLMs. Through our extensive experiments, we have found that
+existing LVLMs struggle to address these medical VQA problems effectively.
+Moreover, what surprises us is that medical-specialized LVLMs even exhibit
+inferior performance to those general-domain models, calling for a more
+versatile and robust LVLM in the biomedical field. The evaluation results not
+only reveal the current limitations of LVLM in understanding real medical
+images but also highlight our dataset's significance. Our code with dataset are
+available at https://github.com/OpenGVLab/Multi-Modality-Arena.",eess.IV,"['eess.IV', 'cs.CV']"
+Sequential Modeling Enables Scalable Learning for Large Vision Models,Yutong Bai · Xinyang Geng · Xinyang Geng · Karttikeya Mangalam · Amir Bar · Alan L. Yuille · Trevor Darrell · Jitendra Malik · Alexei A. Efros, ,https://arxiv.org/abs/2312.00785,,2312.00785.pdf,Sequential Modeling Enables Scalable Learning for Large Vision Models,"We introduce a novel sequential modeling approach which enables learning a
+Large Vision Model (LVM) without making use of any linguistic data. To do this,
+we define a common format, ""visual sentences"", in which we can represent raw
+images and videos as well as annotated data sources such as semantic
+segmentations and depth reconstructions without needing any meta-knowledge
+beyond the pixels. Once this wide variety of visual data (comprising 420
+billion tokens) is represented as sequences, the model can be trained to
+minimize a cross-entropy loss for next token prediction. By training across
+various scales of model architecture and data diversity, we provide empirical
+evidence that our models scale effectively. Many different vision tasks can be
+solved by designing suitable visual prompts at test time.",cs.CV,['cs.CV']
+An edit friendly ddpm noise space: inversion and manipulations,Inbar Huberman-Spiegelglas · Vladimir Kulikov · Tomer Michaeli, ,https://ar5iv.labs.arxiv.org/html/2307.00522,,2307.00522.pdf,LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance,"Recent large-scale text-guided diffusion models provide powerful
+image-generation capabilities. Currently, a significant effort is given to
+enable the modification of these images using text only as means to offer
+intuitive and versatile editing. However, editing proves to be difficult for
+these generative models due to the inherent nature of editing techniques, which
+involves preserving certain content from the original image. Conversely, in
+text-based models, even minor modifications to the text prompt frequently
+result in an entirely distinct result, making attaining one-shot generation
+that accurately corresponds to the users intent exceedingly challenging. In
+addition, to edit a real image using these state-of-the-art tools, one must
+first invert the image into the pre-trained models domain - adding another
+factor affecting the edit quality, as well as latency. In this exploratory
+report, we propose LEDITS - a combined lightweight approach for real-image
+editing, incorporating the Edit Friendly DDPM inversion technique with Semantic
+Guidance, thus extending Semantic Guidance to real image editing, while
+harnessing the editing capabilities of DDPM inversion as well. This approach
+achieves versatile edits, both subtle and extensive as well as alterations in
+composition and style, while requiring no optimization nor extensions to the
+architecture.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes,Alexandros Delitzas · Ayça Takmaz · Federico Tombari · Robert Sumner · Marc Pollefeys · Francis Engelmann,https://scenefun3d.github.io,https://arxiv.org/html/2404.03650v1,,2404.03650v1.pdf,OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views,"Large visual-language models (VLMs), like CLIP, enable open-set image
+segmentation to segment arbitrary concepts from an image in a zero-shot manner.
+This goes beyond the traditional closed-set assumption, i.e., where models can
+only segment classes from a pre-defined training set. More recently, first
+works on open-set segmentation in 3D scenes have appeared in the literature.
+These methods are heavily influenced by closed-set 3D convolutional approaches
+that process point clouds or polygon meshes. However, these 3D scene
+representations do not align well with the image-based nature of the
+visual-language models. Indeed, point cloud and 3D meshes typically have a
+lower resolution than images and the reconstructed 3D scene geometry might not
+project well to the underlying 2D image sequences used to compute pixel-aligned
+CLIP features. To address these challenges, we propose OpenNeRF which naturally
+operates on posed images and directly encodes the VLM features within the NeRF.
+This is similar in spirit to LERF, however our work shows that using pixel-wise
+VLM features (instead of global CLIP features) results in an overall less
+complex architecture without the need for additional DINO regularization. Our
+OpenNeRF further leverages NeRF's ability to render novel views and extract
+open-set VLM features from areas that are not well observed in the initial
+posed images. For 3D point cloud segmentation on the Replica dataset, OpenNeRF
+outperforms recent open-vocabulary methods such as LERF and OpenScene by at
+least +4.9 mIoU.",cs.CV,['cs.CV']
+Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields,Shijie Zhou · Haoran Chang · Sicheng Jiang · Zhiwen Fan · Zehao Zhu · Dejia Xu · Dejia Xu · Pradyumna Chari · Suya You · Zhangyang Wang · Achuta Kadambi, ,https://arxiv.org/abs/2312.03203,,2312.03203.pdf,Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields,"3D scene representations have gained immense popularity in recent years.
+Methods that use Neural Radiance fields are versatile for traditional tasks
+such as novel view synthesis. In recent times, some work has emerged that aims
+to extend the functionality of NeRF beyond view synthesis, for semantically
+aware tasks such as editing and segmentation using 3D feature field
+distillation from 2D foundation models. However, these methods have two major
+limitations: (a) they are limited by the rendering speed of NeRF pipelines, and
+(b) implicitly represented feature fields suffer from continuity artifacts
+reducing feature quality. Recently, 3D Gaussian Splatting has shown
+state-of-the-art performance on real-time radiance field rendering. In this
+work, we go one step further: in addition to radiance field rendering, we
+enable 3D Gaussian splatting on arbitrary-dimension semantic features via 2D
+foundation model distillation. This translation is not straightforward: naively
+incorporating feature fields in the 3DGS framework encounters significant
+challenges, notably the disparities in spatial resolution and channel
+consistency between RGB images and feature maps. We propose architectural and
+training changes to efficiently avert this problem. Our proposed method is
+general, and our experiments showcase novel view semantic segmentation,
+language-guided editing and segment anything through learning feature fields
+from state-of-the-art 2D foundation models such as SAM and CLIP-LSeg. Across
+experiments, our distillation method is able to provide comparable or better
+results, while being significantly faster to both train and render.
+Additionally, to the best of our knowledge, we are the first method to enable
+point and bounding-box prompting for radiance field manipulation, by leveraging
+the SAM model. Project website at: https://feature-3dgs.github.io/",cs.CV,['cs.CV']
+Taming Mode Collapse in Score Distillation for Text-to-3D Generation,Peihao Wang · Dejia Xu · Dejia Xu · Zhiwen Fan · Dilin Wang · Sreyas Mohan · Forrest Iandola · Rakesh Ranjan · Yilei Li · Qiang Liu · Zhangyang Wang · Vikas Chandra, ,https://arxiv.org/abs/2401.00909,,2401.00909.pdf,Taming Mode Collapse in Score Distillation for Text-to-3D Generation,"Despite the remarkable performance of score distillation in text-to-3D
+generation, such techniques notoriously suffer from view inconsistency issues,
+also known as ""Janus"" artifact, where the generated objects fake each view with
+multiple front faces. Although empirically effective methods have approached
+this problem via score debiasing or prompt engineering, a more rigorous
+perspective to explain and tackle this problem remains elusive. In this paper,
+we reveal that the existing score distillation-based text-to-3D generation
+frameworks degenerate to maximal likelihood seeking on each view independently
+and thus suffer from the mode collapse problem, manifesting as the Janus
+artifact in practice. To tame mode collapse, we improve score distillation by
+re-establishing the entropy term in the corresponding variational objective,
+which is applied to the distribution of rendered images. Maximizing the entropy
+encourages diversity among different views in generated 3D assets, thereby
+mitigating the Janus problem. Based on this new objective, we derive a new
+update rule for 3D score distillation, dubbed Entropic Score Distillation
+(ESD). We theoretically reveal that ESD can be simplified and implemented by
+just adopting the classifier-free guidance trick upon variational score
+distillation. Although embarrassingly straightforward, our extensive
+experiments successfully demonstrate that ESD can be an effective treatment for
+Janus artifacts in score distillation.",cs.CV,"['cs.CV', 'cs.LG']"
+LowRankOcc: Tensor Decomposition and Low-Rank Recovery for Vision-based 3D Semantic Occupancy Prediction,Linqing Zhao · Xiuwei Xu · Ziwei Wang · Yunpeng Zhang · Borui Zhang · Wenzhao Zheng · Dalong Du · Jie Zhou · Jiwen Lu, ,https://arxiv.org/abs/2405.17429,,2405.17429.pdf,GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction,"3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and
+semantics of the surrounding scene and is an important task for the robustness
+of vision-centric autonomous driving. Most existing methods employ dense grids
+such as voxels as scene representations, which ignore the sparsity of occupancy
+and the diversity of object scales and thus lead to unbalanced allocation of
+resources. To address this, we propose an object-centric representation to
+describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian
+represents a flexible region of interest and its semantic features. We
+aggregate information from images through the attention mechanism and
+iteratively refine the properties of 3D Gaussians including position,
+covariance, and semantics. We then propose an efficient Gaussian-to-voxel
+splatting method to generate 3D occupancy predictions, which only aggregates
+the neighboring Gaussians for a certain position. We conduct extensive
+experiments on the widely adopted nuScenes and KITTI-360 datasets. Experimental
+results demonstrate that GaussianFormer achieves comparable performance with
+state-of-the-art methods with only 17.8% - 24.8% of their memory consumption.
+Code is available at: https://github.com/huang-yh/GaussianFormer.",cs.CV,"['cs.CV', 'cs.AI']"
+mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration,Qinghao Ye · Haiyang Xu · Jiabo Ye · Ming Yan · Anwen Hu · Haowei Liu · Qi Qian · Ji Zhang · Fei Huang · Fei Huang, ,https://arxiv.org/abs/2311.04257,,2311.04257.pdf,mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration,"Multi-modal Large Language Models (MLLMs) have demonstrated impressive
+instruction abilities across various open-ended tasks. However, previous
+methods primarily focus on enhancing multi-modal capabilities. In this work, we
+introduce a versatile multi-modal large language model, mPLUG-Owl2, which
+effectively leverages modality collaboration to improve performance in both
+text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design,
+with the language decoder acting as a universal interface for managing
+different modalities. Specifically, mPLUG-Owl2 incorporates shared functional
+modules to facilitate modality collaboration and introduces a modality-adaptive
+module that preserves modality-specific features. Extensive experiments reveal
+that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal
+tasks and achieving state-of-the-art performances with a single generic model.
+Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality
+collaboration phenomenon in both pure-text and multi-modal scenarios, setting a
+pioneering path in the development of future multi-modal foundation models.",cs.CL,"['cs.CL', 'cs.CV']"
+NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models,Yusuf Dalva · Pinar Yanardag, ,https://arxiv.org/abs/2312.05390,,2312.05390.pdf,NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models,"Generative models have been very popular in the recent years for their image
+generation capabilities. GAN-based models are highly regarded for their
+disentangled latent space, which is a key feature contributing to their success
+in controlled image editing. On the other hand, diffusion models have emerged
+as powerful tools for generating high-quality images. However, the latent space
+of diffusion models is not as thoroughly explored or understood. Existing
+methods that aim to explore the latent space of diffusion models usually relies
+on text prompts to pinpoint specific semantics. However, this approach may be
+restrictive in areas such as art, fashion, or specialized fields like medicine,
+where suitable text prompts might not be available or easy to conceive thus
+limiting the scope of existing work. In this paper, we propose an unsupervised
+method to discover latent semantics in text-to-image diffusion models without
+relying on text prompts. Our method takes a small set of unlabeled images from
+specific domains, such as faces or cats, and a pre-trained diffusion model, and
+discovers diverse semantics in unsupervised fashion using a contrastive
+learning objective. Moreover, the learned directions can be applied
+simultaneously, either within the same domain (such as various types of facial
+edits) or across different domains (such as applying cat and face edits within
+the same image) without interfering with each other. Our extensive experiments
+show that our method achieves highly disentangled edits, outperforming existing
+approaches in both diffusion-based and GAN-based latent space editing methods.",cs.CV,['cs.CV']
+On the Road to Portability: Compressing End-to-End Motion Planner for Autonomous Driving,Kaituo Feng · Changsheng Li · Dongchun Ren · Ye Yuan · Guoren Wang, ,https://arxiv.org/abs/2403.01238,,2403.01238.pdf,On the Road to Portability: Compressing End-to-End Motion Planner for Autonomous Driving,"End-to-end motion planning models equipped with deep neural networks have
+shown great potential for enabling full autonomous driving. However, the
+oversized neural networks render them impractical for deployment on
+resource-constrained systems, which unavoidably requires more computational
+time and resources during reference.To handle this, knowledge distillation
+offers a promising approach that compresses models by enabling a smaller
+student model to learn from a larger teacher model. Nevertheless, how to apply
+knowledge distillation to compress motion planners has not been explored so
+far. In this paper, we propose PlanKD, the first knowledge distillation
+framework tailored for compressing end-to-end motion planners. First,
+considering that driving scenes are inherently complex, often containing
+planning-irrelevant or even noisy information, transferring such information is
+not beneficial for the student planner. Thus, we design an information
+bottleneck based strategy to only distill planning-relevant information, rather
+than transfer all information indiscriminately. Second, different waypoints in
+an output planned trajectory may hold varying degrees of importance for motion
+planning, where a slight deviation in certain crucial waypoints might lead to a
+collision. Therefore, we devise a safety-aware waypoint-attentive distillation
+module that assigns adaptive weights to different waypoints based on the
+importance, to encourage the student to accurately mimic more crucial
+waypoints, thereby improving overall safety. Experiments demonstrate that our
+PlanKD can boost the performance of smaller planners by a large margin, and
+significantly reduce their reference time.",cs.CV,['cs.CV']
+Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning,Zichen Miao · Jiang Wang · Ze Wang · Zhengyuan Yang · Lijuan Wang · Qiang Qiu · Zicheng Liu, ,,https://bair.berkeley.edu/blog/2023/07/14/ddpo/,,,,,nan
+HDQMF: Holographic Feature Decomposition Using Quantum Algorithms,Prathyush Poduval · Zhuowen Zou · Mohsen Imani, ,https://arxiv.org/abs/2403.17444,,,Quantum accelerated cross regression algorithm for multiview feature extraction,"Multi-view Feature Extraction (MvFE) has wide applications in machine
+learning, image processing and other fields. When dealing with massive
+high-dimensional data, the performance of classical computer faces severe
+challenges due to MvFE involves expensive matrix calculation. To address this
+challenge, a quantum-accelerated cross-regression algorithm for MvFE is
+proposed. The main contributions are as follows:(1) a quantum version algorithm
+for MvFE is proposed for the first time, filling the gap of quantum computing
+in the field of MvFE;(2) a quantum algorithm is designed to construct the
+block-encoding of the target data matrix, so that the optimal Hamiltonian
+simulation technology based on the block-encoding framework can be used to
+efficiently realize the quantum simulation of the target data matrix. This
+approach reduces the dependence of the algorithm's on simulation errors to
+enhance algorithm performance;(3) compared with the classical counterpart
+algorithm, the proposed quantum algorithm has a polynomial acceleration in the
+number of data points, the dimension of data points and the number of view
+data.",quant-ph,['quant-ph']
+Leveraging Predicate and Triplet Learning for Scene Graph Generation,Jiankai Li · Yunhong Wang · Xiefan Guo · Ruijie Yang · Weixin Li, ,https://arxiv.org/abs/2309.03542,,2309.03542.pdf,Zero-Shot Scene Graph Generation via Triplet Calibration and Reduction,"Scene Graph Generation (SGG) plays a pivotal role in downstream
+vision-language tasks. Existing SGG methods typically suffer from poor
+compositional generalizations on unseen triplets. They are generally trained on
+incompletely annotated scene graphs that contain dominant triplets and tend to
+bias toward these seen triplets during inference. To address this issue, we
+propose a Triplet Calibration and Reduction (T-CAR) framework in this paper. In
+our framework, a triplet calibration loss is first presented to regularize the
+representations of diverse triplets and to simultaneously excavate the unseen
+triplets in incompletely annotated training scene graphs. Moreover, the unseen
+space of scene graphs is usually several times larger than the seen space since
+it contains a huge number of unrealistic compositions. Thus, we propose an
+unseen space reduction loss to shift the attention of excavation to reasonable
+unseen compositions to facilitate the model training. Finally, we propose a
+contextual encoder to improve the compositional generalizations of unseen
+triplets by explicitly modeling the relative spatial relations between subjects
+and objects. Extensive experiments show that our approach achieves consistent
+improvements for zero-shot SGG over state-of-the-art methods. The code is
+available at https://github.com/jkli1998/T-CAR.",cs.CV,"['cs.CV', 'cs.MM']"
+Open-vocabulary object 6D pose estimation,Jaime Corsetti · Davide Boscaini · Changjae Oh · Andrea Cavallaro · Fabio Poiesi, ,https://arxiv.org/abs/2312.00690v2,,2312.00690v2.pdf,Open-vocabulary object 6D pose estimation,"We introduce the new setting of open-vocabulary object 6D pose estimation, in
+which a textual prompt is used to specify the object of interest. In contrast
+to existing approaches, in our setting (i) the object of interest is specified
+solely through the textual prompt, (ii) no object model (e.g. CAD or video
+sequence) is required at inference, (iii) the object is imaged from two
+different viewpoints of two different scenes, and (iv) the object was not
+observed during the training phase. To operate in this setting, we introduce a
+novel approach that leverages a Vision-Language Model to segment the object of
+interest from two distinct scenes and to estimate its relative 6D pose. The key
+of our approach is a carefully devised strategy to fuse object-level
+information provided by the prompt with local image features, resulting in a
+feature space that can generalize to novel concepts. We validate our approach
+on a new benchmark based on two popular datasets, REAL275 and Toyota-Light,
+which collectively encompass 39 object instances appearing in four thousand
+image pairs. The results demonstrate that our approach outperforms both a
+well-established hand-crafted method and a recent deep learning-based baseline
+in estimating the relative 6D pose of objects in different scenes. Project
+page: https://jcorsetti.github.io/oryon/.",cs.CV,['cs.CV']
+Matching Anything by Segmenting Anything,Siyuan Li · Lei Ke · Martin Danelljan · Luigi Piccinelli · Mattia Segu · Luc Van Gool · Fisher Yu, ,https://arxiv.org/abs/2401.16741v1,,,MESA: Matching Everything by Segmenting Anything,"Feature matching is a crucial task in the field of computer vision, which
+involves finding correspondences between images. Previous studies achieve
+remarkable performance using learning-based feature comparison. However, the
+pervasive presence of matching redundancy between images gives rise to
+unnecessary and error-prone computations in these methods, imposing limitations
+on their accuracy. To address this issue, we propose MESA, a novel approach to
+establish precise area (or region) matches for efficient matching redundancy
+reduction. MESA first leverages the advanced image understanding capability of
+SAM, a state-of-the-art foundation model for image segmentation, to obtain
+image areas with implicit semantic. Then, a multi-relational graph is proposed
+to model the spatial structure of these areas and construct their scale
+hierarchy. Based on graphical models derived from the graph, the area matching
+is reformulated as an energy minimization task and effectively resolved.
+Extensive experiments demonstrate that MESA yields substantial precision
+improvement for multiple point matchers in indoor and outdoor downstream tasks,
+e.g. +13.61% for DKM in indoor pose estimation.",cs.CV,['cs.CV']
+DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation,Xiaoliang Ju · Zhaoyang Huang · Yijin Li · Guofeng Zhang · Yu Qiao · Hongsheng Li, ,https://ar5iv.labs.arxiv.org/html/2311.17261,,,SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors,"We propose SceneTex, a novel method for effectively generating high-quality
+and style-consistent textures for indoor scenes using depth-to-image diffusion
+priors. Unlike previous methods that either iteratively warp 2D views onto a
+mesh surface or distillate diffusion latent features without accurate geometric
+and style cues, SceneTex formulates the texture synthesis task as an
+optimization problem in the RGB space where style and geometry consistency are
+properly reflected. At its core, SceneTex proposes a multiresolution texture
+field to implicitly encode the mesh appearance. We optimize the target texture
+via a score-distillation-based objective function in respective RGB renderings.
+To further secure the style consistency across views, we introduce a
+cross-attention decoder to predict the RGB values by cross-attending to the
+pre-sampled reference locations in each instance. SceneTex enables various and
+accurate texture synthesis for 3D-FRONT scenes, demonstrating significant
+improvements in visual quality and prompt fidelity over the prior texture
+generation methods.",cs.CV,['cs.CV']
+DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection,Yuhao Sun · Lingyun Yu · Hongtao Xie · Jiaming Li · Yongdong Zhang, ,http://export.arxiv.org/abs/2405.09882,,2405.09882.pdf,DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection,"With the rapid development of face recognition (FR) systems, the privacy of
+face images on social media is facing severe challenges due to the abuse of
+unauthorized FR systems. Some studies utilize adversarial attack techniques to
+defend against malicious FR systems by generating adversarial examples.
+However, the generated adversarial examples, i.e., the protected face images,
+tend to suffer from subpar visual quality and low transferability. In this
+paper, we propose a novel face protection approach, dubbed DiffAM, which
+leverages the powerful generative ability of diffusion models to generate
+high-quality protected face images with adversarial makeup transferred from
+reference images. To be specific, we first introduce a makeup removal module to
+generate non-makeup images utilizing a fine-tuned diffusion model with guidance
+of textual prompts in CLIP space. As the inverse process of makeup transfer,
+makeup removal can make it easier to establish the deterministic relationship
+between makeup domain and non-makeup domain regardless of elaborate text
+prompts. Then, with this relationship, a CLIP-based makeup loss along with an
+ensemble attack strategy is introduced to jointly guide the direction of
+adversarial makeup domain, achieving the generation of protected face images
+with natural-looking makeup and high black-box transferability. Extensive
+experiments demonstrate that DiffAM achieves higher visual quality and attack
+success rates with a gain of 12.98% under black-box setting compared with the
+state of the arts. The code will be available at
+https://github.com/HansSunY/DiffAM.",cs.CV,"['cs.CV', 'cs.AI']"
+MoMask: Generative Masked Modeling of 3D Human Motions,chuan guo · Yuxuan Mu · Muhammad Gohar Javed · Sen Wang · Li Cheng, ,https://arxiv.org/abs/2312.00063,,2312.00063.pdf,MoMask: Generative Masked Modeling of 3D Human Motions,"We introduce MoMask, a novel masked modeling framework for text-driven 3D
+human motion generation. In MoMask, a hierarchical quantization scheme is
+employed to represent human motion as multi-layer discrete motion tokens with
+high-fidelity details. Starting at the base layer, with a sequence of motion
+tokens obtained by vector quantization, the residual tokens of increasing
+orders are derived and stored at the subsequent layers of the hierarchy. This
+is consequently followed by two distinct bidirectional transformers. For the
+base-layer motion tokens, a Masked Transformer is designated to predict
+randomly masked motion tokens conditioned on text input at training stage.
+During generation (i.e. inference) stage, starting from an empty sequence, our
+Masked Transformer iteratively fills up the missing tokens; Subsequently, a
+Residual Transformer learns to progressively predict the next-layer tokens
+based on the results from current layer. Extensive experiments demonstrate that
+MoMask outperforms the state-of-art methods on the text-to-motion generation
+task, with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset,
+and 0.228 (vs 0.514) on KIT-ML, respectively. MoMask can also be seamlessly
+applied in related tasks without further model fine-tuning, such as text-guided
+temporal inpainting.",cs.CV,['cs.CV']
+Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement,Zaid Khan · Vijay Kumar BG · Samuel Schulter · Yun Fu · Manmohan Chandraker, ,https://arxiv.org/abs/2404.04627,,2404.04627.pdf,Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement,"Visual program synthesis is a promising approach to exploit the reasoning
+abilities of large language models for compositional computer vision tasks.
+Previous work has used few-shot prompting with frozen LLMs to synthesize visual
+programs. Training an LLM to write better visual programs is an attractive
+prospect, but it is unclear how to accomplish this. No dataset of visual
+programs for training exists, and acquisition of a visual program dataset
+cannot be easily crowdsourced due to the need for expert annotators. To get
+around the lack of direct supervision, we explore improving the program
+synthesis abilities of an LLM using feedback from interactive experience. We
+propose a method where we exploit existing annotations for a vision-language
+task to improvise a coarse reward signal for that task, treat the LLM as a
+policy, and apply reinforced self-training to improve the visual program
+synthesis ability of the LLM for that task. We describe a series of experiments
+on object detection, compositional visual question answering, and image-text
+retrieval, and show that in each case, the self-trained LLM outperforms or
+performs on par with few-shot frozen LLMs that are an order of magnitude
+larger. Website: https://zaidkhan.me/ViReP",cs.CV,['cs.CV']
+Scaling Laws of Synthetic Images for Model Training ... for Now,Lijie Fan · Kaifeng Chen · Dilip Krishnan · Dina Katabi · Phillip Isola · Yonglong Tian,https://github.com/google-research/syn-rep-learn/tree/main/Scaling,https://arxiv.org/abs/2312.04567,,2312.04567.pdf,Scaling Laws of Synthetic Images for Model Training ... for Now,"Recent significant advances in text-to-image models unlock the possibility of
+training vision systems using synthetic images, potentially overcoming the
+difficulty of collecting curated data at scale. It is unclear, however, how
+these models behave at scale, as more synthetic data is added to the training
+set. In this paper we study the scaling laws of synthetic images generated by
+state of the art text-to-image models, for the training of supervised models:
+image classifiers with label supervision, and CLIP with language supervision.
+We identify several factors, including text prompts, classifier-free guidance
+scale, and types of text-to-image models, that significantly affect scaling
+behavior. After tuning these factors, we observe that synthetic images
+demonstrate a scaling trend similar to, but slightly less effective than, real
+images in CLIP training, while they significantly underperform in scaling when
+training supervised image classifiers. Our analysis indicates that the main
+reason for this underperformance is the inability of off-the-shelf
+text-to-image models to generate certain concepts, a limitation that
+significantly impairs the training of image classifiers. Our findings also
+suggest that scaling synthetic data can be particularly effective in scenarios
+such as: (1) when there is a limited supply of real images for a supervised
+problem (e.g., fewer than 0.5 million images in ImageNet), (2) when the
+evaluation dataset diverges significantly from the training data, indicating
+the out-of-distribution scenario, or (3) when synthetic data is used in
+conjunction with real images, as demonstrated in the training of CLIP models.",cs.CV,['cs.CV']
+Adaptive Hyper-graph Aggregation for Modality-Agnostic Federated Learning,Fan Qi · Shuai Li, ,,https://ieeexplore.ieee.org/document/10528890,,,,,nan
+DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling,Xiaoyun Zheng · Liwei Liao · Xufeng Li · Jianbo Jiao · Rongjie Wang · Feng Gao · Shiqi Wang · Ronggang Wang,https://pku-dymvhumans.github.io/,https://arxiv.org/abs/2403.16080,,2403.16080.pdf,PKU-DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling,"High-quality human reconstruction and photo-realistic rendering of a dynamic
+scene is a long-standing problem in computer vision and graphics. Despite
+considerable efforts invested in developing various capture systems and
+reconstruction algorithms, recent advancements still struggle with loose or
+oversized clothing and overly complex poses. In part, this is due to the
+challenges of acquiring high-quality human datasets. To facilitate the
+development of these fields, in this paper, we present PKU-DyMVHumans, a
+versatile human-centric dataset for high-fidelity reconstruction and rendering
+of dynamic human scenarios from dense multi-view videos. It comprises 8.2
+million frames captured by more than 56 synchronized cameras across diverse
+scenarios. These sequences comprise 32 human subjects across 45 different
+scenarios, each with a high-detailed appearance and realistic human motion.
+Inspired by recent advancements in neural radiance field (NeRF)-based scene
+representations, we carefully set up an off-the-shelf framework that is easy to
+provide those state-of-the-art NeRF-based implementations and benchmark on
+PKU-DyMVHumans dataset. It is paving the way for various applications like
+fine-grained foreground/background decomposition, high-quality human
+reconstruction and photo-realistic novel view synthesis of a dynamic scene.
+Extensive studies are performed on the benchmark, demonstrating new
+observations and challenges that emerge from using such high-fidelity dynamic
+data.",cs.CV,['cs.CV']
+CrossMAE: Cross Modality Masked Autoencoders For Region-Aware Audio-Visual Pre-Training,Yuxin Guo · Siyang Sun · Shuailei Ma · Kecheng Zheng · Xiaoyi Bao · Shijie Ma · Wei Zou · Yun Zheng, ,https://arxiv.org/abs/2401.14391,,2401.14391.pdf,Rethinking Patch Dependence for Masked Autoencoders,"In this work, we re-examine inter-patch dependencies in the decoding
+mechanism of masked autoencoders (MAE). We decompose this decoding mechanism
+for masked patch reconstruction in MAE into self-attention and cross-attention.
+Our investigations suggest that self-attention between mask patches is not
+essential for learning good representations. To this end, we propose a novel
+pretraining framework: Cross-Attention Masked Autoencoders (CrossMAE).
+CrossMAE's decoder leverages only cross-attention between masked and visible
+tokens, with no degradation in downstream performance. This design also enables
+decoding only a small subset of mask tokens, boosting efficiency. Furthermore,
+each decoder block can now leverage different encoder features, resulting in
+improved representation learning. CrossMAE matches MAE in performance with 2.5
+to 3.7$\times$ less decoding compute. It also surpasses MAE on ImageNet
+classification and COCO instance segmentation under the same compute. Code and
+models: https://crossmae.github.io",cs.CV,['cs.CV']
+Traceable Federated Continual Learning,Qiang Wang · Bingyan Liu · Yawen Li, ,https://arxiv.org/abs/2312.13500,,2312.13500.pdf,Federated Continual Novel Class Learning,"In a privacy-focused era, Federated Learning (FL) has emerged as a promising
+machine learning technique. However, most existing FL studies assume that the
+data distribution remains nearly fixed over time, while real-world scenarios
+often involve dynamic and continual changes. To equip FL systems with continual
+model evolution capabilities, we focus on an important problem called Federated
+Continual Novel Class Learning (FedCN) in this work. The biggest challenge in
+FedCN is to merge and align novel classes that are discovered and learned by
+different clients without compromising privacy. To address this, we propose a
+Global Alignment Learning (GAL) framework that can accurately estimate the
+global novel class number and provide effective guidance for local training
+from a global perspective, all while maintaining privacy protection.
+Specifically, GAL first locates high-density regions in the representation
+space through a bi-level clustering mechanism to estimate the novel class
+number, with which the global prototypes corresponding to novel classes can be
+constructed. Then, GAL uses a novel semantic weighted loss to capture all
+possible correlations between these prototypes and the training data for
+mitigating the impact of pseudo-label noise and data heterogeneity. Extensive
+experiments on various datasets demonstrate GAL's superior performance over
+state-of-the-art novel class discovery methods. In particular, GAL achieves
+significant improvements in novel-class performance, increasing the accuracy by
+5.1% to 10.6% in the case of one novel class learning stage and by 7.8% to
+17.9% in the case of two novel class learning stages, without sacrificing
+known-class performance. Moreover, GAL is shown to be effective in equipping a
+variety of different mainstream FL algorithms with novel class discovery and
+learning capability, highlighting its potential for many real-world
+applications.",cs.CV,['cs.CV']
+PolarMatte: Fully Computational Ground-Truth-Quality Alpha Matte Extraction for Images and Video using Polarized Screen Matting,Kenji Enomoto · TJ Rhodes · Brian Price · Gavin Miller, ,https://arxiv.org/abs/2311.13535,,2311.13535.pdf,DiffusionMat: Alpha Matting as Sequential Refinement Learning,"In this paper, we introduce DiffusionMat, a novel image matting framework
+that employs a diffusion model for the transition from coarse to refined alpha
+mattes. Diverging from conventional methods that utilize trimaps merely as
+loose guidance for alpha matte prediction, our approach treats image matting as
+a sequential refinement learning process. This process begins with the addition
+of noise to trimaps and iteratively denoises them using a pre-trained diffusion
+model, which incrementally guides the prediction towards a clean alpha matte.
+The key innovation of our framework is a correction module that adjusts the
+output at each denoising step, ensuring that the final result is consistent
+with the input image's structures. We also introduce the Alpha Reliability
+Propagation, a novel technique designed to maximize the utility of available
+guidance by selectively enhancing the trimap regions with confident alpha
+information, thus simplifying the correction task. To train the correction
+module, we devise specialized loss functions that target the accuracy of the
+alpha matte's edges and the consistency of its opaque and transparent regions.
+We evaluate our model across several image matting benchmarks, and the results
+indicate that DiffusionMat consistently outperforms existing methods. Project
+page at~\url{https://cnnlstm.github.io/DiffusionMat",cs.CV,['cs.CV']
+Relightable and Animatable Neural Avatar from Sparse-View Video,Zhen Xu · Sida Peng · Chen Geng · Linzhan Mou · Zihan Yan · Jiaming Sun · Hujun Bao · Xiaowei Zhou,https://zju3dv.github.io/relightable_avatar,https://arxiv.org/abs/2308.07903,,2308.07903.pdf,Relightable and Animatable Neural Avatar from Sparse-View Video,"This paper tackles the challenge of creating relightable and animatable
+neural avatars from sparse-view (or even monocular) videos of dynamic humans
+under unknown illumination. Compared to studio environments, this setting is
+more practical and accessible but poses an extremely challenging ill-posed
+problem. Previous neural human reconstruction methods are able to reconstruct
+animatable avatars from sparse views using deformed Signed Distance Fields
+(SDF) but cannot recover material parameters for relighting. While
+differentiable inverse rendering-based methods have succeeded in material
+recovery of static objects, it is not straightforward to extend them to dynamic
+humans as it is computationally intensive to compute pixel-surface intersection
+and light visibility on deformed SDFs for inverse rendering. To solve this
+challenge, we propose a Hierarchical Distance Query (HDQ) algorithm to
+approximate the world space distances under arbitrary human poses.
+Specifically, we estimate coarse distances based on a parametric human model
+and compute fine distances by exploiting the local deformation invariance of
+SDF. Based on the HDQ algorithm, we leverage sphere tracing to efficiently
+estimate the surface intersection and light visibility. This allows us to
+develop the first system to recover animatable and relightable neural avatars
+from sparse view (or monocular) inputs. Experiments demonstrate that our
+approach is able to produce superior results compared to state-of-the-art
+methods. Our code will be released for reproducibility.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR']"
+DeepCache: Accelerating Diffusion Models for Free,Xinyin Ma · Gongfan Fang · Xinchao Wang, ,https://arxiv.org/abs/2312.00858,,2312.00858.pdf,DeepCache: Accelerating Diffusion Models for Free,"Diffusion models have recently gained unprecedented attention in the field of
+image synthesis due to their remarkable generative capabilities.
+Notwithstanding their prowess, these models often incur substantial
+computational costs, primarily attributed to the sequential denoising process
+and cumbersome model size. Traditional methods for compressing diffusion models
+typically involve extensive retraining, presenting cost and feasibility
+challenges. In this paper, we introduce DeepCache, a novel training-free
+paradigm that accelerates diffusion models from the perspective of model
+architecture. DeepCache capitalizes on the inherent temporal redundancy
+observed in the sequential denoising steps of diffusion models, which caches
+and retrieves features across adjacent denoising stages, thereby curtailing
+redundant computations. Utilizing the property of the U-Net, we reuse the
+high-level features while updating the low-level features in a very cheap way.
+This innovative strategy, in turn, enables a speedup factor of 2.3$\times$ for
+Stable Diffusion v1.5 with only a 0.05 decline in CLIP Score, and 4.1$\times$
+for LDM-4-G with a slight decrease of 0.22 in FID on ImageNet. Our experiments
+also demonstrate DeepCache's superiority over existing pruning and distillation
+methods that necessitate retraining and its compatibility with current sampling
+techniques. Furthermore, we find that under the same throughput, DeepCache
+effectively achieves comparable or even marginally improved results with DDIM
+or PLMS. The code is available at https://github.com/horseee/DeepCache",cs.CV,"['cs.CV', 'cs.AI']"
+Unsupervised Occupancy Learning from Sparse Point Cloud,Amine Ouasfi · Adnane Boukhayma, ,https://arxiv.org/abs/2404.02759,,2404.02759.pdf,Unsupervised Occupancy Learning from Sparse Point Cloud,"Implicit Neural Representations have gained prominence as a powerful
+framework for capturing complex data modalities, encompassing a wide range from
+3D shapes to images and audio. Within the realm of 3D shape representation,
+Neural Signed Distance Functions (SDF) have demonstrated remarkable potential
+in faithfully encoding intricate shape geometry. However, learning SDFs from 3D
+point clouds in the absence of ground truth supervision remains a very
+challenging task. In this paper, we propose a method to infer occupancy fields
+instead of SDFs as they are easier to learn from sparse inputs. We leverage a
+margin-based uncertainty measure to differentially sample from the decision
+boundary of the occupancy function and supervise the sampled boundary points
+using the input point cloud. We further stabilize the optimization process at
+the early stages of the training by biasing the occupancy function towards
+minimal entropy fields while maximizing its entropy at the input point cloud.
+Through extensive experiments and evaluations, we illustrate the efficacy of
+our proposed method, highlighting its capacity to improve implicit shape
+inference with respect to baselines and the state-of-the-art using synthetic
+and real data.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG']"
+SGC-Occ: Semantic-Geometry Consistent 3D Occupancy Prediction for Autonomous Driving,Zhiwen Yang · Xiangteng He · Yuxin Peng, ,https://arxiv.org/abs/2403.08748,,2403.08748.pdf,Real-time 3D semantic occupancy prediction for autonomous vehicles using memory-efficient sparse convolution,"In autonomous vehicles, understanding the surrounding 3D environment of the
+ego vehicle in real-time is essential. A compact way to represent scenes while
+encoding geometric distances and semantic object information is via 3D semantic
+occupancy maps. State of the art 3D mapping methods leverage transformers with
+cross-attention mechanisms to elevate 2D vision-centric camera features into
+the 3D domain. However, these methods encounter significant challenges in
+real-time applications due to their high computational demands during
+inference. This limitation is particularly problematic in autonomous vehicles,
+where GPU resources must be shared with other tasks such as localization and
+planning. In this paper, we introduce an approach that extracts features from
+front-view 2D camera images and LiDAR scans, then employs a sparse convolution
+network (Minkowski Engine), for 3D semantic occupancy prediction. Given that
+outdoor scenes in autonomous driving scenarios are inherently sparse, the
+utilization of sparse convolution is particularly apt. By jointly solving the
+problems of 3D scene completion of sparse scenes and 3D semantic segmentation,
+we provide a more efficient learning framework suitable for real-time
+applications in autonomous vehicles. We also demonstrate competitive accuracy
+on the nuScenes dataset.",cs.RO,"['cs.RO', 'cs.CV']"
+Countering Personalized Text-to-Image Generation with Influence Watermarks,Hanwen Liu · Zhicheng Sun · Yadong Mu, ,https://arxiv.org/abs/2312.15905,,,Cross Initialization for Personalized Text-to-Image Generation,"Recently, there has been a surge in face personalization techniques,
+benefiting from the advanced capabilities of pretrained text-to-image diffusion
+models. Among these, a notable method is Textual Inversion, which generates
+personalized images by inverting given images into textual embeddings. However,
+methods based on Textual Inversion still struggle with balancing the trade-off
+between reconstruction quality and editability. In this study, we examine this
+issue through the lens of initialization. Upon closely examining traditional
+initialization methods, we identified a significant disparity between the
+initial and learned embeddings in terms of both scale and orientation. The
+scale of the learned embedding can be up to 100 times greater than that of the
+initial embedding. Such a significant change in the embedding could increase
+the risk of overfitting, thereby compromising the editability. Driven by this
+observation, we introduce a novel initialization method, termed Cross
+Initialization, that significantly narrows the gap between the initial and
+learned embeddings. This method not only improves both reconstruction and
+editability but also reduces the optimization steps from 5000 to 320.
+Furthermore, we apply a regularization term to keep the learned embedding close
+to the initial embedding. We show that when combined with Cross Initialization,
+this regularization term can effectively improve editability. We provide
+comprehensive empirical evidence to demonstrate the superior performance of our
+method compared to the baseline methods. Notably, in our experiments, Cross
+Initialization is the only method that successfully edits an individual's
+facial expression. Additionally, a fast version of our method allows for
+capturing an input image in roughly 26 seconds, while surpassing the baseline
+methods in terms of both reconstruction and editability. Code will be made
+publicly available.",cs.CV,['cs.CV']
+GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields,Yunsong Wang · Hanlin Chen · Gim Hee Lee, ,https://arxiv.org/abs/2404.00931,,2404.00931.pdf,GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields,"Recent advancements in vision-language foundation models have significantly
+enhanced open-vocabulary 3D scene understanding. However, the generalizability
+of existing methods is constrained due to their framework designs and their
+reliance on 3D data. We address this limitation by introducing Generalizable
+Open-Vocabulary Neural Semantic Fields (GOV-NeSF), a novel approach offering a
+generalizable implicit representation of 3D scenes with open-vocabulary
+semantics. We aggregate the geometry-aware features using a cost volume, and
+propose a Multi-view Joint Fusion module to aggregate multi-view features
+through a cross-view attention mechanism, which effectively predicts
+view-specific blending weights for both colors and open-vocabulary features.
+Remarkably, our GOV-NeSF exhibits state-of-the-art performance in both 2D and
+3D open-vocabulary semantic segmentation, eliminating the need for ground truth
+semantic labels or depth priors, and effectively generalize across scenes and
+datasets without fine-tuning.",cs.CV,['cs.CV']
+NeuRAD: Neural Rendering for Autonomous Driving,Adam Tonderski · Carl Lindström · Georg Hess · William Ljungbergh · Lennart Svensson · Christoffer Petersson,https://research.zenseact.com/publications/neurad/,https://arxiv.org/abs/2311.15260,,2311.15260.pdf,NeuRAD: Neural Rendering for Autonomous Driving,"Neural radiance fields (NeRFs) have gained popularity in the autonomous
+driving (AD) community. Recent methods show NeRFs' potential for closed-loop
+simulation, enabling testing of AD systems, and as an advanced training data
+augmentation technique. However, existing methods often require long training
+times, dense semantic supervision, or lack generalizability. This, in turn,
+hinders the application of NeRFs for AD at scale. In this paper, we propose
+NeuRAD, a robust novel view synthesis method tailored to dynamic AD data. Our
+method features simple network design, extensive sensor modeling for both
+camera and lidar -- including rolling shutter, beam divergence and ray dropping
+-- and is applicable to multiple datasets out of the box. We verify its
+performance on five popular AD datasets, achieving state-of-the-art performance
+across the board. To encourage further development, we will openly release the
+NeuRAD source code. See https://github.com/georghess/NeuRAD .",cs.CV,['cs.CV']
+Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning,Wei Zhang · Chaoqun Wan · Tongliang Liu · Xinmei Tian · Xu Shen · Jieping Ye, ,https://arxiv.org/abs/2404.00801,,2404.00801.pdf,$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding,"Video temporal grounding (VTG) is a fine-grained video understanding problem
+that aims to ground relevant clips in untrimmed videos given natural language
+queries. Most existing VTG models are built upon frame-wise final-layer CLIP
+features, aided by additional temporal backbones (e.g., SlowFast) with
+sophisticated temporal reasoning mechanisms. In this work, we claim that CLIP
+itself already shows great potential for fine-grained spatial-temporal
+modeling, as each layer offers distinct yet useful information under different
+granularity levels. Motivated by this, we propose Reversed Recurrent Tuning
+($R^2$-Tuning), a parameter- and memory-efficient transfer learning framework
+for video temporal grounding. Our method learns a lightweight $R^2$ Block
+containing only 1.5% of the total parameters to perform progressive
+spatial-temporal modeling. Starting from the last layer of CLIP, $R^2$ Block
+recurrently aggregates spatial features from earlier layers, then refines
+temporal correlation conditioning on the given query, resulting in a
+coarse-to-fine scheme. $R^2$-Tuning achieves state-of-the-art performance
+across three VTG tasks (i.e., moment retrieval, highlight detection, and video
+summarization) on six public benchmarks (i.e., QVHighlights, Charades-STA,
+Ego4D-NLQ, TACoS, YouTube Highlights, and TVSum) even without the additional
+backbone, demonstrating the significance and effectiveness of the proposed
+scheme. Our code is available at https://github.com/yeliudev/R2-Tuning.",cs.CV,['cs.CV']
+Enhancing Vision-Language Pretraining with Rich Supervisions,Yuan Gao · Kunyu Shi · Pengkai Zhu · Edouard Belval · Oren Nuriel · Srikar Appalaraju · Shabnam Ghadar · Zhuowen Tu · Vijay Mahadevan · Stefano Soatto, ,https://arxiv.org/abs/2403.03346,,,Enhancing Vision-Language Pre-training with Rich Supervisions,"We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel
+pre-training paradigm for Vision-Language Models using data from large-scale
+web screenshot rendering. Using web screenshots unlocks a treasure trove of
+visual and textual cues that are not present in using image-text pairs. In S4,
+we leverage the inherent tree-structured hierarchy of HTML elements and the
+spatial localization to carefully design 10 pre-training tasks with large scale
+annotated data. These tasks resemble downstream tasks across different domains
+and the annotations are cheap to obtain. We demonstrate that, compared to
+current screenshot pre-training objectives, our innovative pre-training method
+significantly enhances performance of image-to-text model in nine varied and
+popular downstream tasks - up to 76.1% improvements on Table Detection, and at
+least 1% on Widget Captioning.",cs.CV,['cs.CV']
+A Category Agnostic Model for Visual Rearrangement,Yuyi Liu · Xinhang Song · Weijie Li · Xiaohan Wang · Shuqiang Jiang, ,,http://vipl.ict.ac.cn/en/news/researchevents/202403/t20240315_207762.html,,,,,nan
+Polos: Multimodal Metric Learning from Human Feedback for Image Captioning,Yuiga Wada · Kanta Kaneda · Daichi Saito · Komei Sugiura,https://yuiga.dev/polos,https://arxiv.org/abs/2402.18091,,2402.18091.pdf,Polos: Multimodal Metric Learning from Human Feedback for Image Captioning,"Establishing an automatic evaluation metric that closely aligns with human
+judgments is essential for effectively developing image captioning models.
+Recent data-driven metrics have demonstrated a stronger correlation with human
+judgments than classic metrics such as CIDEr; however they lack sufficient
+capabilities to handle hallucinations and generalize across diverse images and
+texts partially because they compute scalar similarities merely using
+embeddings learned from tasks unrelated to image captioning evaluation. In this
+study, we propose Polos, a supervised automatic evaluation metric for image
+captioning models. Polos computes scores from multimodal inputs, using a
+parallel feature extraction mechanism that leverages embeddings trained through
+large-scale contrastive learning. To train Polos, we introduce Multimodal
+Metric Learning from Human Feedback (M$^2$LHF), a framework for developing
+metrics based on human feedback. We constructed the Polaris dataset, which
+comprises 131K human judgments from 550 evaluators, which is approximately ten
+times larger than standard datasets. Our approach achieved state-of-the-art
+performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and
+the Polaris dataset, thereby demonstrating its effectiveness and robustness.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']"
+CLIB-FIQA: Face Image Quality Assessment with Confidence Calibration,Fu-Zhao Ou · Fu-Zhao Ou · Chongyi Li · Shiqi Wang · Sam Kwong, ,https://arxiv.org/abs/2404.12203,,2404.12203.pdf,GraFIQs: Face Image Quality Assessment Using Gradient Magnitudes,"Face Image Quality Assessment (FIQA) estimates the utility of face images for
+automated face recognition (FR) systems. We propose in this work a novel
+approach to assess the quality of face images based on inspecting the required
+changes in the pre-trained FR model weights to minimize differences between
+testing samples and the distribution of the FR training dataset. To achieve
+that, we propose quantifying the discrepancy in Batch Normalization statistics
+(BNS), including mean and variance, between those recorded during FR training
+and those obtained by processing testing samples through the pretrained FR
+model. We then generate gradient magnitudes of pretrained FR weights by
+backpropagating the BNS through the pretrained model. The cumulative absolute
+sum of these gradient magnitudes serves as the FIQ for our approach. Through
+comprehensive experimentation, we demonstrate the effectiveness of our
+training-free and quality labeling-free approach, achieving competitive
+performance to recent state-of-theart FIQA approaches without relying on
+quality labeling, the need to train regression networks, specialized
+architectures, or designing and optimizing specific loss functions.",cs.CV,['cs.CV']
+EVCap: Retrieval-Augmented Image Captioning with External Visual--Name Memory for Open-World Comprehension,Jiaxuan Li · Duc Minh Vo · Akihiro Sugimoto · Hideki Nakayama, ,https://arxiv.org/abs/2311.15879v2,,2311.15879v2.pdf,EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension,"Large language models (LLMs)-based image captioning has the capability of
+describing objects not explicitly observed in training data; yet novel objects
+occur frequently, necessitating the requirement of sustaining up-to-date object
+knowledge for open-world comprehension. Instead of relying on large amounts of
+data and/or scaling up network parameters, we introduce a highly effective
+retrieval-augmented image captioning method that prompts LLMs with object names
+retrieved from External Visual--name memory (EVCap). We build ever-changing
+object knowledge memory using objects' visuals and names, enabling us to (i)
+update the memory at a minimal cost and (ii) effortlessly augment LLMs with
+retrieved object names by utilizing a lightweight and fast-to-train model. Our
+model, which was trained only on the COCO dataset, can adapt to out-of-domain
+without requiring additional fine-tuning or re-training. Our experiments
+conducted on benchmarks and synthetic commonsense-violating data show that
+EVCap, with only 3.97M trainable parameters, exhibits superior performance
+compared to other methods based on frozen pre-trained LLMs. Its performance is
+also competitive to specialist SOTAs that require extensive training.",cs.CV,['cs.CV']
+On Exact Inversion of DPM-Solvers,Seongmin Hong · Kyeonghyun Lee · Suh Yoon Jeon · Hyewon Bae · Se Young Chun,https://smhongok.github.io/inv-dpm.html,https://arxiv.org/abs/2311.18387v1,,2311.18387v1.pdf,On Exact Inversion of DPM-Solvers,"Diffusion probabilistic models (DPMs) are a key component in modern
+generative models. DPM-solvers have achieved reduced latency and enhanced
+quality significantly, but have posed challenges to find the exact inverse
+(i.e., finding the initial noise from the given image). Here we investigate the
+exact inversions for DPM-solvers and propose algorithms to perform them when
+samples are generated by the first-order as well as higher-order DPM-solvers.
+For each explicit denoising step in DPM-solvers, we formulated the inversions
+using implicit methods such as gradient descent or forward step method to
+ensure the robustness to large classifier-free guidance unlike the prior
+approach using fixed-point iteration. Experimental results demonstrated that
+our proposed exact inversion methods significantly reduced the error of both
+image and noise reconstructions, greatly enhanced the ability to distinguish
+invisible watermarks and well prevented unintended background changes
+consistently during image editing. Project page:
+\url{https://smhongok.github.io/inv-dpm.html}.",cs.CV,"['cs.CV', 'cs.LG']"
+Learning Structure-from-Motion with Graph Attention Networks,Lucas Brynte · José Pedro Iglesias · Carl Olsson · Fredrik Kahl,https://github.com/lucasbrynte/gasfm/,https://arxiv.org/abs/2308.15984,,2308.15984.pdf,Learning Structure-from-Motion with Graph Attention Networks,"In this paper we tackle the problem of learning Structure-from-Motion (SfM)
+through the use of graph attention networks. SfM is a classic computer vision
+problem that is solved though iterative minimization of reprojection errors,
+referred to as Bundle Adjustment (BA), starting from a good initialization. In
+order to obtain a good enough initialization to BA, conventional methods rely
+on a sequence of sub-problems (such as pairwise pose estimation, pose averaging
+or triangulation) which provide an initial solution that can then be refined
+using BA. In this work we replace these sub-problems by learning a model that
+takes as input the 2D keypoints detected across multiple views, and outputs the
+corresponding camera poses and 3D keypoint coordinates. Our model takes
+advantage of graph neural networks to learn SfM-specific primitives, and we
+show that it can be used for fast inference of the reconstruction for new and
+unseen sequences. The experimental results show that the proposed model
+outperforms competing learning-based methods, and challenges COLMAP while
+having lower runtime. Our code is available at
+https://github.com/lucasbrynte/gasfm/.",cs.CV,"['cs.CV', 'cs.LG']"
+Plug and Play Active Learning for Object Detection,Chenhongyi Yang · Lichao Huang · Elliot Crowley, ,,https://allainews.com/item/plug-and-play-active-learning-for-object-detection-2024-03-15/,,,,,nan
+MACE: Mass Concept Erasure in Diffusion Models,Shilin Lu · Zilan Wang · Leyang Li · Yanzhu Liu · Adams Wai-Kin Kong,https://github.com/Shilin-LU/MACE,https://arxiv.org/abs/2403.06135,,2403.06135.pdf,MACE: Mass Concept Erasure in Diffusion Models,"The rapid expansion of large-scale text-to-image diffusion models has raised
+growing concerns regarding their potential misuse in creating harmful or
+misleading content. In this paper, we introduce MACE, a finetuning framework
+for the task of mass concept erasure. This task aims to prevent models from
+generating images that embody unwanted concepts when prompted. Existing concept
+erasure methods are typically restricted to handling fewer than five concepts
+simultaneously and struggle to find a balance between erasing concept synonyms
+(generality) and maintaining unrelated concepts (specificity). In contrast,
+MACE differs by successfully scaling the erasure scope up to 100 concepts and
+by achieving an effective balance between generality and specificity. This is
+achieved by leveraging closed-form cross-attention refinement along with LoRA
+finetuning, collectively eliminating the information of undesirable concepts.
+Furthermore, MACE integrates multiple LoRAs without mutual interference. We
+conduct extensive evaluations of MACE against prior methods across four
+different tasks: object erasure, celebrity erasure, explicit content erasure,
+and artistic style erasure. Our results reveal that MACE surpasses prior
+methods in all evaluated tasks. Code is available at
+https://github.com/Shilin-LU/MACE.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Contextual Augmented Global Contrast for Multimodal Intent Recognition,Kaili Sun · Zhiwen Xie · Mang Ye · Huyin Zhang, ,https://arxiv.org/html/2312.14667v1,,2312.14667v1.pdf,Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition,"Multimodal intent recognition aims to leverage diverse modalities such as
+expressions, body movements and tone of speech to comprehend user's intent,
+constituting a critical task for understanding human language and behavior in
+real-world multimodal scenarios. Nevertheless, the majority of existing methods
+ignore potential correlations among different modalities and own limitations in
+effectively learning semantic features from nonverbal modalities. In this
+paper, we introduce a token-level contrastive learning method with
+modality-aware prompting (TCL-MAP) to address the above challenges. To
+establish an optimal multimodal semantic environment for text modality, we
+develop a modality-aware prompting module (MAP), which effectively aligns and
+fuses features from text, video and audio modalities with similarity-based
+modality alignment and cross-modality attention mechanism. Based on the
+modality-aware prompt and ground truth labels, the proposed token-level
+contrastive learning framework (TCL) constructs augmented samples and employs
+NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal
+textual semantic insights derived from intent labels to guide the learning
+processes of other modalities in return. Extensive experiments show that our
+method achieves remarkable improvements compared to state-of-the-art methods.
+Additionally, ablation analyses demonstrate the superiority of the
+modality-aware prompt over the handcrafted prompt, which holds substantial
+significance for multimodal prompt learning. The codes are released at
+https://github.com/thuiar/TCL-MAP.",cs.MM,"['cs.MM', 'cs.LG']"
+Fixed Point Diffusion Models,Luke Melas-Kyriazi · Xingjian Bai, ,https://arxiv.org/abs/2401.08741,,2401.08741.pdf,Fixed Point Diffusion Models,"We introduce the Fixed Point Diffusion Model (FPDM), a novel approach to
+image generation that integrates the concept of fixed point solving into the
+framework of diffusion-based generative modeling. Our approach embeds an
+implicit fixed point solving layer into the denoising network of a diffusion
+model, transforming the diffusion process into a sequence of closely-related
+fixed point problems. Combined with a new stochastic training method, this
+approach significantly reduces model size, reduces memory usage, and
+accelerates training. Moreover, it enables the development of two new
+techniques to improve sampling efficiency: reallocating computation across
+timesteps and reusing fixed point solutions between timesteps. We conduct
+extensive experiments with state-of-the-art models on ImageNet, FFHQ,
+CelebA-HQ, and LSUN-Church, demonstrating substantial improvements in
+performance and efficiency. Compared to the state-of-the-art DiT model, FPDM
+contains 87% fewer parameters, consumes 60% less memory during training, and
+improves image generation quality in situations where sampling computation or
+time is limited. Our code and pretrained models are available at
+https://lukemelas.github.io/fixed-point-diffusion-models.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+High Fidelity Person-centric Subject-to-Image Synthesis,Yibin Wang · Weizhong Zhang · Jianwei Zheng · Cheng Jin, ,https://arxiv.org/abs/2311.10329,,2311.10329.pdf,High-fidelity Person-centric Subject-to-Image Synthesis,"Current subject-driven image generation methods encounter significant
+challenges in person-centric image generation. The reason is that they learn
+the semantic scene and person generation by fine-tuning a common pre-trained
+diffusion, which involves an irreconcilable training imbalance. Precisely, to
+generate realistic persons, they need to sufficiently tune the pre-trained
+model, which inevitably causes the model to forget the rich semantic scene
+prior and makes scene generation over-fit to the training data. Moreover, even
+with sufficient fine-tuning, these methods can still not generate high-fidelity
+persons since joint learning of the scene and person generation also lead to
+quality compromise. In this paper, we propose Face-diffuser, an effective
+collaborative generation pipeline to eliminate the above training imbalance and
+quality compromise. Specifically, we first develop two specialized pre-trained
+diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented
+Diffusion Model (SDM), for scene and person generation, respectively. The
+sampling process is divided into three sequential stages, i.e., semantic scene
+construction, subject-scene fusion, and subject enhancement. The first and last
+stages are performed by TDM and SDM respectively. The subject-scene fusion
+stage, that is the collaboration achieved through a novel and highly effective
+mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on
+our key observation that there exists a robust link between classifier-free
+guidance responses and the saliency of generated images. In each time step, SNF
+leverages the unique strengths of each model and allows for the spatial
+blending of predicted noises from both models automatically in a saliency-aware
+manner. Extensive experiments confirm the impressive effectiveness and
+robustness of the Face-diffuser.",cs.CV,"['cs.CV', 'cs.AI']"
+On the Content Bias in Fréchet Video Distance,Songwei Ge · Aniruddha Mahapatra · Gaurav Parmar · Jun-Yan Zhu · Jia-Bin Huang, ,https://arxiv.org/abs/2404.12391,,2404.12391.pdf,On the Content Bias in Fréchet Video Distance,"Fr\'echet Video Distance (FVD), a prominent metric for evaluating video
+generation models, is known to conflict with human perception occasionally. In
+this paper, we aim to explore the extent of FVD's bias toward per-frame quality
+over temporal realism and identify its sources. We first quantify the FVD's
+sensitivity to the temporal axis by decoupling the frame and motion quality and
+find that the FVD increases only slightly with large temporal corruption. We
+then analyze the generated videos and show that via careful sampling from a
+large set of generated videos that do not contain motions, one can drastically
+decrease FVD without improving the temporal quality. Both studies suggest FVD's
+bias towards the quality of individual frames. We further observe that the bias
+can be attributed to the features extracted from a supervised video classifier
+trained on the content-biased dataset. We show that FVD with features extracted
+from the recent large-scale self-supervised video models is less biased toward
+image quality. Finally, we revisit a few real-world examples to validate our
+hypothesis.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+Implicit Discriminative Knowledge Learning for Visible-Infrared Person Re-Identification,kaijie ren · Lei Zhang, ,https://arxiv.org/abs/2403.11708v2,,2403.11708v2.pdf,Implicit Discriminative Knowledge Learning for Visible-Infrared Person Re-Identification,"Visible-Infrared Person Re-identification (VI-ReID) is a challenging
+cross-modal pedestrian retrieval task, due to significant intra-class
+variations and cross-modal discrepancies among different cameras. Existing
+works mainly focus on embedding images of different modalities into a unified
+space to mine modality-shared features. They only seek distinctive information
+within these shared features, while ignoring the identity-aware useful
+information that is implicit in the modality-specific features. To address this
+issue, we propose a novel Implicit Discriminative Knowledge Learning (IDKL)
+network to uncover and leverage the implicit discriminative information
+contained within the modality-specific. First, we extract modality-specific and
+modality-shared features using a novel dual-stream network. Then, the
+modality-specific features undergo purification to reduce their modality style
+discrepancies while preserving identity-aware discriminative knowledge.
+Subsequently, this kind of implicit knowledge is distilled into the
+modality-shared feature to enhance its distinctiveness. Finally, an alignment
+loss is proposed to minimize modality discrepancy on enhanced modality-shared
+features. Extensive experiments on multiple public datasets demonstrate the
+superiority of IDKL network over the state-of-the-art methods. Code is
+available at https://github.com/1KK077/IDKL.",cs.CV,['cs.CV']
+PointBeV: A Sparse Approach for BeV Predictions,Loick Chambon · Éloi Zablocki · Mickaël Chen · Florent Bartoccioni · Patrick Pérez · Matthieu Cord, ,https://arxiv.org/abs/2312.00703,,2312.00703.pdf,PointBeV: A Sparse Approach to BeV Predictions,"Bird's-eye View (BeV) representations have emerged as the de-facto shared
+space in driving applications, offering a unified space for sensor data fusion
+and supporting various downstream tasks. However, conventional models use grids
+with fixed resolution and range and face computational inefficiencies due to
+the uniform allocation of resources across all cells. To address this, we
+propose PointBeV, a novel sparse BeV segmentation model operating on sparse BeV
+cells instead of dense grids. This approach offers precise control over memory
+usage, enabling the use of long temporal contexts and accommodating
+memory-constrained platforms. PointBeV employs an efficient two-pass strategy
+for training, enabling focused computation on regions of interest. At inference
+time, it can be used with various memory/performance trade-offs and flexibly
+adjusts to new specific use cases. PointBeV achieves state-of-the-art results
+on the nuScenes dataset for vehicle, pedestrian, and lane segmentation,
+showcasing superior performance in static and temporal settings despite being
+trained solely with sparse signals. We will release our code along with two new
+efficient modules used in the architecture: Sparse Feature Pulling, designed
+for the effective extraction of features from images to BeV, and Submanifold
+Attention, which enables efficient temporal modeling. Our code is available at
+https://github.com/valeoai/PointBeV.",cs.CV,['cs.CV']
+Behind the Veil: Enhanced Indoor 3D Scene Reconstruction with Occluded Surfaces Completion,Su Sun · Cheng Zhao · Yuliang Guo · Ruoyu Wang · Xinyu Huang · Yingjie Victor Chen · Liu Ren, ,https://arxiv.org/abs/2404.03070,,2404.03070.pdf,Behind the Veil: Enhanced Indoor 3D Scene Reconstruction with Occluded Surfaces Completion,"In this paper, we present a novel indoor 3D reconstruction method with
+occluded surface completion, given a sequence of depth readings. Prior
+state-of-the-art (SOTA) methods only focus on the reconstruction of the visible
+areas in a scene, neglecting the invisible areas due to the occlusions, e.g.,
+the contact surface between furniture, occluded wall and floor. Our method
+tackles the task of completing the occluded scene surfaces, resulting in a
+complete 3D scene mesh. The core idea of our method is learning 3D geometry
+prior from various complete scenes to infer the occluded geometry of an unseen
+scene from solely depth measurements. We design a coarse-fine hierarchical
+octree representation coupled with a dual-decoder architecture, i.e.,
+Geo-decoder and 3D Inpainter, which jointly reconstructs the complete 3D scene
+geometry. The Geo-decoder with detailed representation at fine levels is
+optimized online for each scene to reconstruct visible surfaces. The 3D
+Inpainter with abstract representation at coarse levels is trained offline
+using various scenes to complete occluded surfaces. As a result, while the
+Geo-decoder is specialized for an individual scene, the 3D Inpainter can be
+generally applied across different scenes. We evaluate the proposed method on
+the 3D Completed Room Scene (3D-CRS) and iTHOR datasets, significantly
+outperforming the SOTA methods by a gain of 16.8% and 24.2% in terms of the
+completeness of 3D reconstruction. 3D-CRS dataset including a complete 3D mesh
+of each scene is provided at project webpage.",cs.CV,['cs.CV']
+VidLA: Video-Language Alignment at Scale,Mamshad Nayeem Rizve · Fan Fei · Jayakrishnan Unnikrishnan · Son Dinh Tran · Benjamin Yao · Belinda Zeng · Mubarak Shah · Trishul Chilimbi, ,https://arxiv.org/abs/2403.14870,,2403.14870.pdf,VidLA: Video-Language Alignment at Scale,"In this paper, we propose VidLA, an approach for video-language alignment at
+scale. There are two major limitations of previous video-language alignment
+approaches. First, they do not capture both short-range and long-range temporal
+dependencies and typically employ complex hierarchical deep network
+architectures that are hard to integrate with existing pretrained image-text
+foundation models. To effectively address this limitation, we instead keep the
+network architecture simple and use a set of data tokens that operate at
+different temporal resolutions in a hierarchical manner, accounting for the
+temporally hierarchical nature of videos. By employing a simple two-tower
+architecture, we are able to initialize our video-language model with
+pretrained image-text foundation models, thereby boosting the final
+performance. Second, existing video-language alignment works struggle due to
+the lack of semantically aligned large-scale training data. To overcome it, we
+leverage recent LLMs to curate the largest video-language dataset to date with
+better visual grounding. Furthermore, unlike existing video-text datasets which
+only contain short clips, our dataset is enriched with video clips of varying
+durations to aid our temporally hierarchical data tokens in extracting better
+representations at varying temporal scales. Overall, empirical results show
+that our proposed approach surpasses state-of-the-art methods on multiple
+retrieval benchmarks, especially on longer videos, and performs competitively
+on classification benchmarks.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']"
+ODIN: A Single Model for 2D and 3D Segmentation,Ayush Jain · Pushkal Katara · Nikolaos Gkanatsios · Adam Harley · Gabriel Sarch · Kriti Aggarwal · Vishrav Chaudhary · Katerina Fragkiadaki, ,https://arxiv.org/abs/2401.02416,,2401.02416.pdf,ODIN: A Single Model for 2D and 3D Segmentation,"State-of-the-art models on contemporary 3D segmentation benchmarks like
+ScanNet consume and label dataset-provided 3D point clouds, obtained through
+post processing of sensed multiview RGB-D images. They are typically trained
+in-domain, forego large-scale 2D pre-training and outperform alternatives that
+featurize the posed RGB-D multiview images instead. The gap in performance
+between methods that consume posed images versus post-processed 3D point clouds
+has fueled the belief that 2D and 3D perception require distinct model
+architectures. In this paper, we challenge this view and propose ODIN
+(Omni-Dimensional INstance segmentation), a model that can segment and label
+both 2D RGB images and 3D point clouds, using a transformer architecture that
+alternates between 2D within-view and 3D cross-view information fusion. Our
+model differentiates 2D and 3D feature operations through the positional
+encodings of the tokens involved, which capture pixel coordinates for 2D patch
+tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art
+performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation
+benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It
+outperforms all previous works by a wide margin when the sensed 3D point cloud
+is used in place of the point cloud sampled from 3D mesh. When used as the 3D
+perception engine in an instructable embodied agent architecture, it sets a new
+state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and
+checkpoints can be found at the project website (https://odin-seg.github.io).",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.RO']"
+VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding,Syed Talal Wasim · Muzammal Naseer · Salman Khan · Ming-Hsuan Yang · Fahad Shahbaz Khan, ,https://arxiv.org/abs/2401.00901,,2401.00901.pdf,Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding,"Video grounding aims to localize a spatio-temporal section in a video
+corresponding to an input text query. This paper addresses a critical
+limitation in current video grounding methodologies by introducing an
+Open-Vocabulary Spatio-Temporal Video Grounding task. Unlike prevalent
+closed-set approaches that struggle with open-vocabulary scenarios due to
+limited training data and predefined vocabularies, our model leverages
+pre-trained representations from foundational spatial grounding models. This
+empowers it to effectively bridge the semantic gap between natural language and
+diverse visual content, achieving strong performance in closed-set and
+open-vocabulary settings. Our contributions include a novel spatio-temporal
+video grounding model, surpassing state-of-the-art results in closed-set
+evaluations on multiple datasets and demonstrating superior performance in
+open-vocabulary scenarios. Notably, the proposed model outperforms
+state-of-the-art methods in closed-set settings on VidSTG (Declarative and
+Interrogative) and HC-STVG (V1 and V2) datasets. Furthermore, in
+open-vocabulary evaluations on HC-STVG V1 and YouCook-Interactions, our model
+surpasses the recent best-performing models by $4.88$ m_vIoU and $1.83\%$
+accuracy, demonstrating its efficacy in handling diverse linguistic and visual
+concepts for improved video understanding. Our codes will be publicly released.",cs.CV,['cs.CV']
+Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts,Jiawen Zhu · Guansong Pang, ,https://arxiv.org/abs/2403.06495,,2403.06495.pdf,Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts,"This paper explores the problem of Generalist Anomaly Detection (GAD), aiming
+to train one single detection model that can generalize to detect anomalies in
+diverse datasets from different application domains without any further
+training on the target data. Some recent studies have shown that large
+pre-trained Visual-Language Models (VLMs) like CLIP have strong generalization
+capabilities on detecting industrial defects from various datasets, but their
+methods rely heavily on handcrafted text prompts about defects, making them
+difficult to generalize to anomalies in other applications, e.g., medical image
+anomalies or semantic anomalies in natural images. In this work, we propose to
+train a GAD model with few-shot normal images as sample prompts for AD on
+diverse datasets on the fly. To this end, we introduce a novel approach that
+learns an in-context residual learning model for GAD, termed InCTRL. It is
+trained on an auxiliary dataset to discriminate anomalies from normal samples
+based on a holistic evaluation of the residuals between query images and
+few-shot normal sample prompts. Regardless of the datasets, per definition of
+anomaly, larger residuals are expected for anomalies than normal samples,
+thereby enabling InCTRL to generalize across different domains without further
+training. Comprehensive experiments on nine AD datasets are performed to
+establish a GAD benchmark that encapsulate the detection of industrial defect
+anomalies, medical anomalies, and semantic anomalies in both one-vs-all and
+multi-class setting, on which InCTRL is the best performer and significantly
+outperforms state-of-the-art competing methods. Code is available at
+https://github.com/mala-lab/InCTRL.",cs.CV,['cs.CV']
+ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting,Chen Duan · Pei Fu · Shan Guo · Qianyi Jiang · Xiaoming Wei, ,https://arxiv.org/abs/2403.00303,,2403.00303.pdf,ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting,"In recent years, text-image joint pre-training techniques have shown
+promising results in various tasks. However, in Optical Character Recognition
+(OCR) tasks, aligning text instances with their corresponding text regions in
+images poses a challenge, as it requires effective alignment between text and
+OCR-Text (referring to the text in images as OCR-Text to distinguish from the
+text in natural language) rather than a holistic understanding of the overall
+image content. In this paper, we propose a new pre-training method called
+OCR-Text Destylization Modeling (ODM) that transfers diverse styles of text
+found in images to a uniform style based on the text prompt. With ODM, we
+achieve better alignment between text and OCR-Text and enable pre-trained
+models to adapt to the complex and diverse styles of scene text detection and
+spotting tasks. Additionally, we have designed a new labeling generation method
+specifically for ODM and combined it with our proposed Text-Controller module
+to address the challenge of annotation costs in OCR tasks, allowing a larger
+amount of unlabeled data to participate in pre-training. Extensive experiments
+on multiple public datasets demonstrate that our method significantly improves
+performance and outperforms current pre-training methods in scene text
+detection and spotting tasks. Code is available at
+https://github.com/PriNing/ODM.",cs.CV,['cs.CV']
+"LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning",Sijin Chen · Xin Chen · Chi Zhang · Mingsheng Li · Gang Yu · Hao Fei · Hongyuan Zhu · Jiayuan Fan · Tao Chen, ,https://arxiv.org/abs/2311.18651,,2311.18651.pdf,"LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning","Recent advances in Large Multimodal Models (LMM) have made it possible for
+various applications in human-machine interactions. However, developing LMMs
+that can comprehend, reason, and plan in complex and diverse 3D environments
+remains a challenging topic, especially considering the demand for
+understanding permutation-invariant point cloud 3D representations of the 3D
+scene. Existing works seek help from multi-view images, and project 2D features
+to 3D space as 3D scene representations. This, however, leads to huge
+computational overhead and performance degradation. In this paper, we present
+LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and
+respond to both textual-instructions and visual-prompts. This help LMMs better
+comprehend human interactions and further help to remove the ambiguities in
+cluttered 3D scenes. Experiments show that LL3DA achieves remarkable results,
+and surpasses various 3D vision-language models on both 3D Dense Captioning and
+3D Question Answering.",cs.CV,['cs.CV']
+UV-IDM: Identity-Conditioned Latent Diffusion Model for Face UV-Texture Generation,Hong Li · Yutang Feng · Song Xue · Xuhui Liu · Boyu Liu · Bohan Zeng · Shanglin Li · Jianzhuang Liu · Shumin Han · Baochang Zhang, ,https://arxiv.org/abs/2403.19235,,,DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation,"While large-scale pre-trained text-to-image models can synthesize diverse and
+high-quality human-centered images, novel challenges arise with a nuanced task
+of ""identity fine editing"": precisely modifying specific features of a subject
+while maintaining its inherent identity and context. Existing personalization
+methods either require time-consuming optimization or learning additional
+encoders, adept in ""identity re-contextualization"". However, they often
+struggle with detailed and sensitive tasks like human face editing. To address
+these challenges, we introduce DreamSalon, a noise-guided, staged-editing
+framework, uniquely focusing on detailed image manipulations and
+identity-context preservation. By discerning editing and boosting stages via
+the frequency and gradient of predicted noises, DreamSalon first performs
+detailed manipulations on specific features in the editing stage, guided by
+high-frequency information, and then employs stochastic denoising in the
+boosting stage to improve image quality. For more precise editing, DreamSalon
+semantically mixes source and target textual prompts, guided by differences in
+their embedding covariances, to direct the model's focus on specific
+manipulation areas. Our experiments demonstrate DreamSalon's ability to
+efficiently and faithfully edit fine details on human faces, outperforming
+existing methods both qualitatively and quantitatively.",cs.CV,['cs.CV']
+Text2QR:  Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation,Guangyang Wu · Xiaohong Liu · Jun Jia · Xuehao Cui · Guangtao Zhai, ,https://arxiv.org/abs/2403.06452,,2403.06452.pdf,Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation,"In the digital era, QR codes serve as a linchpin connecting virtual and
+physical realms. Their pervasive integration across various applications
+highlights the demand for aesthetically pleasing codes without compromised
+scannability. However, prevailing methods grapple with the intrinsic challenge
+of balancing customization and scannability. Notably, stable-diffusion models
+have ushered in an epoch of high-quality, customizable content generation. This
+paper introduces Text2QR, a pioneering approach leveraging these advancements
+to address a fundamental challenge: concurrently achieving user-defined
+aesthetics and scanning robustness. To ensure stable generation of aesthetic QR
+codes, we introduce the QR Aesthetic Blueprint (QAB) module, generating a
+blueprint image exerting control over the entire generation process.
+Subsequently, the Scannability Enhancing Latent Refinement (SELR) process
+refines the output iteratively in the latent space, enhancing scanning
+robustness. This approach harnesses the potent generation capabilities of
+stable-diffusion models, navigating the trade-off between image aesthetics and
+QR code scannability. Our experiments demonstrate the seamless fusion of visual
+appeal with the practical utility of aesthetic QR codes, markedly outperforming
+prior methods. Codes are available at \url{https://github.com/mulns/Text2QR}",cs.CV,['cs.CV']
+Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers,Zi-Xin Zou · Zhipeng Yu · Yuan-Chen Guo · Yangguang Li · Yan-Pei Cao · Ding Liang · Song-Hai Zhang,https://zouzx.github.io/TriplaneGaussian/,https://arxiv.org/abs/2312.09147,,2312.09147.pdf,Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers,"Recent advancements in 3D reconstruction from single images have been driven
+by the evolution of generative models. Prominent among these are methods based
+on Score Distillation Sampling (SDS) and the adaptation of diffusion models in
+the 3D domain. Despite their progress, these techniques often face limitations
+due to slow optimization or rendering processes, leading to extensive training
+and optimization times. In this paper, we introduce a novel approach for
+single-view reconstruction that efficiently generates a 3D model from a single
+image via feed-forward inference. Our method utilizes two transformer-based
+networks, namely a point decoder and a triplane decoder, to reconstruct 3D
+objects using a hybrid Triplane-Gaussian intermediate representation. This
+hybrid representation strikes a balance, achieving a faster rendering speed
+compared to implicit representations while simultaneously delivering superior
+rendering quality than explicit representations. The point decoder is designed
+for generating point clouds from single images, offering an explicit
+representation which is then utilized by the triplane decoder to query Gaussian
+features for each point. This design choice addresses the challenges associated
+with directly regressing explicit 3D Gaussian attributes characterized by their
+non-structural nature. Subsequently, the 3D Gaussians are decoded by an MLP to
+enable rapid rendering through splatting. Both decoders are built upon a
+scalable, transformer-based architecture and have been efficiently trained on
+large-scale 3D datasets. The evaluations conducted on both synthetic datasets
+and real-world images demonstrate that our method not only achieves higher
+quality but also ensures a faster runtime in comparison to previous
+state-of-the-art techniques. Please see our project page at
+https://zouzx.github.io/TriplaneGaussian/.",cs.CV,['cs.CV']
+Active Object Detection with Knowledge Aggregation and Distillation from Large Models,Dejie Yang · Yang Liu, ,https://arxiv.org/abs/2405.12509,,2405.12509.pdf,Active Object Detection with Knowledge Aggregation and Distillation from Large Models,"Accurately detecting active objects undergoing state changes is essential for
+comprehending human interactions and facilitating decision-making. The existing
+methods for active object detection (AOD) primarily rely on visual appearance
+of the objects within input, such as changes in size, shape and relationship
+with hands. However, these visual changes can be subtle, posing challenges,
+particularly in scenarios with multiple distracting no-change instances of the
+same category. We observe that the state changes are often the result of an
+interaction being performed upon the object, thus propose to use informed
+priors about object related plausible interactions (including semantics and
+visual appearance) to provide more reliable cues for AOD. Specifically, we
+propose a knowledge aggregation procedure to integrate the aforementioned
+informed priors into oracle queries within the teacher decoder, offering more
+object affordance commonsense to locate the active object. To streamline the
+inference process and reduce extra knowledge inputs, we propose a knowledge
+distillation approach that encourages the student decoder to mimic the
+detection capabilities of the teacher decoder using the oracle query by
+replicating its predictions and attention. Our proposed framework achieves
+state-of-the-art performance on four datasets, namely Ego4D, Epic-Kitchens,
+MECCANO, and 100DOH, which demonstrates the effectiveness of our approach in
+improving AOD.",cs.CV,['cs.CV']
+Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs,shiyu xuan · Qingpei Guo · Ming Yang · Shiliang Zhang, ,https://arxiv.org/abs/2310.00582,,,Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs,"Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities
+in various multi-modal tasks. Nevertheless, their performance in fine-grained
+image understanding tasks is still limited. To address this issue, this paper
+proposes a new framework to enhance the fine-grained image understanding
+abilities of MLLMs. Specifically, we present a new method for constructing the
+instruction tuning dataset at a low cost by leveraging annotations in existing
+datasets. A self-consistent bootstrapping method is also introduced to extend
+existing dense object annotations into high-quality
+referring-expression-bounding-box pairs. These methods enable the generation of
+high-quality instruction data which includes a wide range of fundamental
+abilities essential for fine-grained image perception. Moreover, we argue that
+the visual encoder should be tuned during instruction tuning to mitigate the
+gap between full image perception and fine-grained image perception.
+Experimental results demonstrate the superior performance of our method. For
+instance, our model exhibits a 5.2% accuracy improvement over Qwen-VL on GQA
+and surpasses the accuracy of Kosmos-2 by 24.7% on RefCOCO_val. We have also
+attained the top rank on the leaderboard of MMBench. This promising performance
+is achieved by training on only publicly available data, making it easily
+reproducible. The models, datasets, and codes are publicly available at
+https://github.com/SY-Xuan/Pink.",cs.CV,"['cs.CV', 'cs.AI']"
+PoseIRM: Enhance 3D Human Pose Estimation on Unseen  Camera Settings via Invariant Risk Minimization,Yanlu Cai · Weizhong Zhang · Yuan Wu · Cheng Jin, ,https://arxiv.org/abs/2405.05216,,,FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models,"The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to
+predict human joint coordinates in 3D space. Despite recent advancements in
+deep learning-based methods, they mostly ignore the capability of coupling
+accessible texts and naturally feasible knowledge of humans, missing out on
+valuable implicit supervision to guide the 3D HPE task. Moreover, previous
+efforts often study this task from the perspective of the whole human body,
+neglecting fine-grained guidance hidden in different body parts. To this end,
+we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model
+for 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancing
+the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt
+learning (FPP) block constructs fine-grained part-aware prompts via coupling
+accessible texts and naturally feasible knowledge of body parts with learnable
+prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication
+(FPC) block establishes fine-grained communications between learned part-aware
+prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp
+Stylization (PTS) block integrates learned prompt embedding and temporal
+information related to the noise level to enable adaptive adjustment at each
+denoising step. Extensive experiments on public single-human pose estimation
+datasets show that FinePOSE outperforms state-of-the-art methods. We further
+extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE
+on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with
+complex multi-human scenarios. Code is available at
+https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024.",cs.CV,['cs.CV']
+Transcriptomics-guided Slide Representation Learning in Computational Pathology,Guillaume Jaume · Lukas Oldenburg · Anurag Vaidya · Richard J. Chen · Drew F. K. Williamson · Thomas Peeters · Andrew Song · Faisal Mahmood,https://github.com/mahmoodlab/TANGLE,https://arxiv.org/abs/2405.11618,,2405.11618.pdf,Transcriptomics-guided Slide Representation Learning in Computational Pathology,"Self-supervised learning (SSL) has been successful in building patch
+embeddings of small histology images (e.g., 224x224 pixels), but scaling these
+models to learn slide embeddings from the entirety of giga-pixel whole-slide
+images (WSIs) remains challenging. Here, we leverage complementary information
+from gene expression profiles to guide slide representation learning using
+multimodal pre-training. Expression profiles constitute highly detailed
+molecular descriptions of a tissue that we hypothesize offer a strong
+task-agnostic training signal for learning slide embeddings. Our slide and
+expression (S+E) pre-training strategy, called Tangle, employs
+modality-specific encoders, the outputs of which are aligned via contrastive
+learning. Tangle was pre-trained on samples from three different organs: liver
+(n=6,597 S+E pairs), breast (n=1,020), and lung (n=1,012) from two different
+species (Homo sapiens and Rattus norvegicus). Across three independent test
+datasets consisting of 1,265 breast WSIs, 1,946 lung WSIs, and 4,584 liver
+WSIs, Tangle shows significantly better few-shot performance compared to
+supervised and SSL baselines. When assessed using prototype-based
+classification and slide retrieval, Tangle also shows a substantial performance
+improvement over all baselines. Code available at
+https://github.com/mahmoodlab/TANGLE.",cs.CV,"['cs.CV', 'cs.AI']"
+Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households,Zhihao Cao · ZiDong Wang · Siwen Xie · Anji Liu · Lifeng Fan,https://github.com/bigai-ai/smart-help,https://arxiv.org/abs/2404.09001,,2404.09001.pdf,Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households,"Despite the significant demand for assistive technology among vulnerable
+groups (e.g., the elderly, children, and the disabled) in daily tasks, research
+into advanced AI-driven assistive solutions that genuinely accommodate their
+diverse needs remains sparse. Traditional human-machine interaction tasks often
+require machines to simply help without nuanced consideration of human
+abilities and feelings, such as their opportunity for practice and learning,
+sense of self-improvement, and self-esteem. Addressing this gap, we define a
+pivotal and novel challenge Smart Help, which aims to provide proactive yet
+adaptive support to human agents with diverse disabilities and dynamic goals in
+various tasks and environments. To establish this challenge, we leverage
+AI2-THOR to build a new interactive 3D realistic household environment for the
+Smart Help task. We introduce an innovative opponent modeling module that
+provides a nuanced understanding of the main agent's capabilities and goals, in
+order to optimize the assisting agent's helping policy. Rigorous experiments
+validate the efficacy of our model components and show the superiority of our
+holistic approach against established baselines. Our findings illustrate the
+potential of AI-imbued assistive robots in improving the well-being of
+vulnerable groups.",cs.RO,"['cs.RO', 'cs.AI', 'cs.CV']"
+FocSAM: Delving Deeply into Focused Objects in Segmenting Anything,You Huang · Zongyu Lan · Liujuan Cao · Xianming Lin · Shengchuan Zhang · Guannan Jiang · Rongrong Ji, ,https://arxiv.org/abs/2405.18706,,2405.18706.pdf,FocSAM: Delving Deeply into Focused Objects in Segmenting Anything,"The Segment Anything Model (SAM) marks a notable milestone in segmentation
+models, highlighted by its robust zero-shot capabilities and ability to handle
+diverse prompts. SAM follows a pipeline that separates interactive segmentation
+into image preprocessing through a large encoder and interactive inference via
+a lightweight decoder, ensuring efficient real-time performance. However, SAM
+faces stability issues in challenging samples upon this pipeline. These issues
+arise from two main factors. Firstly, the image preprocessing disables SAM from
+dynamically using image-level zoom-in strategies to refocus on the target
+object during interaction. Secondly, the lightweight decoder struggles to
+sufficiently integrate interactive information with image embeddings. To
+address these two limitations, we propose FocSAM with a pipeline redesigned on
+two pivotal aspects. First, we propose Dynamic Window Multi-head Self-Attention
+(Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object.
+Dwin-MSA localizes attention computations around the target object, enhancing
+object-related embeddings with minimal computational overhead. Second, we
+propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of
+interactive information from a few initial clicks that have significant impacts
+on the overall segmentation results. Experimentally, FocSAM augments SAM's
+interactive segmentation performance to match the existing state-of-the-art
+method in segmentation quality, requiring only about 5.6% of this method's
+inference time on CPUs.",cs.CV,['cs.CV']
+Towards the Uncharted: Density-Descending Feature Perturbation for Semi-supervised Semantic Segmentation,Xiaoyang Wang · Huihui Bai · Limin Yu · Yao Zhao · Jimin Xiao, ,https://arxiv.org/abs/2403.06462v2,,2403.06462v2.pdf,Towards the Uncharted: Density-Descending Feature Perturbation for Semi-supervised Semantic Segmentation,"Semi-supervised semantic segmentation allows model to mine effective
+supervision from unlabeled data to complement label-guided training. Recent
+research has primarily focused on consistency regularization techniques,
+exploring perturbation-invariant training at both the image and feature levels.
+In this work, we proposed a novel feature-level consistency learning framework
+named Density-Descending Feature Perturbation (DDFP). Inspired by the
+low-density separation assumption in semi-supervised learning, our key insight
+is that feature density can shed a light on the most promising direction for
+the segmentation classifier to explore, which is the regions with lower
+density. We propose to shift features with confident predictions towards
+lower-density regions by perturbation injection. The perturbed features are
+then supervised by the predictions on the original features, thereby compelling
+the classifier to explore less dense regions to effectively regularize the
+decision boundary. Central to our method is the estimation of feature density.
+To this end, we introduce a lightweight density estimator based on normalizing
+flow, allowing for efficient capture of the feature density distribution in an
+online manner. By extracting gradients from the density estimator, we can
+determine the direction towards less dense regions for each feature. The
+proposed DDFP outperforms other designs on feature-level perturbations and
+shows state of the art performances on both Pascal VOC and Cityscapes dataset
+under various partition protocols. The project is available at
+https://github.com/Gavinwxy/DDFP.",cs.CV,['cs.CV']
+Rethinking the Region Classification in Open-Vocabulary Semantic Segmentation: An Image-to-Image View,Yuan Wang · Rui Sun · Naisong Luo · Yuwen Pan · Tianzhu Zhang, ,https://arxiv.org/abs/2404.00262,,2404.00262.pdf,Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation,"Open-vocabulary semantic segmentation (OVS) aims to segment images of
+arbitrary categories specified by class labels or captions. However, most
+previous best-performing methods, whether pixel grouping methods or region
+recognition methods, suffer from false matches between image features and
+category labels. We attribute this to the natural gap between the textual
+features and visual features. In this work, we rethink how to mitigate false
+matches from the perspective of image-to-image matching and propose a novel
+relation-aware intra-modal matching (RIM) framework for OVS based on visual
+foundation models. RIM achieves robust region classification by firstly
+constructing diverse image-modal reference features and then matching them with
+region features based on relation-aware ranking distribution. The proposed RIM
+enjoys several merits. First, the intra-modal reference features are better
+aligned, circumventing potential ambiguities that may arise in cross-modal
+matching. Second, the ranking-based matching process harnesses the structure
+information implicit in the inter-class relationships, making it more robust
+than comparing individually. Extensive experiments on three benchmarks
+demonstrate that RIM outperforms previous state-of-the-art methods by large
+margins, obtaining a lead of more than 10% in mIoU on PASCAL VOC benchmark.",cs.CV,['cs.CV']
+Addressing Background Context Bias in Few-Shot Segmentation through Iterative Modulation,Lanyun Zhu · Tianrun Chen · Jianxiong Yin · Simon See · Jun Liu, ,https://arxiv.org/abs/2401.08407,,2401.08407.pdf,Cross-Domain Few-Shot Segmentation via Iterative Support-Query Correspondence Mining,"Cross-Domain Few-Shot Segmentation (CD-FSS) poses the challenge of segmenting
+novel categories from a distinct domain using only limited exemplars. In this
+paper, we undertake a comprehensive study of CD-FSS and uncover two crucial
+insights: (i) the necessity of a fine-tuning stage to effectively transfer the
+learned meta-knowledge across domains, and (ii) the overfitting risk during the
+na\""ive fine-tuning due to the scarcity of novel category examples. With these
+insights, we propose a novel cross-domain fine-tuning strategy that addresses
+the challenging CD-FSS tasks. We first design Bi-directional Few-shot
+Prediction (BFP), which establishes support-query correspondence in a
+bi-directional manner, crafting augmented supervision to reduce the overfitting
+risk. Then we further extend BFP into Iterative Few-shot Adaptor (IFA), which
+is a recursive framework to capture the support-query correspondence
+iteratively, targeting maximal exploitation of supervisory signals from the
+sparse novel category samples. Extensive empirical evaluations show that our
+method significantly outperforms the state-of-the-arts (+7.8\%), which verifies
+that IFA tackles the cross-domain challenges and mitigates the overfitting
+simultaneously. The code is available at: https://github.com/niejiahao1998/IFA.",cs.CV,['cs.CV']
+GeoChat: Grounded Large Vision-Language Model for Remote Sensing,Kartik Kuckreja · Muhammad Sohail Danish · Muzammal Naseer · Abhijit Das · Salman Khan · Fahad Shahbaz Khan, ,https://arxiv.org/abs/2311.15826,,2311.15826.pdf,GeoChat: Grounded Large Vision-Language Model for Remote Sensing,"Recent advancements in Large Vision-Language Models (VLMs) have shown great
+promise in natural image domains, allowing users to hold a dialogue about given
+visual content. However, such general-domain VLMs perform poorly for Remote
+Sensing (RS) scenarios, leading to inaccurate or fabricated information when
+presented with RS domain-specific queries. Such a behavior emerges due to the
+unique challenges introduced by RS imagery. For example, to handle
+high-resolution RS imagery with diverse scale changes across categories and
+many small objects, region-level reasoning is necessary alongside holistic
+scene interpretation. Furthermore, the lack of domain-specific multimodal
+instruction following data as well as strong backbone models for RS make it
+hard for the models to align their behavior with user queries. To address these
+limitations, we propose GeoChat - the first versatile remote sensing VLM that
+offers multitask conversational capabilities with high-resolution RS images.
+Specifically, GeoChat can not only answer image-level queries but also accepts
+region inputs to hold region-specific dialogue. Furthermore, it can visually
+ground objects in its responses by referring to their spatial coordinates. To
+address the lack of domain-specific datasets, we generate a novel RS multimodal
+instruction-following dataset by extending image-text pairs from existing
+diverse RS datasets. We establish a comprehensive benchmark for RS multitask
+conversations and compare with a number of baseline methods. GeoChat
+demonstrates robust zero-shot performance on various RS tasks, e.g., image and
+region captioning, visual question answering, scene classification, visually
+grounded conversations and referring detection. Our code is available at
+https://github.com/mbzuai-oryx/geochat.",cs.CV,"['cs.CV', 'cs.AI']"
+RankMatch: Exploring the Better Consistency Regularization for Semi-supervised Semantic Segmentation,Huayu Mai · Rui Sun · Tianzhu Zhang · Feng Wu, ,https://arxiv.org/abs/2312.08631,,2312.08631.pdf,Semi-supervised Semantic Segmentation Meets Masked Modeling:Fine-grained Locality Learning Matters in Consistency Regularization,"Semi-supervised semantic segmentation aims to utilize limited labeled images
+and abundant unlabeled images to achieve label-efficient learning, wherein the
+weak-to-strong consistency regularization framework, popularized by FixMatch,
+is widely used as a benchmark scheme. Despite its effectiveness, we observe
+that such scheme struggles with satisfactory segmentation for the local
+regions. This can be because it originally stems from the image classification
+task and lacks specialized mechanisms to capture fine-grained local semantics
+that prioritizes in dense prediction. To address this issue, we propose a novel
+framework called \texttt{MaskMatch}, which enables fine-grained locality
+learning to achieve better dense segmentation. On top of the original
+teacher-student framework, we design a masked modeling proxy task that
+encourages the student model to predict the segmentation given the unmasked
+image patches (even with 30\% only) and enforces the predictions to be
+consistent with pseudo-labels generated by the teacher model using the complete
+image. Such design is motivated by the intuition that if the predictions are
+more consistent given insufficient neighboring information, stronger
+fine-grained locality perception is achieved. Besides, recognizing the
+importance of reliable pseudo-labels in the above locality learning and the
+original consistency learning scheme, we design a multi-scale ensembling
+strategy that considers context at different levels of abstraction for
+pseudo-label generation. Extensive experiments on benchmark datasets
+demonstrate the superiority of our method against previous approaches and its
+plug-and-play flexibility.",cs.CV,['cs.CV']
+Jointly Training and Pruning CNNs via Learnable Agent Guidance and Alignment,Alireza Ganjdanesh · Shangqian Gao · Heng Huang, ,https://arxiv.org/abs/2403.19490,,2403.19490.pdf,Jointly Training and Pruning CNNs via Learnable Agent Guidance and Alignment,"Structural model pruning is a prominent approach used for reducing the
+computational cost of Convolutional Neural Networks (CNNs) before their
+deployment on resource-constrained devices. Yet, the majority of proposed ideas
+require a pretrained model before pruning, which is costly to secure. In this
+paper, we propose a novel structural pruning approach to jointly learn the
+weights and structurally prune architectures of CNN models. The core element of
+our method is a Reinforcement Learning (RL) agent whose actions determine the
+pruning ratios of the CNN model's layers, and the resulting model's accuracy
+serves as its reward. We conduct the joint training and pruning by iteratively
+training the model's weights and the agent's policy, and we regularize the
+model's weights to align with the selected structure by the agent. The evolving
+model's weights result in a dynamic reward function for the agent, which
+prevents using prominent episodic RL methods with stationary environment
+assumption for our purpose. We address this challenge by designing a mechanism
+to model the complex changing dynamics of the reward function and provide a
+representation of it to the RL agent. To do so, we take a learnable embedding
+for each training epoch and employ a recurrent model to calculate a
+representation of the changing environment. We train the recurrent model and
+embeddings using a decoder model to reconstruct observed rewards. Such a design
+empowers our agent to effectively leverage episodic observations along with the
+environment representations to learn a proper policy to determine performant
+sub-networks of the CNN model. Our extensive experiments on CIFAR-10 and
+ImageNet using ResNets and MobileNets demonstrate the effectiveness of our
+method.",cs.CV,['cs.CV']
+WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion,Soyong Shin · Juyong Kim · Eni Halilaj · Michael J. Black,https://wham.is.tue.mpg.de/,https://arxiv.org/abs/2312.07531,,2312.07531.pdf,WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion,"The estimation of 3D human motion from video has progressed rapidly but
+current methods still have several key limitations. First, most methods
+estimate the human in camera coordinates. Second, prior work on estimating
+humans in global coordinates often assumes a flat ground plane and produces
+foot sliding. Third, the most accurate methods rely on computationally
+expensive optimization pipelines, limiting their use to offline applications.
+Finally, existing video-based methods are surprisingly less accurate than
+single-frame methods. We address these limitations with WHAM (World-grounded
+Humans with Accurate Motion), which accurately and efficiently reconstructs 3D
+human motion in a global coordinate system from video. WHAM learns to lift 2D
+keypoint sequences to 3D using motion capture data and fuses this with video
+features, integrating motion context and visual information. WHAM exploits
+camera angular velocity estimated from a SLAM method together with human motion
+to estimate the body's global trajectory. We combine this with a contact-aware
+trajectory refinement method that lets WHAM capture human motion in diverse
+conditions, such as climbing stairs. WHAM outperforms all existing 3D human
+motion recovery methods across multiple in-the-wild benchmarks. Code will be
+available for research purposes at http://wham.is.tue.mpg.de/",cs.CV,['cs.CV']
+StrokeFaceNeRF: Stroke-based Facial Appearance Editing in Neural Radiance Field,Xiao-juan Li · Dingxi Zhang · Shu-Yu Chen · Feng-Lin Liu, ,https://arxiv.org/abs/2312.09913,,,LAENeRF: Local Appearance Editing for Neural Radiance Fields,"Due to the omnipresence of Neural Radiance Fields (NeRFs), the interest
+towards editable implicit 3D representations has surged over the last years.
+However, editing implicit or hybrid representations as used for NeRFs is
+difficult due to the entanglement of appearance and geometry encoded in the
+model parameters. Despite these challenges, recent research has shown first
+promising steps towards photorealistic and non-photorealistic appearance edits.
+The main open issues of related work include limited interactivity, a lack of
+support for local edits and large memory requirements, rendering them less
+useful in practice. We address these limitations with LAENeRF, a unified
+framework for photorealistic and non-photorealistic appearance editing of
+NeRFs. To tackle local editing, we leverage a voxel grid as starting point for
+region selection. We learn a mapping from expected ray terminations to final
+output color, which can optionally be supervised by a style loss, resulting in
+a framework which can perform photorealistic and non-photorealistic appearance
+editing of selected regions. Relying on a single point per ray for our mapping,
+we limit memory requirements and enable fast optimization. To guarantee
+interactivity, we compose the output color using a set of learned, modifiable
+base colors, composed with additive layer mixing. Compared to concurrent work,
+LAENeRF enables recoloring and stylization while keeping processing time low.
+Furthermore, we demonstrate that our approach surpasses baseline methods both
+quantitatively and qualitatively.",cs.CV,['cs.CV']
+Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D,Mukund Varma T · Peihao Wang · Zhiwen Fan · Zhangyang Wang · Hao Su · Ravi Ramamoorthi,https://mukundvarmat.github.io/Lift3D/,https://arxiv.org/abs/2403.18922,,2403.18922.pdf,Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D,"In recent years, there has been an explosion of 2D vision models for numerous
+tasks such as semantic segmentation, style transfer or scene editing, enabled
+by large-scale 2D image datasets. At the same time, there has been renewed
+interest in 3D scene representations such as neural radiance fields from
+multi-view images. However, the availability of 3D or multiview data is still
+substantially limited compared to 2D image datasets, making extending 2D vision
+models to 3D data highly desirable but also very challenging. Indeed, extending
+a single 2D vision operator like scene editing to 3D typically requires a
+highly creative method specialized to that task and often requires per-scene
+optimization. In this paper, we ask the question of whether any 2D vision model
+can be lifted to make 3D consistent predictions. We answer this question in the
+affirmative; our new Lift3D method trains to predict unseen views on feature
+spaces generated by a few visual models (i.e. DINO and CLIP), but then
+generalizes to novel vision operators and tasks, such as style transfer,
+super-resolution, open vocabulary segmentation and image colorization; for some
+of these tasks, there is no comparable previous 3D method. In many cases, we
+even outperform state-of-the-art methods specialized for the task in question.
+Moreover, Lift3D is a zero-shot method, in the sense that it requires no
+task-specific training, nor scene-specific optimization.",cs.CV,['cs.CV']
+Querying as Prompt: Parameter-Efficient Learning for Multimodal Language Model,Tian Liang · Jing Huang · Ming Kong · Luyuan Chen · Qiang Zhu, ,https://arxiv.org/html/2405.20654v1,,2405.20654v1.pdf,Passage-specific Prompt Tuning for Passage Reranking in Question Answering with Large Language Models,"Effective passage retrieval and reranking methods have been widely utilized
+to identify suitable candidates in open-domain question answering tasks, recent
+studies have resorted to LLMs for reranking the retrieved passages by the
+log-likelihood of the question conditioned on each passage. Although these
+methods have demonstrated promising results, the performance is notably
+sensitive to the human-written prompt (or hard prompt), and fine-tuning LLMs
+can be computationally intensive and time-consuming. Furthermore, this approach
+limits the leverage of question-passage relevance pairs and passage-specific
+knowledge to enhance the ranking capabilities of LLMs. In this paper, we
+propose passage-specific prompt tuning for reranking in open-domain question
+answering (PSPT): a parameter-efficient method that fine-tunes learnable
+passage-specific soft prompts, incorporating passage-specific knowledge from a
+limited set of question-passage relevance pairs. The method involves ranking
+retrieved passages based on the log-likelihood of the model generating the
+question conditioned on each passage and the learned soft prompt. We conducted
+extensive experiments utilizing the Llama-2-chat-7B model across three publicly
+available open-domain question answering datasets and the results demonstrate
+the effectiveness of the proposed approach.",cs.CL,"['cs.CL', 'cs.IR']"
+PairDETR : Joint Detection and Association of Human Bodies and Faces,Ammar Ali · Georgii Gaikov · Denis Rybalchenko · Alexander Chigorin · Ivan Laptev · Sergey Zagoruyko, ,https://arxiv.org/abs/2404.08450,,2404.08450.pdf,Joint Physical-Digital Facial Attack Detection Via Simulating Spoofing Clues,"Face recognition systems are frequently subjected to a variety of physical
+and digital attacks of different types. Previous methods have achieved
+satisfactory performance in scenarios that address physical attacks and digital
+attacks, respectively. However, few methods are considered to integrate a model
+that simultaneously addresses both physical and digital attacks, implying the
+necessity to develop and maintain multiple models. To jointly detect physical
+and digital attacks within a single model, we propose an innovative approach
+that can adapt to any network architecture. Our approach mainly contains two
+types of data augmentation, which we call Simulated Physical Spoofing Clues
+augmentation (SPSC) and Simulated Digital Spoofing Clues augmentation (SDSC).
+SPSC and SDSC augment live samples into simulated attack samples by simulating
+spoofing clues of physical and digital attacks, respectively, which
+significantly improve the capability of the model to detect ""unseen"" attack
+types. Extensive experiments show that SPSC and SDSC can achieve
+state-of-the-art generalization in Protocols 2.1 and 2.2 of the UniAttackData
+dataset, respectively. Our method won first place in ""Unified Physical-Digital
+Face Attack Detection"" of the 5th Face Anti-spoofing Challenge@CVPR2024. Our
+final submission obtains 3.75% APCER, 0.93% BPCER, and 2.34% ACER,
+respectively. Our code is available at
+https://github.com/Xianhua-He/cvpr2024-face-anti-spoofing-challenge.",cs.CV,['cs.CV']
+Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis,Zicheng Zhang · RUOBING ZHENG · Bonan Li · Congying Han · Tianqi Li · Meng Wang · Tiande Guo · Jingdong Chen · Ziwen Liu · Ming Yang, ,https://arxiv.org/abs/2402.17364v1,,2402.17364v1.pdf,Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis,"Recent works in implicit representations, such as Neural Radiance Fields
+(NeRF), have advanced the generation of realistic and animatable head avatars
+from video sequences. These implicit methods are still confronted by visual
+artifacts and jitters, since the lack of explicit geometric constraints poses a
+fundamental challenge in accurately modeling complex facial deformations. In
+this paper, we introduce Dynamic Tetrahedra (DynTet), a novel hybrid
+representation that encodes explicit dynamic meshes by neural networks to
+ensure geometric consistency across various motions and viewpoints. DynTet is
+parameterized by the coordinate-based networks which learn signed distance,
+deformation, and material texture, anchoring the training data into a
+predefined tetrahedra grid. Leveraging Marching Tetrahedra, DynTet efficiently
+decodes textured meshes with a consistent topology, enabling fast rendering
+through a differentiable rasterizer and supervision via a pixel loss. To
+enhance training efficiency, we incorporate classical 3D Morphable Models to
+facilitate geometry learning and define a canonical space for simplifying
+texture learning. These advantages are readily achievable owing to the
+effective geometric representation employed in DynTet. Compared with prior
+works, DynTet demonstrates significant improvements in fidelity, lip
+synchronization, and real-time performance according to various metrics. Beyond
+producing stable and visually appealing synthesis videos, our method also
+outputs the dynamic meshes which is promising to enable many emerging
+applications.",cs.CV,['cs.CV']
+SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering,Antoine Guédon · Vincent Lepetit,https://anttwo.github.io/sugar/,,https://huggingface.co/papers/2311.12775,,,,,nan
+Composed Video Retrieval via Enriched Context and Discriminative Embeddings,Omkar Thawakar · Muzammal Naseer · Rao Anwer · Salman Khan · Michael Felsberg · Mubarak Shah · Fahad Shahbaz Khan, ,https://arxiv.org/abs/2403.16997,,2403.16997.pdf,Composed Video Retrieval via Enriched Context and Discriminative Embeddings,"Composed video retrieval (CoVR) is a challenging problem in computer vision
+which has recently highlighted the integration of modification text with visual
+queries for more sophisticated video search in large databases. Existing works
+predominantly rely on visual queries combined with modification text to
+distinguish relevant videos. However, such a strategy struggles to fully
+preserve the rich query-specific context in retrieved target videos and only
+represents the target video using visual embedding. We introduce a novel CoVR
+framework that leverages detailed language descriptions to explicitly encode
+query-specific contextual information and learns discriminative embeddings of
+vision only, text only and vision-text for better alignment to accurately
+retrieve matched target videos. Our proposed framework can be flexibly employed
+for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on
+three datasets show that our approach obtains state-of-the-art performance for
+both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in
+terms of recall@K=1 score. Our code, models, detailed language descriptions for
+WebViD-CoVR dataset are available at
+\url{https://github.com/OmkarThawakar/composed-video-retrieval}",cs.CV,['cs.CV']
+Distilling Vision-Language Models on Millions of Videos,Yue Zhao · Long Zhao · Xingyi Zhou · Jialin Wu · Chun-Te Chu · Hui Miao · Florian Schroff · Hartwig Adam · Ting Liu · Boqing Gong · Philipp Krähenbühl · Liangzhe Yuan, ,https://arxiv.org/abs/2401.06129,,2401.06129.pdf,Distilling Vision-Language Models on Millions of Videos,"The recent advance in vision-language models is largely attributed to the
+abundance of image-text data. We aim to replicate this success for
+video-language models, but there simply is not enough human-curated video-text
+data available. We thus resort to fine-tuning a video-language model from a
+strong image-language baseline with synthesized instructional data. The
+resulting video model by video-instruction-tuning (VIIT) is then used to
+auto-label millions of videos to generate high-quality captions. We show the
+adapted video-language model performs well on a wide range of video-language
+benchmarks. For instance, it surpasses the best prior result on open-ended
+NExT-QA by 2.8%. Besides, our model generates detailed descriptions for
+previously unseen videos, which provide better textual supervision than
+existing methods. Experiments show that a video-language dual-encoder model
+contrastively trained on these auto-generated captions is 3.8% better than the
+strongest baseline that also leverages vision-language models. Our best model
+outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video
+retrieval by 6%. As a side product, we generate the largest video caption
+dataset to date.",cs.CV,['cs.CV']
+ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing,Junkun Chen · Samuel Rota Bulò · Norman Müller · Lorenzo Porzi · Peter Kontschieder · Yu-Xiong Wang, ,https://arxiv.org/abs/2308.13223,,,EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior,"While image diffusion models have made significant progress in text-driven 3D
+content creation, they often fail to accurately capture the intended meaning of
+text prompts, especially for view information. This limitation leads to the
+Janus problem, where multi-faced 3D models are generated under the guidance of
+such diffusion models. In this paper, we propose a robust high-quality 3D
+content generation pipeline by exploiting orthogonal-view image guidance.
+First, we introduce a novel 2D diffusion model that generates an image
+consisting of four orthogonal-view sub-images based on the given text prompt.
+Then, the 3D content is created using this diffusion model. Notably, the
+generated orthogonal-view image provides strong geometric structure priors and
+thus improves 3D consistency. As a result, it effectively resolves the Janus
+problem and significantly enhances the quality of 3D content creation.
+Additionally, we present a 3D synthesis fusion network that can further improve
+the details of the generated 3D contents. Both quantitative and qualitative
+evaluations demonstrate that our method surpasses previous text-to-3D
+techniques. Project page: https://efficientdreamer.github.io.",cs.CV,['cs.CV']
+Bilateral Adaptation for Human-Object Interaction Detection with Occlusion-Robustness,Guangzhi Wang · Yangyang Guo · Ziwei Xu · Mohan Kankanhalli, ,https://arxiv.org/abs/2307.10499,,,Mining Conditional Part Semantics with Occluded Extrapolation for Human-Object Interaction Detection,"Human-Object Interaction Detection is a crucial aspect of human-centric scene
+understanding, with important applications in various domains. Despite recent
+progress in this field, recognizing subtle and detailed interactions remains
+challenging. Existing methods try to use human-related clues to alleviate the
+difficulty, but rely heavily on external annotations or knowledge, limiting
+their practical applicability in real-world scenarios. In this work, we propose
+a novel Part Semantic Network (PSN) to solve this problem. The core of PSN is a
+Conditional Part Attention (CPA) mechanism, where human features are taken as
+keys and values, and the object feature is used as query for the computation in
+a cross-attention mechanism. In this way, our model learns to automatically
+focus on the most informative human parts conditioned on the involved object,
+generating more semantically meaningful features for interaction recognition.
+Additionally, we propose an Occluded Part Extrapolation (OPE) strategy to
+facilitate interaction recognition under occluded scenarios, which teaches the
+model to extrapolate detailed features from partially occluded ones. Our method
+consistently outperforms prior approaches on the V-COCO and HICO-DET datasets,
+without external data or extra annotations. Additional ablation studies
+validate the effectiveness of each component of our proposed method.",cs.CV,['cs.CV']
+Multi-modal learning for geospatial vegetation forecasting,Vitus Benson · Claire Robin · Christian Requena-Mesa · LAZARO ALONSO SILVA · Mélanie Weynants · Nora Linscheid · Jose Cortes · Zhihan Gao · Nuno Carvalhais · Markus Reichstein, ,https://arxiv.org/html/2405.20161v1,,2405.20161v1.pdf,Landslide mapping from Sentinel-2 imagery through change detection,"Landslides are one of the most critical and destructive geohazards.
+Widespread development of human activities and settlements combined with the
+effects of climate change on weather are resulting in a high increase in the
+frequency and destructive power of landslides, making them a major threat to
+human life and the economy. In this paper, we explore methodologies to map
+newly-occurred landslides using Sentinel-2 imagery automatically. All
+approaches presented are framed as a bi-temporal change detection problem,
+requiring only a pair of Sentinel-2 images, taken respectively before and after
+a landslide-triggering event. Furthermore, we introduce a novel deep learning
+architecture for fusing Sentinel-2 bi-temporal image pairs with Digital
+Elevation Model (DEM) data, showcasing its promising performances w.r.t. other
+change detection models in the literature. As a parallel task, we address
+limitations in existing datasets by creating a novel geodatabase, which
+includes manually validated open-access landslide inventories over
+heterogeneous ecoregions of the world. We release both code and dataset with an
+open-source license.",cs.CV,"['cs.CV', 'eess.IV']"
+LISA: Reasoning Segmentation via Large Language Model,Xin Lai · Zhuotao Tian · Yukang Chen · Yanwei Li · Yuhui Yuan · Shu Liu · Jiaya Jia, ,https://arxiv.org/abs/2308.00692,,2308.00692.pdf,LISA: Reasoning Segmentation via Large Language Model,"Although perception systems have made remarkable advancements in recent
+years, they still rely on explicit human instruction or pre-defined categories
+to identify the target objects before executing visual recognition tasks. Such
+systems cannot actively reason and comprehend implicit user intention. In this
+work, we propose a new segmentation task -- reasoning segmentation. The task is
+designed to output a segmentation mask given a complex and implicit query text.
+Furthermore, we establish a benchmark comprising over one thousand
+image-instruction-mask data samples, incorporating intricate reasoning and
+world knowledge for evaluation purposes. Finally, we present LISA: large
+Language Instructed Segmentation Assistant, which inherits the language
+generation capabilities of multimodal Large Language Models (LLMs) while also
+possessing the ability to produce segmentation masks. We expand the original
+vocabulary with a <SEG> token and propose the embedding-as-mask paradigm to
+unlock the segmentation capability. Remarkably, LISA can handle cases involving
+complex reasoning and world knowledge. Also, it demonstrates robust zero-shot
+capability when trained exclusively on reasoning-free datasets. In addition,
+fine-tuning the model with merely 239 reasoning segmentation data samples
+results in further performance enhancement. Both quantitative and qualitative
+experiments show our method effectively unlocks new reasoning segmentation
+capabilities for multimodal LLMs. Code, models, and data are available at
+https://github.com/dvlab-research/LISA.",cs.CV,['cs.CV']
+Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion,Linzhan Mou · Junkun Chen · Yu-Xiong Wang, ,https://arxiv.org/abs/2306.09551,,2306.09551.pdf,Edit-DiffNeRF: Editing 3D Neural Radiance Fields using 2D Diffusion Model,"Recent research has demonstrated that the combination of pretrained diffusion
+models with neural radiance fields (NeRFs) has emerged as a promising approach
+for text-to-3D generation. Simply coupling NeRF with diffusion models will
+result in cross-view inconsistency and degradation of stylized view syntheses.
+To address this challenge, we propose the Edit-DiffNeRF framework, which is
+composed of a frozen diffusion model, a proposed delta module to edit the
+latent semantic space of the diffusion model, and a NeRF. Instead of training
+the entire diffusion for each scene, our method focuses on editing the latent
+semantic space in frozen pretrained diffusion models by the delta module. This
+fundamental change to the standard diffusion framework enables us to make
+fine-grained modifications to the rendered views and effectively consolidate
+these instructions in a 3D scene via NeRF training. As a result, we are able to
+produce an edited 3D scene that faithfully aligns to input text instructions.
+Furthermore, to ensure semantic consistency across different viewpoints, we
+propose a novel multi-view semantic consistency loss that extracts a latent
+semantic embedding from the input view as a prior, and aim to reconstruct it in
+different views. Our proposed method has been shown to effectively edit
+real-world 3D scenes, resulting in 25% improvement in the alignment of the
+performed 3D edits with text instructions compared to prior work.",cs.CV,['cs.CV']
+RepAn: Enhanced Annealing through Re-parameterization,Xiang Fei · Xiawu Zheng · Yan Wang · Fei Chao · Chenglin Wu · Liujuan Cao, ,,https://dilithjay.com/blog/the-reparameterization-trick-clearly-explained,,,,,nan
+EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything,Yunyang Xiong · Balakrishnan Varadarajan · Lemeng Wu · Xiaoyu Xiang · Fanyi Xiao · Chenchen Zhu · Xiaoliang Dai · Dilin Wang · Fei Sun · Forrest Iandola · Raghuraman Krishnamoorthi · Vikas Chandra, ,https://arxiv.org/abs/2312.00863,,2312.00863.pdf,EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything,"Segment Anything Model (SAM) has emerged as a powerful tool for numerous
+vision applications. A key component that drives the impressive performance for
+zero-shot transfer and high versatility is a super large Transformer model
+trained on the extensive high-quality SA-1B dataset. While beneficial, the huge
+computation cost of SAM model has limited its applications to wider real-world
+applications. To address this limitation, we propose EfficientSAMs,
+light-weight SAM models that exhibits decent performance with largely reduced
+complexity. Our idea is based on leveraging masked image pretraining, SAMI,
+which learns to reconstruct features from SAM image encoder for effective
+visual representation learning. Further, we take SAMI-pretrained light-weight
+image encoders and mask decoder to build EfficientSAMs, and finetune the models
+on SA-1B for segment anything task. We perform evaluations on multiple vision
+tasks including image classification, object detection, instance segmentation,
+and semantic object detection, and find that our proposed pretraining method,
+SAMI, consistently outperforms other masked image pretraining methods. On
+segment anything task such as zero-shot instance segmentation, our
+EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably
+with a significant gain (e.g., ~4 AP on COCO/LVIS) over other fast SAM models.",cs.CV,['cs.CV']
+Strong Transferable Adversarial Attacks via Ensembled Asymptotically Normal Distribution Learning,Zhengwei Fang · Rui Wang · Tao Huang · Liping Jing, ,https://arxiv.org/abs/2308.02897,,2308.02897.pdf,An Adaptive Model Ensemble Adversarial Attack for Boosting Adversarial Transferability,"While the transferability property of adversarial examples allows the
+adversary to perform black-box attacks (i.e., the attacker has no knowledge
+about the target model), the transfer-based adversarial attacks have gained
+great attention. Previous works mostly study gradient variation or image
+transformations to amplify the distortion on critical parts of inputs. These
+methods can work on transferring across models with limited differences, i.e.,
+from CNNs to CNNs, but always fail in transferring across models with wide
+differences, such as from CNNs to ViTs. Alternatively, model ensemble
+adversarial attacks are proposed to fuse outputs from surrogate models with
+diverse architectures to get an ensemble loss, making the generated adversarial
+example more likely to transfer to other models as it can fool multiple models
+concurrently. However, existing ensemble attacks simply fuse the outputs of the
+surrogate models evenly, thus are not efficacious to capture and amplify the
+intrinsic transfer information of adversarial examples. In this paper, we
+propose an adaptive ensemble attack, dubbed AdaEA, to adaptively control the
+fusion of the outputs from each model, via monitoring the discrepancy ratio of
+their contributions towards the adversarial objective. Furthermore, an extra
+disparity-reduced filter is introduced to further synchronize the update
+direction. As a result, we achieve considerable improvement over the existing
+ensemble attacks on various datasets, and the proposed AdaEA can also boost
+existing transfer-based attacks, which further demonstrates its efficacy and
+versatility.",cs.CV,['cs.CV']
+AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning,Duojun Huang · Xinyu Xiong · Jie Ma · Jichang Li · Zequn Jie · Lin Ma · Guanbin Li, ,https://arxiv.org/abs/2312.03628,,2312.03628.pdf,Boosting Segment Anything Model Towards Open-Vocabulary Learning,"The recent Segment Anything Model (SAM) has emerged as a new paradigmatic
+vision foundation model, showcasing potent zero-shot generalization and
+flexible prompting. Despite SAM finding applications and adaptations in various
+domains, its primary limitation lies in the inability to grasp object
+semantics. In this paper, we present Sambor to seamlessly integrate SAM with
+the open-vocabulary object detector in an end-to-end framework. While retaining
+all the remarkable capabilities inherent to SAM, we enhance it with the
+capacity to detect arbitrary objects based on human inputs like category names
+or reference expressions. To accomplish this, we introduce a novel SideFormer
+module that extracts SAM features to facilitate zero-shot object localization
+and inject comprehensive semantic information for open-vocabulary recognition.
+In addition, we devise an open-set region proposal network (Open-set RPN),
+enabling the detector to acquire the open-set proposals generated by SAM.
+Sambor demonstrates superior zero-shot performance across benchmarks, including
+COCO and LVIS, proving highly competitive against previous SoTA methods. We
+aspire for this work to serve as a meaningful endeavor in endowing SAM to
+recognize diverse object categories and advancing open-vocabulary learning with
+the support of vision foundation models.",cs.CV,['cs.CV']
+Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering,Zaid Khan · Yun Fu, ,https://arxiv.org/abs/2404.10193,,2404.10193.pdf,Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering,"The goal of selective prediction is to allow an a model to abstain when it
+may not be able to deliver a reliable prediction, which is important in
+safety-critical contexts. Existing approaches to selective prediction typically
+require access to the internals of a model, require retraining a model or study
+only unimodal models. However, the most powerful models (e.g. GPT-4) are
+typically only available as black boxes with inaccessible internals, are not
+retrainable by end-users, and are frequently used for multimodal tasks. We
+study the possibility of selective prediction for vision-language models in a
+realistic, black-box setting. We propose using the principle of
+\textit{neighborhood consistency} to identify unreliable responses from a
+black-box vision-language model in question answering tasks. We hypothesize
+that given only a visual question and model response, the consistency of the
+model's responses over the neighborhood of a visual question will indicate
+reliability. It is impossible to directly sample neighbors in feature space in
+a black-box setting. Instead, we show that it is possible to use a smaller
+proxy model to approximately sample from the neighborhood. We find that
+neighborhood consistency can be used to identify model responses to visual
+questions that are likely unreliable, even in adversarial settings or settings
+that are out-of-distribution to the proxy model.",cs.CV,['cs.CV']
+Distribution-aware Knowledge Prototyping for Non-exemplar Lifelong Person Re-identification,Kunlun Xu · Xu Zou · Yuxin Peng · Jiahuan Zhou, ,https://arxiv.org/abs/2405.19005,,2405.19005.pdf,Auto-selected Knowledge Adapters for Lifelong Person Re-identification,"Lifelong Person Re-Identification (LReID) extends traditional ReID by
+requiring systems to continually learn from non-overlapping datasets across
+different times and locations, adapting to new identities while preserving
+knowledge of previous ones. Existing approaches, either rehearsal-free or
+rehearsal-based, still suffer from the problem of catastrophic forgetting since
+they try to cram diverse knowledge into one fixed model. To overcome this
+limitation, we introduce a novel framework AdalReID, that adopts knowledge
+adapters and a parameter-free auto-selection mechanism for lifelong learning.
+Concretely, we incrementally build distinct adapters to learn domain-specific
+knowledge at each step, which can effectively learn and preserve knowledge
+across different datasets. Meanwhile, the proposed auto-selection strategy
+adaptively calculates the knowledge similarity between the input set and the
+adapters. On the one hand, the appropriate adapters are selected for the inputs
+to process ReID, and on the other hand, the knowledge interaction and fusion
+between adapters are enhanced to improve the generalization ability of the
+model. Extensive experiments are conducted to demonstrate the superiority of
+our AdalReID, which significantly outperforms SOTAs by about 10$\sim$20\% mAP
+on both seen and unseen domains.",cs.CV,['cs.CV']
+Looking 3D: Anomaly Detection with 2D-3D Alignment,Ankan Kumar Bhunia · Changjian Li · Hakan Bilen,https://github.com/VICO-UoE/Looking3D,https://arxiv.org/abs/2311.14897,,2311.14897.pdf,Towards Scalable 3D Anomaly Detection and Localization: A Benchmark via 3D Anomaly Synthesis and A Self-Supervised Learning Network,"Recently, 3D anomaly detection, a crucial problem involving fine-grained
+geometry discrimination, is getting more attention. However, the lack of
+abundant real 3D anomaly data limits the scalability of current models. To
+enable scalable anomaly data collection, we propose a 3D anomaly synthesis
+pipeline to adapt existing large-scale 3Dmodels for 3D anomaly detection.
+Specifically, we construct a synthetic dataset, i.e., Anomaly-ShapeNet, basedon
+ShapeNet. Anomaly-ShapeNet consists of 1600 point cloud samples under 40
+categories, which provides a rich and varied collection of data, enabling
+efficient training and enhancing adaptability to industrial scenarios.
+Meanwhile,to enable scalable representation learning for 3D anomaly
+localization, we propose a self-supervised method, i.e., Iterative Mask
+Reconstruction Network (IMRNet). During training, we propose a geometry-aware
+sample module to preserve potentially anomalous local regions during point
+cloud down-sampling. Then, we randomly mask out point patches and sent the
+visible patches to a transformer for reconstruction-based self-supervision.
+During testing, the point cloud repeatedly goes through the Mask Reconstruction
+Network, with each iteration's output becoming the next input. By merging and
+contrasting the final reconstructed point cloud with the initial input, our
+method successfully locates anomalies. Experiments show that IMRNet outperforms
+previous state-of-the-art methods, achieving 66.1% in I-AUC on Anomaly-ShapeNet
+dataset and 72.5% in I-AUC on Real3D-AD dataset. Our dataset will be released
+at https://github.com/Chopper-233/Anomaly-ShapeNet",cs.CV,['cs.CV']
+Towards Scalable 3D Anomaly Detection and Localization: A Benchmark via 3D Anomaly Synthesis and A Self-Supervised Learning Network,wenqiao Li · Xiaohao Xu · Yao Gu · BoZhong Zheng · Shenghua Gao · Yingna Wu, ,https://arxiv.org/abs/2311.14897,,,Towards Scalable 3D Anomaly Detection and Localization: A Benchmark via 3D Anomaly Synthesis and A Self-Supervised Learning Network,"Recently, 3D anomaly detection, a crucial problem involving fine-grained
+geometry discrimination, is getting more attention. However, the lack of
+abundant real 3D anomaly data limits the scalability of current models. To
+enable scalable anomaly data collection, we propose a 3D anomaly synthesis
+pipeline to adapt existing large-scale 3Dmodels for 3D anomaly detection.
+Specifically, we construct a synthetic dataset, i.e., Anomaly-ShapeNet, basedon
+ShapeNet. Anomaly-ShapeNet consists of 1600 point cloud samples under 40
+categories, which provides a rich and varied collection of data, enabling
+efficient training and enhancing adaptability to industrial scenarios.
+Meanwhile,to enable scalable representation learning for 3D anomaly
+localization, we propose a self-supervised method, i.e., Iterative Mask
+Reconstruction Network (IMRNet). During training, we propose a geometry-aware
+sample module to preserve potentially anomalous local regions during point
+cloud down-sampling. Then, we randomly mask out point patches and sent the
+visible patches to a transformer for reconstruction-based self-supervision.
+During testing, the point cloud repeatedly goes through the Mask Reconstruction
+Network, with each iteration's output becoming the next input. By merging and
+contrasting the final reconstructed point cloud with the initial input, our
+method successfully locates anomalies. Experiments show that IMRNet outperforms
+previous state-of-the-art methods, achieving 66.1% in I-AUC on Anomaly-ShapeNet
+dataset and 72.5% in I-AUC on Real3D-AD dataset. Our dataset will be released
+at https://github.com/Chopper-233/Anomaly-ShapeNet",cs.CV,['cs.CV']
+Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data,Lihe Yang · Bingyi Kang · Zilong Huang · Xiaogang Xu · Jiashi Feng · Hengshuang Zhao, ,https://arxiv.org/abs/2401.10891,,2401.10891.pdf,Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data,"This work presents Depth Anything, a highly practical solution for robust
+monocular depth estimation. Without pursuing novel technical modules, we aim to
+build a simple yet powerful foundation model dealing with any images under any
+circumstances. To this end, we scale up the dataset by designing a data engine
+to collect and automatically annotate large-scale unlabeled data (~62M), which
+significantly enlarges the data coverage and thus is able to reduce the
+generalization error. We investigate two simple yet effective strategies that
+make data scaling-up promising. First, a more challenging optimization target
+is created by leveraging data augmentation tools. It compels the model to
+actively seek extra visual knowledge and acquire robust representations.
+Second, an auxiliary supervision is developed to enforce the model to inherit
+rich semantic priors from pre-trained encoders. We evaluate its zero-shot
+capabilities extensively, including six public datasets and randomly captured
+photos. It demonstrates impressive generalization ability. Further, through
+fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs
+are set. Our better depth model also results in a better depth-conditioned
+ControlNet. Our models are released at
+https://github.com/LiheYoung/Depth-Anything.",cs.CV,['cs.CV']
+SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World,Kiana Ehsani · Tanmay Gupta · Rose Hendrix · Jordi Salvador · Luca Weihs · Kuo-Hao Zeng · Kunal Singh Singh · Yejin Kim · Winson Han · Alvaro Herrasti · Ranjay Krishna · Dustin Schwenk · Eli VanderBilt · Aniruddha Kembhavi, ,https://arxiv.org/abs/2312.02976,,2312.02976.pdf,Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World,"Reinforcement learning (RL) with dense rewards and imitation learning (IL)
+with human-generated trajectories are the most widely used approaches for
+training modern embodied agents. RL requires extensive reward shaping and
+auxiliary losses and is often too slow and ineffective for long-horizon tasks.
+While IL with human supervision is effective, collecting human trajectories at
+scale is extremely expensive. In this work, we show that imitating
+shortest-path planners in simulation produces agents that, given a language
+instruction, can proficiently navigate, explore, and manipulate objects in both
+simulation and in the real world using only RGB sensors (no depth map or GPS
+coordinates). This surprising result is enabled by our end-to-end,
+transformer-based, SPOC architecture, powerful visual encoders paired with
+extensive image augmentation, and the dramatic scale and diversity of our
+training data: millions of frames of shortest-path-expert trajectories
+collected inside approximately 200,000 procedurally generated houses containing
+40,000 unique 3D assets. Our models, data, training code, and newly proposed
+10-task benchmarking suite CHORES will be open-sourced.",cs.RO,"['cs.RO', 'cs.AI', 'cs.CV']"
+A Unified and Interpretable Emotion Representation and Expression Generation,Reni Paskaleva · Mykyta Holubakha · Andela Ilic · Saman Motamed · Luc Van Gool · Danda Paudel,https://emotion-diffusion.github.io/,https://arxiv.org/abs/2404.01243,,2404.01243.pdf,A Unified and Interpretable Emotion Representation and Expression Generation,"Canonical emotions, such as happy, sad, and fearful, are easy to understand
+and annotate. However, emotions are often compound, e.g. happily surprised, and
+can be mapped to the action units (AUs) used for expressing emotions, and
+trivially to the canonical ones. Intuitively, emotions are continuous as
+represented by the arousal-valence (AV) model. An interpretable unification of
+these four modalities - namely, Canonical, Compound, AUs, and AV - is highly
+desirable, for a better representation and understanding of emotions. However,
+such unification remains to be unknown in the current literature. In this work,
+we propose an interpretable and unified emotion model, referred as C2A2. We
+also develop a method that leverages labels of the non-unified models to
+annotate the novel unified one. Finally, we modify the text-conditional
+diffusion models to understand continuous numbers, which are then used to
+generate continuous expressions using our unified emotion model. Through
+quantitative and qualitative experiments, we show that our generated images are
+rich and capture subtle expressions. Our work allows a fine-grained generation
+of expressions in conjunction with other textual inputs and offers a new label
+space for emotions at the same time.",cs.CV,['cs.CV']
+Regularized Parameter Uncertainty for Improving Generalization in Reinforcement Learning,Pehuen Moure · Longbiao Cheng · Joachim Ott · Zuowen Wang · Shih-Chii Liu, ,,https://arxiv.org/pdf/2207.02016v4,,,,,nan
+Understanding Video Transfomers via Universal Concept Discovery,Matthew Kowal · Achal Dave · Rares Andrei Ambrus · Adrien Gaidon · Kosta Derpanis · Pavel Tokmakov,https://yorkucvil.github.io/VTCD/,https://arxiv.org/abs/2401.10831,,,Understanding Video Transformers via Universal Concept Discovery,"This paper studies the problem of concept-based interpretability of
+transformer representations for videos. Concretely, we seek to explain the
+decision-making process of video transformers based on high-level,
+spatiotemporal concepts that are automatically discovered. Prior research on
+concept-based interpretability has concentrated solely on image-level tasks.
+Comparatively, video models deal with the added temporal dimension, increasing
+complexity and posing challenges in identifying dynamic concepts over time. In
+this work, we systematically address these challenges by introducing the first
+Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose
+an efficient approach for unsupervised identification of units of video
+transformer representations - concepts, and ranking their importance to the
+output of a model. The resulting concepts are highly interpretable, revealing
+spatio-temporal reasoning mechanisms and object-centric representations in
+unstructured video models. Performing this analysis jointly over a diverse set
+of supervised and self-supervised representations, we discover that some of
+these mechanism are universal in video transformers. Finally, we show that VTCD
+can be used for fine-grained action recognition and video object segmentation.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.RO']"
+HIPTrack: Visual Tracking with Historical Prompts,Wenrui Cai · Qingjie Liu · Yunhong Wang, ,https://arxiv.org/abs/2311.02072,,2311.02072.pdf,HIPTrack: Visual Tracking with Historical Prompts,"Trackers that follow Siamese paradigm utilize similarity matching between
+template and search region features for tracking. Many methods have been
+explored to enhance tracking performance by incorporating tracking history to
+better handle scenarios involving target appearance variations such as
+deformation and occlusion. However, the utilization of historical information
+in existing methods is insufficient and incomprehensive, which typically
+requires repetitive training and introduces a large amount of computation. In
+this paper, we show that by providing a tracker that follows Siamese paradigm
+with precise and updated historical information, a significant performance
+improvement can be achieved with completely unchanged parameters. Based on
+this, we propose a historical prompt network that uses refined historical
+foreground masks and historical visual features of the target to provide
+comprehensive and precise prompts for the tracker. We build a novel tracker
+called HIPTrack based on the historical prompt network, which achieves
+considerable performance improvements without the need to retrain the entire
+model. We conduct experiments on seven datasets and experimental results
+demonstrate that our method surpasses the current state-of-the-art trackers on
+LaSOT, LaSOText, GOT-10k and NfS. Furthermore, the historical prompt network
+can seamlessly integrate as a plug-and-play module into existing trackers,
+providing performance enhancements. The source code is available at
+https://github.com/WenRuiCai/HIPTrack.",cs.CV,['cs.CV']
+Self-supervised Representation Learning from Arbitrary Scenarios,Zhaowen Li · Yousong Zhu · Zhiyang Chen · Zongxin Gao · Rui Zhao · Chaoyang Zhao · Ming Tang · Jinqiao Wang, ,https://arxiv.org/abs/2403.03740,,2403.03740.pdf,Self-supervised Photographic Image Layout Representation Learning,"In the domain of image layout representation learning, the critical process
+of translating image layouts into succinct vector forms is increasingly
+significant across diverse applications, such as image retrieval, manipulation,
+and generation. Most approaches in this area heavily rely on costly labeled
+datasets and notably lack in adapting their modeling and learning methods to
+the specific nuances of photographic image layouts. This shortfall makes the
+learning process for photographic image layouts suboptimal. In our research, we
+directly address these challenges. We innovate by defining basic layout
+primitives that encapsulate various levels of layout information and by mapping
+these, along with their interconnections, onto a heterogeneous graph structure.
+This graph is meticulously engineered to capture the intricate layout
+information within the pixel domain explicitly. Advancing further, we introduce
+novel pretext tasks coupled with customized loss functions, strategically
+designed for effective self-supervised learning of these layout graphs.
+Building on this foundation, we develop an autoencoder-based network
+architecture skilled in compressing these heterogeneous layout graphs into
+precise, dimensionally-reduced layout representations. Additionally, we
+introduce the LODB dataset, which features a broader range of layout categories
+and richer semantics, serving as a comprehensive benchmark for evaluating the
+effectiveness of layout representation learning methods. Our extensive
+experimentation on this dataset demonstrates the superior performance of our
+approach in the realm of photographic image layout representation learning.",cs.CV,"['cs.CV', 'cs.MM']"
+Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation,Yunhao Ge · Xiaohui Zeng · Jacob Huffman · Tsung-Yi Lin · Ming-Yu Liu · Yin Cui, ,https://arxiv.org/abs/2404.19752,,2404.19752.pdf,Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation,"Existing automatic captioning methods for visual content face challenges such
+as lack of detail, content hallucination, and poor instruction following. In
+this work, we propose VisualFactChecker (VFC), a flexible training-free
+pipeline that generates high-fidelity and detailed captions for both 2D images
+and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text
+captioning models propose multiple initial captions; 2) verification, where a
+large language model (LLM) utilizes tools such as object detection and VQA
+models to fact-check proposed captions; 3) captioning, where an LLM generates
+the final caption by summarizing caption proposals and the fact check
+verification results. In this step, VFC can flexibly generate captions in
+various styles following complex instructions. We conduct comprehensive
+captioning evaluations using four metrics: 1) CLIP-Score for image-text
+similarity; 2) CLIP-Image-Score for measuring the image-image similarity
+between the original and the reconstructed image generated by a text-to-image
+model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V
+for fine-grained evaluation. Evaluation results show that VFC outperforms
+state-of-the-art open-sourced captioning methods for 2D images on the COCO
+dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by
+combining open-source models into a pipeline, we can attain captioning
+capability comparable to proprietary models such as GPT-4V, despite being over
+10x smaller in model size.",cs.CV,['cs.CV']
+NViST: In the Wild New View Synthesis from a Single Image with Transformers,Wonbong Jang · Lourdes Agapito, ,https://arxiv.org/abs/2312.08568,,2312.08568.pdf,NViST: In the Wild New View Synthesis from a Single Image with Transformers,"We propose NViST, a transformer-based model for efficient and generalizable
+novel-view synthesis from a single image for real-world scenes. In contrast to
+many methods that are trained on synthetic data, object-centred scenarios, or
+in a category-specific manner, NViST is trained on MVImgNet, a large-scale
+dataset of casually-captured real-world videos of hundreds of object categories
+with diverse backgrounds. NViST transforms image inputs directly into a
+radiance field, conditioned on camera parameters via adaptive layer
+normalisation. In practice, NViST exploits fine-tuned masked autoencoder (MAE)
+features and translates them to 3D output tokens via cross-attention, while
+addressing occlusions with self-attention. To move away from object-centred
+datasets and enable full scene synthesis, NViST adopts a 6-DOF camera pose
+model and only requires relative pose, dropping the need for canonicalization
+of the training data, which removes a substantial barrier to it being used on
+casually captured datasets. We show results on unseen objects and categories
+from MVImgNet and even generalization to casual phone captures. We conduct
+qualitative and quantitative evaluations on MVImgNet and ShapeNet to show that
+our model represents a step forward towards enabling true in-the-wild
+generalizable novel-view synthesis from a single image. Project webpage:
+https://wbjang.github.io/nvist_webpage.",cs.CV,['cs.CV']
+Space-time Diffusion Features for Zero-shot Text-driven Motion Transfer,Rafail Fridman · Danah Yatim · Omer Bar-Tal · Yoni Kasten · Tali Dekel,https://diffusion-motion-transfer.github.io/,https://arxiv.org/abs/2311.17009,,2311.17009.pdf,Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer,"We present a new method for text-driven motion transfer - synthesizing a
+video that complies with an input text prompt describing the target objects and
+scene while maintaining an input video's motion and scene layout. Prior methods
+are confined to transferring motion across two subjects within the same or
+closely related object categories and are applicable for limited domains (e.g.,
+humans). In this work, we consider a significantly more challenging setting in
+which the target and source objects differ drastically in shape and
+fine-grained motion characteristics (e.g., translating a jumping dog into a
+dolphin). To this end, we leverage a pre-trained and fixed text-to-video
+diffusion model, which provides us with generative and motion priors. The
+pillar of our method is a new space-time feature loss derived directly from the
+model. This loss guides the generation process to preserve the overall motion
+of the input video while complying with the target object in terms of shape and
+fine-grained motion traits.",cs.CV,['cs.CV']
+COLMAP-Free 3D Gaussian Splatting,Yang Fu · Sifei Liu · Amey Kulkarni · Jan Kautz · Alexei A. Efros · Xiaolong Wang, ,https://arxiv.org/abs/2312.07504,,2312.07504.pdf,COLMAP-Free 3D Gaussian Splatting,"While neural rendering has led to impressive advances in scene reconstruction
+and novel view synthesis, it relies heavily on accurately pre-computed camera
+poses. To relax this constraint, multiple efforts have been made to train
+Neural Radiance Fields (NeRFs) without pre-processed camera poses. However, the
+implicit representations of NeRFs provide extra challenges to optimize the 3D
+structure and camera poses at the same time. On the other hand, the recently
+proposed 3D Gaussian Splatting provides new opportunities given its explicit
+point cloud representations. This paper leverages both the explicit geometric
+representation and the continuity of the input video stream to perform novel
+view synthesis without any SfM preprocessing. We process the input frames in a
+sequential manner and progressively grow the 3D Gaussians set by taking one
+input frame at a time, without the need to pre-compute the camera poses. Our
+method significantly improves over previous approaches in view synthesis and
+camera pose estimation under large motion changes. Our project page is
+https://oasisyang.github.io/colmap-free-3dgs",cs.CV,['cs.CV']
+Scaling Laws for Data Filtering: Data Curation cannot be Compute Agnostic,Sachin Goyal · Pratyush Maini · Zachary Lipton · Aditi Raghunathan · Zico Kolter, ,https://arxiv.org/abs/2404.07177v1,,2404.07177v1.pdf,Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic,"Vision-language models (VLMs) are trained for thousands of GPU hours on
+carefully curated web datasets. In recent times, data curation has gained
+prominence with several works developing strategies to retain 'high-quality'
+subsets of 'raw' scraped data. For instance, the LAION public dataset retained
+only 10% of the total crawled data. However, these strategies are typically
+developed agnostic of the available compute for training. In this paper, we
+first demonstrate that making filtering decisions independent of training
+compute is often suboptimal: the limited high-quality data rapidly loses its
+utility when repeated, eventually requiring the inclusion of 'unseen' but
+'lower-quality' data. To address this quality-quantity tradeoff
+($\texttt{QQT}$), we introduce neural scaling laws that account for the
+non-homogeneous nature of web data, an angle ignored in existing literature.
+Our scaling laws (i) characterize the $\textit{differing}$ 'utility' of various
+quality subsets of web data; (ii) account for how utility diminishes for a data
+point at its 'nth' repetition; and (iii) formulate the mutual interaction of
+various data pools when combined, enabling the estimation of model performance
+on a combination of multiple data pools without ever jointly training on them.
+Our key message is that data curation $\textit{cannot}$ be agnostic of the
+total compute that a model will be trained for. Our scaling laws allow us to
+curate the best possible pool for achieving top performance on Datacomp at
+various compute budgets, carving out a pareto-frontier for data curation. Code
+is available at https://github.com/locuslab/scaling_laws_data_filtering.",cs.LG,['cs.LG']
+GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces,Yingwenqi Jiang · Jiadong Tu · Yuan Liu · Xifeng Gao · Xiaoxiao Long · Wenping Wang · Yuexin Ma, ,https://arxiv.org/abs/2311.17977,,2311.17977.pdf,GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces,"The advent of neural 3D Gaussians has recently brought about a revolution in
+the field of neural rendering, facilitating the generation of high-quality
+renderings at real-time speeds. However, the explicit and discrete
+representation encounters challenges when applied to scenes featuring
+reflective surfaces. In this paper, we present GaussianShader, a novel method
+that applies a simplified shading function on 3D Gaussians to enhance the
+neural rendering in scenes with reflective surfaces while preserving the
+training and rendering efficiency. The main challenge in applying the shading
+function lies in the accurate normal estimation on discrete 3D Gaussians.
+Specifically, we proposed a novel normal estimation framework based on the
+shortest axis directions of 3D Gaussians with a delicately designed loss to
+make the consistency between the normals and the geometries of Gaussian
+spheres. Experiments show that GaussianShader strikes a commendable balance
+between efficiency and visual quality. Our method surpasses Gaussian Splatting
+in PSNR on specular object datasets, exhibiting an improvement of 1.57dB. When
+compared to prior works handling reflective surfaces, such as Ref-NeRF, our
+optimization time is significantly accelerated (23h vs. 0.58h). Please click on
+our project website to see more results.",cs.CV,['cs.CV']
+BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation,Qihang Zhang · Yinghao Xu · Yujun Shen · Bo Dai · Bolei Zhou · Ceyuan Yang, ,https://arxiv.org/abs/2312.02136,,2312.02136.pdf,BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation,"Generating large-scale 3D scenes cannot simply apply existing 3D object
+synthesis technique since 3D scenes usually hold complex spatial configurations
+and consist of a number of objects at varying scales. We thus propose a
+practical and efficient 3D representation that incorporates an equivariant
+radiance field with the guidance of a bird's-eye view (BEV) map. Concretely,
+objects of synthesized 3D scenes could be easily manipulated through steering
+the corresponding BEV maps. Moreover, by adequately incorporating positional
+encoding and low-pass filters into the generator, the representation becomes
+equivariant to the given BEV map. Such equivariance allows us to produce
+large-scale, even infinite-scale, 3D scenes via synthesizing local scenes and
+then stitching them with smooth consistency. Extensive experiments on 3D scene
+datasets demonstrate the effectiveness of our approach. Our project website is
+at https://zqh0253.github.io/BerfScene/.",cs.CV,['cs.CV']
+L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream,Jingtao Sun · Yaonan Wang · Mingtao Feng · Yulan Guo · Ajmal Mian · Mike Zheng Shou, ,https://arxiv.org/abs/2403.12728,,2403.12728.pdf,Diffusion-Driven Self-Supervised Learning for Shape Reconstruction and Pose Estimation,"Fully-supervised category-level pose estimation aims to determine the 6-DoF
+poses of unseen instances from known categories, requiring expensive mannual
+labeling costs. Recently, various self-supervised category-level pose
+estimation methods have been proposed to reduce the requirement of the
+annotated datasets. However, most methods rely on synthetic data or 3D CAD
+model for self-supervised training, and they are typically limited to
+addressing single-object pose problems without considering multi-objective
+tasks or shape reconstruction. To overcome these challenges and limitations, we
+introduce a diffusion-driven self-supervised network for multi-object shape
+reconstruction and categorical pose estimation, only leveraging the shape
+priors. Specifically, to capture the SE(3)-equivariant pose features and 3D
+scale-invariant shape information, we present a Prior-Aware Pyramid 3D Point
+Transformer in our network. This module adopts a point convolutional layer with
+radial-kernels for pose-aware learning and a 3D scale-invariant graph
+convolution layer for object-level shape representation, respectively.
+Furthermore, we introduce a pretrain-to-refine self-supervised training
+paradigm to train our network. It enables proposed network to capture the
+associations between shape priors and observations, addressing the challenge of
+intra-class shape variations by utilising the diffusion mechanism. Extensive
+experiments conducted on four public datasets and a self-built dataset
+demonstrate that our method significantly outperforms state-of-the-art
+self-supervised category-level baselines and even surpasses some
+fully-supervised instance-level and category-level methods.",cs.CV,['cs.CV']
+Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection,Jiawen Zhu · Choubo Ding · Yu Tian · Guansong Pang, ,https://arxiv.org/abs/2310.12790,,2310.12790.pdf,Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection,"Open-set supervised anomaly detection (OSAD) - a recently emerging anomaly
+detection area - aims at utilizing a few samples of anomaly classes seen during
+training to detect unseen anomalies (i.e., samples from open-set anomaly
+classes), while effectively identifying the seen anomalies. Benefiting from the
+prior knowledge illustrated by the seen anomalies, current OSAD methods can
+often largely reduce false positive errors. However, these methods are trained
+in a closed-set setting and treat the anomaly examples as from a homogeneous
+distribution, rendering them less effective in generalizing to unseen anomalies
+that can be drawn from any distribution. This paper proposes to learn
+heterogeneous anomaly distributions using the limited anomaly examples to
+address this issue. To this end, we introduce a novel approach, namely Anomaly
+Heterogeneity Learning (AHL), that simulates a diverse set of heterogeneous
+anomaly distributions and then utilizes them to learn a unified heterogeneous
+abnormality model in surrogate open-set environments. Further, AHL is a generic
+framework that existing OSAD models can plug and play for enhancing their
+abnormality modeling. Extensive experiments on nine real-world anomaly
+detection datasets show that AHL can 1) substantially enhance different
+state-of-the-art OSAD models in detecting seen and unseen anomalies, and 2)
+effectively generalize to unseen anomalies in new domains. Code is available at
+https://github.com/mala-lab/AHL.",cs.CV,['cs.CV']
+"1-Lipschitz Layers Compared: Memory, Speed, and Certifiable Robustness",Bernd Prach · Fabio Brau · Giorgio Buttazzo · Christoph Lampert, ,https://arxiv.org/abs/2311.16833,,2311.16833.pdf,"1-Lipschitz Layers Compared: Memory, Speed, and Certifiable Robustness","The robustness of neural networks against input perturbations with bounded
+magnitude represents a serious concern in the deployment of deep learning
+models in safety-critical systems. Recently, the scientific community has
+focused on enhancing certifiable robustness guarantees by crafting 1-Lipschitz
+neural networks that leverage Lipschitz bounded dense and convolutional layers.
+Although different methods have been proposed in the literature to achieve this
+goal, understanding the performance of such methods is not straightforward,
+since different metrics can be relevant (e.g., training time, memory usage,
+accuracy, certifiable robustness) for different applications. For this reason,
+this work provides a thorough theoretical and empirical comparison between
+methods by evaluating them in terms of memory usage, speed, and certifiable
+robust accuracy. The paper also provides some guidelines and recommendations to
+support the user in selecting the methods that work best depending on the
+available resources. We provide code at
+https://github.com/berndprach/1LipschitzLayersCompared.",cs.LG,"['cs.LG', 'cs.CV', 'cs.NE']"
+Bootstrapping Autonomous Driving Radars with Self-Supervised Learning,Yiduo Hao · Sohrab Madani · Junfeng Guan · Mo Alloulah · Saurabh Gupta · Haitham Al Hassanieh, ,https://arxiv.org/abs/2312.04519,,2312.04519.pdf,Bootstrapping Autonomous Driving Radars with Self-Supervised Learning,"The perception of autonomous vehicles using radars has attracted increased
+research interest due its ability to operate in fog and bad weather. However,
+training radar models is hindered by the cost and difficulty of annotating
+large-scale radar data. To overcome this bottleneck, we propose a
+self-supervised learning framework to leverage the large amount of unlabeled
+radar data to pre-train radar-only embeddings for self-driving perception
+tasks. The proposed method combines radar-to-radar and radar-to-vision
+contrastive losses to learn a general representation from unlabeled radar
+heatmaps paired with their corresponding camera images. When used for
+downstream object detection, we demonstrate that the proposed self-supervision
+framework can improve the accuracy of state-of-the-art supervised baselines by
+$5.8\%$ in mAP. Code is available at \url{https://github.com/yiduohao/Radical}.",cs.CV,['cs.CV']
+CONFORM: Contrast is All You Need for High-Fidelity Text-to-Image Diffusion Models,Tuna Han Salih Meral · Enis Simsar · Federico Tombari · Pinar Yanardag, ,https://arxiv.org/abs/2312.06059v1,,2312.06059v1.pdf,CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models,"Images produced by text-to-image diffusion models might not always faithfully
+represent the semantic intent of the provided text prompt, where the model
+might overlook or entirely fail to produce certain objects. Existing solutions
+often require customly tailored functions for each of these problems, leading
+to sub-optimal results, especially for complex prompts. Our work introduces a
+novel perspective by tackling this challenge in a contrastive context. Our
+approach intuitively promotes the segregation of objects in attention maps
+while also maintaining that pairs of related attributes are kept close to each
+other. We conduct extensive experiments across a wide variety of scenarios,
+each involving unique combinations of objects, attributes, and scenes. These
+experiments effectively showcase the versatility, efficiency, and flexibility
+of our method in working with both latent and pixel-based diffusion models,
+including Stable Diffusion and Imagen. Moreover, we publicly share our source
+code to facilitate further research.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model,Dian Zheng · Xiao-Ming Wu · Shuzhou Yang · Jian Zhang · Jian-Fang Hu · Wei-Shi Zheng, ,https://arxiv.org/abs/2403.11157,,2403.11157.pdf,Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model,"Universal image restoration is a practical and potential computer vision task
+for real-world applications. The main challenge of this task is handling the
+different degradation distributions at once. Existing methods mainly utilize
+task-specific conditions (e.g., prompt) to guide the model to learn different
+distributions separately, named multi-partite mapping. However, it is not
+suitable for universal model learning as it ignores the shared information
+between different tasks. In this work, we propose an advanced selective
+hourglass mapping strategy based on diffusion model, termed DiffUIR. Two novel
+considerations make our DiffUIR non-trivial. Firstly, we equip the model with
+strong condition guidance to obtain accurate generation direction of diffusion
+model (selective). More importantly, DiffUIR integrates a flexible shared
+distribution term (SDT) into the diffusion algorithm elegantly and naturally,
+which gradually maps different distributions into a shared one. In the reverse
+process, combined with SDT and strong condition guidance, DiffUIR iteratively
+guides the shared distribution to the task-specific distribution with high
+image quality (hourglass). Without bells and whistles, by only modifying the
+mapping strategy, we achieve state-of-the-art performance on five image
+restoration tasks, 22 benchmarks in the universal setting and zero-shot
+generalization setting. Surprisingly, by only using a lightweight model (only
+0.89M), we could achieve outstanding performance. The source code and
+pre-trained models are available at https://github.com/iSEE-Laboratory/DiffUIR",cs.CV,['cs.CV']
+ActiveDC: Distribution Calibration for Active Finetuning,Wenshuai Xu · Zhenghui Hu · Yu Lu · Jinzhou Meng · Qingjie Liu · Yunhong Wang, ,https://arxiv.org/abs/2311.07634,,2311.07634.pdf,ActiveDC: Distribution Calibration for Active Finetuning,"The pretraining-finetuning paradigm has gained popularity in various computer
+vision tasks. In this paradigm, the emergence of active finetuning arises due
+to the abundance of large-scale data and costly annotation requirements. Active
+finetuning involves selecting a subset of data from an unlabeled pool for
+annotation, facilitating subsequent finetuning. However, the use of a limited
+number of training samples can lead to a biased distribution, potentially
+resulting in model overfitting. In this paper, we propose a new method called
+ActiveDC for the active finetuning tasks. Firstly, we select samples for
+annotation by optimizing the distribution similarity between the subset to be
+selected and the entire unlabeled pool in continuous space. Secondly, we
+calibrate the distribution of the selected samples by exploiting implicit
+category information in the unlabeled pool. The feature visualization provides
+an intuitive sense of the effectiveness of our approach to distribution
+calibration. We conducted extensive experiments on three image classification
+datasets with different sampling ratios. The results indicate that ActiveDC
+consistently outperforms the baseline performance in all image classification
+tasks. The improvement is particularly significant when the sampling ratio is
+low, with performance gains of up to 10%. Our code will be released.",cs.CV,['cs.CV']
+Extreme Point Supervised Instance Segmentation,Hyeonjun Lee · Sehyun Hwang · Suha Kwak, ,https://arxiv.org/abs/2405.20729,,2405.20729.pdf,Extreme Point Supervised Instance Segmentation,"This paper introduces a novel approach to learning instance segmentation
+using extreme points, i.e., the topmost, leftmost, bottommost, and rightmost
+points, of each object. These points are readily available in the modern
+bounding box annotation process while offering strong clues for precise
+segmentation, and thus allows to improve performance at the same annotation
+cost with box-supervised methods. Our work considers extreme points as a part
+of the true instance mask and propagates them to identify potential foreground
+and background points, which are all together used for training a pseudo label
+generator. Then pseudo labels given by the generator are in turn used for
+supervised learning of our final model. On three public benchmarks, our method
+significantly outperforms existing box-supervised methods, further narrowing
+the gap with its fully supervised counterpart. In particular, our model
+generates high-quality masks when a target object is separated into multiple
+parts, where previous box-supervised methods often fail.",cs.CV,['cs.CV']
+Towards Robust 3D Pose Transfer with Adversarial Learning,Haoyu Chen · Hao Tang · Ehsan Adeli · Guoying Zhao, ,https://arxiv.org/abs/2404.02242,,2404.02242.pdf,Towards Robust 3D Pose Transfer with Adversarial Learning,"3D pose transfer that aims to transfer the desired pose to a target mesh is
+one of the most challenging 3D generation tasks. Previous attempts rely on
+well-defined parametric human models or skeletal joints as driving pose
+sources. However, to obtain those clean pose sources, cumbersome but necessary
+pre-processing pipelines are inevitable, hindering implementations of the
+real-time applications. This work is driven by the intuition that the
+robustness of the model can be enhanced by introducing adversarial samples into
+the training, leading to a more invulnerable model to the noisy inputs, which
+even can be further extended to directly handling the real-world data like raw
+point clouds/scans without intermediate processing. Furthermore, we propose a
+novel 3D pose Masked Autoencoder (3D-PoseMAE), a customized MAE that
+effectively learns 3D extrinsic presentations (i.e., pose). 3D-PoseMAE
+facilitates learning from the aspect of extrinsic attributes by simultaneously
+generating adversarial samples that perturb the model and learning the
+arbitrary raw noisy poses via a multi-scale masking strategy. Both qualitative
+and quantitative studies show that the transferred meshes given by our network
+result in much better quality. Besides, we demonstrate the strong
+generalizability of our method on various poses, different domains, and even
+raw scans. Experimental results also show meaningful insights that the
+intermediate adversarial samples generated in the training can successfully
+attack the existing pose transfer models.",cs.CV,['cs.CV']
+Improving Image Restoration through Removing Degradations in Textual Representations,Jingbo Lin · Zhilu Zhang · Yuxiang Wei · Dongwei Ren · Dongsheng Jiang · Qi Tian · Wangmeng Zuo, ,https://arxiv.org/abs/2312.17334,,2312.17334.pdf,Improving Image Restoration through Removing Degradations in Textual Representations,"In this paper, we introduce a new perspective for improving image restoration
+by removing degradation in the textual representations of a given degraded
+image. Intuitively, restoration is much easier on text modality than image one.
+For example, it can be easily conducted by removing degradation-related words
+while keeping the content-aware words. Hence, we combine the advantages of
+images in detail description and ones of text in degradation removal to perform
+restoration. To address the cross-modal assistance, we propose to map the
+degraded images into textual representations for removing the degradations, and
+then convert the restored textual representations into a guidance image for
+assisting image restoration. In particular, We ingeniously embed an
+image-to-text mapper and text restoration module into CLIP-equipped
+text-to-image models to generate the guidance. Then, we adopt a simple
+coarse-to-fine approach to dynamically inject multi-scale information from
+guidance to image restoration networks. Extensive experiments are conducted on
+various image restoration tasks, including deblurring, dehazing, deraining, and
+denoising, and all-in-one image restoration. The results showcase that our
+method outperforms state-of-the-art ones across all these tasks. The codes and
+models are available at \url{https://github.com/mrluin/TextualDegRemoval}.",cs.CV,['cs.CV']
+Learning Coupled Dictionaries from Unpaired Data for Image Super-Resolution,Longguang Wang · Juncheng Li · Yingqian Wang · Qingyong Hu · Yulan Guo, ,,https://link.springer.com/article/10.1007/s11760-023-02936-x,,,,,nan
+Dispersed Structured Light for Hyperspectral 3D Imaging,Suhyun Shin · Seokjun Choi · Felix Heide · Seung-Hwan Baek, ,https://arxiv.org/abs/2311.18287,,2311.18287.pdf,Dispersed Structured Light for Hyperspectral 3D Imaging,"Hyperspectral 3D imaging aims to acquire both depth and spectral information
+of a scene. However, existing methods are either prohibitively expensive and
+bulky or compromise on spectral and depth accuracy. In this work, we present
+Dispersed Structured Light (DSL), a cost-effective and compact method for
+accurate hyperspectral 3D imaging. DSL modifies a traditional projector-camera
+system by placing a sub-millimeter thick diffraction grating film front of the
+projector. The grating disperses structured light based on light wavelength. To
+utilize the dispersed structured light, we devise a model for dispersive
+projection image formation and a per-pixel hyperspectral 3D reconstruction
+method. We validate DSL by instantiating a compact experimental prototype. DSL
+achieves spectral accuracy of 18.8nm full-width half-maximum (FWHM) and depth
+error of 1mm. We demonstrate that DSL outperforms prior work on practical
+hyperspectral 3D imaging. DSL promises accurate and practical hyperspectral 3D
+imaging for diverse application domains, including computer vision and
+graphics, cultural heritage, geology, and biology.",eess.IV,"['eess.IV', 'cs.CV', 'cs.GR']"
+MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric,Haokun Lin · Haoli Bai · Zhili Liu · Lu Hou · Muyi Sun · Linqi Song · Ying Wei · Zhenan Sun, ,,https://paperswithcode.com/paper/mope-clip-structured-pruning-for-efficient,,,,,nan
+"TurboSL: Dense, Accurate and Fast 3D by Neural Inverse Structured Light",Parsa Mirdehghan · Maxx Wu · Wenzheng Chen · Wenzheng Chen · David B. Lindell · Kiriakos Kutulakos, ,https://arxiv.org/abs/2306.13361,,2306.13361.pdf,Neural 360$^\circ$ Structured Light with Learned Metasurfaces,"Structured light has proven instrumental in 3D imaging, LiDAR, and
+holographic light projection. Metasurfaces, comprised of sub-wavelength-sized
+nanostructures, facilitate 180$^\circ$ field-of-view (FoV) structured light,
+circumventing the restricted FoV inherent in traditional optics like
+diffractive optical elements. However, extant metasurface-facilitated
+structured light exhibits sub-optimal performance in downstream tasks, due to
+heuristic pattern designs such as periodic dots that do not consider the
+objectives of the end application. In this paper, we present neural 360$^\circ$
+structured light, driven by learned metasurfaces. We propose a differentiable
+framework, that encompasses a computationally-efficient 180$^\circ$ wave
+propagation model and a task-specific reconstructor, and exploits both
+transmission and reflection channels of the metasurface. Leveraging a
+first-order optimizer within our differentiable framework, we optimize the
+metasurface design, thereby realizing neural 360$^\circ$ structured light. We
+have utilized neural 360$^\circ$ structured light for holographic light
+projection and 3D imaging. Specifically, we demonstrate the first 360$^\circ$
+light projection of complex patterns, enabled by our propagation model that can
+be computationally evaluated 50,000$\times$ faster than the Rayleigh-Sommerfeld
+propagation. For 3D imaging, we improve depth-estimation accuracy by
+5.09$\times$ in RMSE compared to the heuristically-designed structured light.
+Neural 360$^\circ$ structured light promises robust 360$^\circ$ imaging and
+display for robotics, extended-reality systems, and human-computer
+interactions.",physics.optics,"['physics.optics', 'cs.CV', 'eess.IV']"
+Unlocking the Potential of Prompt-Tuning in Bridging Generalized and Personalized Federated Learning,wenlong deng · Christos Thrampoulidis · Xiaoxiao Li, ,https://arxiv.org/abs/2310.18285,,2310.18285.pdf,Unlocking the Potential of Prompt-Tuning in Bridging Generalized and Personalized Federated Learning,"Vision Transformers (ViT) and Visual Prompt Tuning (VPT) achieve
+state-of-the-art performance with improved efficiency in various computer
+vision tasks. This suggests a promising paradigm shift of adapting pre-trained
+ViT models to Federated Learning (FL) settings. However, the challenge of data
+heterogeneity among FL clients presents a significant hurdle in effectively
+deploying ViT models. Existing Generalized FL (GFL) and Personalized FL (PFL)
+methods have limitations in balancing performance across both global and local
+data distributions. In this paper, we present a novel algorithm, SGPT, that
+integrates GFL and PFL approaches by employing a unique combination of both
+shared and group-specific prompts. This design enables SGPT to capture both
+common and group-specific features. A key feature of SGPT is its prompt
+selection module, which facilitates the training of a single global model
+capable of automatically adapting to diverse local client data distributions
+without the need for local fine-tuning. To effectively train the prompts, we
+utilize block coordinate descent (BCD), learning from common feature
+information (shared prompts), and then more specialized knowledge (group
+prompts) iteratively. Theoretically, we justify that learning the proposed
+prompts can reduce the gap between global and local performance. Empirically,
+we conduct experiments on both label and feature heterogeneity settings in
+comparison with state-of-the-art baselines, along with extensive ablation
+studies, to substantiate the superior performance of SGPT.",cs.LG,"['cs.LG', 'cs.CV']"
+Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models,Shitian Zhao · Zhuowan Li · YadongLu · Alan L. Yuille · Yan Wang, ,https://arxiv.org/abs/2312.06685,,2312.06685.pdf,Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models,"While Multi-modal Language Models (MLMs) demonstrate impressive multimodal
+ability, they still struggle on providing factual and precise responses for
+tasks like visual question answering (VQA). In this paper, we address this
+challenge from the perspective of contextual information. We propose Causal
+Context Generation, Causal-CoG, which is a prompting strategy that engages
+contextual information to enhance precise VQA during inference. Specifically,
+we prompt MLMs to generate contexts, i.e, text description of an image, and
+engage the generated contexts for question answering. Moreover, we investigate
+the advantage of contexts on VQA from a causality perspective, introducing
+causality filtering to select samples for which contextual information is
+helpful. To show the effectiveness of Causal-CoG, we run extensive experiments
+on 10 multimodal benchmarks and show consistent improvements, e.g., +6.30% on
+POPE, +13.69% on Vizwiz and +6.43% on VQAv2 compared to direct decoding,
+surpassing existing methods. We hope Casual-CoG inspires explorations of
+context knowledge in multimodal models, and serves as a plug-and-play strategy
+for MLM decoding.",cs.AI,['cs.AI']
+Person-in-WiFi 3D: End-to-End Multi-Person 3D Pose Estimation with Wi-Fi,Kangwei Yan · Fei Wang · Bo Qian · Han Ding · Jinsong Han · Xing Wei, ,https://arxiv.org/abs/2404.02041,,2404.02041.pdf,SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation,"We present a new self-supervised approach, SelfPose3d, for estimating 3d
+poses of multiple persons from multiple camera views. Unlike current
+state-of-the-art fully-supervised methods, our approach does not require any 2d
+or 3d ground-truth poses and uses only the multi-view input images from a
+calibrated camera setup and 2d pseudo poses generated from an off-the-shelf 2d
+human pose estimator. We propose two self-supervised learning objectives:
+self-supervised person localization in 3d space and self-supervised 3d pose
+estimation. We achieve self-supervised 3d person localization by training the
+model on synthetically generated 3d points, serving as 3d person root
+positions, and on the projected root-heatmaps in all the views. We then model
+the 3d poses of all the localized persons with a bottleneck representation, map
+them onto all views obtaining 2d joints, and render them using 2d Gaussian
+heatmaps in an end-to-end differentiable manner. Afterwards, we use the
+corresponding 2d joints and heatmaps from the pseudo 2d poses for learning. To
+alleviate the intrinsic inaccuracy of the pseudo labels, we propose an adaptive
+supervision attention mechanism to guide the self-supervision. Our experiments
+and analysis on three public benchmark datasets, including Panoptic, Shelf, and
+Campus, show the effectiveness of our approach, which is comparable to
+fully-supervised methods. Code is available at
+\url{https://github.com/CAMMA-public/SelfPose3D}",cs.CV,['cs.CV']
+Multi-agent Collaborative Perception via Motion-aware Robust Communication Network,Shixin Hong · Yu LIU · Zhi Li · Shaohui Li · You He, ,https://arxiv.org/abs/2401.12694,,2401.12694.pdf,Pragmatic Communication in Multi-Agent Collaborative Perception,"Collaborative perception allows each agent to enhance its perceptual
+abilities by exchanging messages with others. It inherently results in a
+trade-off between perception ability and communication costs. Previous works
+transmit complete full-frame high-dimensional feature maps among agents,
+resulting in substantial communication costs. To promote communication
+efficiency, we propose only transmitting the information needed for the
+collaborator's downstream task. This pragmatic communication strategy focuses
+on three key aspects: i) pragmatic message selection, which selects
+task-critical parts from the complete data, resulting in spatially and
+temporally sparse feature vectors; ii) pragmatic message representation, which
+achieves pragmatic approximation of high-dimensional feature vectors with a
+task-adaptive dictionary, enabling communicating with integer indices; iii)
+pragmatic collaborator selection, which identifies beneficial collaborators,
+pruning unnecessary communication links. Following this strategy, we first
+formulate a mathematical optimization framework for the
+perception-communication trade-off and then propose PragComm, a multi-agent
+collaborative perception system with two key components: i) single-agent
+detection and tracking and ii) pragmatic collaboration. The proposed PragComm
+promotes pragmatic communication and adapts to a wide range of communication
+conditions. We evaluate PragComm for both collaborative 3D object detection and
+tracking tasks in both real-world, V2V4Real, and simulation datasets, OPV2V and
+V2X-SIM2.0. PragComm consistently outperforms previous methods with more than
+32.7K times lower communication volume on OPV2V. Code is available at
+github.com/PhyllisH/PragComm.",cs.CV,['cs.CV']
+Dense Optical Tracking: Connecting the Dots,Guillaume Le Moing · Jean Ponce · Cordelia Schmid,https://github.com/16lemoing/dot,https://arxiv.org/abs/2312.00786,,2312.00786.pdf,Dense Optical Tracking: Connecting the Dots,"Recent approaches to point tracking are able to recover the trajectory of any
+scene point through a large portion of a video despite the presence of
+occlusions. They are, however, too slow in practice to track every point
+observed in a single frame in a reasonable amount of time. This paper
+introduces DOT, a novel, simple and efficient method for solving this problem.
+It first extracts a small set of tracks from key regions at motion boundaries
+using an off-the-shelf point tracking algorithm. Given source and target
+frames, DOT then computes rough initial estimates of a dense flow field and
+visibility mask through nearest-neighbor interpolation, before refining them
+using a learnable optical flow estimator that explicitly handles occlusions and
+can be trained on synthetic data with ground-truth correspondences. We show
+that DOT is significantly more accurate than current optical flow techniques,
+outperforms sophisticated ""universal"" trackers like OmniMotion, and is on par
+with, or better than, the best point tracking algorithms like CoTracker while
+being at least two orders of magnitude faster. Quantitative and qualitative
+experiments with synthetic and real videos validate the promise of the proposed
+approach. Code, data, and videos showcasing the capabilities of our approach
+are available in the project webpage: https://16lemoing.github.io/dot .",cs.CV,['cs.CV']
+Enhancing Post-training Quantization Calibration through Contrastive Learning,Yuzhang Shang · Gaowen Liu · Ramana Kompella · Yan Yan, ,https://arxiv.org/abs/2311.06322,,2311.06322.pdf,Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models,"Diffusion models have achieved great success due to their remarkable
+generation ability. However, their high computational overhead is still a
+troublesome problem. Recent studies have leveraged post-training quantization
+(PTQ) to compress diffusion models. However, most of them only focus on
+unconditional models, leaving the quantization of widely used large pretrained
+text-to-image models, e.g., Stable Diffusion, largely unexplored. In this
+paper, we propose a novel post-training quantization method PCR (Progressive
+Calibration and Relaxing) for text-to-image diffusion models, which consists of
+a progressive calibration strategy that considers the accumulated quantization
+error across timesteps, and an activation relaxing strategy that improves the
+performance with negligible cost. Additionally, we demonstrate the previous
+metrics for text-to-image diffusion model quantization are not accurate due to
+the distribution gap. To tackle the problem, we propose a novel QDiffBench
+benchmark, which utilizes data in the same domain for more accurate evaluation.
+Besides, QDiffBench also considers the generalization performance of the
+quantized model outside the calibration dataset. Extensive experiments on
+Stable Diffusion and Stable Diffusion XL demonstrate the superiority of our
+method and benchmark. Moreover, we are the first to achieve quantization for
+Stable Diffusion XL while maintaining the performance.",cs.CV,"['cs.CV', 'cs.LG']"
+PanoPose: Self-supervised Relative Pose Estimation for Panoramic Images,Diantao Tu · Hainan Cui · Xianwei Zheng · Shuhan Shen, ,https://arxiv.org/abs/2404.02041,,,SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation,"We present a new self-supervised approach, SelfPose3d, for estimating 3d
+poses of multiple persons from multiple camera views. Unlike current
+state-of-the-art fully-supervised methods, our approach does not require any 2d
+or 3d ground-truth poses and uses only the multi-view input images from a
+calibrated camera setup and 2d pseudo poses generated from an off-the-shelf 2d
+human pose estimator. We propose two self-supervised learning objectives:
+self-supervised person localization in 3d space and self-supervised 3d pose
+estimation. We achieve self-supervised 3d person localization by training the
+model on synthetically generated 3d points, serving as 3d person root
+positions, and on the projected root-heatmaps in all the views. We then model
+the 3d poses of all the localized persons with a bottleneck representation, map
+them onto all views obtaining 2d joints, and render them using 2d Gaussian
+heatmaps in an end-to-end differentiable manner. Afterwards, we use the
+corresponding 2d joints and heatmaps from the pseudo 2d poses for learning. To
+alleviate the intrinsic inaccuracy of the pseudo labels, we propose an adaptive
+supervision attention mechanism to guide the self-supervision. Our experiments
+and analysis on three public benchmark datasets, including Panoptic, Shelf, and
+Campus, show the effectiveness of our approach, which is comparable to
+fully-supervised methods. Code is available at
+\url{https://github.com/CAMMA-public/SelfPose3D}",cs.CV,['cs.CV']
+Semantics-aware Motion Retargeting with Vision-Language Models,Haodong Zhang · ZhiKe Chen · Haocheng Xu · Lei Hao · Xiaofei Wu · Songcen Xu · Zhensong Zhang · Yue Wang · Rong Xiong, ,https://arxiv.org/abs/2312.01964,,2312.01964.pdf,Semantics-aware Motion Retargeting with Vision-Language Models,"Capturing and preserving motion semantics is essential to motion retargeting
+between animation characters. However, most of the previous works neglect the
+semantic information or rely on human-designed joint-level representations.
+Here, we present a novel Semantics-aware Motion reTargeting (SMT) method with
+the advantage of vision-language models to extract and maintain meaningful
+motion semantics. We utilize a differentiable module to render 3D motions. Then
+the high-level motion semantics are incorporated into the motion retargeting
+process by feeding the vision-language model with the rendered images and
+aligning the extracted semantic embeddings. To ensure the preservation of
+fine-grained motion details and high-level semantics, we adopt a two-stage
+pipeline consisting of skeleton-aware pre-training and fine-tuning with
+semantics and geometry constraints. Experimental results show the effectiveness
+of the proposed method in producing high-quality motion retargeting results
+while accurately preserving motion semantics.",cs.CV,"['cs.CV', 'cs.GR']"
+HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances,Supreeth Narasimhaswamy · Uttaran Bhattacharya · Xiang Chen · Ishita Dasgupta · Saayan Mitra · Minh Hoai, ,https://arxiv.org/abs/2403.01693,,2403.01693.pdf,HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances,"Text-to-image generative models can generate high-quality humans, but realism
+is lost when generating hands. Common artifacts include irregular hand poses,
+shapes, incorrect numbers of fingers, and physically implausible finger
+orientations. To generate images with realistic hands, we propose a novel
+diffusion-based architecture called HanDiffuser that achieves realism by
+injecting hand embeddings in the generative process. HanDiffuser consists of
+two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and
+MANO-Hand parameters from input text prompts, and a Text-Guided
+Hand-Params-to-Image diffusion model to synthesize images by conditioning on
+the prompts and hand parameters generated by the previous component. We
+incorporate multiple aspects of hand representation, including 3D shapes and
+joint-level finger positions, orientations and articulations, for robust
+learning and reliable performance during inference. We conduct extensive
+quantitative and qualitative experiments and perform user studies to
+demonstrate the efficacy of our method in generating images with high-quality
+hands.",cs.CV,"['cs.CV', 'cs.AI']"
+Communication-Efficient Collaborative Perception via Information Filling with Codebook,Yue Hu · Juntong Peng · Sifei Liu · Junhao Ge · Si Liu · Siheng Chen, ,https://arxiv.org/abs/2405.04966,,2405.04966.pdf,Communication-Efficient Collaborative Perception via Information Filling with Codebook,"Collaborative perception empowers each agent to improve its perceptual
+ability through the exchange of perceptual messages with other agents. It
+inherently results in a fundamental trade-off between perception ability and
+communication cost. To address this bottleneck issue, our core idea is to
+optimize the collaborative messages from two key aspects: representation and
+selection. The proposed codebook-based message representation enables the
+transmission of integer codes, rather than high-dimensional feature maps. The
+proposed information-filling-driven message selection optimizes local messages
+to collectively fill each agent's information demand, preventing information
+overflow among multiple agents. By integrating these two designs, we propose
+CodeFilling, a novel communication-efficient collaborative perception system,
+which significantly advances the perception-communication trade-off and is
+inclusive to both homogeneous and heterogeneous collaboration settings. We
+evaluate CodeFilling in both a real-world dataset, DAIR-V2X, and a new
+simulation dataset, OPV2VH+. Results show that CodeFilling outperforms previous
+SOTA Where2comm on DAIR-V2X/OPV2VH+ with 1,333/1,206 times lower communication
+volume. Our code is available at https://github.com/PhyllisH/CodeFilling.",cs.IT,"['cs.IT', 'cs.CV', 'cs.MA', 'math.IT']"
+Adversarial Score Distillation: When score distillation meets GAN,Min Wei · Jingkai Zhou · Junyao Sun · Xuesong Zhang, ,https://arxiv.org/abs/2312.00739,,2312.00739.pdf,Adversarial Score Distillation: When score distillation meets GAN,"Existing score distillation methods are sensitive to classifier-free guidance
+(CFG) scale: manifested as over-smoothness or instability at small CFG scales,
+while over-saturation at large ones. To explain and analyze these issues, we
+revisit the derivation of Score Distillation Sampling (SDS) and decipher
+existing score distillation with the Wasserstein Generative Adversarial Network
+(WGAN) paradigm. With the WGAN paradigm, we find that existing score
+distillation either employs a fixed sub-optimal discriminator or conducts
+incomplete discriminator optimization, resulting in the scale-sensitive issue.
+We propose the Adversarial Score Distillation (ASD), which maintains an
+optimizable discriminator and updates it using the complete optimization
+objective. Experiments show that the proposed ASD performs favorably in 2D
+distillation and text-to-3D tasks against existing methods. Furthermore, to
+explore the generalization ability of our WGAN paradigm, we extend ASD to the
+image editing task, which achieves competitive results. The project page and
+code are at https://github.com/2y7c3/ASD.",cs.CV,['cs.CV']
+Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks,Bin Xiao · Haiping Wu · Weijian Xu · Xiyang Dai · Houdong Hu · Yumao Lu · Michael Zeng · Ce Liu · Lu Yuan, ,https://arxiv.org/abs/2311.06242,,2311.06242.pdf,Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks,"We introduce Florence-2, a novel vision foundation model with a unified,
+prompt-based representation for a variety of computer vision and
+vision-language tasks. While existing large vision models excel in transfer
+learning, they struggle to perform a diversity of tasks with simple
+instructions, a capability that implies handling the complexity of various
+spatial hierarchy and semantic granularity. Florence-2 was designed to take
+text-prompt as task instructions and generate desirable results in text forms,
+whether it be captioning, object detection, grounding or segmentation. This
+multi-task learning setup demands large-scale, high-quality annotated data. To
+this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive
+visual annotations on 126 million images, using an iterative strategy of
+automated image annotation and model refinement. We adopted a
+sequence-to-sequence structure to train Florence-2 to perform versatile and
+comprehensive vision tasks. Extensive evaluations on numerous tasks
+demonstrated Florence-2 to be a strong vision foundation model contender with
+unprecedented zero-shot and fine-tuning capabilities.",cs.CV,['cs.CV']
+GOAT-Bench: A Benchmark for Multi-modal Lifelong Navigation,Mukul Khanna · Ram Ramrakhya · Gunjan Chhablani · Sriram Yenamandra · Theo Gervet · Matthew Chang · Zsolt Kira · Devendra Singh Chaplot · Dhruv Batra · Roozbeh Mottaghi, ,https://arxiv.org/abs/2404.06609,,2404.06609.pdf,GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation,"The Embodied AI community has made significant strides in visual navigation
+tasks, exploring targets from 3D coordinates, objects, language descriptions,
+and images. However, these navigation models often handle only a single input
+modality as the target. With the progress achieved so far, it is time to move
+towards universal navigation models capable of handling various goal types,
+enabling more effective user interaction with robots. To facilitate this goal,
+we propose GOAT-Bench, a benchmark for the universal navigation task referred
+to as GO to AnyThing (GOAT). In this task, the agent is directed to navigate to
+a sequence of targets specified by the category name, language description, or
+image in an open-vocabulary fashion. We benchmark monolithic RL and modular
+methods on the GOAT task, analyzing their performance across modalities, the
+role of explicit and implicit scene memories, their robustness to noise in goal
+specifications, and the impact of memory in lifelong scenarios.",cs.AI,"['cs.AI', 'cs.RO']"
+LLaFS: When Large Language Models Meet Few-Shot Segmentation,Lanyun Zhu · Tianrun Chen · Deyi Ji · Deyi Ji · Jieping Ye · Jun Liu, ,https://arxiv.org/abs/2311.16926,,2311.16926.pdf,LLaFS: When Large Language Models Meet Few-Shot Segmentation,"This paper proposes LLaFS, the first attempt to leverage large language
+models (LLMs) in few-shot segmentation. In contrast to the conventional
+few-shot segmentation methods that only rely on the limited and biased
+information from the annotated support images, LLaFS leverages the vast prior
+knowledge gained by LLM as an effective supplement and directly uses the LLM to
+segment images in a few-shot manner. To enable the text-based LLM to handle
+image-related tasks, we carefully design an input instruction that allows the
+LLM to produce segmentation results represented as polygons, and propose a
+region-attribute table to simulate the human visual mechanism and provide
+multi-modal guidance. We also synthesize pseudo samples and use curriculum
+learning for pretraining to augment data and achieve better optimization. LLaFS
+achieves state-of-the-art results on multiple datasets, showing the potential
+of using LLMs for few-shot computer vision tasks.",cs.CV,['cs.CV']
+MVCPS-NeuS: Multi-view Constrained Photometric Stereo for Neural Surface Reconstruction,Hiroaki Santo · Fumio Okura · Yasuyuki Matsushita,https://github.com/hiroaki-santo/mvcps-neus,https://arxiv.org/abs/2405.12057,,2405.12057.pdf,NPLMV-PS: Neural Point-Light Multi-View Photometric Stereo,"In this work we present a novel multi-view photometric stereo (PS) method.
+Like many works in 3D reconstruction we are leveraging neural shape
+representations and learnt renderers. However, our work differs from the
+state-of-the-art multi-view PS methods such as PS-NeRF or SuperNormal we
+explicity leverage per-pixel intensity renderings rather than relying mainly on
+estimated normals.
+  We model point light attenuation and explicitly raytrace cast shadows in
+order to best approximate each points incoming radiance. This is used as input
+to a fully neural material renderer that uses minimal prior assumptions and it
+is jointly optimised with the surface. Finally, estimated normal and
+segmentation maps can also incorporated in order to maximise the surface
+accuracy.
+  Our method is among the first to outperform the classical approach of
+DiLiGenT-MV and achieves average 0.2mm Chamfer distance for objects imaged at
+approx 1.5m distance away with approximate 400x400 resolution. Moreover, we
+show robustness to poor normals in low light count scenario, achieving 0.27mm
+Chamfer distance when pixel rendering is used instead of estimated normals.",cs.CV,['cs.CV']
+FlowTrack: Revisiting Optical Flow for Long-Range Dense Tracking,Seokju Cho · Gabriel Huang · Seungryong Kim · Joon-Young Lee, ,https://arxiv.org/abs/2312.00786,,,Dense Optical Tracking: Connecting the Dots,"Recent approaches to point tracking are able to recover the trajectory of any
+scene point through a large portion of a video despite the presence of
+occlusions. They are, however, too slow in practice to track every point
+observed in a single frame in a reasonable amount of time. This paper
+introduces DOT, a novel, simple and efficient method for solving this problem.
+It first extracts a small set of tracks from key regions at motion boundaries
+using an off-the-shelf point tracking algorithm. Given source and target
+frames, DOT then computes rough initial estimates of a dense flow field and
+visibility mask through nearest-neighbor interpolation, before refining them
+using a learnable optical flow estimator that explicitly handles occlusions and
+can be trained on synthetic data with ground-truth correspondences. We show
+that DOT is significantly more accurate than current optical flow techniques,
+outperforms sophisticated ""universal"" trackers like OmniMotion, and is on par
+with, or better than, the best point tracking algorithms like CoTracker while
+being at least two orders of magnitude faster. Quantitative and qualitative
+experiments with synthetic and real videos validate the promise of the proposed
+approach. Code, data, and videos showcasing the capabilities of our approach
+are available in the project webpage: https://16lemoing.github.io/dot .",cs.CV,['cs.CV']
+MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning,Matteo Farina · Massimiliano Mancini · Elia Cunegatti · Gaowen Liu · Giovanni Iacca · Elisa Ricci,https://github.com/FarinaMatteo/multiflow,https://arxiv.org/abs/2404.05621,,2404.05621.pdf,MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning,"While excellent in transfer learning, Vision-Language models (VLMs) come with
+high computational costs due to their large number of parameters. To address
+this issue, removing parameters via model pruning is a viable solution.
+However, existing techniques for VLMs are task-specific, and thus require
+pruning the network from scratch for each new task of interest. In this work,
+we explore a new direction: Task-Agnostic Vision-Language Pruning (TA-VLP).
+Given a pretrained VLM, the goal is to find a unique pruned counterpart
+transferable to multiple unknown downstream tasks. In this challenging setting,
+the transferable representations already encoded in the pretrained model are a
+key aspect to preserve. Thus, we propose Multimodal Flow Pruning (MULTIFLOW), a
+first, gradient-free, pruning framework for TA-VLP where: (i) the importance of
+a parameter is expressed in terms of its magnitude and its information flow, by
+incorporating the saliency of the neurons it connects; and (ii) pruning is
+driven by the emergent (multimodal) distribution of the VLM parameters after
+pretraining. We benchmark eight state-of-the-art pruning algorithms in the
+context of TA-VLP, experimenting with two VLMs, three vision-language tasks,
+and three pruning ratios. Our experimental results show that MULTIFLOW
+outperforms recent sophisticated, combinatorial competitors in the vast
+majority of the cases, paving the way towards addressing TA-VLP. The code is
+publicly available at https://github.com/FarinaMatteo/multiflow.",cs.CV,['cs.CV']
+In-Context Matting,He Guo · Zixuan Ye · Zhiguo Cao · Hao Lu, ,https://arxiv.org/abs/2403.15789,,2403.15789.pdf,In-Context Matting,"We introduce in-context matting, a novel task setting of image matting. Given
+a reference image of a certain foreground and guided priors such as points,
+scribbles, and masks, in-context matting enables automatic alpha estimation on
+a batch of target images of the same foreground category, without additional
+auxiliary input. This setting marries good performance in auxiliary input-based
+matting and ease of use in automatic matting, which finds a good trade-off
+between customization and automation. To overcome the key challenge of accurate
+foreground matching, we introduce IconMatting, an in-context matting model
+built upon a pre-trained text-to-image diffusion model. Conditioned on inter-
+and intra-similarity matching, IconMatting can make full use of reference
+context to generate accurate target alpha mattes. To benchmark the task, we
+also introduce a novel testing dataset ICM-$57$, covering 57 groups of
+real-world images. Quantitative and qualitative results on the ICM-57 testing
+set show that IconMatting rivals the accuracy of trimap-based matting while
+retaining the automation level akin to automatic matting. Code is available at
+https://github.com/tiny-smart/in-context-matting",cs.CV,['cs.CV']
+Interactive Continual Learning: Fast and Slow Thinking,Biqing Qi · Xinquan Chen · Junqi Gao · Dong Li · Jianxing Liu · Ligang Wu · Bowen Zhou, ,https://arxiv.org/abs/2403.02628,,2403.02628.pdf,Interactive Continual Learning: Fast and Slow Thinking,"Advanced life forms, sustained by the synergistic interaction of neural
+cognitive mechanisms, continually acquire and transfer knowledge throughout
+their lifespan. In contrast, contemporary machine learning paradigms exhibit
+limitations in emulating the facets of continual learning (CL). Nonetheless,
+the emergence of large language models (LLMs) presents promising avenues for
+realizing CL via interactions with these models. Drawing on Complementary
+Learning System theory, this paper presents a novel Interactive Continual
+Learning (ICL) framework, enabled by collaborative interactions among models of
+various sizes. Specifically, we assign the ViT model as System1 and multimodal
+LLM as System2. To enable the memory module to deduce tasks from class
+information and enhance Set2Set retrieval, we propose the Class-Knowledge-Task
+Multi-Head Attention (CKT-MHA). Additionally, to improve memory retrieval in
+System1 through enhanced geometric representation, we introduce the CL-vMF
+mechanism, based on the von Mises-Fisher (vMF) distribution. Meanwhile, we
+introduce the von Mises-Fisher Outlier Detection and Interaction (vMF-ODI)
+strategy to identify hard examples, thus enhancing collaboration between
+System1 and System2 for complex reasoning realization. Comprehensive evaluation
+of our proposed ICL demonstrates significant resistance to forgetting and
+superior performance relative to existing methods. Code is available at
+github.com/ICL.",cs.CV,"['cs.CV', 'cs.LG']"
+The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing,Denis Bobkov · Vadim Titov · Aibek Alanov · Dmitry Vetrov, ,https://ar5iv.labs.arxiv.org/html/2203.08450,,2203.08450.pdf,The Devil Is in the Details: Window-based Attention for Image Compression,"Learned image compression methods have exhibited superior rate-distortion
+performance than classical image compression standards. Most existing learned
+image compression models are based on Convolutional Neural Networks (CNNs).
+Despite great contributions, a main drawback of CNN based model is that its
+structure is not designed for capturing local redundancy, especially the
+non-repetitive textures, which severely affects the reconstruction quality.
+Therefore, how to make full use of both global structure and local texture
+becomes the core problem for learning-based image compression. Inspired by
+recent progresses of Vision Transformer (ViT) and Swin Transformer, we found
+that combining the local-aware attention mechanism with the global-related
+feature learning could meet the expectation in image compression. In this
+paper, we first extensively study the effects of multiple kinds of attention
+mechanisms for local features learning, then introduce a more straightforward
+yet effective window-based local attention block. The proposed window-based
+attention is very flexible which could work as a plug-and-play component to
+enhance CNN and Transformer models. Moreover, we propose a novel Symmetrical
+TransFormer (STF) framework with absolute transformer blocks in the
+down-sampling encoder and up-sampling decoder. Extensive experimental
+evaluations have shown that the proposed method is effective and outperforms
+the state-of-the-art methods. The code is publicly available at
+https://github.com/Googolxx/STF.",eess.IV,"['eess.IV', 'cs.CV']"
+RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos,Hongchi Xia · Yang Fu · Sifei Liu · Xiaolong Wang, ,https://arxiv.org/abs/2401.12592,,2401.12592.pdf,RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos,"We introduce a new RGB-D object dataset captured in the wild called
+WildRGB-D. Unlike most existing real-world object-centric datasets which only
+come with RGB capturing, the direct capture of the depth channel allows better
+3D annotations and broader downstream applications. WildRGB-D comprises
+large-scale category-level RGB-D object videos, which are taken using an iPhone
+to go around the objects in 360 degrees. It contains around 8500 recorded
+objects and nearly 20000 RGB-D videos across 46 common object categories. These
+videos are taken with diverse cluttered backgrounds with three setups to cover
+as many real-world scenarios as possible: (i) a single object in one video;
+(ii) multiple objects in one video; and (iii) an object with a static hand in
+one video. The dataset is annotated with object masks, real-world scale camera
+poses, and reconstructed aggregated point clouds from RGBD videos. We benchmark
+four tasks with WildRGB-D including novel view synthesis, camera pose
+estimation, object 6d pose estimation, and object surface reconstruction. Our
+experiments show that the large-scale capture of RGB-D objects provides a large
+potential to advance 3D object learning. Our project page is
+https://wildrgbd.github.io/.",cs.CV,['cs.CV']
+Learning Continual Compatible Representation for Re-indexing Free Lifelong Person Re-identification,Zhenyu Cui · Jiahuan Zhou · Xun Wang · Manyu Zhu · Yuxin Peng, ,https://arxiv.org/abs/2403.16003,,2403.16003.pdf,Diverse Representation Embedding for Lifelong Person Re-Identification,"Lifelong Person Re-Identification (LReID) aims to continuously learn from
+successive data streams, matching individuals across multiple cameras. The key
+challenge for LReID is how to effectively preserve old knowledge while
+incrementally learning new information, which is caused by task-level domain
+gaps and limited old task datasets. Existing methods based on CNN backbone are
+insufficient to explore the representation of each instance from different
+perspectives, limiting model performance on limited old task datasets and new
+task datasets. Unlike these methods, we propose a Diverse Representations
+Embedding (DRE) framework that first explores a pure transformer for LReID. The
+proposed DRE preserves old knowledge while adapting to new information based on
+instance-level and task-level layout. Concretely, an Adaptive Constraint Module
+(ACM) is proposed to implement integration and push away operations between
+multiple overlapping representations generated by transformer-based backbone,
+obtaining rich and discriminative representations for each instance to improve
+adaptive ability of LReID. Based on the processed diverse representations, we
+propose Knowledge Update (KU) and Knowledge Preservation (KP) strategies at the
+task-level layout by introducing the adjustment model and the learner model. KU
+strategy enhances the adaptive learning ability of learner models for new
+information under the adjustment model prior, and KP strategy preserves old
+knowledge operated by representation-level alignment and logit-level
+supervision in limited old task datasets while guaranteeing the adaptive
+learning information capacity of the LReID model. Compared to state-of-the-art
+methods, our method achieves significantly improved performance in holistic,
+large-scale, and occluded datasets.",cs.CV,"['cs.CV', 'cs.AI']"
+6-DoF Pose Estimation with MultiScale Residual Correlation,Yuelong Li · Yafei Mao · Raja Bala · Sunil Hadap,https://github.com/amzn/mrc-net-6d-pose,https://arxiv.org/abs/2403.08019,,2403.08019.pdf,MRC-Net: 6-DoF Pose Estimation with MultiScale Residual Correlation,"We propose a single-shot approach to determining 6-DoF pose of an object with
+available 3D computer-aided design (CAD) model from a single RGB image. Our
+method, dubbed MRC-Net, comprises two stages. The first performs pose
+classification and renders the 3D object in the classified pose. The second
+stage performs regression to predict fine-grained residual pose within class.
+Connecting the two stages is a novel multi-scale residual correlation (MRC)
+layer that captures high-and-low level correspondences between the input image
+and rendering from first stage. MRC-Net employs a Siamese network with shared
+weights between both stages to learn embeddings for input and rendered images.
+To mitigate ambiguity when predicting discrete pose class labels on symmetric
+objects, we use soft probabilistic labels to define pose class in the first
+stage. We demonstrate state-of-the-art accuracy, outperforming all competing
+RGB-based methods on four challenging BOP benchmark datasets: T-LESS, LM-O,
+YCB-V, and ITODD. Our method is non-iterative and requires no complex
+post-processing.",cs.CV,['cs.CV']
+Minimal Perspective Autocalibration,Andrea Porfiri Dal Cin · Timothy Duff · Luca Magri · Tomas Pajdla, ,https://arxiv.org/abs/2405.05605,,2405.05605.pdf,Minimal Perspective Autocalibration,"We introduce a new family of minimal problems for reconstruction from
+multiple views. Our primary focus is a novel approach to autocalibration, a
+long-standing problem in computer vision. Traditional approaches to this
+problem, such as those based on Kruppa's equations or the modulus constraint,
+rely explicitly on the knowledge of multiple fundamental matrices or a
+projective reconstruction. In contrast, we consider a novel formulation
+involving constraints on image points, the unknown depths of 3D points, and a
+partially specified calibration matrix $K$. For $2$ and $3$ views, we present a
+comprehensive taxonomy of minimal autocalibration problems obtained by relaxing
+some of these constraints. These problems are organized into classes according
+to the number of views and any assumed prior knowledge of $K$. Within each
+class, we determine problems with the fewest -- or a relatively small number of
+-- solutions. From this zoo of problems, we devise three practical solvers.
+Experiments with synthetic and real data and interfacing our solvers with
+COLMAP demonstrate that we achieve superior accuracy compared to
+state-of-the-art calibration methods. The code is available at
+https://github.com/andreadalcin/MinimalPerspectiveAutocalibration",cs.CV,['cs.CV']
+Improving Spectral Snapshot Reconstruction with Spectral-Spatial Rectification,Jiancheng Zhang · Haijin Zeng · Yongyong Chen · Dengxiu Yu · Yinping Zhao,https://github.com/ZhangJC-2k/SSR,,https://ieeexplore.ieee.org/document/10411766,,,,,nan
+WorDepth: Variational Language Prior for Monocular Depth Estimation,Ziyao Zeng · Hyoungseob Park · Fengyu Yang · Daniel Wang · Stefano Soatto · Dong Lao · Alex Wong, ,https://arxiv.org/abs/2404.03635,,2404.03635.pdf,WorDepth: Variational Language Prior for Monocular Depth Estimation,"Three-dimensional (3D) reconstruction from a single image is an ill-posed
+problem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text
+description(s) is similarly ill-posed, i.e. spatial arrangements of objects
+described. We investigate the question of whether two inherently ambiguous
+modalities can be used in conjunction to produce metric-scaled reconstructions.
+To test this, we focus on monocular depth estimation, the problem of predicting
+a dense depth map from a single image, but with an additional text caption
+describing the scene. To this end, we begin by encoding the text caption as a
+mean and standard deviation; using a variational framework, we learn the
+distribution of the plausible metric reconstructions of 3D scenes corresponding
+to the text captions as a prior. To ""select"" a specific reconstruction or depth
+map, we encode the given image through a conditional sampler that samples from
+the latent space of the variational text encoder, which is then decoded to the
+output depth map. Our approach is trained alternatingly between the text and
+image branches: in one optimization step, we predict the mean and standard
+deviation from the text description and sample from a standard Gaussian, and in
+the other, we sample using a (image) conditional sampler. Once trained, we
+directly predict depth from the encoded text using the conditional sampler. We
+demonstrate our approach on indoor (NYUv2) and outdoor (KITTI) scenarios, where
+we show that language can consistently improve performance in both.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG', 'cs.MM']"
+Hierarchical Patch Diffusion Models for High-Resolution Video Generation,Ivan Skorokhodov · Willi Menapace · Aliaksandr Siarohin · Sergey Tulyakov, ,http://export.arxiv.org/abs/2310.19512,,2310.19512.pdf,VideoCrafter1: Open Diffusion Models for High-Quality Video Generation,"Video generation has increasingly gained interest in both academia and
+industry. Although commercial tools can generate plausible videos, there is a
+limited number of open-source models available for researchers and engineers.
+In this work, we introduce two diffusion models for high-quality video
+generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V
+models synthesize a video based on a given text input, while I2V models
+incorporate an additional image input. Our proposed T2V model can generate
+realistic and cinematic-quality videos with a resolution of $1024 \times 576$,
+outperforming other open-source T2V models in terms of quality. The I2V model
+is designed to produce videos that strictly adhere to the content of the
+provided reference image, preserving its content, structure, and style. This
+model is the first open-source I2V foundation model capable of transforming a
+given image into a video clip while maintaining content preservation
+constraints. We believe that these open-source video generation models will
+contribute significantly to the technological advancements within the
+community.",cs.CV,['cs.CV']
+End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames,Shuming Liu · Chenlin Zhang · Chen Zhao · Bernard Ghanem, ,https://arxiv.org/abs/2311.17241,,2311.17241.pdf,End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames,"Recently, temporal action detection (TAD) has seen significant performance
+improvement with end-to-end training. However, due to the memory bottleneck,
+only models with limited scales and limited data volumes can afford end-to-end
+training, which inevitably restricts TAD performance. In this paper, we reduce
+the memory consumption for end-to-end training, and manage to scale up the TAD
+backbone to 1 billion parameters and the input video to 1,536 frames, leading
+to significant detection performance. The key to our approach lies in our
+proposed temporal-informative adapter (TIA), which is a novel lightweight
+module that reduces training memory. Using TIA, we free the humongous backbone
+from learning to adapt to the TAD task by only updating the parameters in TIA.
+TIA also leads to better TAD representation by temporally aggregating context
+from adjacent frames throughout the backbone. We evaluate our model across four
+representative datasets. Owing to our efficient design, we are able to train
+end-to-end on VideoMAEv2-giant and achieve 75.4% mAP on THUMOS14, being the
+first end-to-end model to outperform the best feature-based methods. Code is
+available at https://github.com/sming256/AdaTAD.",cs.CV,['cs.CV']
+Dual DETRs for Multi-Label Temporal Action Detection,Yuhan Zhu · Guozhen Zhang · Jing Tan · Gangshan Wu · Limin Wang, ,https://arxiv.org/abs/2404.00653,,2404.00653.pdf,Dual DETRs for Multi-Label Temporal Action Detection,"Temporal Action Detection (TAD) aims to identify the action boundaries and
+the corresponding category within untrimmed videos. Inspired by the success of
+DETR in object detection, several methods have adapted the query-based
+framework to the TAD task. However, these approaches primarily followed DETR to
+predict actions at the instance level (i.e., identify each action by its center
+point), leading to sub-optimal boundary localization. To address this issue, we
+propose a new Dual-level query-based TAD framework, namely DualDETR, to detect
+actions from both instance-level and boundary-level. Decoding at different
+levels requires semantics of different granularity, therefore we introduce a
+two-branch decoding structure. This structure builds distinctive decoding
+processes for different levels, facilitating explicit capture of temporal cues
+and semantics at each level. On top of the two-branch design, we present a
+joint query initialization strategy to align queries from both levels.
+Specifically, we leverage encoder proposals to match queries from each level in
+a one-to-one manner. Then, the matched queries are initialized using position
+and content prior from the matched action proposal. The aligned dual-level
+queries can refine the matched proposal with complementary cues during
+subsequent decoding. We evaluate DualDETR on three challenging multi-label TAD
+benchmarks. The experimental results demonstrate the superior performance of
+DualDETR to the existing state-of-the-art methods, achieving a substantial
+improvement under det-mAP and delivering impressive results under seg-mAP.",cs.CV,['cs.CV']
+LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model,Chenjie Cao · Yunuo Cai · Qiaole Dong · Yikai Wang · Yanwei Fu,https://ewrfcas.github.io/LeftRefill/,https://arxiv.org/html/2405.18416v1,,2405.18416v1.pdf,3D StreetUnveiler with Semantic-Aware 2DGS,"Unveiling an empty street from crowded observations captured by in-car
+cameras is crucial for autonomous driving. However, removing all temporary
+static objects, such as stopped vehicles and standing pedestrians, presents a
+significant challenge. Unlike object-centric 3D inpainting, which relies on
+thorough observation in a small scene, street scenes involve long trajectories
+that differ from previous 3D inpainting tasks. The camera-centric moving
+environment of captured videos further complicates the task due to the limited
+degree and time duration of object observation. To address these obstacles, we
+introduce StreetUnveiler to reconstruct an empty street. StreetUnveiler learns
+a 3D representation of the empty street from crowded observations. Our
+representation is based on the hard-label semantic 2D Gaussian Splatting (2DGS)
+for its scalability and ability to identify Gaussians to be removed. We inpaint
+rendered image after removing unwanted Gaussians to provide pseudo-labels and
+subsequently re-optimize the 2DGS. Given its temporal continuous movement, we
+divide the empty street scene into observed, partial-observed, and unobserved
+regions, which we propose to locate through a rendered alpha map. This
+decomposition helps us to minimize the regions that need to be inpainted. To
+enhance the temporal consistency of the inpainting, we introduce a novel
+time-reversal framework to inpaint frames in reverse order and use later frames
+as references for earlier frames to fully utilize the long-trajectory
+observations. Our experiments conducted on the street scene dataset
+successfully reconstructed a 3D representation of the empty street. The mesh
+representation of the empty street can be extracted for further applications.
+Project page and more visualizations can be found at:
+https://streetunveiler.github.io",cs.CV,['cs.CV']
+3DiffTection: 3D Object Detection with Geometry-aware Diffusion Features,Chenfeng Xu · Huan Ling · Sanja Fidler · Or Litany, ,https://arxiv.org/abs/2311.04391,,2311.04391.pdf,3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features,"We present 3DiffTection, a state-of-the-art method for 3D object detection
+from single images, leveraging features from a 3D-aware diffusion model.
+Annotating large-scale image data for 3D detection is resource-intensive and
+time-consuming. Recently, pretrained large image diffusion models have become
+prominent as effective feature extractors for 2D perception tasks. However,
+these features are initially trained on paired text and image data, which are
+not optimized for 3D tasks, and often exhibit a domain gap when applied to the
+target data. Our approach bridges these gaps through two specialized tuning
+strategies: geometric and semantic. For geometric tuning, we fine-tune a
+diffusion model to perform novel view synthesis conditioned on a single image,
+by introducing a novel epipolar warp operator. This task meets two essential
+criteria: the necessity for 3D awareness and reliance solely on posed image
+data, which are readily available (e.g., from videos) and does not require
+manual annotation. For semantic refinement, we further train the model on
+target data with detection supervision. Both tuning phases employ ControlNet to
+preserve the integrity of the original feature capabilities. In the final step,
+we harness these enhanced capabilities to conduct a test-time prediction
+ensemble across multiple virtual viewpoints. Through our methodology, we obtain
+3D-aware features that are tailored for 3D detection and excel in identifying
+cross-view point correspondences. Consequently, our model emerges as a powerful
+3D detector, substantially surpassing previous benchmarks, e.g., Cube-RCNN, a
+precedent in single-view 3D detection by 9.43\% in AP3D on the
+Omni3D-ARkitscene dataset. Furthermore, 3DiffTection showcases robust data
+efficiency and generalization to cross-domain data.",cs.CV,['cs.CV']
+Unsupervised Feature Learning with Emergent Data-Driven Prototypicality,Yunhui Guo · Youren Zhang · Yubei Chen · Stella X. Yu, ,https://arxiv.org/abs/2307.01421,,2307.01421.pdf,Unsupervised Feature Learning with Emergent Data-Driven Prototypicality,"Given an image set without any labels, our goal is to train a model that maps
+each image to a point in a feature space such that, not only proximity
+indicates visual similarity, but where it is located directly encodes how
+prototypical the image is according to the dataset.
+  Our key insight is to perform unsupervised feature learning in hyperbolic
+instead of Euclidean space, where the distance between points still reflect
+image similarity, and yet we gain additional capacity for representing
+prototypicality with the location of the point: The closer it is to the origin,
+the more prototypical it is. The latter property is simply emergent from
+optimizing the usual metric learning objective: The image similar to many
+training instances is best placed at the center of corresponding points in
+Euclidean space, but closer to the origin in hyperbolic space.
+  We propose an unsupervised feature learning algorithm in Hyperbolic space
+with sphere pACKing. HACK first generates uniformly packed particles in the
+Poincar\'e ball of hyperbolic space and then assigns each image uniquely to
+each particle. Images after congealing are regarded more typical of the dataset
+it belongs to. With our feature mapper simply trained to spread out training
+instances in hyperbolic space, we observe that images move closer to the origin
+with congealing, validating our idea of unsupervised prototypicality discovery.
+We demonstrate that our data-driven prototypicality provides an easy and
+superior unsupervised instance selection to reduce sample complexity, increase
+model generalization with atypical instances and robustness with typical ones.",cs.CV,"['cs.CV', 'cs.AI']"
+Visual In-Context Prompting,Feng Li · Qing Jiang · Hao Zhang · Shilong Liu · Huaizhe Xu · Xueyan Zou · Tianhe Ren · Hongyang Li · Lei Zhang · Chunyuan Li · Jianwei Yang · Jianfeng Gao, ,https://arxiv.org/abs/2311.13601,,2311.13601.pdf,Visual In-Context Prompting,"In-context prompting in large language models (LLMs) has become a prevalent
+approach to improve zero-shot capabilities, but this idea is less explored in
+the vision domain. Existing visual prompting methods focus on referring
+segmentation to segment the most relevant object, falling short of addressing
+many generic vision tasks like open-set segmentation and detection. In this
+paper, we introduce a universal visual in-context prompting framework for both
+tasks. In particular, we build on top of an encoder-decoder architecture, and
+develop a versatile prompt encoder to support a variety of prompts like
+strokes, boxes, and points. We further enhance it to take an arbitrary number
+of reference image segments as the context. Our extensive explorations show
+that the proposed visual in-context prompting elicits extraordinary referring
+and generic segmentation capabilities to refer and detect, yielding competitive
+performance to close-set in-domain datasets and showing promising results on
+many open-set segmentation datasets. By joint training on COCO and SA-1B, our
+model achieves $57.7$ PQ on COCO and $23.2$ PQ on ADE20K. Code will be
+available at https://github.com/UX-Decoder/DINOv.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Fair Federated Learning under Domain Skew with Local Consistency and Domain Diversity,Yuhang Chen · Wenke Huang · Mang Ye,https://github.com/yuhangchen0/FedHEAL,https://arxiv.org/abs/2405.16585,,2405.16585.pdf,Fair Federated Learning under Domain Skew with Local Consistency and Domain Diversity,"Federated learning (FL) has emerged as a new paradigm for privacy-preserving
+collaborative training. Under domain skew, the current FL approaches are biased
+and face two fairness problems. 1) Parameter Update Conflict: data disparity
+among clients leads to varying parameter importance and inconsistent update
+directions. These two disparities cause important parameters to potentially be
+overwhelmed by unimportant ones of dominant updates. It consequently results in
+significant performance decreases for lower-performing clients. 2) Model
+Aggregation Bias: existing FL approaches introduce unfair weight allocation and
+neglect domain diversity. It leads to biased model convergence objective and
+distinct performance among domains. We discover a pronounced directional update
+consistency in Federated Learning and propose a novel framework to tackle above
+issues. First, leveraging the discovered characteristic, we selectively discard
+unimportant parameter updates to prevent updates from clients with lower
+performance overwhelmed by unimportant parameters, resulting in fairer
+generalization performance. Second, we propose a fair aggregation objective to
+prevent global model bias towards some domains, ensuring that the global model
+continuously aligns with an unbiased model. The proposed method is generic and
+can be combined with other existing FL methods to enhance fairness.
+Comprehensive experiments on Digits and Office-Caltech demonstrate the high
+fairness and performance of our method.",cs.LG,"['cs.LG', 'cs.AI']"
+Reg-PTQ: Regression-specialized Post-training Quantization for Fully Quantized Object Detector,Yifu Ding · Weilun Feng · Chuyan Chen · Jinyang Guo · Xianglong Liu, ,,,,,,,nan
+MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion,Roy Kapon · Guy Tevet · Daniel Cohen-Or · Amit H. Bermano, ,https://arxiv.org/abs/2310.14729,,2310.14729.pdf,MAS: Multi-view Ancestral Sampling for 3D motion generation using 2D diffusion,"We introduce Multi-view Ancestral Sampling (MAS), a method for 3D motion
+generation, using 2D diffusion models that were trained on motions obtained
+from in-the-wild videos. As such, MAS opens opportunities to exciting and
+diverse fields of motion previously under-explored as 3D data is scarce and
+hard to collect. MAS works by simultaneously denoising multiple 2D motion
+sequences representing different views of the same 3D motion. It ensures
+consistency across all views at each diffusion step by combining the individual
+generations into a unified 3D sequence, and projecting it back to the original
+views. We demonstrate MAS on 2D pose data acquired from videos depicting
+professional basketball maneuvers, rhythmic gymnastic performances featuring a
+ball apparatus, and horse races. In each of these domains, 3D motion capture is
+arduous, and yet, MAS generates diverse and realistic 3D sequences. Unlike the
+Score Distillation approach, which optimizes each sample by repeatedly applying
+small fixes, our method uses a sampling process that was constructed for the
+diffusion framework. As we demonstrate, MAS avoids common issues such as
+out-of-domain sampling and mode-collapse. https://guytevet.github.io/mas-page/",cs.CV,"['cs.CV', 'cs.GR']"
+PEEKABOO: Interactive Video Generation via Masked-Diffusion,Yash Jain · Anshul Nasery · Vibhav Vineet · Harkirat Behl, ,https://arxiv.org/abs/2312.07509,,2312.07509.pdf,PEEKABOO: Interactive Video Generation via Masked-Diffusion,"Modern video generation models like Sora have achieved remarkable success in
+producing high-quality videos. However, a significant limitation is their
+inability to offer interactive control to users, a feature that promises to
+open up unprecedented applications and creativity. In this work, we introduce
+the first solution to equip diffusion-based video generation models with
+spatio-temporal control. We present Peekaboo, a novel masked attention module,
+which seamlessly integrates with current video generation models offering
+control without the need for additional training or inference overhead. To
+facilitate future research, we also introduce a comprehensive benchmark for
+interactive video generation. This benchmark offers a standardized framework
+for the community to assess the efficacy of emerging interactive video
+generation models. Our extensive qualitative and quantitative assessments
+reveal that Peekaboo achieves up to a 3.8x improvement in mIoU over baseline
+models, all while maintaining the same latency. Code and benchmark are
+available on the webpage.",cs.CV,"['cs.CV', 'cs.LG']"
+Efficiently Assemble Normalization Layers and Regularization for Federated Domain Generalization,Khiem Le · Tuan Long Ho · Cuong Do · Danh Le-Phuoc · KOK SENG WONG, ,https://arxiv.org/abs/2403.15605,,2403.15605.pdf,Efficiently Assemble Normalization Layers and Regularization for Federated Domain Generalization,"Domain shift is a formidable issue in Machine Learning that causes a model to
+suffer from performance degradation when tested on unseen domains. Federated
+Domain Generalization (FedDG) attempts to train a global model using
+collaborative clients in a privacy-preserving manner that can generalize well
+to unseen clients possibly with domain shift. However, most existing FedDG
+methods either cause additional privacy risks of data leakage or induce
+significant costs in client communication and computation, which are major
+concerns in the Federated Learning paradigm. To circumvent these challenges,
+here we introduce a novel architectural method for FedDG, namely gPerXAN, which
+relies on a normalization scheme working with a guiding regularizer. In
+particular, we carefully design Personalized eXplicitly Assembled Normalization
+to enforce client models selectively filtering domain-specific features that
+are biased towards local data while retaining discrimination of those features.
+Then, we incorporate a simple yet effective regularizer to guide these models
+in directly capturing domain-invariant representations that the global model's
+classifier can leverage. Extensive experimental results on two benchmark
+datasets, i.e., PACS and Office-Home, and a real-world medical dataset,
+Camelyon17, indicate that our proposed method outperforms other existing
+methods in addressing this particular problem.",cs.CV,"['cs.CV', 'cs.LG']"
+S$^2$MVTC: a Simple yet Efficient Scalable Multi-View Tensor Clustering,Zhen Long · Qiyuan Wang · Yazhou Ren · Yipeng Liu · Ce Zhu, ,https://arxiv.org/abs/2403.09107,,2403.09107.pdf,S^2MVTC: a Simple yet Efficient Scalable Multi-View Tensor Clustering,"Anchor-based large-scale multi-view clustering has attracted considerable
+attention for its effectiveness in handling massive datasets. However, current
+methods mainly seek the consensus embedding feature for clustering by exploring
+global correlations between anchor graphs or projection matrices.In this paper,
+we propose a simple yet efficient scalable multi-view tensor clustering
+(S^2MVTC) approach, where our focus is on learning correlations of embedding
+features within and across views. Specifically, we first construct the
+embedding feature tensor by stacking the embedding features of different views
+into a tensor and rotating it. Additionally, we build a novel tensor
+low-frequency approximation (TLFA) operator, which incorporates graph
+similarity into embedding feature learning, efficiently achieving smooth
+representation of embedding features within different views. Furthermore,
+consensus constraints are applied to embedding features to ensure inter-view
+semantic consistency. Experimental results on six large-scale multi-view
+datasets demonstrate that S^2MVTC significantly outperforms state-of-the-art
+algorithms in terms of clustering performance and CPU execution time,
+especially when handling massive data. The code of S^2MVTC is publicly
+available at https://github.com/longzhen520/S2MVTC.",cs.LG,"['cs.LG', 'cs.CV']"
+LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content,Qihao Zhao · Yalun Dai · Hao Li · Wei Hu · Fan Zhang · Jun Liu, ,https://arxiv.org/abs/2403.05854,,2403.05854.pdf,LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content,"Long-tail recognition is challenging because it requires the model to learn
+good representations from tail categories and address imbalances across all
+categories. In this paper, we propose a novel generative and fine-tuning
+framework, LTGC, to handle long-tail recognition via leveraging generated
+content. Firstly, inspired by the rich implicit knowledge in large-scale models
+(e.g., large language models, LLMs), LTGC leverages the power of these models
+to parse and reason over the original tail data to produce diverse tail-class
+content. We then propose several novel designs for LTGC to ensure the quality
+of the generated data and to efficiently fine-tune the model using both the
+generated and original data. The visualization demonstrates the effectiveness
+of the generation module in LTGC, which produces accurate and diverse tail
+data. Additionally, the experimental results demonstrate that our LTGC
+outperforms existing state-of-the-art methods on popular long-tailed
+benchmarks.",cs.CV,['cs.CV']
+BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation,Yunhao Ge · Yihe Tang · Jiashu Xu · Cem Gokmen · Chengshu Li · Wensi Ai · Benjamin Martinez · Arman Aydin · Mona Anvari · Ayush Chakravarthy · Hong-Xing Yu · Josiah Wong · Sanjana Srivastava · Sharon Lee · Shengxin Zha · Laurent Itti · Yunzhu Li · Roberto Martín-Martín · Miao Liu · Pengchuan Zhang · Ruohan Zhang · Li Fei-Fei · Jiajun Wu, ,,,,,,,nan
+Relightful Harmonization: Lighting-aware Portrait Background Replacement,Mengwei Ren · Wei Xiong · Jae Shin Yoon · Zhixin Shu · Jianming Zhang · HyunJoon Jung · Guido Gerig · He Zhang, ,https://arxiv.org/abs/2312.06886,,2312.06886.pdf,Relightful Harmonization: Lighting-aware Portrait Background Replacement,"Portrait harmonization aims to composite a subject into a new background,
+adjusting its lighting and color to ensure harmony with the background scene.
+Existing harmonization techniques often only focus on adjusting the global
+color and brightness of the foreground and ignore crucial illumination cues
+from the background such as apparent lighting direction, leading to unrealistic
+compositions. We introduce Relightful Harmonization, a lighting-aware diffusion
+model designed to seamlessly harmonize sophisticated lighting effect for the
+foreground portrait using any background image. Our approach unfolds in three
+stages. First, we introduce a lighting representation module that allows our
+diffusion model to encode lighting information from target image background.
+Second, we introduce an alignment network that aligns lighting features learned
+from image background with lighting features learned from panorama environment
+maps, which is a complete representation for scene illumination. Last, to
+further boost the photorealism of the proposed method, we introduce a novel
+data simulation pipeline that generates synthetic training pairs from a diverse
+range of natural images, which are used to refine the model. Our method
+outperforms existing benchmarks in visual fidelity and lighting coherence,
+showing superior generalization in real-world testing scenarios, highlighting
+its versatility and practicality.",cs.CV,['cs.CV']
+Image Processing GNN: Breaking Rigidity in Super-Resolution,Yuchuan Tian · Hanting Chen · Chao Xu · Yunhe Wang, ,https://arxiv.org/abs/2310.10413,,2310.10413.pdf,Image super-resolution via dynamic network,"Convolutional neural networks (CNNs) depend on deep network architectures to
+extract accurate information for image super-resolution. However, obtained
+information of these CNNs cannot completely express predicted high-quality
+images for complex scenes. In this paper, we present a dynamic network for
+image super-resolution (DSRNet), which contains a residual enhancement block,
+wide enhancement block, feature refinement block and construction block. The
+residual enhancement block is composed of a residual enhanced architecture to
+facilitate hierarchical features for image super-resolution. To enhance
+robustness of obtained super-resolution model for complex scenes, a wide
+enhancement block achieves a dynamic architecture to learn more robust
+information to enhance applicability of an obtained super-resolution model for
+varying scenes. To prevent interference of components in a wide enhancement
+block, a refinement block utilizes a stacked architecture to accurately learn
+obtained features. Also, a residual learning operation is embedded in the
+refinement block to prevent long-term dependency problem. Finally, a
+construction block is responsible for reconstructing high-quality images.
+Designed heterogeneous architecture can not only facilitate richer structural
+information, but also be lightweight, which is suitable for mobile digital
+devices. Experimental results shows that our method is more competitive in
+terms of performance and recovering time of image super-resolution and
+complexity. The code of DSRNet can be obtained at
+https://github.com/hellloxiaotian/DSRNet.",eess.IV,"['eess.IV', 'cs.CV']"
+TexTile: A Differentiable Metric for Texture Tileability,Carlos Rodriguez-Pardo · Dan Casas · Elena Garces · Jorge Lopez-Moreno,https://mslab.es/projects/TexTile/,,,,,,,nan
+GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning,Ye Yuan · Xueting Li · Yangyi Huang · Shalini De Mello · Koki Nagano · Jan Kautz · Umar Iqbal,https://nvlabs.github.io/GAvatar/,https://arxiv.org/abs/2312.11461,,2312.11461.pdf,GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning,"Gaussian splatting has emerged as a powerful 3D representation that harnesses
+the advantages of both explicit (mesh) and implicit (NeRF) 3D representations.
+In this paper, we seek to leverage Gaussian splatting to generate realistic
+animatable avatars from textual descriptions, addressing the limitations (e.g.,
+flexibility and efficiency) imposed by mesh or NeRF-based representations.
+However, a naive application of Gaussian splatting cannot generate high-quality
+animatable avatars and suffers from learning instability; it also cannot
+capture fine avatar geometries and often leads to degenerate body parts. To
+tackle these problems, we first propose a primitive-based 3D Gaussian
+representation where Gaussians are defined inside pose-driven primitives to
+facilitate animation. Second, to stabilize and amortize the learning of
+millions of Gaussians, we propose to use neural implicit fields to predict the
+Gaussian attributes (e.g., colors). Finally, to capture fine avatar geometries
+and extract detailed meshes, we propose a novel SDF-based implicit mesh
+learning approach for 3D Gaussians that regularizes the underlying geometries
+and extracts highly detailed textured meshes. Our proposed method, GAvatar,
+enables the large-scale generation of diverse animatable avatars using only
+text prompts. GAvatar significantly surpasses existing methods in terms of both
+appearance and geometry quality, and achieves extremely fast rendering (100
+fps) at 1K resolution.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+Instance Tracking in 3D Scenes from Egocentric Videos,Yunhan Zhao · Haoyu Ma · Shu Kong · Charless Fowlkes,https://github.com/IT3DEgo/IT3DEgo/,https://arxiv.org/abs/2312.04117,,2312.04117.pdf,Instance Tracking in 3D Scenes from Egocentric Videos,"Egocentric sensors such as AR/VR devices capture human-object interactions
+and offer the potential to provide task-assistance by recalling 3D locations of
+objects of interest in the surrounding environment. This capability requires
+instance tracking in real-world 3D scenes from egocentric videos (IT3DEgo). We
+explore this problem by first introducing a new benchmark dataset, consisting
+of RGB and depth videos, per-frame camera pose, and instance-level annotations
+in both 2D camera and 3D world coordinates. We present an evaluation protocol
+which evaluates tracking performance in 3D coordinates with two settings for
+enrolling instances to track: (1) single-view online enrollment where an
+instance is specified on-the-fly based on the human wearer's interactions. and
+(2) multi-view pre-enrollment where images of an instance to be tracked are
+stored in memory ahead of time. To address IT3DEgo, we first re-purpose methods
+from relevant areas, e.g., single object tracking (SOT) -- running SOT methods
+to track instances in 2D frames and lifting them to 3D using camera pose and
+depth. We also present a simple method that leverages pretrained segmentation
+and detection models to generate proposals from RGB frames and match proposals
+with enrolled instance images. Perhaps surprisingly, our extensive experiments
+show that our method (with no finetuning) significantly outperforms SOT-based
+approaches. We conclude by arguing that the problem of egocentric instance
+tracking is made easier by leveraging camera pose and using a 3D allocentric
+(world) coordinate representation.",cs.CV,['cs.CV']
+ViT-Lens: Towards Omni-modal Representations,Stan Weixian Lei · Yixiao Ge · Kun Yi · Jianfeng Zhang · Difei Gao · Dylan Sun · Yuying Ge · Ying Shan · Mike Zheng Shou, ,https://arxiv.org/abs/2311.16081,,2311.16081.pdf,ViT-Lens: Towards Omni-modal Representations,"Aiming to advance AI agents, large foundation models significantly improve
+reasoning and instruction execution, yet the current focus on vision and
+language neglects the potential of perceiving diverse modalities in open-world
+environments. However, the success of data-driven vision and language models is
+costly or even infeasible to be reproduced for rare modalities. In this paper,
+we present ViT-Lens-2 that facilitates efficient omni-modal representation
+learning by perceiving novel modalities with a pretrained ViT and aligning them
+to a pre-defined space. Specifically, the modality-specific lens is tuned to
+project any-modal signals to an intermediate embedding space, which are then
+processed by a strong ViT with pre-trained visual knowledge. The encoded
+representations are optimized toward aligning with the modal-independent space,
+pre-defined by off-the-shelf foundation models. ViT-Lens-2 provides a unified
+solution for representation learning of increasing modalities with two
+appealing advantages: (i) Unlocking the great potential of pretrained ViTs to
+novel modalities effectively with efficient data regime; (ii) Enabling emergent
+downstream capabilities through modality alignment and shared ViT parameters.
+We tailor ViT-Lens-2 to learn representations for 3D point cloud, depth, audio,
+tactile and EEG, and set new state-of-the-art results across various
+understanding tasks, such as zero-shot classification. By seamlessly
+integrating ViT-Lens-2 into Multimodal Foundation Models, we enable
+Any-modality to Text and Image Generation in a zero-shot manner. Code and
+models are available at https://github.com/TencentARC/ViT-Lens.",cs.CV,"['cs.CV', 'cs.AI']"
+VideoDistill: Language-aware Vision Distillation for Video Question Answering,Bo Zou · Chao Yang · Yu Qiao · Chengbin Quan · Youjian Zhao, ,https://arxiv.org/abs/2404.00973,,2404.00973.pdf,VideoDistill: Language-aware Vision Distillation for Video Question Answering,"Significant advancements in video question answering (VideoQA) have been made
+thanks to thriving large image-language pretraining frameworks. Although these
+image-language models can efficiently represent both video and language
+branches, they typically employ a goal-free vision perception process and do
+not interact vision with language well during the answer generation, thus
+omitting crucial visual cues. In this paper, we are inspired by the human
+recognition and learning pattern and propose VideoDistill, a framework with
+language-aware (i.e., goal-driven) behavior in both vision perception and
+answer generation process. VideoDistill generates answers only from
+question-related visual embeddings and follows a thinking-observing-answering
+approach that closely resembles human behavior, distinguishing it from previous
+research. Specifically, we develop a language-aware gating mechanism to replace
+the standard cross-attention, avoiding language's direct fusion into visual
+representations. We incorporate this mechanism into two key components of the
+entire framework. The first component is a differentiable sparse sampling
+module, which selects frames containing the necessary dynamics and semantics
+relevant to the questions. The second component is a vision refinement module
+that merges existing spatial-temporal attention layers to ensure the extraction
+of multi-grained visual semantics associated with the questions. We conduct
+experimental evaluations on various challenging video question-answering
+benchmarks, and VideoDistill achieves state-of-the-art performance in both
+general and long-form VideoQA datasets. In Addition, we verify that
+VideoDistill can effectively alleviate the utilization of language shortcut
+solutions in the EgoTaskQA dataset.",cs.CV,['cs.CV']
+Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model,Zelin Peng · Zhengqin Xu · Zhilin Zeng · Lingxi Xie · Qi Tian · Wei Shen, ,https://arxiv.org/abs/2311.17112,,2311.17112.pdf,Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model,"Parameter-efficient fine-tuning (PEFT) is an effective methodology to unleash
+the potential of large foundation models in novel scenarios with limited
+training data. In the computer vision community, PEFT has shown effectiveness
+in image classification, but little research has studied its ability for image
+segmentation. Fine-tuning segmentation models usually require a heavier
+adjustment of parameters to align the proper projection directions in the
+parameter space for new scenarios. This raises a challenge to existing PEFT
+algorithms, as they often inject a limited number of individual parameters into
+each block, which prevents substantial adjustment of the projection direction
+of the parameter space due to the limitation of Hidden Markov Chain along
+blocks. In this paper, we equip PEFT with a cross-block orchestration mechanism
+to enable the adaptation of the Segment Anything Model (SAM) to various
+downstream scenarios. We introduce a novel inter-block communication module,
+which integrates a learnable relation matrix to facilitate communication among
+different coefficient sets of each PEFT block's parameter space. Moreover, we
+propose an intra-block enhancement module, which introduces a linear projection
+head whose weights are generated from a hyper-complex layer, further enhancing
+the impact of the adjustment of projection directions on the entire parameter
+space. Extensive experiments on diverse benchmarks demonstrate that our
+proposed approach consistently improves the segmentation performance
+significantly on novel scenarios with only around 1K additional parameters.",cs.CV,['cs.CV']
+Generating Illustrated Instructions,Sachit Menon · Ishan Misra · Rohit Girdhar, ,https://arxiv.org/abs/2312.04552,,2312.04552.pdf,Generating Illustrated Instructions,"We introduce the new task of generating Illustrated Instructions, i.e.,
+visual instructions customized to a user's needs. We identify desiderata unique
+to this task, and formalize it through a suite of automatic and human
+evaluation metrics, designed to measure the validity, consistency, and efficacy
+of the generations. We combine the power of large language models (LLMs)
+together with strong text-to-image generation diffusion models to propose a
+simple approach called StackedDiffusion, which generates such illustrated
+instructions given text as input. The resulting model strongly outperforms
+baseline approaches and state-of-the-art multimodal LLMs; and in 30% of cases,
+users even prefer it to human-generated articles. Most notably, it enables
+various new and exciting applications far beyond what static articles on the
+web can provide, such as personalized instructions complete with intermediate
+steps and pictures in response to a user's individual situation.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.MM']"
+3D-LFM: Lifting Foundation Model,Mosam Dabhi · László A. Jeni · Simon Lucey, ,https://arxiv.org/abs/2312.11894,,2312.11894.pdf,3D-LFM: Lifting Foundation Model,"The lifting of 3D structure and camera from 2D landmarks is at the
+cornerstone of the entire discipline of computer vision. Traditional methods
+have been confined to specific rigid objects, such as those in
+Perspective-n-Point (PnP) problems, but deep learning has expanded our
+capability to reconstruct a wide range of object classes (e.g. C3DPO and PAUL)
+with resilience to noise, occlusions, and perspective distortions. All these
+techniques, however, have been limited by the fundamental need to establish
+correspondences across the 3D training data -- significantly limiting their
+utility to applications where one has an abundance of ""in-correspondence"" 3D
+data. Our approach harnesses the inherent permutation equivariance of
+transformers to manage varying number of points per 3D data instance,
+withstands occlusions, and generalizes to unseen categories. We demonstrate
+state of the art performance across 2D-3D lifting task benchmarks. Since our
+approach can be trained across such a broad class of structures we refer to it
+simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction,Bo Zou · Chao Yang · Yu Qiao · Chengbin Quan · Youjian Zhao, ,https://arxiv.org/abs/2404.00913v1,,2404.00913v1.pdf,LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction,"Existing methods to fine-tune LLMs, like Adapter, Prefix-tuning, and LoRA,
+which introduce extra modules or additional input sequences to inject new
+skills or knowledge, may compromise the innate abilities of LLMs. In this
+paper, we propose LLaMA-Excitor, a lightweight method that stimulates the LLMs'
+potential to better follow instructions by gradually paying more attention to
+worthwhile information. Specifically, the LLaMA-Excitor does not directly
+change the intermediate hidden state during the self-attention calculation of
+the transformer structure. We designed the Excitor block as a bypass module for
+the similarity score computation in LLMs' self-attention to reconstruct keys
+and change the importance of values by learnable prompts. LLaMA-Excitor ensures
+a self-adaptive allocation of additional attention to input instructions, thus
+effectively preserving LLMs' pre-trained knowledge when fine-tuning LLMs on
+low-quality instruction-following datasets. Furthermore, we unify the modeling
+of multi-modal tuning and language-only tuning, extending LLaMA-Excitor to a
+powerful visual instruction follower without the need for complex multi-modal
+alignment. Our proposed approach is evaluated in language-only and multi-modal
+tuning experimental scenarios. Notably, LLaMA-Excitor is the only method that
+maintains basic capabilities while achieving a significant improvement (+6%) on
+the MMLU benchmark. In the visual instruction tuning, we achieve a new
+state-of-the-art image captioning performance of 157.5 CIDEr on MSCOCO, and a
+comparable performance (88.39%) on ScienceQA to cutting-edge models with more
+parameters and extensive vision-language pertaining.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']"
+3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation,Dale Decatur · Itai Lang · Kfir Aberman · Rana Hanocka, ,https://arxiv.org/abs/2311.09571,,2311.09571.pdf,3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation,"In this work we develop 3D Paintbrush, a technique for automatically
+texturing local semantic regions on meshes via text descriptions. Our method is
+designed to operate directly on meshes, producing texture maps which seamlessly
+integrate into standard graphics pipelines. We opt to simultaneously produce a
+localization map (to specify the edit region) and a texture map which conforms
+to it. This synergistic approach improves the quality of both the localization
+and the stylization. To enhance the details and resolution of the textured
+area, we leverage multiple stages of a cascaded diffusion model to supervise
+our local editing technique with generative priors learned from images at
+different resolutions. Our technique, referred to as Cascaded Score
+Distillation (CSD), simultaneously distills scores at multiple resolutions in a
+cascaded fashion, enabling control over both the granularity and global
+understanding of the supervision. We demonstrate the effectiveness of 3D
+Paintbrush to locally texture a variety of shapes within different semantic
+regions. Project page: https://threedle.github.io/3d-paintbrush",cs.GR,"['cs.GR', 'cs.CV']"
+Epistemic Uncertainty Quantification For Pre-trained Neural Networks,Hanjing Wang · Qiang Ji, ,https://arxiv.org/abs/2404.10124,,2404.10124.pdf,Epistemic Uncertainty Quantification For Pre-trained Neural Network,"Epistemic uncertainty quantification (UQ) identifies where models lack
+knowledge. Traditional UQ methods, often based on Bayesian neural networks, are
+not suitable for pre-trained non-Bayesian models. Our study addresses
+quantifying epistemic uncertainty for any pre-trained model, which does not
+need the original training data or model modifications and can ensure broad
+applicability regardless of network architectures or training techniques.
+Specifically, we propose a gradient-based approach to assess epistemic
+uncertainty, analyzing the gradients of outputs relative to model parameters,
+and thereby indicating necessary model adjustments to accurately represent the
+inputs. We first explore theoretical guarantees of gradient-based methods for
+epistemic UQ, questioning the view that this uncertainty is only calculable
+through differences between multiple models. We further improve gradient-driven
+UQ by using class-specific weights for integrating gradients and emphasizing
+distinct contributions from neural network layers. Additionally, we enhance UQ
+accuracy by combining gradient and perturbation methods to refine the
+gradients. We evaluate our approach on out-of-distribution detection,
+uncertainty calibration, and active learning, demonstrating its superiority
+over current state-of-the-art UQ methods for pre-trained models.",cs.LG,"['cs.LG', 'cs.CV']"
+Teeth-SEG: An Efficient Instance Segmentation Framework for Orthodontic Treatment based on Anthropic Prior Knowledge,Bo Zou · Shaofeng Wang · Hao Liu · Gaoyue Sun · Yajie Wang · Zuo FeiFei · Chengbin Quan · Youjian Zhao, ,,https://paperswithcode.com/paper/teeth-seg-an-efficient-instance-segmentation,,,,,nan
+Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer,Junyi Wu · Bin Duan · Weitai Kang · Hao Tang · Yan Yan, ,https://arxiv.org/abs/2403.14552,,2403.14552.pdf,Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer,"While Transformers have rapidly gained popularity in various computer vision
+applications, post-hoc explanations of their internal mechanisms remain largely
+unexplored. Vision Transformers extract visual information by representing
+image regions as transformed tokens and integrating them via attention weights.
+However, existing post-hoc explanation methods merely consider these attention
+weights, neglecting crucial information from the transformed tokens, which
+fails to accurately illustrate the rationales behind the models' predictions.
+To incorporate the influence of token transformation into interpretation, we
+propose TokenTM, a novel post-hoc explanation method that utilizes our
+introduced measurement of token transformation effects. Specifically, we
+quantify token transformation effects by measuring changes in token lengths and
+correlations in their directions pre- and post-transformation. Moreover, we
+develop initialization and aggregation rules to integrate both attention
+weights and token transformation effects across all layers, capturing holistic
+token contributions throughout the model. Experimental results on segmentation
+and perturbation tests demonstrate the superiority of our proposed TokenTM
+compared to state-of-the-art Vision Transformer explanation methods.",cs.CV,['cs.CV']
+Global Latent Neural Rendering,Thomas Tanay · Matteo Maggioni, ,https://arxiv.org/abs/2312.08338,,2312.08338.pdf,Global Latent Neural Rendering,"A recent trend among generalizable novel view synthesis methods is to learn a
+rendering operator acting over single camera rays. This approach is promising
+because it removes the need for explicit volumetric rendering, but it
+effectively treats target images as collections of independent pixels. Here, we
+propose to learn a global rendering operator acting over all camera rays
+jointly. We show that the right representation to enable such rendering is a
+5-dimensional plane sweep volume consisting of the projection of the input
+images on a set of planes facing the target camera. Based on this
+understanding, we introduce our Convolutional Global Latent Renderer (ConvGLR),
+an efficient convolutional architecture that performs the rendering operation
+globally in a low-resolution latent space. Experiments on various datasets
+under sparse and generalizable setups show that our approach consistently
+outperforms existing methods by significant margins.",cs.CV,['cs.CV']
+MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers,Yawar Siddiqui · Antonio Alliegro · Alexey Artemov · Tatiana Tommasi · Daniele Sirigatti · Vladislav Rosov · Angela Dai · Matthias Nießner, ,https://arxiv.org/abs/2311.15475,,2311.15475.pdf,MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers,"We introduce MeshGPT, a new approach for generating triangle meshes that
+reflects the compactness typical of artist-created meshes, in contrast to dense
+triangle meshes extracted by iso-surfacing methods from neural fields. Inspired
+by recent advances in powerful large language models, we adopt a sequence-based
+approach to autoregressively generate triangle meshes as sequences of
+triangles. We first learn a vocabulary of latent quantized embeddings, using
+graph convolutions, which inform these embeddings of the local mesh geometry
+and topology. These embeddings are sequenced and decoded into triangles by a
+decoder, ensuring that they can effectively reconstruct the mesh. A transformer
+is then trained on this learned vocabulary to predict the index of the next
+embedding given previous embeddings. Once trained, our model can be
+autoregressively sampled to generate new triangle meshes, directly generating
+compact meshes with sharp edges, more closely imitating the efficient
+triangulation patterns of human-crafted meshes. MeshGPT demonstrates a notable
+improvement over state of the art mesh generation methods, with a 9% increase
+in shape coverage and a 30-point enhancement in FID scores across various
+categories.",cs.CV,"['cs.CV', 'cs.LG']"
+Video Recognition in Portrait Mode,Mingfei Han · Linjie Yang · Xiaojie Jin · Jiashi Feng · Xiaojun Chang · Heng Wang, ,https://arxiv.org/abs/2312.13746v1,,2312.13746v1.pdf,Video Recognition in Portrait Mode,"The creation of new datasets often presents new challenges for video
+recognition and can inspire novel ideas while addressing these challenges.
+While existing datasets mainly comprise landscape mode videos, our paper seeks
+to introduce portrait mode videos to the research community and highlight the
+unique challenges associated with this video format. With the growing
+popularity of smartphones and social media applications, recognizing portrait
+mode videos is becoming increasingly important. To this end, we have developed
+the first dataset dedicated to portrait mode video recognition, namely
+PortraitMode-400. The taxonomy of PortraitMode-400 was constructed in a
+data-driven manner, comprising 400 fine-grained categories, and rigorous
+quality assurance was implemented to ensure the accuracy of human annotations.
+In addition to the new dataset, we conducted a comprehensive analysis of the
+impact of video format (portrait mode versus landscape mode) on recognition
+accuracy and spatial bias due to the different formats. Furthermore, we
+designed extensive experiments to explore key aspects of portrait mode video
+recognition, including the choice of data augmentation, evaluation procedure,
+the importance of temporal information, and the role of audio modality.
+Building on the insights from our experimental results and the introduction of
+PortraitMode-400, our paper aims to inspire further research efforts in this
+emerging research area.",cs.CV,['cs.CV']
+VGGSfM: Visual Geometry Grounded Deep Structure From Motion,Jianyuan Wang · Nikita Karaev · Christian Rupprecht · David Novotny, ,https://arxiv.org/abs/2312.04563,,2312.04563.pdf,Visual Geometry Grounded Deep Structure From Motion,"Structure-from-motion (SfM) is a long-standing problem in the computer vision
+community, which aims to reconstruct the camera poses and 3D structure of a
+scene from a set of unconstrained 2D images. Classical frameworks solve this
+problem in an incremental manner by detecting and matching keypoints,
+registering images, triangulating 3D points, and conducting bundle adjustment.
+Recent research efforts have predominantly revolved around harnessing the power
+of deep learning techniques to enhance specific elements (e.g., keypoint
+matching), but are still based on the original, non-differentiable pipeline.
+Instead, we propose a new deep pipeline VGGSfM, where each component is fully
+differentiable and thus can be trained in an end-to-end manner. To this end, we
+introduce new mechanisms and simplifications. First, we build on recent
+advances in deep 2D point tracking to extract reliable pixel-accurate tracks,
+which eliminates the need for chaining pairwise matches. Furthermore, we
+recover all cameras simultaneously based on the image and track features
+instead of gradually registering cameras. Finally, we optimise the cameras and
+triangulate 3D points via a differentiable bundle adjustment layer. We attain
+state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism,
+and ETH3D.",cs.CV,"['cs.CV', 'cs.RO']"
+Intrinsic Image Diffusion for Indoor Single-view Material Estimation,Peter Kocsis · Vincent Sitzmann · Matthias Nießner,https://peter-kocsis.github.io/IntrinsicImageDiffusion/,https://arxiv.org/abs/2312.12274,,2312.12274.pdf,Intrinsic Image Diffusion for Indoor Single-view Material Estimation,"We present Intrinsic Image Diffusion, a generative model for appearance
+decomposition of indoor scenes. Given a single input view, we sample multiple
+possible material explanations represented as albedo, roughness, and metallic
+maps. Appearance decomposition poses a considerable challenge in computer
+vision due to the inherent ambiguity between lighting and material properties
+and the lack of real datasets. To address this issue, we advocate for a
+probabilistic formulation, where instead of attempting to directly predict the
+true material properties, we employ a conditional generative model to sample
+from the solution space. Furthermore, we show that utilizing the strong learned
+prior of recent diffusion models trained on large-scale real-world images can
+be adapted to material estimation and highly improves the generalization to
+real images. Our method produces significantly sharper, more consistent, and
+more detailed materials, outperforming state-of-the-art methods by $1.5dB$ on
+PSNR and by $45\%$ better FID score on albedo prediction. We demonstrate the
+effectiveness of our approach through experiments on both synthetic and
+real-world datasets.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'I.4.8; I.2.10']"
+An N-Point Linear Solver for Line and Motion Estimation with Event Cameras,Ling Gao · Daniel Gehrig · Hang Su · Davide Scaramuzza · Laurent Kneip,https://mgaoling.github.io/eventail/,https://arxiv.org/abs/2404.00842v1,,2404.00842v1.pdf,An N-Point Linear Solver for Line and Motion Estimation with Event Cameras,"Event cameras respond primarily to edges--formed by strong gradients--and are
+thus particularly well-suited for line-based motion estimation. Recent work has
+shown that events generated by a single line each satisfy a polynomial
+constraint which describes a manifold in the space-time volume. Multiple such
+constraints can be solved simultaneously to recover the partial linear velocity
+and line parameters. In this work, we show that, with a suitable line
+parametrization, this system of constraints is actually linear in the unknowns,
+which allows us to design a novel linear solver. Unlike existing solvers, our
+linear solver (i) is fast and numerically stable since it does not rely on
+expensive root finding, (ii) can solve both minimal and overdetermined systems
+with more than 5 events, and (iii) admits the characterization of all
+degenerate cases and multiple solutions. The found line parameters are
+singularity-free and have a fixed scale, which eliminates the need for
+auxiliary constraints typically encountered in previous work. To recover the
+full linear camera velocity we fuse observations from multiple lines with a
+novel velocity averaging scheme that relies on a geometrically-motivated
+residual, and thus solves the problem more efficiently than previous schemes
+which minimize an algebraic residual. Extensive experiments in synthetic and
+real-world settings demonstrate that our method surpasses the previous work in
+numerical stability, and operates over 600 times faster.",cs.CV,['cs.CV']
+Benchmarking Segmentation Models with Mask-Preserved Attribute Editing,Zijin Yin · Kongming Liang · Bing Li · Zhanyu Ma · Jun Guo, ,https://arxiv.org/abs/2403.01231,,2403.01231.pdf,Benchmarking Segmentation Models with Mask-Preserved Attribute Editing,"When deploying segmentation models in practice, it is critical to evaluate
+their behaviors in varied and complex scenes. Different from the previous
+evaluation paradigms only in consideration of global attribute variations (e.g.
+adverse weather), we investigate both local and global attribute variations for
+robustness evaluation. To achieve this, we construct a mask-preserved attribute
+editing pipeline to edit visual attributes of real images with precise control
+of structural information. Therefore, the original segmentation labels can be
+reused for the edited images. Using our pipeline, we construct a benchmark
+covering both object and image attributes (e.g. color, material, pattern,
+style). We evaluate a broad variety of semantic segmentation models, spanning
+from conventional close-set models to recent open-vocabulary large models on
+their robustness to different types of variations. We find that both local and
+global attribute variations affect segmentation performances, and the
+sensitivity of models diverges across different variation types. We argue that
+local attributes have the same importance as global attributes, and should be
+considered in the robustness evaluation of segmentation models. Code:
+https://github.com/PRIS-CV/Pascal-EA.",cs.CV,['cs.CV']
+How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?,Yuxin Chen · Zongyang Ma · Ziqi Zhang · Zhongang Qi · Chunfeng Yuan · Bing Li · Junfu Pu · Ying Shan · Xiaojuan Qi · Weiming Hu, ,https://arxiv.org/abs/2310.19654,,2310.19654.pdf,MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval,"Due to the success of large-scale visual-language pretraining (VLP) models
+and the widespread use of image-text retrieval in industry areas, it is now
+critically necessary to reduce the model size and streamline their
+mobile-device deployment. Single- and dual-stream model structures are commonly
+used in image-text retrieval with the goal of closing the semantic gap between
+textual and visual modalities. While single-stream models use deep feature
+fusion to achieve more accurate cross-model alignment, dual-stream models are
+better at offline indexing and fast inference.We propose a Multi-teacher
+Cross-modality Alignment Distillation (MCAD) technique to integrate the
+advantages of single- and dual-stream models. By incorporating the fused
+single-stream features into the image and text features of the dual-stream
+model, we formulate new modified teacher similarity distributions and features.
+Then, we conduct both distribution and feature distillation to boost the
+capability of the student dual-stream model, achieving high retrieval
+performance without increasing inference complexity.Extensive experiments
+demonstrate the remarkable performance and high efficiency of MCAD on
+image-text retrieval tasks. Furthermore, we implement a lightweight CLIP model
+on Snapdragon/Dimensity chips with only $\sim$100M running memory and
+$\sim$8.0ms search latency, achieving the mobile-device application of VLP
+models.",cs.CV,"['cs.CV', 'cs.AI']"
+A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals,Jiangnan Tang · Jingya Wang · Kaiyang Ji · Lan Xu · Jingyi Yu · Ye Shi, ,https://arxiv.org/abs/2404.04890,,2404.04890.pdf,A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals,"Estimating full-body human motion via sparse tracking signals from
+head-mounted displays and hand controllers in 3D scenes is crucial to
+applications in AR/VR. One of the biggest challenges to this task is the
+one-to-many mapping from sparse observations to dense full-body motions, which
+endowed inherent ambiguities. To help resolve this ambiguous problem, we
+introduce a new framework to combine rich contextual information provided by
+scenes to benefit full-body motion tracking from sparse observations. To
+estimate plausible human motions given sparse tracking signals and 3D scenes,
+we develop $\text{S}^2$Fusion, a unified framework fusing \underline{S}cene and
+sparse \underline{S}ignals with a conditional dif\underline{Fusion} model.
+$\text{S}^2$Fusion first extracts the spatial-temporal relations residing in
+the sparse signals via a periodic autoencoder, and then produces time-alignment
+feature embedding as additional inputs. Subsequently, by drawing initial noisy
+motion from a pre-trained prior, $\text{S}^2$Fusion utilizes conditional
+diffusion to fuse scene geometry and sparse tracking signals to generate
+full-body scene-aware motions. The sampling procedure of $\text{S}^2$Fusion is
+further guided by a specially designed scene-penetration loss and
+phase-matching loss, which effectively regularizes the motion of the lower body
+even in the absence of any tracking signals, making the generated motion much
+more plausible and coherent. Extensive experimental results have demonstrated
+that our $\text{S}^2$Fusion outperforms the state-of-the-art in terms of
+estimation quality and smoothness.",cs.CV,['cs.CV']
+Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction,Ziyi Yang · Xinyu Gao · Wen Zhou · Shaohui Jiao · Yuqing Zhang · Xiaogang Jin, ,https://arxiv.org/abs/2309.13101,,2309.13101.pdf,Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction,"Implicit neural representation has paved the way for new approaches to
+dynamic scene reconstruction and rendering. Nonetheless, cutting-edge dynamic
+neural rendering methods rely heavily on these implicit representations, which
+frequently struggle to capture the intricate details of objects in the scene.
+Furthermore, implicit methods have difficulty achieving real-time rendering in
+general dynamic scenes, limiting their use in a variety of tasks. To address
+the issues, we propose a deformable 3D Gaussians Splatting method that
+reconstructs scenes using 3D Gaussians and learns them in canonical space with
+a deformation field to model monocular dynamic scenes. We also introduce an
+annealing smoothing training mechanism with no extra overhead, which can
+mitigate the impact of inaccurate poses on the smoothness of time interpolation
+tasks in real-world datasets. Through a differential Gaussian rasterizer, the
+deformable 3D Gaussians not only achieve higher rendering quality but also
+real-time rendering speed. Experiments show that our method outperforms
+existing methods significantly in terms of both rendering quality and speed,
+making it well-suited for tasks such as novel-view synthesis, time
+interpolation, and real-time rendering.",cs.CV,['cs.CV']
+Towards Generalizable Tumor Synthesis,Qi Chen · Xiaoxi Chen · Haorui Song · Alan L. Yuille · Zhiwei Xiong · Chen Wei · Zongwei Zhou, ,https://arxiv.org/abs/2402.19470,,2402.19470.pdf,Towards Generalizable Tumor Synthesis,"Tumor synthesis enables the creation of artificial tumors in medical images,
+facilitating the training of AI models for tumor detection and segmentation.
+However, success in tumor synthesis hinges on creating visually realistic
+tumors that are generalizable across multiple organs and, furthermore, the
+resulting AI models being capable of detecting real tumors in images sourced
+from different domains (e.g., hospitals). This paper made a progressive stride
+toward generalizable tumor synthesis by leveraging a critical observation:
+early-stage tumors (< 2cm) tend to have similar imaging characteristics in
+computed tomography (CT), whether they originate in the liver, pancreas, or
+kidneys. We have ascertained that generative AI models, e.g., Diffusion Models,
+can create realistic tumors generalized to a range of organs even when trained
+on a limited number of tumor examples from only one organ. Moreover, we have
+shown that AI models trained on these synthetic tumors can be generalized to
+detect and segment real tumors from CT volumes, encompassing a broad spectrum
+of patient demographics, imaging protocols, and healthcare facilities.",eess.IV,"['eess.IV', 'cs.CV']"
+Prompt3D: Random Prompt Assisted Weakly-Supervised 3D Object Detection,Xiaohong Zhang · Huisheng Ye · Jingwen Li · Qinyu Tang · Yuanqi Li · Yanwen Guo · Jie Guo,https://huishengye.github.io/prompt3d/,https://arxiv.org/abs/2312.07530,,2312.07530.pdf,Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance,"Weakly supervised 3D object detection aims to learn a 3D detector with lower
+annotation cost, e.g., 2D labels. Unlike prior work which still relies on few
+accurate 3D annotations, we propose a framework to study how to leverage
+constraints between 2D and 3D domains without requiring any 3D labels.
+Specifically, we employ visual data from three perspectives to establish
+connections between 2D and 3D domains. First, we design a feature-level
+constraint to align LiDAR and image features based on object-aware regions.
+Second, the output-level constraint is developed to enforce the overlap between
+2D and projected 3D box estimations. Finally, the training-level constraint is
+utilized by producing accurate and consistent 3D pseudo-labels that align with
+the visual data. We conduct extensive experiments on the KITTI dataset to
+validate the effectiveness of the proposed three constraints. Without using any
+3D labels, our method achieves favorable performance against state-of-the-art
+approaches and is competitive with the method that uses 500-frame 3D
+annotations. Code and models will be made publicly available at
+https://github.com/kuanchihhuang/VG-W3D.",cs.CV,['cs.CV']
+Discriminative Pattern Calibration Mechanism for Source-Free Domain Adaptation,Haifeng Xia · Siyu Xia · Zhengming Ding, ,https://arxiv.org/abs/2405.02954,,2405.02954.pdf,Source-Free Domain Adaptation Guided by Vision and Vision-Language Pre-Training,"Source-free domain adaptation (SFDA) aims to adapt a source model trained on
+a fully-labeled source domain to a related but unlabeled target domain. While
+the source model is a key avenue for acquiring target pseudolabels, the
+generated pseudolabels may exhibit source bias. In the conventional SFDA
+pipeline, a large data (e.g. ImageNet) pre-trained feature extractor is used to
+initialize the source model at the start of source training, and subsequently
+discarded. Despite having diverse features important for generalization, the
+pre-trained feature extractor can overfit to the source data distribution
+during source training and forget relevant target domain knowledge. Rather than
+discarding this valuable knowledge, we introduce an integrated framework to
+incorporate pre-trained networks into the target adaptation process. The
+proposed framework is flexible and allows us to plug modern pre-trained
+networks into the adaptation process to leverage their stronger representation
+learning capabilities. For adaptation, we propose the Co-learn algorithm to
+improve target pseudolabel quality collaboratively through the source model and
+a pre-trained feature extractor. Building on the recent success of the
+vision-language model CLIP in zero-shot image recognition, we present an
+extension Co-learn++ to further incorporate CLIP's zero-shot classification
+decisions. We evaluate on 3 benchmark datasets and include more challenging
+scenarios such as open-set, partial-set and open-partial SFDA. Experimental
+results demonstrate that our proposed strategy improves adaptation performance
+and can be successfully integrated with existing SFDA methods.",cs.CV,"['cs.CV', 'cs.LG']"
+Reconstruction-free Cascaded Adaptive Compressive Sensing,Chenxi Qiu · Tao Yue · Xuemei Hu, ,https://arxiv.org/abs/2403.17006,,2403.17006.pdf,Invertible Diffusion Models for Compressed Sensing,"While deep neural networks (NN) significantly advance image compressed
+sensing (CS) by improving reconstruction quality, the necessity of training
+current CS NNs from scratch constrains their effectiveness and hampers rapid
+deployment. Although recent methods utilize pre-trained diffusion models for
+image reconstruction, they struggle with slow inference and restricted
+adaptability to CS. To tackle these challenges, this paper proposes Invertible
+Diffusion Models (IDM), a novel efficient, end-to-end diffusion-based CS
+method. IDM repurposes a large-scale diffusion sampling process as a
+reconstruction model, and finetunes it end-to-end to recover original images
+directly from CS measurements, moving beyond the traditional paradigm of
+one-step noise estimation learning. To enable such memory-intensive end-to-end
+finetuning, we propose a novel two-level invertible design to transform both
+(1) the multi-step sampling process and (2) the noise estimation U-Net in each
+step into invertible networks. As a result, most intermediate features are
+cleared during training to reduce up to 93.8% GPU memory. In addition, we
+develop a set of lightweight modules to inject measurements into noise
+estimator to further facilitate reconstruction. Experiments demonstrate that
+IDM outperforms existing state-of-the-art CS networks by up to 2.64dB in PSNR.
+Compared to the recent diffusion model-based approach DDNM, our IDM achieves up
+to 10.09dB PSNR gain and 14.54 times faster inference.",cs.CV,['cs.CV']
+CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention,Mohammad Sadil Khan · Elona Dupont · Sk Aziz Ali · Kseniya Cherenkova · Anis Kacem · Djamila Aouada,https://cvi2.uni.lu/cadsig-net/,https://arxiv.org/abs/2402.17678,,2402.17678.pdf,CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention,"Reverse engineering in the realm of Computer-Aided Design (CAD) has been a
+longstanding aspiration, though not yet entirely realized. Its primary aim is
+to uncover the CAD process behind a physical object given its 3D scan. We
+propose CAD-SIGNet, an end-to-end trainable and auto-regressive architecture to
+recover the design history of a CAD model represented as a sequence of
+sketch-and-extrusion from an input point cloud. Our model learns
+visual-language representations by layer-wise cross-attention between point
+cloud and CAD language embedding. In particular, a new Sketch instance Guided
+Attention (SGA) module is proposed in order to reconstruct the fine-grained
+details of the sketches. Thanks to its auto-regressive nature, CAD-SIGNet not
+only reconstructs a unique full design history of the corresponding CAD model
+given an input point cloud but also provides multiple plausible design choices.
+This allows for an interactive reverse engineering scenario by providing
+designers with multiple next-step choices along with the design process.
+Extensive experiments on publicly available CAD datasets showcase the
+effectiveness of our approach against existing baseline models in two settings,
+namely, full design history recovery and conditional auto-completion from point
+clouds.",cs.CV,['cs.CV']
+SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation,Junyan Ye · Qiyan Luo · Jinhua Yu · Huaping Zhong · Zhimeng Zheng · Conghui He · Weijia Li, ,https://arxiv.org/abs/2404.02638,,2404.02638.pdf,SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation,"This paper aims at achieving fine-grained building attribute segmentation in
+a cross-view scenario, i.e., using satellite and street-view image pairs. The
+main challenge lies in overcoming the significant perspective differences
+between street views and satellite views. In this work, we introduce SG-BEV, a
+novel approach for satellite-guided BEV fusion for cross-view semantic
+segmentation. To overcome the limitations of existing cross-view projection
+methods in capturing the complete building facade features, we innovatively
+incorporate Bird's Eye View (BEV) method to establish a spatially explicit
+mapping of street-view features. Moreover, we fully leverage the advantages of
+multiple perspectives by introducing a novel satellite-guided reprojection
+module, optimizing the uneven feature distribution issues associated with
+traditional BEV methods. Our method demonstrates significant improvements on
+four cross-view datasets collected from multiple cities, including New York,
+San Francisco, and Boston. On average across these datasets, our method
+achieves an increase in mIOU by 10.13% and 5.21% compared with the
+state-of-the-art satellite-based and cross-view methods. The code and datasets
+of this work will be released at https://github.com/yejy53/SG-BEV.",cs.CV,['cs.CV']
+MSU-4S - The Michigan State University Four Seasons Dataset,Daniel Kent · Mohammed Alyaqoub · Xiaohu Lu · Sayed Khatounabadi · Kookjin Sung · Cole Scheller · Alexander Dalat · Xinwei Guo · Asma Bin Thabit · Roberto Muntaner Whitley · Hayder Radha, ,,https://msuspartans.com/news/2024/5/1/womens-basketball-fralick-adds-four-to-womens-basketball-roster.aspx?print=true,,,,,nan
+Retraining-free Model Quantization via One-Shot Weight-Coupling Learning,Chen Tang · Yuan Meng · Jiacheng Jiang · Shuzhao Xie · Rongwei Lu · Xinzhu Ma · Zhi Wang · Wenwu Zhu, ,https://arxiv.org/abs/2401.01543,,2401.01543.pdf,Retraining-free Model Quantization via One-Shot Weight-Coupling Learning,"Quantization is of significance for compressing the over-parameterized deep
+neural models and deploying them on resource-limited devices. Fixed-precision
+quantization suffers from performance drop due to the limited numerical
+representation ability. Conversely, mixed-precision quantization (MPQ) is
+advocated to compress the model effectively by allocating heterogeneous
+bit-width for layers. MPQ is typically organized into a searching-retraining
+two-stage process. Previous works only focus on determining the optimal
+bit-width configuration in the first stage efficiently, while ignoring the
+considerable time costs in the second stage. However, retraining always
+consumes hundreds of GPU-hours on the cutting-edge GPUs, thus hindering
+deployment efficiency significantly. In this paper, we devise a one-shot
+training-searching paradigm for mixed-precision model compression.
+Specifically, in the first stage, all potential bit-width configurations are
+coupled and thus optimized simultaneously within a set of shared weights.
+However, our observations reveal a previously unseen and severe bit-width
+interference phenomenon among highly coupled weights during optimization,
+leading to considerable performance degradation under a high compression ratio.
+To tackle this problem, we first design a bit-width scheduler to dynamically
+freeze the most turbulent bit-width of layers during training, to ensure the
+rest bit-widths converged properly. Then, taking inspiration from information
+theory, we present an information distortion mitigation technique to align the
+behaviour of the bad-performing bit-widths to the well-performing ones.",cs.CV,['cs.CV']
+Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder,Jinseok Kim · Tae-Kyun Kim, ,https://arxiv.org/abs/2403.10255,,2403.10255.pdf,Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder,"Super-resolution (SR) and image generation are important tasks in computer
+vision and are widely adopted in real-world applications. Most existing
+methods, however, generate images only at fixed-scale magnification and suffer
+from over-smoothing and artifacts. Additionally, they do not offer enough
+diversity of output images nor image consistency at different scales. Most
+relevant work applied Implicit Neural Representation (INR) to the denoising
+diffusion model to obtain continuous-resolution yet diverse and high-quality SR
+results. Since this model operates in the image space, the larger the
+resolution of image is produced, the more memory and inference time is
+required, and it also does not maintain scale-specific consistency. We propose
+a novel pipeline that can super-resolve an input image or generate from a
+random noise a novel image at arbitrary scales. The method consists of a
+pretrained auto-encoder, a latent diffusion model, and an implicit neural
+decoder, and their learning strategies. The proposed method adopts diffusion
+processes in a latent space, thus efficient, yet aligned with output image
+space decoded by MLPs at arbitrary scales. More specifically, our
+arbitrary-scale decoder is designed by the symmetric decoder w/o up-scaling
+from the pretrained auto-encoder, and Local Implicit Image Function (LIIF) in
+series. The latent diffusion process is learnt by the denoising and the
+alignment losses jointly. Errors in output images are backpropagated via the
+fixed decoder, improving the quality of output images. In the extensive
+experiments using multiple public benchmarks on the two tasks i.e. image
+super-resolution and novel image generation at arbitrary scales, the proposed
+method outperforms relevant methods in metrics of image quality, diversity and
+scale consistency. It is significantly better than the relevant prior-art in
+the inference speed and memory usage.",cs.CV,['cs.CV']
+Incremental Nuclei Segmentation from Histopathological Images via Future-class Awareness and Compatibility-inspired Distillation,Huyong Wang · Huisi Wu · Jing Qin, ,,https://bmcmedimaging.biomedcentral.com/articles/10.1186/s12880-023-01121-3,,,,,nan
+PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment,Tianchen Deng · Guole Shen · Tong Qin · jianyu wang · Wentao Zhao · Jingchuan Wang · Danwei Wang · Weidong Chen, ,https://arxiv.org/abs/2312.09866,,2312.09866.pdf,PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment,"Neural implicit scene representations have recently shown encouraging results
+in dense visual SLAM. However, existing methods produce low-quality scene
+reconstruction and low-accuracy localization performance when scaling up to
+large indoor scenes and long sequences. These limitations are mainly due to
+their single, global radiance field with finite capacity, which does not adapt
+to large scenarios. Their end-to-end pose networks are also not robust enough
+with the growth of cumulative errors in large scenes. To this end, we introduce
+PLGSLAM, a neural visual SLAM system capable of high-fidelity surface
+reconstruction and robust camera tracking in real-time. To handle large-scale
+indoor scenes, PLGSLAM proposes a progressive scene representation method which
+dynamically allocates new local scene representation trained with frames within
+a local sliding window. This allows us to scale up to larger indoor scenes and
+improves robustness (even under pose drifts). In local scene representation,
+PLGSLAM utilizes tri-planes for local high-frequency features with multi-layer
+perceptron (MLP) networks for the low-frequency feature, achieving smoothness
+and scene completion in unobserved areas. Moreover, we propose local-to-global
+bundle adjustment method with a global keyframe database to address the
+increased pose drifts on long sequences. Experimental results demonstrate that
+PLGSLAM achieves state-of-the-art scene reconstruction results and tracking
+performance across various datasets and scenarios (both in small and
+large-scale indoor environments).",cs.CV,['cs.CV']
+Bayesian Exploration of Pre-trained Models for Low-shot Image Classification,Yibo Miao · Yu lei · Feng Zhou · Zhijie Deng, ,https://arxiv.org/abs/2404.00312,,2404.00312.pdf,Bayesian Exploration of Pre-trained Models for Low-shot Image Classification,"Low-shot image classification is a fundamental task in computer vision, and
+the emergence of large-scale vision-language models such as CLIP has greatly
+advanced the forefront of research in this field. However, most existing
+CLIP-based methods lack the flexibility to effectively incorporate other
+pre-trained models that encompass knowledge distinct from CLIP. To bridge the
+gap, this work proposes a simple and effective probabilistic model ensemble
+framework based on Gaussian processes, which have previously demonstrated
+remarkable efficacy in processing small data. We achieve the integration of
+prior knowledge by specifying the mean function with CLIP and the kernel
+function with an ensemble of deep kernels built upon various pre-trained
+models. By regressing the classification label directly, our framework enables
+analytical inference, straightforward uncertainty quantification, and
+principled hyper-parameter tuning. Through extensive experiments on standard
+benchmarks, we demonstrate that our method consistently outperforms competitive
+ensemble baselines regarding predictive performance. Additionally, we assess
+the robustness of our method and the quality of the yielded uncertainty
+estimates on out-of-distribution datasets. We also illustrate that our method,
+despite relying on label regression, still enjoys superior model calibration
+compared to most deterministic baselines.",cs.CV,"['cs.CV', 'cs.AI']"
+What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models,Letian Zhang · Xiaotong Zhai · Zhongkai Zhao · Yongshuo Zong · Xin Wen · Bingchen Zhao, ,https://arxiv.org/abs/2310.06627,,2310.06627.pdf,What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models,"Counterfactual reasoning, a fundamental aspect of human cognition, involves
+contemplating alternatives to established facts or past events, significantly
+enhancing our abilities in planning and decision-making. In light of the
+advancements in current multi-modal large language models, we explore their
+effectiveness in counterfactual reasoning. To facilitate this investigation, we
+introduce a novel dataset, C-VQA, specifically designed to test the
+counterfactual reasoning capabilities of modern multi-modal large language
+models. This dataset is constructed by infusing original questions with
+counterfactual presuppositions, spanning various types such as numerical and
+boolean queries. It encompasses a mix of real and synthetic data, representing
+a wide range of difficulty levels. Our thorough evaluations of contemporary
+vision-language models using this dataset have revealed substantial performance
+drops, with some models showing up to a 40% decrease, highlighting a
+significant gap between current models and human-like vision reasoning
+capabilities. We hope our dataset will serve as a vital benchmark for
+evaluating the counterfactual reasoning capabilities of models. Code and
+dataset are publicly available at https://bzhao.me/C-VQA/.",cs.CL,"['cs.CL', 'cs.CV', 'cs.LG']"
+Vision-and-Language Navigation via Causal Learning,Liuyi Wang · Zongtao He · Ronghao Dang · mengjiao shen · Chengju Liu · Qijun Chen, ,https://arxiv.org/abs/2404.10241,,2404.10241.pdf,Vision-and-Language Navigation via Causal Learning,"In the pursuit of robust and generalizable environment perception and
+language understanding, the ubiquitous challenge of dataset bias continues to
+plague vision-and-language navigation (VLN) agents, hindering their performance
+in unseen environments. This paper introduces the generalized cross-modal
+causal transformer (GOAT), a pioneering solution rooted in the paradigm of
+causal inference. By delving into both observable and unobservable confounders
+within vision, language, and history, we propose the back-door and front-door
+adjustment causal learning (BACL and FACL) modules to promote unbiased learning
+by comprehensively mitigating potential spurious correlations. Additionally, to
+capture global confounder features, we propose a cross-modal feature pooling
+(CFP) module supervised by contrastive learning, which is also shown to be
+effective in improving cross-modal representations during pre-training.
+Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and
+SOON) underscore the superiority of our proposed method over previous
+state-of-the-art approaches. Code is available at
+https://github.com/CrystalSixone/VLN-GOAT.",cs.CV,"['cs.CV', 'cs.AI']"
+TIM: A Time Interval Machine for Audio-Visual Action Recognition,Jacob Chalk · Jaesung Huh · Evangelos Kazakos · Andrew Zisserman · Dima Damen,https://jacobchalk.github.io/TIM-Project/,https://arxiv.org/abs/2404.05559,,2404.05559.pdf,TIM: A Time Interval Machine for Audio-Visual Action Recognition,"Diverse actions give rise to rich audio-visual signals in long videos. Recent
+works showcase that the two modalities of audio and video exhibit different
+temporal extents of events and distinct labels. We address the interplay
+between the two modalities in long videos by explicitly modelling the temporal
+extents of audio and visual events. We propose the Time Interval Machine (TIM)
+where a modality-specific time interval poses as a query to a transformer
+encoder that ingests a long video input. The encoder then attends to the
+specified interval, as well as the surrounding context in both modalities, in
+order to recognise the ongoing action.
+  We test TIM on three long audio-visual video datasets: EPIC-KITCHENS,
+Perception Test, and AVE, reporting state-of-the-art (SOTA) for recognition. On
+EPIC-KITCHENS, we beat previous SOTA that utilises LLMs and significantly
+larger pre-training by 2.9% top-1 action recognition accuracy. Additionally, we
+show that TIM can be adapted for action detection, using dense multi-scale
+interval queries, outperforming SOTA on EPIC-KITCHENS-100 for most metrics, and
+showing strong performance on the Perception Test. Our ablations show the
+critical role of integrating the two modalities and modelling their time
+intervals in achieving this performance. Code and models at:
+https://github.com/JacobChalk/TIM",cs.CV,['cs.CV']
+Retrieval-Augmented Open-Vocabulary Object Detection,Jooyeon Kim · Eulrang Cho · Sehyung Kim · Hyunwoo J. Kim, ,https://arxiv.org/abs/2404.05687,,2404.05687.pdf,Retrieval-Augmented Open-Vocabulary Object Detection,"Open-vocabulary object detection (OVD) has been studied with Vision-Language
+Models (VLMs) to detect novel objects beyond the pre-trained categories.
+Previous approaches improve the generalization ability to expand the knowledge
+of the detector, using 'positive' pseudo-labels with additional 'class' names,
+e.g., sock, iPod, and alligator. To extend the previous methods in two aspects,
+we propose Retrieval-Augmented Losses and visual Features (RALF). Our method
+retrieves related 'negative' classes and augments loss functions. Also, visual
+features are augmented with 'verbalized concepts' of classes, e.g., worn on the
+feet, handheld music player, and sharp teeth. Specifically, RALF consists of
+two modules: Retrieval Augmented Losses (RAL) and Retrieval-Augmented visual
+Features (RAF). RAL constitutes two losses reflecting the semantic similarity
+with negative vocabularies. In addition, RAF augments visual features with the
+verbalized concepts from a large language model (LLM). Our experiments
+demonstrate the effectiveness of RALF on COCO and LVIS benchmark datasets. We
+achieve improvement up to 3.4 box AP$_{50}^{\text{N}}$ on novel categories of
+the COCO dataset and 3.6 mask AP$_{\text{r}}$ gains on the LVIS dataset. Code
+is available at https://github.com/mlvlab/RALF .",cs.CV,['cs.CV']
+Continual Motion Prediction Learning Framework via Meta-Representation Learning and Optimal Memory Buffer Retention Strategy,Dae Jun Kang · Dongsuk Kum · Sanmin Kim, ,https://arxiv.org/html/2311.11908v3,,2311.11908v3.pdf,Continual Learning: Applications and the Road Forward,"Continual learning is a subfield of machine learning, which aims to allow
+machine learning models to continuously learn on new data, by accumulating
+knowledge without forgetting what was learned in the past. In this work, we
+take a step back, and ask: ""Why should one care about continual learning in the
+first place?"". We set the stage by examining recent continual learning papers
+published at four major machine learning conferences, and show that
+memory-constrained settings dominate the field. Then, we discuss five open
+problems in machine learning, and even though they might seem unrelated to
+continual learning at first sight, we show that continual learning will
+inevitably be part of their solution. These problems are model editing,
+personalization and specialization, on-device learning, faster (re-)training
+and reinforcement learning. Finally, by comparing the desiderata from these
+unsolved problems and the current assumptions in continual learning, we
+highlight and discuss four future directions for continual learning research.
+We hope that this work offers an interesting perspective on the future of
+continual learning, while displaying its potential value and the paths we have
+to pursue in order to make it successful. This work is the result of the many
+discussions the authors had at the Dagstuhl seminar on Deep Continual Learning,
+in March 2023.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+Learning Visual Prompt for Gait Recognition,Kang Ma · Ying Fu · Chunshui Cao · Saihui Hou · Yongzhen Huang · Dezhi Zheng, ,https://arxiv.org/abs/2402.19122,,2402.19122.pdf,BigGait: Learning Gait Representation You Want by Large Vision Models,"Gait recognition stands as one of the most pivotal remote identification
+technologies and progressively expands across research and industry
+communities. However, existing gait recognition methods heavily rely on
+task-specific upstream driven by supervised learning to provide explicit gait
+representations like silhouette sequences, which inevitably introduce expensive
+annotation costs and potential error accumulation. Escaping from this trend,
+this work explores effective gait representations based on the all-purpose
+knowledge produced by task-agnostic Large Vision Models (LVMs) and proposes a
+simple yet efficient gait framework, termed BigGait. Specifically, the Gait
+Representation Extractor (GRE) within BigGait draws upon design principles from
+established gait representations, effectively transforming all-purpose
+knowledge into implicit gait representations without requiring third-party
+supervision signals. Experiments on CCPG, CAISA-B* and SUSTech1K indicate that
+BigGait significantly outperforms the previous methods in both within-domain
+and cross-domain tasks in most cases, and provides a more practical paradigm
+for learning the next-generation gait representation. Finally, we delve into
+prospective challenges and promising directions in LVMs-based gait recognition,
+aiming to inspire future work in this emerging topic. The source code is
+available at https://github.com/ShiqiYu/OpenGait.",cs.CV,['cs.CV']
+Zero-Reference Low-Light Enhancement via Physical Quadruple Priors,Wenjing Wang · Huan Yang · Jianlong Fu · Jiaying Liu,https://daooshee.github.io/QuadPrior-Website/,https://arxiv.org/abs/2403.12933,,2403.12933.pdf,Zero-Reference Low-Light Enhancement via Physical Quadruple Priors,"Understanding illumination and reducing the need for supervision pose a
+significant challenge in low-light enhancement. Current approaches are highly
+sensitive to data usage during training and illumination-specific
+hyper-parameters, limiting their ability to handle unseen scenarios. In this
+paper, we propose a new zero-reference low-light enhancement framework
+trainable solely with normal light images. To accomplish this, we devise an
+illumination-invariant prior inspired by the theory of physical light transfer.
+This prior serves as the bridge between normal and low-light images. Then, we
+develop a prior-to-image framework trained without low-light data. During
+testing, this framework is able to restore our illumination-invariant prior
+back to images, automatically achieving low-light enhancement. Within this
+framework, we leverage a pretrained generative diffusion model for model
+ability, introduce a bypass decoder to handle detail distortion, as well as
+offer a lightweight version for practicality. Extensive experiments demonstrate
+our framework's superiority in various scenarios as well as good
+interpretability, robustness, and efficiency. Code is available on our project
+homepage: http://daooshee.github.io/QuadPrior-Website/",cs.CV,['cs.CV']
+Differentiable Information Bottleneck for Deterministic Multi-view Clustering,Xiaoqiang Yan · Zhixiang Jin · Fengshou Han · Yangdong Ye, ,https://arxiv.org/abs/2403.15681,,2403.15681.pdf,Differentiable Information Bottleneck for Deterministic Multi-view Clustering,"In recent several years, the information bottleneck (IB) principle provides
+an information-theoretic framework for deep multi-view clustering (MVC) by
+compressing multi-view observations while preserving the relevant information
+of multiple views. Although existing IB-based deep MVC methods have achieved
+huge success, they rely on variational approximation and distribution
+assumption to estimate the lower bound of mutual information, which is a
+notoriously hard and impractical problem in high-dimensional multi-view spaces.
+In this work, we propose a new differentiable information bottleneck (DIB)
+method, which provides a deterministic and analytical MVC solution by fitting
+the mutual information without the necessity of variational approximation.
+Specifically, we first propose to directly fit the mutual information of
+high-dimensional spaces by leveraging normalized kernel Gram matrix, which does
+not require any auxiliary neural estimator to estimate the lower bound of
+mutual information. Then, based on the new mutual information measurement, a
+deterministic multi-view neural network with analytical gradients is explicitly
+trained to parameterize IB principle, which derives a deterministic compression
+of input variables from different views. Finally, a triplet consistency
+discovery mechanism is devised, which is capable of mining the feature
+consistency, cluster consistency and joint consistency based on the
+deterministic and compact representations. Extensive experimental results show
+the superiority of our DIB method on 6 benchmarks compared with 13
+state-of-the-art baselines.",cs.IT,"['cs.IT', 'cs.LG', 'math.IT']"
+Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction,Zilin Du · Haoxin Li · Xu Guo · Boyang Li, ,https://arxiv.org/abs/2312.03025,,2312.03025.pdf,Training on Synthetic Data Beats Real Data in Multimodal Relation Extraction,"The task of multimodal relation extraction has attracted significant research
+attention, but progress is constrained by the scarcity of available training
+data. One natural thought is to extend existing datasets with cross-modal
+generative models. In this paper, we consider a novel problem setting, where
+only unimodal data, either text or image, are available during training. We aim
+to train a multimodal classifier from synthetic data that perform well on real
+multimodal test data. However, training with synthetic data suffers from two
+obstacles: lack of data diversity and label information loss. To alleviate the
+issues, we propose Mutual Information-aware Multimodal Iterated Relational dAta
+GEneration (MI2RAGE), which applies Chained Cross-modal Generation (CCG) to
+promote diversity in the generated data and exploits a teacher network to
+select valuable training samples with high mutual information with the
+ground-truth labels. Comparing our method to direct training on synthetic data,
+we observed a significant improvement of 24.06% F1 with synthetic text and
+26.42% F1 with synthetic images. Notably, our best model trained on completely
+synthetic images outperforms prior state-of-the-art models trained on real
+multimodal data by a margin of 3.76% in F1. Our codebase will be made available
+upon acceptance.",cs.AI,"['cs.AI', 'cs.CL', 'cs.CV', 'cs.LG']"
+DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data,Hanrong Ye · Dan Xu, ,https://arxiv.org/abs/2403.15389,,2403.15389.pdf,DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data,"Recently, there has been an increased interest in the practical problem of
+learning multiple dense scene understanding tasks from partially annotated
+data, where each training sample is only labeled for a subset of the tasks. The
+missing of task labels in training leads to low-quality and noisy predictions,
+as can be observed from state-of-the-art methods. To tackle this issue, we
+reformulate the partially-labeled multi-task dense prediction as a pixel-level
+denoising problem, and propose a novel multi-task denoising diffusion framework
+coined as DiffusionMTL. It designs a joint diffusion and denoising paradigm to
+model a potential noisy distribution in the task prediction or feature maps and
+generate rectified outputs for different tasks. To exploit multi-task
+consistency in denoising, we further introduce a Multi-Task Conditioning
+strategy, which can implicitly utilize the complementary nature of the tasks to
+help learn the unlabeled tasks, leading to an improvement in the denoising
+performance of the different tasks. Extensive quantitative and qualitative
+experiments demonstrate that the proposed multi-task denoising diffusion model
+can significantly improve multi-task prediction maps, and outperform the
+state-of-the-art methods on three challenging multi-task benchmarks, under two
+different partial-labeling evaluation settings. The code is available at
+https://prismformore.github.io/diffusionmtl/.",cs.CV,"['cs.CV', 'cs.LG']"
+Retrieval-Augmented Embodied Agents,Yichen Zhu · Zhicai Ou · Xiaofeng Mou · Jian Tang, ,https://arxiv.org/abs/2404.11699,,2404.11699.pdf,Retrieval-Augmented Embodied Agents,"Embodied agents operating in complex and uncertain environments face
+considerable challenges. While some advanced agents handle complex manipulation
+tasks with proficiency, their success often hinges on extensive training data
+to develop their capabilities. In contrast, humans typically rely on recalling
+past experiences and analogous situations to solve new problems. Aiming to
+emulate this human approach in robotics, we introduce the Retrieval-Augmented
+Embodied Agent (RAEA). This innovative system equips robots with a form of
+shared memory, significantly enhancing their performance. Our approach
+integrates a policy retriever, allowing robots to access relevant strategies
+from an external policy memory bank based on multi-modal inputs. Additionally,
+a policy generator is employed to assimilate these strategies into the learning
+process, enabling robots to formulate effective responses to tasks. Extensive
+testing of RAEA in both simulated and real-world scenarios demonstrates its
+superior performance over traditional methods, representing a major leap
+forward in robotic technology.",cs.RO,['cs.RO']
+Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models,Shengqu Cai · Duygu Ceylan · Matheus Gadelha · Chun-Hao P. Huang · Tuanfeng Y. Wang · Gordon Wetzstein,https://primecai.github.io/generative_rendering/,https://arxiv.org/abs/2312.01409,,2312.01409.pdf,Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models,"Traditional 3D content creation tools empower users to bring their
+imagination to life by giving them direct control over a scene's geometry,
+appearance, motion, and camera path. Creating computer-generated videos,
+however, is a tedious manual process, which can be automated by emerging
+text-to-video diffusion models. Despite great promise, video diffusion models
+are difficult to control, hindering a user to apply their own creativity rather
+than amplifying it. To address this challenge, we present a novel approach that
+combines the controllability of dynamic 3D meshes with the expressivity and
+editability of emerging diffusion models. For this purpose, our approach takes
+an animated, low-fidelity rendered mesh as input and injects the ground truth
+correspondence information obtained from the dynamic mesh into various stages
+of a pre-trained text-to-image generation model to output high-quality and
+temporally consistent frames. We demonstrate our approach on various examples
+where motion can be obtained by animating rigged assets or changing the camera
+path.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR']"
+Not All Classes Stand on Same Embeddings: Calibrating a Semantic Distance with Metric Tensor,Jae Hyeon Park · Gyoomin Lee · Seunggi Park · Sung In Cho, ,,https://stackoverflow.com/questions/76678783/langchains-chroma-vectordb-similarity-search-with-score-and-vectordb-simil,,,,,nan
+OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation,Qidong Huang · Xiaoyi Dong · Pan Zhang · Bin Wang · Conghui He · Jiaqi Wang · Dahua Lin · Weiming Zhang · Nenghai Yu, ,https://arxiv.org/abs/2311.17911,,2311.17911.pdf,OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation,"Hallucination, posed as a pervasive challenge of multi-modal large language
+models (MLLMs), has significantly impeded their real-world usage that demands
+precise judgment. Existing methods mitigate this issue with either training
+with specific designed data or inferencing with external knowledge from other
+sources, incurring inevitable additional costs. In this paper, we present
+OPERA, a novel MLLM decoding method grounded in an Over-trust Penalty and a
+Retrospection-Allocation strategy, serving as a nearly free lunch to alleviate
+the hallucination issue without additional data, knowledge, or training. Our
+approach begins with an interesting observation that, most hallucinations are
+closely tied to the knowledge aggregation patterns manifested in the
+self-attention matrix, i.e., MLLMs tend to generate new tokens by focusing on a
+few summary tokens, but not all the previous tokens. Such partial over-trust
+inclination results in the neglecting of image tokens and describes the image
+content with hallucination. Based on the observation, OPERA introduces a
+penalty term on the model logits during the beam-search decoding to mitigate
+the over-trust issue, along with a rollback strategy that retrospects the
+presence of summary tokens in the previously generated tokens, and re-allocate
+the token selection if necessary. With extensive experiments, OPERA shows
+significant hallucination-mitigating performance on different MLLMs and
+metrics, proving its effectiveness and generality. Our code is available at:
+https://github.com/shikiw/OPERA.",cs.CV,['cs.CV']
+Combining Frame and GOP Embeddings for Neural Video Representation,Jens Eirik Saethre · Roberto Azevedo · Christopher Schroers, ,https://arxiv.org/abs/2403.15679,,2403.15679.pdf,DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes,"Implicit neural representations for video (NeRV) have recently become a novel
+way for high-quality video representation. However, existing works employ a
+single network to represent the entire video, which implicitly confuse static
+and dynamic information. This leads to an inability to effectively compress the
+redundant static information and lack the explicitly modeling of global
+temporal-coherent dynamic details. To solve above problems, we propose DS-NeRV,
+which decomposes videos into sparse learnable static codes and dynamic codes
+without the need for explicit optical flow or residual supervision. By setting
+different sampling rates for two codes and applying weighted sum and
+interpolation sampling methods, DS-NeRV efficiently utilizes redundant static
+information while maintaining high-frequency details. Additionally, we design a
+cross-channel attention-based (CCA) fusion module to efficiently fuse these two
+codes for frame decoding. Our approach achieves a high quality reconstruction
+of 31.2 PSNR with only 0.35M parameters thanks to separate static and dynamic
+codes representation and outperforms existing NeRV methods in many downstream
+tasks. Our project website is at https://haoyan14.github.io/DS-NeRV.",cs.CV,"['cs.CV', 'cs.MM']"
+FlowIE：Efficient Image Enhancement via Rectified Flow,Yixuan Zhu · Wenliang Zhao · Ao Li · Yansong Tang · Jie Zhou · Jiwen Lu, ,https://arxiv.org/abs/2405.14677,,2405.14677.pdf,RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance,"Customizing diffusion models to generate identity-preserving images from
+user-provided reference images is an intriguing new problem. The prevalent
+approaches typically require training on extensive domain-specific images to
+achieve identity preservation, which lacks flexibility across different use
+cases. To address this issue, we exploit classifier guidance, a training-free
+technique that steers diffusion models using an existing classifier, for
+personalized image generation. Our study shows that based on a recent rectified
+flow framework, the major limitation of vanilla classifier guidance in
+requiring a special classifier can be resolved with a simple fixed-point
+solution, allowing flexible personalization with off-the-shelf image
+discriminators. Moreover, its solving procedure proves to be stable when
+anchored to a reference flow trajectory, with a convergence guarantee. The
+derived method is implemented on rectified flow with different off-the-shelf
+image discriminators, delivering advantageous personalization results for human
+faces, live subjects, and certain objects. Code is available at
+https://github.com/feifeiobama/RectifID.",cs.CV,"['cs.CV', 'cs.LG']"
+CPR-Coach: Recognizing Composite Error Actions based on Single-class Training,Shunli Wang · Shuaibing Wang · Dingkang Yang · Mingcheng Li · Haopeng Kuang · Xiao Zhao · Liuzhen Su · Peng Zhai · Lihua Zhang, ,https://arxiv.org/abs/2309.11718,,2309.11718.pdf,CPR-Coach: Recognizing Composite Error Actions based on Single-class Training,"The fine-grained medical action analysis task has received considerable
+attention from pattern recognition communities recently, but it faces the
+problems of data and algorithm shortage. Cardiopulmonary Resuscitation (CPR) is
+an essential skill in emergency treatment. Currently, the assessment of CPR
+skills mainly depends on dummies and trainers, leading to high training costs
+and low efficiency. For the first time, this paper constructs a vision-based
+system to complete error action recognition and skill assessment in CPR.
+Specifically, we define 13 types of single-error actions and 74 types of
+composite error actions during external cardiac compression and then develop a
+video dataset named CPR-Coach. By taking the CPR-Coach as a benchmark, this
+paper thoroughly investigates and compares the performance of existing action
+recognition models based on different data modalities. To solve the unavoidable
+Single-class Training & Multi-class Testing problem, we propose a
+humancognition-inspired framework named ImagineNet to improve the model's
+multierror recognition performance under restricted supervision. Extensive
+experiments verify the effectiveness of the framework. We hope this work could
+advance research toward fine-grained medical action analysis and skill
+assessment. The CPR-Coach dataset and the code of ImagineNet are publicly
+available on Github.",cs.CV,"['cs.CV', 'I.5.4']"
+Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion,Zixian Gao · Xun Jiang · Xing Xu · Fumin Shen · Yujie Li · Heng Tao Shen, ,https://arxiv.org/abs/2307.16121,,2307.16121.pdf,Uncertainty-Encoded Multi-Modal Fusion for Robust Object Detection in Autonomous Driving,"Multi-modal fusion has shown initial promising results for object detection
+of autonomous driving perception. However, many existing fusion schemes do not
+consider the quality of each fusion input and may suffer from adverse
+conditions on one or more sensors. While predictive uncertainty has been
+applied to characterize single-modal object detection performance at run time,
+incorporating uncertainties into the multi-modal fusion still lacks effective
+solutions due primarily to the uncertainty's cross-modal incomparability and
+distinct sensitivities to various adverse conditions. To fill this gap, this
+paper proposes Uncertainty-Encoded Mixture-of-Experts (UMoE) that explicitly
+incorporates single-modal uncertainties into LiDAR-camera fusion. UMoE uses
+individual expert network to process each sensor's detection result together
+with encoded uncertainty. Then, the expert networks' outputs are analyzed by a
+gating network to determine the fusion weights. The proposed UMoE module can be
+integrated into any proposal fusion pipeline. Evaluation shows that UMoE
+achieves a maximum of 10.67%, 3.17%, and 5.40% performance gain compared with
+the state-of-the-art proposal-level multi-modal object detectors under extreme
+weather, adversarial, and blinding attack scenarios.",cs.CV,"['cs.CV', 'cs.AI']"
+Unsupervised Deep Unrolling Networks for Phase Unwrapping,Zhile Chen · Yuhui Quan · Hui Ji, ,,https://ieeexplore.ieee.org/document/10520881,,,,,nan
+Transductive Zero-Shot $\&$ Few-Shot CLIP,Ségolène Martin · Yunshi HUANG · Fereshteh Shakeri · Jean-Christophe Pesquet · Ismail Ben Ayed, ,https://arxiv.org/abs/2405.18437,,2405.18437.pdf,Transductive Zero-Shot and Few-Shot CLIP,"Transductive inference has been widely investigated in few-shot image
+classification, but completely overlooked in the recent, fast growing
+literature on adapting vision-langage models like CLIP. This paper addresses
+the transductive zero-shot and few-shot CLIP classification challenge, in which
+inference is performed jointly across a mini-batch of unlabeled query samples,
+rather than treating each instance independently. We initially construct
+informative vision-text probability features, leading to a classification
+problem on the unit simplex set. Inspired by Expectation-Maximization (EM), our
+optimization-based classification objective models the data probability
+distribution for each class using a Dirichlet law. The minimization problem is
+then tackled with a novel block Majorization-Minimization algorithm, which
+simultaneously estimates the distribution parameters and class assignments.
+Extensive numerical experiments on 11 datasets underscore the benefits and
+efficacy of our batch inference approach.On zero-shot tasks with test batches
+of 75 samples, our approach yields near 20% improvement in ImageNet accuracy
+over CLIP's zero-shot performance. Additionally, we outperform state-of-the-art
+methods in the few-shot setting. The code is available at:
+https://github.com/SegoleneMartin/transductive-CLIP.",cs.CV,"['cs.CV', 'cs.AI']"
+Just Add $\pi$! Pose Induced Video Transformers for Understanding Activities of Daily Living,Dominick Reilly · Srijan Das, ,https://arxiv.org/abs/2311.18840,,2311.18840.pdf,Just Add $π$! Pose Induced Video Transformers for Understanding Activities of Daily Living,"Video transformers have become the de facto standard for human action
+recognition, yet their exclusive reliance on the RGB modality still limits
+their adoption in certain domains. One such domain is Activities of Daily
+Living (ADL), where RGB alone is not sufficient to distinguish between visually
+similar actions, or actions observed from multiple viewpoints. To facilitate
+the adoption of video transformers for ADL, we hypothesize that the
+augmentation of RGB with human pose information, known for its sensitivity to
+fine-grained motion and multiple viewpoints, is essential. Consequently, we
+introduce the first Pose Induced Video Transformer: PI-ViT (or $\pi$-ViT), a
+novel approach that augments the RGB representations learned by video
+transformers with 2D and 3D pose information. The key elements of $\pi$-ViT are
+two plug-in modules, 2D Skeleton Induction Module and 3D Skeleton Induction
+Module, that are responsible for inducing 2D and 3D pose information into the
+RGB representations. These modules operate by performing pose-aware auxiliary
+tasks, a design choice that allows $\pi$-ViT to discard the modules during
+inference. Notably, $\pi$-ViT achieves the state-of-the-art performance on
+three prominent ADL datasets, encompassing both real-world and large-scale
+RGB-D datasets, without requiring poses or additional computational overhead at
+inference.",cs.CV,['cs.CV']
+Specularity Factorization for Low Light Enhancement,Saurabh Saini · P. J. Narayanan, ,https://arxiv.org/abs/2404.01998,,2404.01998.pdf,Specularity Factorization for Low-Light Enhancement,"We present a new additive image factorization technique that treats images to
+be composed of multiple latent specular components which can be simply
+estimated recursively by modulating the sparsity during decomposition. Our
+model-driven {\em RSFNet} estimates these factors by unrolling the optimization
+into network layers requiring only a few scalars to be learned. The resultant
+factors are interpretable by design and can be fused for different image
+enhancement tasks via a network or combined directly by the user in a
+controllable fashion. Based on RSFNet, we detail a zero-reference Low Light
+Enhancement (LLE) application trained without paired or unpaired supervision.
+Our system improves the state-of-the-art performance on standard benchmarks and
+achieves better generalization on multiple other datasets. We also integrate
+our factors with other task specific fusion networks for applications like
+deraining, deblurring and dehazing with negligible overhead thereby
+highlighting the multi-domain and multi-task generalizability of our proposed
+RSFNet. The code and data is released for reproducibility on the project
+homepage.",cs.CV,"['cs.CV', 'cs.LG']"
+Draw Step by Step: Reconstructing CAD Construction Sequences from Point Clouds via Multimodal Diffusion.,Weijian Ma · Shuaiqi Chen · Yunzhong Lou · Xueyang Li · Xiangdong Zhou, ,https://arxiv.org/abs/2405.15188,,2405.15188.pdf,PS-CAD: Local Geometry Guidance via Prompting and Selection for CAD Reconstruction,"Reverse engineering CAD models from raw geometry is a classic but challenging
+research problem. In particular, reconstructing the CAD modeling sequence from
+point clouds provides great interpretability and convenience for editing. To
+improve upon this problem, we introduce geometric guidance into the
+reconstruction network. Our proposed model, PS-CAD, reconstructs the CAD
+modeling sequence one step at a time. At each step, we provide two forms of
+geometric guidance. First, we provide the geometry of surfaces where the
+current reconstruction differs from the complete model as a point cloud. This
+helps the framework to focus on regions that still need work. Second, we use
+geometric analysis to extract a set of planar prompts, that correspond to
+candidate surfaces where a CAD extrusion step could be started. Our framework
+has three major components. Geometric guidance computation extracts the two
+types of geometric guidance. Single-step reconstruction computes a single
+candidate CAD modeling step for each provided prompt. Single-step selection
+selects among the candidate CAD modeling steps. The process continues until the
+reconstruction is completed. Our quantitative results show a significant
+improvement across all metrics. For example, on the dataset DeepCAD, PS-CAD
+improves upon the best published SOTA method by reducing the geometry errors
+(CD and HD) by 10%, and the structural error (ECD metric) by about 15%.",cs.CV,['cs.CV']
+Logarithmic Lenses: Exploring Log RGB Data for Image Classification,Bruce Maxwell · Bruce Maxwell · Sumegha Singhania · Avnish Patel · Rahul Kumar · Heather Fryling · Sihan Li · Haonan Sun · Ping He · Zewen Li, ,,https://medium.com/@adjileyeb/unlocking-visual-insights-applying-the-logit-lens-to-image-data-with-vision-transformers-b99cb70dd704,,,,,nan
+D$^4$M: Dataset Distillation via Disentangled Diffusion Model,Duo Su · Junjie Hou · Weizhi Gao · Yingjie Tian · Bowen Tang, ,https://arxiv.org/abs/2403.03881,,2403.03881.pdf,Latent Dataset Distillation with Diffusion Models,"The efficacy of machine learning has traditionally relied on the availability
+of increasingly larger datasets. However, large datasets pose storage
+challenges and contain non-influential samples, which could be ignored during
+training without impacting the final accuracy of the model. In response to
+these limitations, the concept of distilling the information on a dataset into
+a condensed set of (synthetic) samples, namely a distilled dataset, emerged.
+One crucial aspect is the selected architecture (usually ConvNet) for linking
+the original and synthetic datasets. However, the final accuracy is lower if
+the employed model architecture differs from the model used during
+distillation. Another challenge is the generation of high-resolution images,
+e.g., 128x128 and higher. In this paper, we propose Latent Dataset Distillation
+with Diffusion Models (LD3M) that combine diffusion in latent space with
+dataset distillation to tackle both challenges. LD3M incorporates a novel
+diffusion process tailored for dataset distillation, which improves the
+gradient norms for learning synthetic images. By adjusting the number of
+diffusion steps, LD3M also offers a straightforward way of controlling the
+trade-off between speed and accuracy. We evaluate our approach in several
+ImageNet subsets and for high-resolution images (128x128 and 256x256). As a
+result, LD3M consistently outperforms state-of-the-art distillation techniques
+by up to 4.8 p.p. and 4.2 p.p. for 1 and 10 images per class, respectively.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Probing the 3D Awareness of Visual Foundation Models,Mohamed El Banani · Amit Raj · Kevis-kokitsi Maninis · Abhishek Kar · Yuanzhen Li · Michael Rubinstein · Deqing Sun · Leonidas Guibas · Justin Johnson · Varun Jampani, ,https://arxiv.org/abs/2404.08636,,2404.08636.pdf,Probing the 3D Awareness of Visual Foundation Models,"Recent advances in large-scale pretraining have yielded visual foundation
+models with strong capabilities. Not only can recent models generalize to
+arbitrary images for their training task, their intermediate representations
+are useful for other visual tasks such as detection and segmentation. Given
+that such models can classify, delineate, and localize objects in 2D, we ask
+whether they also represent their 3D structure? In this work, we analyze the 3D
+awareness of visual foundation models. We posit that 3D awareness implies that
+representations (1) encode the 3D structure of the scene and (2) consistently
+represent the surface across views. We conduct a series of experiments using
+task-specific probes and zero-shot inference procedures on frozen features. Our
+experiments reveal several limitations of the current models. Our code and
+analysis can be found at https://github.com/mbanani/probe3d.",cs.CV,['cs.CV']
+HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models,Nataniel Ruiz · Yuanzhen Li · Varun Jampani · Wei Wei · Tingbo Hou · Yael Pritch · Neal Wadhwa · Michael Rubinstein · Kfir Aberman, ,https://arxiv.org/abs/2307.06949,,2307.06949.pdf,HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models,"Personalization has emerged as a prominent aspect within the field of
+generative AI, enabling the synthesis of individuals in diverse contexts and
+styles, while retaining high-fidelity to their identities. However, the process
+of personalization presents inherent challenges in terms of time and memory
+requirements. Fine-tuning each personalized model needs considerable GPU time
+investment, and storing a personalized model per subject can be demanding in
+terms of storage capacity. To overcome these challenges, we propose
+HyperDreamBooth-a hypernetwork capable of efficiently generating a small set of
+personalized weights from a single image of a person. By composing these
+weights into the diffusion model, coupled with fast finetuning, HyperDreamBooth
+can generate a person's face in various contexts and styles, with high subject
+details while also preserving the model's crucial knowledge of diverse styles
+and semantic modifications. Our method achieves personalization on faces in
+roughly 20 seconds, 25x faster than DreamBooth and 125x faster than Textual
+Inversion, using as few as one reference image, with the same quality and style
+diversity as DreamBooth. Also our method yields a model that is 10000x smaller
+than a normal DreamBooth model. Project page: https://hyperdreambooth.github.io",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG']"
+VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams,Liao Wang · Kaixin Yao · Chengcheng Guo · Zhirui Zhang · Qiang Hu · Jingyi Yu · Lan Xu · Minye Wu, ,https://arxiv.org/abs/2312.01407,,2312.01407.pdf,VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams,"Neural Radiance Fields (NeRFs) excel in photorealistically rendering static
+scenes. However, rendering dynamic, long-duration radiance fields on ubiquitous
+devices remains challenging, due to data storage and computational constraints.
+In this paper, we introduce VideoRF, the first approach to enable real-time
+streaming and rendering of dynamic radiance fields on mobile platforms. At the
+core is a serialized 2D feature image stream representing the 4D radiance field
+all in one. We introduce a tailored training scheme directly applied to this 2D
+domain to impose the temporal and spatial redundancy of the feature image
+stream. By leveraging the redundancy, we show that the feature image stream can
+be efficiently compressed by 2D video codecs, which allows us to exploit video
+hardware accelerators to achieve real-time decoding. On the other hand, based
+on the feature image stream, we propose a novel rendering pipeline for VideoRF,
+which has specialized space mappings to query radiance properties efficiently.
+Paired with a deferred shading model, VideoRF has the capability of real-time
+rendering on mobile devices thanks to its efficiency. We have developed a
+real-time interactive player that enables online streaming and rendering of
+dynamic scenes, offering a seamless and immersive free-viewpoint experience
+across a range of devices, from desktops to mobile phones.",cs.CV,['cs.CV']
+GLaMM: Pixel Grounding Large Multimodal Model,Hanoona Rasheed · Muhammad Maaz · Sahal Shaji Mullappilly · Abdelrahman Shaker · Salman Khan · Hisham Cholakkal · Rao Anwer · Eric P. Xing · Ming-Hsuan Yang · Fahad Shahbaz Khan, ,https://arxiv.org/abs/2311.03356v1,,2311.03356v1.pdf,GLaMM: Pixel Grounding Large Multimodal Model,"Large Multimodal Models (LMMs) extend Large Language Models to the vision
+domain. Initial efforts towards LMMs used holistic images and text prompts to
+generate ungrounded textual responses. Very recently, region-level LMMs have
+been used to generate visually grounded responses. However, they are limited to
+only referring a single object category at a time, require users to specify the
+regions in inputs, or cannot offer dense pixel-wise object grounding. In this
+work, we present Grounding LMM (GLaMM), the first model that can generate
+natural language responses seamlessly intertwined with corresponding object
+segmentation masks. GLaMM not only grounds objects appearing in the
+conversations but is flexible enough to accept both textual and optional visual
+prompts (region of interest) as input. This empowers users to interact with the
+model at various levels of granularity, both in textual and visual domains. Due
+to the lack of standard benchmarks for the novel setting of generating visually
+grounded detailed conversations, we introduce a comprehensive evaluation
+protocol with our curated grounded conversations. Our proposed Grounded
+Conversation Generation (GCG) task requires densely grounded concepts in
+natural scenes at a large-scale. To this end, we propose a densely annotated
+Grounding-anything Dataset (GranD) using our proposed automated annotation
+pipeline that encompasses 7.5M unique concepts grounded in a total of 810M
+regions available with segmentation masks. Besides GCG, GLaMM also performs
+effectively on several downstream tasks e.g., referring expression
+segmentation, image and region-level captioning and vision-language
+conversations. Project Page: https://mbzuai-oryx.github.io/groundingLMM.",cs.CV,"['cs.CV', 'cs.AI']"
+pix2gestalt: Amodal Segmentation by Synthesizing Wholes,Ege Ozguroglu · Ruoshi Liu · Dídac Surís · Dian Chen · Achal Dave · Pavel Tokmakov · Carl Vondrick, ,https://arxiv.org/abs/2401.14398,,2401.14398.pdf,pix2gestalt: Amodal Segmentation by Synthesizing Wholes,"We introduce pix2gestalt, a framework for zero-shot amodal segmentation,
+which learns to estimate the shape and appearance of whole objects that are
+only partially visible behind occlusions. By capitalizing on large-scale
+diffusion models and transferring their representations to this task, we learn
+a conditional diffusion model for reconstructing whole objects in challenging
+zero-shot cases, including examples that break natural and physical priors,
+such as art. As training data, we use a synthetically curated dataset
+containing occluded objects paired with their whole counterparts. Experiments
+show that our approach outperforms supervised baselines on established
+benchmarks. Our model can furthermore be used to significantly improve the
+performance of existing object recognition and 3D reconstruction methods in the
+presence of occlusions.",cs.CV,"['cs.CV', 'cs.LG']"
+LightOctree: Lightweight 3D Spatially-Coherent Indoor Lighting Estimation,Xuecan Wang · Shibang Xiao · Xiaohui Liang, ,https://arxiv.org/abs/2404.03925,,2404.03925.pdf,LightOctree: Lightweight 3D Spatially-Coherent Indoor Lighting Estimation,"We present a lightweight solution for estimating spatially-coherent indoor
+lighting from a single RGB image. Previous methods for estimating illumination
+using volumetric representations have overlooked the sparse distribution of
+light sources in space, necessitating substantial memory and computational
+resources for achieving high-quality results. We introduce a unified, voxel
+octree-based illumination estimation framework to produce 3D spatially-coherent
+lighting. Additionally, a differentiable voxel octree cone tracing rendering
+layer is proposed to eliminate regular volumetric representation throughout the
+entire process and ensure the retention of features across different frequency
+domains. This reduction significantly decreases spatial usage and required
+floating-point operations without substantially compromising precision.
+Experimental results demonstrate that our approach achieves high-quality
+coherent estimation with minimal cost compared to previous methods.",cs.CV,['cs.CV']
+3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis,Zhicheng Lu · xiang guo · Le Hui · Tianrui Chen · Min Yang · Xiao Tang · feng zhu · Yuchao Dai, ,https://arxiv.org/abs/2404.06270,,2404.06270.pdf,3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis,"In this paper, we propose a 3D geometry-aware deformable Gaussian Splatting
+method for dynamic view synthesis. Existing neural radiance fields (NeRF) based
+solutions learn the deformation in an implicit manner, which cannot incorporate
+3D scene geometry. Therefore, the learned deformation is not necessarily
+geometrically coherent, which results in unsatisfactory dynamic view synthesis
+and 3D dynamic reconstruction. Recently, 3D Gaussian Splatting provides a new
+representation of the 3D scene, building upon which the 3D geometry could be
+exploited in learning the complex 3D deformation. Specifically, the scenes are
+represented as a collection of 3D Gaussian, where each 3D Gaussian is optimized
+to move and rotate over time to model the deformation. To enforce the 3D scene
+geometry constraint during deformation, we explicitly extract 3D geometry
+features and integrate them in learning the 3D deformation. In this way, our
+solution achieves 3D geometry-aware deformation modeling, which enables
+improved dynamic view synthesis and 3D dynamic reconstruction. Extensive
+experimental results on both synthetic and real datasets prove the superiority
+of our solution, which achieves new state-of-the-art performance.
+  The project is available at https://npucvr.github.io/GaGS/",cs.CV,['cs.CV']
+PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection,Qihang Ma · Zhizhong Zhang · Xin Tan · Yanyun Qu · Chengwei Chen · Yuan Xie · Lizhuang Ma, ,https://arxiv.org/abs/2404.05231,,2404.05231.pdf,PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection,"The vision-language model has brought great improvement to few-shot
+industrial anomaly detection, which usually needs to design of hundreds of
+prompts through prompt engineering. For automated scenarios, we first use
+conventional prompt learning with many-class paradigm as the baseline to
+automatically learn prompts but found that it can not work well in one-class
+anomaly detection. To address the above problem, this paper proposes a
+one-class prompt learning method for few-shot anomaly detection, termed
+PromptAD. First, we propose semantic concatenation which can transpose normal
+prompts into anomaly prompts by concatenating normal prompts with anomaly
+suffixes, thus constructing a large number of negative samples used to guide
+prompt learning in one-class setting. Furthermore, to mitigate the training
+challenge caused by the absence of anomaly images, we introduce the concept of
+explicit anomaly margin, which is used to explicitly control the margin between
+normal prompt features and anomaly prompt features through a hyper-parameter.
+For image-level/pixel-level anomaly detection, PromptAD achieves first place in
+11/12 few-shot settings on MVTec and VisA.",cs.CV,['cs.CV']
+Diffeomorphic Template Registration for Atmospheric Turbulence Mitigation,Dong Lao · Congli Wang · Alex Wong · Stefano Soatto, ,,https://www.semanticscholar.org/paper/Diffeomorphic-Template-Registration-for-Atmospheric-Lao-Wang/d03a9da146a21840a76c6a42b1a1572736fe5a14/figure/2,,,,,nan
+From Variance to Veracity: Unbundling and Mitigating Gradient Variance in Differentiable Bundle Adjustment Layers,Swaminathan Gurumurthy · Karnik Ram · Bingqing Chen · Zachary Manchester · Zico Kolter, ,https://arxiv.org/abs/2307.08873,,2307.08873.pdf,An Alternative to Variance: Gini Deviation for Risk-averse Policy Gradient,"Restricting the variance of a policy's return is a popular choice in
+risk-averse Reinforcement Learning (RL) due to its clear mathematical
+definition and easy interpretability. Traditional methods directly restrict the
+total return variance. Recent methods restrict the per-step reward variance as
+a proxy. We thoroughly examine the limitations of these variance-based methods,
+such as sensitivity to numerical scale and hindering of policy learning, and
+propose to use an alternative risk measure, Gini deviation, as a substitute. We
+study various properties of this new risk measure and derive a policy gradient
+algorithm to minimize it. Empirical evaluation in domains where risk-aversion
+can be clearly defined, shows that our algorithm can mitigate the limitations
+of variance-based risk measures and achieves high return with low risk in terms
+of variance and Gini deviation when others fail to learn a reasonable policy.",cs.LG,"['cs.LG', 'cs.AI']"
+HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D,Sangmin Woo · byeongjun park · Hyojun Go · Jin-Young Kim · Changick Kim, ,,https://github.com/byeongjun-park/HarmonyView,,,,,nan
+Learning SO(3)-Invariant Semantic Correspondence via Local Shape Transform,Chunghyun Park · Seungwook Kim · Jaesik Park · Minsu Cho, ,https://arxiv.org/abs/2404.11156,,2404.11156.pdf,Learning SO(3)-Invariant Semantic Correspondence via Local Shape Transform,"Establishing accurate 3D correspondences between shapes stands as a pivotal
+challenge with profound implications for computer vision and robotics. However,
+existing self-supervised methods for this problem assume perfect input shape
+alignment, restricting their real-world applicability. In this work, we
+introduce a novel self-supervised Rotation-Invariant 3D correspondence learner
+with Local Shape Transform, dubbed RIST, that learns to establish dense
+correspondences between shapes even under challenging intra-class variations
+and arbitrary orientations. Specifically, RIST learns to dynamically formulate
+an SO(3)-invariant local shape transform for each point, which maps the
+SO(3)-equivariant global shape descriptor of the input shape to a local shape
+descriptor. These local shape descriptors are provided as inputs to our decoder
+to facilitate point cloud self- and cross-reconstruction. Our proposed
+self-supervised training pipeline encourages semantically corresponding points
+from different shapes to be mapped to similar local shape descriptors, enabling
+RIST to establish dense point-wise correspondences. RIST demonstrates
+state-of-the-art performances on 3D part label transfer and semantic keypoint
+transfer given arbitrarily rotated point cloud pairs, outperforming existing
+methods by significant margins.",cs.CV,['cs.CV']
+BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation,Jiahao Lu · Jiacheng Deng · Tianzhu Zhang, ,https://arxiv.org/abs/2403.15019,,2403.15019.pdf,BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation,"3D instance segmentation (3DIS) is a crucial task, but point-level
+annotations are tedious in fully supervised settings. Thus, using bounding
+boxes (bboxes) as annotations has shown great potential. The current mainstream
+approach is a two-step process, involving the generation of pseudo-labels from
+box annotations and the training of a 3DIS network with the pseudo-labels.
+However, due to the presence of intersections among bboxes, not every point has
+a determined instance label, especially in overlapping areas. To generate
+higher quality pseudo-labels and achieve more precise weakly supervised 3DIS
+results, we propose the Box-Supervised Simulation-assisted Mean Teacher for 3D
+Instance Segmentation (BSNet), which devises a novel pseudo-labeler called
+Simulation-assisted Transformer. The labeler consists of two main components.
+The first is Simulation-assisted Mean Teacher, which introduces Mean Teacher
+for the first time in this task and constructs simulated samples to assist the
+labeler in acquiring prior knowledge about overlapping areas. To better model
+local-global structure, we also propose Local-Global Aware Attention as the
+decoder for teacher and student labelers. Extensive experiments conducted on
+the ScanNetV2 and S3DIS datasets verify the superiority of our designs. Code is
+available at
+\href{https://github.com/peoplelu/BSNet}{https://github.com/peoplelu/BSNet}.",cs.CV,['cs.CV']
+Motion Diversification Networks,Hee Jae Kim · Eshed Ohn-Bar, ,,https://www.kdramastars.com/articles/131362/20230922/moving-actor-stuns-viewers-unrecognizable-transformation-villain.htm,,,,,nan
+PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks,Marina Neseem · Conor McCullough · Randy Hsin · Chas Leichner · Shan Li · In Suk Chong · Andrew Howard · Lukasz Lew · Sherief Reda · Ville-Mikko Rautio · Daniele Moro, ,https://arxiv.org/abs/2404.00103,,2404.00103.pdf,PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks,"Low-precision quantization is recognized for its efficacy in neural network
+optimization. Our analysis reveals that non-quantized elementwise operations
+which are prevalent in layers such as parameterized activation functions, batch
+normalization, and quantization scaling dominate the inference cost of
+low-precision models. These non-quantized elementwise operations are commonly
+overlooked in SOTA efficiency metrics such as Arithmetic Computation Effort
+(ACE). In this paper, we propose ACEv2 - an extended version of ACE which
+offers a better alignment with the inference cost of quantized models and their
+energy consumption on ML hardware. Moreover, we introduce PikeLPN, a model that
+addresses these efficiency issues by applying quantization to both elementwise
+operations and multiply-accumulate operations. In particular, we present a
+novel quantization technique for batch normalization layers named QuantNorm
+which allows for quantizing the batch normalization parameters without
+compromising the model performance. Additionally, we propose applying Double
+Quantization where the quantization scaling parameters are quantized.
+Furthermore, we recognize and resolve the issue of distribution mismatch in
+Separable Convolution layers by introducing Distribution-Heterogeneous
+Quantization which enables quantizing them to low-precision. PikeLPN achieves
+Pareto-optimality in efficiency-accuracy trade-off with up to 3X efficiency
+improvement compared to SOTA low-precision models.",cs.LG,"['cs.LG', 'cs.CV']"
+Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed,Yifan Wang · Xingyi He · Sida Peng · Dongli Tan · Xiaowei Zhou, ,https://arxiv.org/abs/2403.04765,,2403.04765.pdf,Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed,"We present a novel method for efficiently producing semi-dense matches across
+images. Previous detector-free matcher LoFTR has shown remarkable matching
+capability in handling large-viewpoint change and texture-poor scenarios but
+suffers from low efficiency. We revisit its design choices and derive multiple
+improvements for both efficiency and accuracy. One key observation is that
+performing the transformer over the entire feature map is redundant due to
+shared local information, therefore we propose an aggregated attention
+mechanism with adaptive token selection for efficiency. Furthermore, we find
+spatial variance exists in LoFTR's fine correlation module, which is adverse to
+matching accuracy. A novel two-stage correlation layer is proposed to achieve
+accurate subpixel correspondences for accuracy improvement. Our efficiency
+optimized model is $\sim 2.5\times$ faster than LoFTR which can even surpass
+state-of-the-art efficient sparse matching pipeline SuperPoint + LightGlue.
+Moreover, extensive experiments show that our method can achieve higher
+accuracy compared with competitive semi-dense matchers, with considerable
+efficiency benefits. This opens up exciting prospects for large-scale or
+latency-sensitive applications such as image retrieval and 3D reconstruction.
+Project page: https://zju3dv.github.io/efficientloftr.",cs.CV,['cs.CV']
+"Uncovering What, Why and How:  A Comprehensive Benchmark for Causation Understanding of Video Anomaly",Hang Du · Sicheng Zhang · Binzhu Xie · Guoshun Nan · Jiayang Zhang · Junrui Xu · Hangyu Liu · Sicong Leng · Jiangming Liu · Hehe Fan · Dajiu Huang · Jing Feng · Linli Chen · Can Zhang · Xuhuan Li · Hao Zhang · Jianhang Chen · Qimei Cui · Xiaofeng Tao, ,https://arxiv.org/abs/2405.00181,,2405.00181.pdf,"Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly","Video anomaly understanding (VAU) aims to automatically comprehend unusual
+occurrences in videos, thereby enabling various applications such as traffic
+surveillance and industrial manufacturing. While existing VAU benchmarks
+primarily concentrate on anomaly detection and localization, our focus is on
+more practicality, prompting us to raise the following crucial questions: ""what
+anomaly occurred?"", ""why did it happen?"", and ""how severe is this abnormal
+event?"". In pursuit of these answers, we present a comprehensive benchmark for
+Causation Understanding of Video Anomaly (CUVA). Specifically, each instance of
+the proposed benchmark involves three sets of human annotations to indicate the
+""what"", ""why"" and ""how"" of an anomaly, including 1) anomaly type, start and end
+times, and event descriptions, 2) natural language explanations for the cause
+of an anomaly, and 3) free text reflecting the effect of the abnormality. In
+addition, we also introduce MMEval, a novel evaluation metric designed to
+better align with human preferences for CUVA, facilitating the measurement of
+existing LLMs in comprehending the underlying cause and corresponding effect of
+video anomalies. Finally, we propose a novel prompt-based method that can serve
+as a baseline approach for the challenging CUVA. We conduct extensive
+experiments to show the superiority of our evaluation metric and the
+prompt-based approach. Our code and dataset are available at
+https://github.com/fesvhtr/CUVA.",cs.CV,"['cs.CV', 'cs.AI']"
+GART: Gaussian Articulated Template Models,Jiahui Lei · Yufu Wang · Georgios Pavlakos · Lingjie Liu · Kostas Daniilidis, ,https://arxiv.org/abs/2311.16099,,2311.16099.pdf,GART: Gaussian Articulated Template Models,"We introduce Gaussian Articulated Template Model GART, an explicit,
+efficient, and expressive representation for non-rigid articulated subject
+capturing and rendering from monocular videos. GART utilizes a mixture of
+moving 3D Gaussians to explicitly approximate a deformable subject's geometry
+and appearance. It takes advantage of a categorical template model prior (SMPL,
+SMAL, etc.) with learnable forward skinning while further generalizing to more
+complex non-rigid deformations with novel latent bones. GART can be
+reconstructed via differentiable rendering from monocular videos in seconds or
+minutes and rendered in novel poses faster than 150fps.",cs.CV,"['cs.CV', 'cs.GR']"
+Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering,Tao Lu · Mulin Yu · Linning Xu · Yuanbo Xiangli · Limin Wang · Dahua Lin · Bo Dai, ,https://arxiv.org/abs/2312.00109,,2312.00109.pdf,Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering,"Neural rendering methods have significantly advanced photo-realistic 3D scene
+rendering in various academic and industrial applications. The recent 3D
+Gaussian Splatting method has achieved the state-of-the-art rendering quality
+and speed combining the benefits of both primitive-based representations and
+volumetric representations. However, it often leads to heavily redundant
+Gaussians that try to fit every training view, neglecting the underlying scene
+geometry. Consequently, the resulting model becomes less robust to significant
+view changes, texture-less area and lighting effects. We introduce Scaffold-GS,
+which uses anchor points to distribute local 3D Gaussians, and predicts their
+attributes on-the-fly based on viewing direction and distance within the view
+frustum. Anchor growing and pruning strategies are developed based on the
+importance of neural Gaussians to reliably improve the scene coverage. We show
+that our method effectively reduces redundant Gaussians while delivering
+high-quality rendering. We also demonstrates an enhanced capability to
+accommodate scenes with varying levels-of-detail and view-dependent
+observations, without sacrificing the rendering speed.",cs.CV,['cs.CV']
+DiSR-NeRF: Diffusion-Guided View-Consistent Super-Resolution NeRF,Jie Long Lee · Chen Li · Gim Hee Lee, ,https://arxiv.org/abs/2404.00874,,2404.00874.pdf,DiSR-NeRF: Diffusion-Guided View-Consistent Super-Resolution NeRF,"We present DiSR-NeRF, a diffusion-guided framework for view-consistent
+super-resolution (SR) NeRF. Unlike prior works, we circumvent the requirement
+for high-resolution (HR) reference images by leveraging existing powerful 2D
+super-resolution models. Nonetheless, independent SR 2D images are often
+inconsistent across different views. We thus propose Iterative 3D
+Synchronization (I3DS) to mitigate the inconsistency problem via the inherent
+multi-view consistency property of NeRF. Specifically, our I3DS alternates
+between upscaling low-resolution (LR) rendered images with diffusion models,
+and updating the underlying 3D representation with standard NeRF training. We
+further introduce Renoised Score Distillation (RSD), a novel score-distillation
+objective for 2D image resolution. Our RSD combines features from ancestral
+sampling and Score Distillation Sampling (SDS) to generate sharp images that
+are also LR-consistent. Qualitative and quantitative results on both synthetic
+and real-world datasets demonstrate that our DiSR-NeRF can achieve better
+results on NeRF super-resolution compared with existing works. Code and video
+results available at the project website.",cs.CV,['cs.CV']
+SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction,Yuanhui Huang · Wenzhao Zheng · Borui Zhang · Jie Zhou · Jiwen Lu, ,https://arxiv.org/abs/2311.12754,,2311.12754.pdf,SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction,"3D occupancy prediction is an important task for the robustness of
+vision-centric autonomous driving, which aims to predict whether each point is
+occupied in the surrounding 3D space. Existing methods usually require 3D
+occupancy labels to produce meaningful results. However, it is very laborious
+to annotate the occupancy status of each voxel. In this paper, we propose
+SelfOcc to explore a self-supervised way to learn 3D occupancy using only video
+sequences. We first transform the images into the 3D space (e.g., bird's eye
+view) to obtain 3D representation of the scene. We directly impose constraints
+on the 3D representations by treating them as signed distance fields. We can
+then render 2D images of previous and future frames as self-supervision signals
+to learn the 3D representations. We propose an MVS-embedded strategy to
+directly optimize the SDF-induced weights with multiple depth proposals. Our
+SelfOcc outperforms the previous best method SceneRF by 58.7% using a single
+frame as input on SemanticKITTI and is the first self-supervised work that
+produces reasonable 3D occupancy for surround cameras on nuScenes. SelfOcc
+produces high-quality depth and achieves state-of-the-art results on novel
+depth synthesis, monocular depth estimation, and surround-view depth estimation
+on the SemanticKITTI, KITTI-2015, and nuScenes, respectively. Code:
+https://github.com/huang-yh/SelfOcc.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Seeing the Unseen: Visual Common Sense for Semantic Placement,Ram Ramrakhya · Aniruddha Kembhavi · Dhruv Batra · Zsolt Kira · Kuo-Hao Zeng · Luca Weihs, ,https://arxiv.org/abs/2401.07770,,2401.07770.pdf,Seeing the Unseen: Visual Common Sense for Semantic Placement,"Computer vision tasks typically involve describing what is present in an
+image (e.g. classification, detection, segmentation, and captioning). We study
+a visual common sense task that requires understanding what is not present.
+Specifically, given an image (e.g. of a living room) and name of an object
+(""cushion""), a vision system is asked to predict semantically-meaningful
+regions (masks or bounding boxes) in the image where that object could be
+placed or is likely be placed by humans (e.g. on the sofa). We call this task:
+Semantic Placement (SP) and believe that such common-sense visual understanding
+is critical for assitive robots (tidying a house), and AR devices
+(automatically rendering an object in the user's space). Studying the invisible
+is hard. Datasets for image description are typically constructed by curating
+relevant images and asking humans to annotate the contents of the image;
+neither of those two steps are straightforward for objects not present in the
+image. We overcome this challenge by operating in the opposite direction: we
+start with an image of an object in context from web, and then remove that
+object from the image via inpainting. This automated pipeline converts
+unstructured web data into a dataset comprising pairs of images with/without
+the object. Using this, we collect a novel dataset, with ${\sim}1.3$M images
+across $9$ object categories, and train a SP prediction model called CLIP-UNet.
+CLIP-UNet outperforms existing VLMs and baselines that combine semantic priors
+with object detectors on real-world and simulated images. In our user studies,
+we find that the SP masks predicted by CLIP-UNet are favored $43.7\%$ and
+$31.3\%$ times when comparing against the $4$ SP baselines on real and
+simulated images. In addition, we demonstrate leveraging SP mask predictions
+from CLIP-UNet enables downstream applications like building tidying robots in
+indoor environments.",cs.CV,['cs.CV']
+Non-autoregressive Sequence-to-Sequence Vision-Language Models,Kunyu Shi · Qi Dong · Luis Goncalves · Zhuowen Tu · Stefano Soatto, ,https://arxiv.org/abs/2403.02249,,2403.02249.pdf,Non-autoregressive Sequence-to-Sequence Vision-Language Models,"Sequence-to-sequence vision-language models are showing promise, but their
+applicability is limited by their inference latency due to their autoregressive
+way of generating predictions. We propose a parallel decoding
+sequence-to-sequence vision-language model, trained with a Query-CTC loss, that
+marginalizes over multiple inference paths in the decoder. This allows us to
+model the joint distribution of tokens, rather than restricting to conditional
+distribution as in an autoregressive model. The resulting model, NARVL,
+achieves performance on-par with its state-of-the-art autoregressive
+counterpart, but is faster at inference time, reducing from the linear
+complexity associated with the sequential generation of tokens to a paradigm of
+constant time joint inference.",cs.CV,"['cs.CV', 'cs.AI']"
+Deep Video Inverse Tone Mapping Based on Temporal Clues,Yuyao Ye · Ning Zhang · Yang Zhao · Hongbin Cao · Ronggang Wang, ,,https://dl.acm.org/doi/10.1145/3648570,,,,,nan
+L2B: Learning to Bootstrap Robust Models for Combating Label Noise,Yuyin Zhou · Xianhang li · Fengze Liu · Qingyue Wei · Xuxi Chen · Lequan Yu · Cihang Xie · Matthew P. Lungren · Lei Xing, ,,https://link.springer.com/chapter/10.1007/978-3-031-43415-0_1,,,,,nan
+Spherical Mask: Coarse-to-Fine 3D Point Cloud Instance Segmentation with Spherical Representation,Sangyun Shin · Kaichen Zhou · Madhu Vankadari · Andrew Markham · Niki Trigoni, ,https://arxiv.org/abs/2312.11269,,2312.11269.pdf,Spherical Mask: Coarse-to-Fine 3D Point Cloud Instance Segmentation with Spherical Representation,"Coarse-to-fine 3D instance segmentation methods show weak performances
+compared to recent Grouping-based, Kernel-based and Transformer-based methods.
+We argue that this is due to two limitations: 1) Instance size overestimation
+by axis-aligned bounding box(AABB) 2) False negative error accumulation from
+inaccurate box to the refinement phase. In this work, we introduce Spherical
+Mask, a novel coarse-to-fine approach based on spherical representation,
+overcoming those two limitations with several benefits. Specifically, our
+coarse detection estimates each instance with a 3D polygon using a center and
+radial distance predictions, which avoids excessive size estimation of AABB. To
+cut the error propagation in the existing coarse-to-fine approaches, we
+virtually migrate points based on the polygon, allowing all foreground points,
+including false negatives, to be refined. During inference, the proposal and
+point migration modules run in parallel and are assembled to form binary masks
+of instances. We also introduce two margin-based losses for the point migration
+to enforce corrections for the false positives/negatives and cohesion of
+foreground points, significantly improving the performance. Experimental
+results from three datasets, such as ScanNetV2, S3DIS, and STPLS3D, show that
+our proposed method outperforms existing works, demonstrating the effectiveness
+of the new instance representation with spherical coordinates.",cs.CV,"['cs.CV', 'cs.LG']"
+DiffSCI: Zero-Shot Snapshot Compressive Imaging via Iterative Spectral Diffusion Model,Zhenghao Pan · Haijin Zeng · Jiezhang Cao · Kai Zhang · Yongyong Chen,https://github.com/PAN083/DiffSCI,https://arxiv.org/abs/2311.11417,,2311.11417.pdf,DiffSCI: Zero-Shot Snapshot Compressive Imaging via Iterative Spectral Diffusion Model,"This paper endeavors to advance the precision of snapshot compressive imaging
+(SCI) reconstruction for multispectral image (MSI). To achieve this, we
+integrate the advantageous attributes of established SCI techniques and an
+image generative model, propose a novel structured zero-shot diffusion model,
+dubbed DiffSCI. DiffSCI leverages the structural insights from the deep prior
+and optimization-based methodologies, complemented by the generative
+capabilities offered by the contemporary denoising diffusion model.
+Specifically, firstly, we employ a pre-trained diffusion model, which has been
+trained on a substantial corpus of RGB images, as the generative denoiser
+within the Plug-and-Play framework for the first time. This integration allows
+for the successful completion of SCI reconstruction, especially in the case
+that current methods struggle to address effectively. Secondly, we
+systematically account for spectral band correlations and introduce a robust
+methodology to mitigate wavelength mismatch, thus enabling seamless adaptation
+of the RGB diffusion model to MSIs. Thirdly, an accelerated algorithm is
+implemented to expedite the resolution of the data subproblem. This
+augmentation not only accelerates the convergence rate but also elevates the
+quality of the reconstruction process. We present extensive testing to show
+that DiffSCI exhibits discernible performance enhancements over prevailing
+self-supervised and zero-shot approaches, surpassing even supervised
+transformer counterparts across both simulated and real datasets. Our code will
+be available.",cs.CV,['cs.CV']
+$\mathsf{LQMFormer}$:~Language-aware Query Mask Transformer for Referring Image Segmentation,Nisarg Shah · Vibashan VS · Vishal M. Patel, ,https://arxiv.org/abs/2312.12198,,,Mask Grounding for Referring Image Segmentation,"Referring Image Segmentation (RIS) is a challenging task that requires an
+algorithm to segment objects referred by free-form language expressions.
+Despite significant progress in recent years, most state-of-the-art (SOTA)
+methods still suffer from considerable language-image modality gap at the pixel
+and word level. These methods generally 1) rely on sentence-level language
+features for language-image alignment and 2) lack explicit training supervision
+for fine-grained visual grounding. Consequently, they exhibit weak object-level
+correspondence between visual and language features. Without well-grounded
+features, prior methods struggle to understand complex expressions that require
+strong reasoning over relationships among multiple objects, especially when
+dealing with rarely used or ambiguous clauses. To tackle this challenge, we
+introduce a novel Mask Grounding auxiliary task that significantly improves
+visual grounding within language features, by explicitly teaching the model to
+learn fine-grained correspondence between masked textual tokens and their
+matching visual objects. Mask Grounding can be directly used on prior RIS
+methods and consistently bring improvements. Furthermore, to holistically
+address the modality gap, we also design a cross-modal alignment loss and an
+accompanying alignment module. These additions work synergistically with Mask
+Grounding. With all these techniques, our comprehensive approach culminates in
+MagNet (Mask-grounded Network), an architecture that significantly outperforms
+prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating
+our method's effectiveness in addressing current limitations of RIS algorithms.
+Our code and pre-trained weights will be released.",cs.CV,['cs.CV']
+CLIP as RNN:  Segment Countless Visual Concepts without Training Endeavor,Shuyang Sun · Runjia Li · Philip H.S. Torr · Xiuye Gu · Siyang Li, ,https://arxiv.org/abs/2312.07661,,2312.07661.pdf,CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor,"Existing open-vocabulary image segmentation methods require a fine-tuning
+step on mask labels and/or image-text datasets. Mask labels are
+labor-intensive, which limits the number of categories in segmentation
+datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely
+reduced after fine-tuning. However, without fine-tuning, VLMs trained under
+weak image-text supervision tend to make suboptimal mask predictions. To
+alleviate these issues, we introduce a novel recurrent framework that
+progressively filters out irrelevant texts and enhances mask quality without
+training efforts. The recurrent unit is a two-stage segmenter built upon a
+frozen VLM. Thus, our model retains the VLM's broad vocabulary space and equips
+it with segmentation ability. Experiments show that our method outperforms not
+only the training-free counterparts, but also those fine-tuned with millions of
+data samples, and sets the new state-of-the-art records for both zero-shot
+semantic and referring segmentation. Concretely, we improve the current record
+by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG', 'cs.MM']"
+Improving Generalization via Meta-Learning on Hard Samples,Nishant Jain · Arun Suggala · Pradeep Shenoy, ,https://arxiv.org/abs/2403.12236,,2403.12236.pdf,Improving Generalization via Meta-Learning on Hard Samples,"Learned reweighting (LRW) approaches to supervised learning use an
+optimization criterion to assign weights for training instances, in order to
+maximize performance on a representative validation dataset. We pose and
+formalize the problem of optimized selection of the validation set used in LRW
+training, to improve classifier generalization. In particular, we show that
+using hard-to-classify instances in the validation set has both a theoretical
+connection to, and strong empirical evidence of generalization. We provide an
+efficient algorithm for training this meta-optimized model, as well as a simple
+train-twice heuristic for careful comparative study. We demonstrate that LRW
+with easy validation data performs consistently worse than LRW with hard
+validation data, establishing the validity of our meta-optimization problem.
+Our proposed algorithm outperforms a wide range of baselines on a range of
+datasets and domain shift challenges (Imagenet-1K, CIFAR-100, Clothing-1M,
+CAMELYON, WILDS, etc.), with ~1% gains using VIT-B on Imagenet. We also show
+that using naturally hard examples for validation (Imagenet-R / Imagenet-A) in
+LRW training for Imagenet improves performance on both clean and naturally hard
+test instances by 1-2%. Secondary analyses show that using hard validation data
+in an LRW framework improves margins on test data, hinting at the mechanism
+underlying our empirical gains. We believe this work opens up new research
+directions for the meta-optimization of meta-learning in a supervised learning
+context.",cs.LG,"['cs.LG', 'cs.CV']"
+PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns,Shuliang Ning · Duomin Wang · Yipeng Qin · Zirong Jin · Baoyuan Wang · Xiaoguang Han, ,https://arxiv.org/abs/2312.04534,,2312.04534.pdf,PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns,"In this paper, we propose a novel virtual try-on from unconstrained designs
+(ucVTON) task to enable photorealistic synthesis of personalized composite
+clothing on input human images. Unlike prior arts constrained by specific input
+types, our method allows flexible specification of style (text or image) and
+texture (full garment, cropped sections, or texture patches) conditions. To
+address the entanglement challenge when using full garment images as
+conditions, we develop a two-stage pipeline with explicit disentanglement of
+style and texture. In the first stage, we generate a human parsing map
+reflecting the desired style conditioned on the input. In the second stage, we
+composite textures onto the parsing map areas based on the texture input. To
+represent complex and non-stationary textures that have never been achieved in
+previous fashion editing works, we first propose extracting hierarchical and
+balanced CLIP features and applying position encoding in VTON. Experiments
+demonstrate superior synthesis quality and personalization enabled by our
+method. The flexible control over style and texture mixing brings virtual
+try-on to a new level of user experience for online shopping and fashion
+design.",cs.CV,['cs.CV']
+KPConvX: Modernizing Kernel Point Convolution with Kernel Attention,Hugues Thomas · Yao-Hung Hubert Tsai · Timothy Barfoot · Jian Zhang, ,https://arxiv.org/abs/2405.13194,,2405.13194.pdf,KPConvX: Modernizing Kernel Point Convolution with Kernel Attention,"In the field of deep point cloud understanding, KPConv is a unique
+architecture that uses kernel points to locate convolutional weights in space,
+instead of relying on Multi-Layer Perceptron (MLP) encodings. While it
+initially achieved success, it has since been surpassed by recent MLP networks
+that employ updated designs and training strategies. Building upon the kernel
+point principle, we present two novel designs: KPConvD (depthwise KPConv), a
+lighter design that enables the use of deeper architectures, and KPConvX, an
+innovative design that scales the depthwise convolutional weights of KPConvD
+with kernel attention values. Using KPConvX with a modern architecture and
+training strategy, we are able to outperform current state-of-the-art
+approaches on the ScanObjectNN, Scannetv2, and S3DIS datasets. We validate our
+design choices through ablation studies and release our code and models.",cs.CV,['cs.CV']
+FedAS: Bridging Inconsistency in Personalized Federated Learning,Xiyuan Yang · Wenke Huang · Mang Ye,https://github.com/xiyuanyang45/FedAS,,https://dl.acm.org/doi/10.5555/3666122.3669282,,,,,nan
+DeIl: Direct and Inverse CLIP for Open-World Few-Shot Learning,Shuai Shao · Yu Bai · Yan WANG · Bao-di Liu · Yicong Zhou, ,,https://www.semanticscholar.org/paper/Collaborative-Consortium-of-Foundation-Models-for-Shao-Bai/90668de8b1c5dcb0471444e3177dc28e20fce5d4,,,,,nan
+"Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding",Wujian Peng · Sicheng Xie · Zuyao You · Shiyi Lan · Zuxuan Wu,https://github.com/wjpoom/SPEC,https://arxiv.org/abs/2312.00081,,2312.00081.pdf,"Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding","Vision language models (VLM) have demonstrated remarkable performance across
+various downstream tasks. However, understanding fine-grained visual-linguistic
+concepts, such as attributes and inter-object relationships, remains a
+significant challenge. While several benchmarks aim to evaluate VLMs in finer
+granularity, their primary focus remains on the linguistic aspect, neglecting
+the visual dimension. Here, we highlight the importance of evaluating VLMs from
+both a textual and visual perspective. We introduce a progressive pipeline to
+synthesize images that vary in a specific attribute while ensuring consistency
+in all other aspects. Utilizing this data engine, we carefully design a
+benchmark, SPEC, to diagnose the comprehension of object size, position,
+existence, and count. Subsequently, we conduct a thorough evaluation of four
+leading VLMs on SPEC. Surprisingly, their performance is close to random guess,
+revealing significant limitations. With this in mind, we propose a simple yet
+effective approach to optimize VLMs in fine-grained understanding, achieving
+significant improvements on SPEC without compromising the zero-shot
+performance. Results on two additional fine-grained benchmarks also show
+consistent improvements, further validating the transferability of our
+approach. Code and data are available at https://github.com/wjpoom/SPEC.",cs.CV,['cs.CV']
+CPR: Retrieval Augmented Generation for Copyright Protection,Aditya Golatkar · Alessandro Achille · Luca Zancato · Yu-Xiang Wang · Ashwin Swaminathan · Stefano Soatto · Stefano Soatto, ,https://arxiv.org/abs/2403.18920,,2403.18920.pdf,CPR: Retrieval Augmented Generation for Copyright Protection,"Retrieval Augmented Generation (RAG) is emerging as a flexible and robust
+technique to adapt models to private users data without training, to handle
+credit attribution, and to allow efficient machine unlearning at scale.
+However, RAG techniques for image generation may lead to parts of the retrieved
+samples being copied in the model's output. To reduce risks of leaking private
+information contained in the retrieved set, we introduce Copy-Protected
+generation with Retrieval (CPR), a new method for RAG with strong copyright
+protection guarantees in a mixed-private setting for diffusion models.CPR
+allows to condition the output of diffusion models on a set of retrieved
+images, while also guaranteeing that unique identifiable information about
+those example is not exposed in the generated outputs. In particular, it does
+so by sampling from a mixture of public (safe) distribution and private (user)
+distribution by merging their diffusion scores at inference. We prove that CPR
+satisfies Near Access Freeness (NAF) which bounds the amount of information an
+attacker may be able to extract from the generated images. We provide two
+algorithms for copyright protection, CPR-KL and CPR-Choose. Unlike previously
+proposed rejection-sampling-based NAF methods, our methods enable efficient
+copyright-protected sampling with a single run of backward diffusion. We show
+that our method can be applied to any pre-trained conditional diffusion model,
+such as Stable Diffusion or unCLIP. In particular, we empirically show that
+applying CPR on top of unCLIP improves quality and text-to-image alignment of
+the generated results (81.4 to 83.17 on TIFA benchmark), while enabling credit
+attribution, copy-right protection, and deterministic, constant time,
+unlearning.",cs.CR,"['cs.CR', 'cs.AI', 'cs.CV']"
+FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models,Shivangi Aneja · Justus Thies · Angela Dai · Matthias Nießner, ,https://arxiv.org/abs/2312.08459,,2312.08459.pdf,FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models,"We introduce FaceTalk, a novel generative approach designed for synthesizing
+high-fidelity 3D motion sequences of talking human heads from input audio
+signal. To capture the expressive, detailed nature of human heads, including
+hair, ears, and finer-scale eye movements, we propose to couple speech signal
+with the latent space of neural parametric head models to create high-fidelity,
+temporally coherent motion sequences. We propose a new latent diffusion model
+for this task, operating in the expression space of neural parametric head
+models, to synthesize audio-driven realistic head sequences. In the absence of
+a dataset with corresponding NPHM expressions to audio, we optimize for these
+correspondences to produce a dataset of temporally-optimized NPHM expressions
+fit to audio-video recordings of people talking. To the best of our knowledge,
+this is the first work to propose a generative approach for realistic and
+high-quality motion synthesis of volumetric human heads, representing a
+significant advancement in the field of audio-driven 3D animation. Notably, our
+approach stands out in its ability to generate plausible motion sequences that
+can produce high-fidelity head animation coupled with the NPHM shape space. Our
+experimental results substantiate the effectiveness of FaceTalk, consistently
+achieving superior and visually natural motion, encompassing diverse facial
+expressions and styles, outperforming existing methods by 75% in perceptual
+user study evaluation.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.SD', 'eess.AS']"
+Binding Touch to Everything: Learning Unified Multimodal Tactile Representations,Fengyu Yang · Chao Feng · Ziyang Chen · Hyoungseob Park · Daniel Wang · Yiming Dou · Ziyao Zeng · xien chen · Suchisrit Gangopadhyay · Andrew Owens · Alex Wong, ,https://arxiv.org/abs/2401.18084,,2401.18084.pdf,Binding Touch to Everything: Learning Unified Multimodal Tactile Representations,"The ability to associate touch with other modalities has huge implications
+for humans and computational systems. However, multimodal learning with touch
+remains challenging due to the expensive data collection process and
+non-standardized sensor outputs. We introduce UniTouch, a unified tactile model
+for vision-based touch sensors connected to multiple modalities, including
+vision, language, and sound. We achieve this by aligning our UniTouch
+embeddings to pretrained image embeddings already associated with a variety of
+other modalities. We further propose learnable sensor-specific tokens, allowing
+the model to learn from a set of heterogeneous tactile sensors, all at the same
+time. UniTouch is capable of conducting various touch sensing tasks in the
+zero-shot setting, from robot grasping prediction to touch image question
+answering. To the best of our knowledge, UniTouch is the first to demonstrate
+such capabilities. Project page: https://cfeng16.github.io/UniTouch/",cs.CV,"['cs.CV', 'cs.RO']"
+Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles,Vanessa Sklyarova · Egor Zakharov · Otmar Hilliges · Michael J. Black · Justus Thies,https://haar.is.tue.mpg.de/,https://arxiv.org/abs/2312.11666,,2312.11666.pdf,HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles,"We present HAAR, a new strand-based generative model for 3D human hairstyles.
+Specifically, based on textual inputs, HAAR produces 3D hairstyles that could
+be used as production-level assets in modern computer graphics engines. Current
+AI-based generative models take advantage of powerful 2D priors to reconstruct
+3D content in the form of point clouds, meshes, or volumetric functions.
+However, by using the 2D priors, they are intrinsically limited to only
+recovering the visual parts. Highly occluded hair structures can not be
+reconstructed with those methods, and they only model the ''outer shell'',
+which is not ready to be used in physics-based rendering or simulation
+pipelines. In contrast, we propose a first text-guided generative method that
+uses 3D hair strands as an underlying representation. Leveraging 2D visual
+question-answering (VQA) systems, we automatically annotate synthetic hair
+models that are generated from a small set of artist-created hairstyles. This
+allows us to train a latent diffusion model that operates in a common hairstyle
+UV space. In qualitative and quantitative studies, we demonstrate the
+capabilities of the proposed model and compare it to existing hairstyle
+generation approaches.",cs.CV,"['cs.CV', 'cs.GR']"
+Sieve: Multimodal Dataset Pruning using Image-Captioning Models,Anas Mahmoud · Mostafa Elhoushi · Amro Abbas · Yu Yang · Newsha Ardalani · Hugh Leather · Ari Morcos, ,https://arxiv.org/abs/2310.02110,,2310.02110.pdf,Sieve: Multimodal Dataset Pruning Using Image Captioning Models,"Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy
+web-crawled datasets. This underscores the critical need for dataset pruning,
+as the quality of these datasets is strongly correlated with the performance of
+VLMs on downstream tasks. Using CLIPScore from a pretrained model to only train
+models using highly-aligned samples is one of the most successful methods for
+pruning. We argue that this approach suffers from multiple limitations
+including: false positives and negatives due to CLIP's pretraining on noisy
+labels. We propose a pruning signal, Sieve, that employs synthetic captions
+generated by image-captioning models pretrained on small, diverse, and
+well-aligned image-text pairs to evaluate the alignment of noisy image-text
+pairs. To bridge the gap between the limited diversity of generated captions
+and the high diversity of alternative text (alt-text), we estimate the semantic
+textual similarity in the embedding space of a language model pretrained on
+unlabeled text corpus. Using DataComp, a multimodal dataset filtering
+benchmark, when evaluating on 38 downstream tasks, our pruning approach,
+surpasses CLIPScore by 2.6\% and 1.7\% on medium and large scale respectively.
+In addition, on retrieval tasks, Sieve leads to a significant improvement of
+2.7% and 4.5% on medium and large scale respectively.",cs.CV,['cs.CV']
+Streaming Dense Video Captioning,Xingyi Zhou · Anurag Arnab · Shyamal Buch · Shen Yan · Austin Myers · Xuehan Xiong · Arsha Nagrani · Cordelia Schmid, ,https://arxiv.org/abs/2404.01297,,2404.01297.pdf,Streaming Dense Video Captioning,"An ideal model for dense video captioning -- predicting captions localized
+temporally in a video -- should be able to handle long input videos, predict
+rich, detailed textual descriptions, and be able to produce outputs before
+processing the entire video. Current state-of-the-art models, however, process
+a fixed number of downsampled frames, and make a single full prediction after
+seeing the whole video. We propose a streaming dense video captioning model
+that consists of two novel components: First, we propose a new memory module,
+based on clustering incoming tokens, which can handle arbitrarily long videos
+as the memory is of a fixed size. Second, we develop a streaming decoding
+algorithm that enables our model to make predictions before the entire video
+has been processed. Our model achieves this streaming ability, and
+significantly improves the state-of-the-art on three dense video captioning
+benchmarks: ActivityNet, YouCook2 and ViTT. Our code is released at
+https://github.com/google-research/scenic.",cs.CV,['cs.CV']
+DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes,Hao Yan · Zhihui Ke · Xiaobo Zhou · Tie Qiu · Xidong Shi · DaDong Jiang,https://haoyan14.github.io/DS-NeRV/,https://arxiv.org/abs/2403.15679,,,DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes,"Implicit neural representations for video (NeRV) have recently become a novel
+way for high-quality video representation. However, existing works employ a
+single network to represent the entire video, which implicitly confuse static
+and dynamic information. This leads to an inability to effectively compress the
+redundant static information and lack the explicitly modeling of global
+temporal-coherent dynamic details. To solve above problems, we propose DS-NeRV,
+which decomposes videos into sparse learnable static codes and dynamic codes
+without the need for explicit optical flow or residual supervision. By setting
+different sampling rates for two codes and applying weighted sum and
+interpolation sampling methods, DS-NeRV efficiently utilizes redundant static
+information while maintaining high-frequency details. Additionally, we design a
+cross-channel attention-based (CCA) fusion module to efficiently fuse these two
+codes for frame decoding. Our approach achieves a high quality reconstruction
+of 31.2 PSNR with only 0.35M parameters thanks to separate static and dynamic
+codes representation and outperforms existing NeRV methods in many downstream
+tasks. Our project website is at https://haoyan14.github.io/DS-NeRV.",cs.CV,"['cs.CV', 'cs.MM']"
+SANeRF-HQ: Segment Anything for NeRF in High Quality,Yichen Liu · Benran Hu · Chi-Keung Tang · Yu-Wing Tai, ,https://arxiv.org/abs/2312.01531,,2312.01531.pdf,SANeRF-HQ: Segment Anything for NeRF in High Quality,"Recently, the Segment Anything Model (SAM) has showcased remarkable
+capabilities of zero-shot segmentation, while NeRF (Neural Radiance Fields) has
+gained popularity as a method for various 3D problems beyond novel view
+synthesis. Though there exist initial attempts to incorporate these two methods
+into 3D segmentation, they face the challenge of accurately and consistently
+segmenting objects in complex scenarios. In this paper, we introduce the
+Segment Anything for NeRF in High Quality (SANeRF-HQ) to achieve high-quality
+3D segmentation of any target object in a given scene. SANeRF-HQ utilizes SAM
+for open-world object segmentation guided by user-supplied prompts, while
+leveraging NeRF to aggregate information from different viewpoints. To overcome
+the aforementioned challenges, we employ density field and RGB similarity to
+enhance the accuracy of segmentation boundary during the aggregation.
+Emphasizing on segmentation accuracy, we evaluate our method on multiple NeRF
+datasets where high-quality ground-truths are available or manually annotated.
+SANeRF-HQ shows a significant quality improvement over state-of-the-art methods
+in NeRF object segmentation, provides higher flexibility for object
+localization, and enables more consistent object segmentation across multiple
+views. Results and code are available at the project site:
+https://lyclyc52.github.io/SANeRF-HQ/.",cs.CV,['cs.CV']
+\emph{RealCustom}: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization,Mengqi Huang · Zhendong Mao · Mingcong Liu · Qian HE · Yongdong Zhang,https://corleone-huang.github.io/realcustom/,https://arxiv.org/abs/2403.00483,,2403.00483.pdf,RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization,"Text-to-image customization, which aims to synthesize text-driven images for
+the given subjects, has recently revolutionized content creation. Existing
+works follow the pseudo-word paradigm, i.e., represent the given subjects as
+pseudo-words and then compose them with the given text. However, the inherent
+entangled influence scope of pseudo-words with the given text results in a
+dual-optimum paradox, i.e., the similarity of the given subjects and the
+controllability of the given text could not be optimal simultaneously. We
+present RealCustom that, for the first time, disentangles similarity from
+controllability by precisely limiting subject influence to relevant parts only,
+achieved by gradually narrowing real text word from its general connotation to
+the specific subject and using its cross-attention to distinguish relevance.
+Specifically, RealCustom introduces a novel ""train-inference"" decoupled
+framework: (1) during training, RealCustom learns general alignment between
+visual conditions to original textual conditions by a novel adaptive scoring
+module to adaptively modulate influence quantity; (2) during inference, a novel
+adaptive mask guidance strategy is proposed to iteratively update the influence
+scope and influence quantity of the given subjects to gradually narrow the
+generation of the real text word. Comprehensive experiments demonstrate the
+superior real-time customization ability of RealCustom in the open domain,
+achieving both unprecedented similarity of the given subjects and
+controllability of the given text for the first time. The project page is
+https://corleone-huang.github.io/realcustom/.",cs.CV,['cs.CV']
+Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification,Sravanti Addepalli · Ashish Asokan · Lakshay Sharma · R. Venkatesh Babu, ,https://arxiv.org/abs/2310.08255,,2310.08255.pdf,Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification,"Vision-Language Models (VLMs) such as CLIP are trained on large amounts of
+image-text pairs, resulting in remarkable generalization across several data
+distributions. However, in several cases, their expensive training and data
+collection/curation costs do not justify the end application. This motivates a
+vendor-client paradigm, where a vendor trains a large-scale VLM and grants only
+input-output access to clients on a pay-per-query basis in a black-box setting.
+The client aims to minimize inference cost by distilling the VLM to a student
+model using the limited available task-specific data, and further deploying
+this student model in the downstream application. While naive distillation
+largely improves the In-Domain (ID) accuracy of the student, it fails to
+transfer the superior out-of-distribution (OOD) generalization of the VLM
+teacher using the limited available labeled images. To mitigate this, we
+propose Vision-Language to Vision - Align, Distill, Predict (VL2V-ADiP), which
+first aligns the vision and language modalities of the teacher model with the
+vision modality of a pre-trained student model, and further distills the
+aligned VLM representations to the student. This maximally retains the
+pre-trained features of the student, while also incorporating the rich
+representations of the VLM image encoder and the superior generalization of the
+text embeddings. The proposed approach achieves state-of-the-art results on the
+standard Domain Generalization benchmarks in a black-box teacher setting as
+well as a white-box setting where the weights of the VLM are accessible.",cs.CV,['cs.CV']
+TransLoc4D: Transformer-based 4D Radar Place Recognition,Guohao Peng · Heshan Li · Yangyang Zhao · Jun Zhang · Zhenyu Wu · Pengyu Zheng · Danwei Wang, ,https://arxiv.org/abs/2401.13082,,2401.13082.pdf,PlaceFormer: Transformer-based Visual Place Recognition using Multi-Scale Patch Selection and Fusion,"Visual place recognition is a challenging task in the field of computer
+vision, and autonomous robotics and vehicles, which aims to identify a location
+or a place from visual inputs. Contemporary methods in visual place recognition
+employ convolutional neural networks and utilize every region within the image
+for the place recognition task. However, the presence of dynamic and
+distracting elements in the image may impact the effectiveness of the place
+recognition process. Therefore, it is meaningful to focus on task-relevant
+regions of the image for improved recognition. In this paper, we present
+PlaceFormer, a novel transformer-based approach for visual place recognition.
+PlaceFormer employs patch tokens from the transformer to create global image
+descriptors, which are then used for image retrieval. To re-rank the retrieved
+images, PlaceFormer merges the patch tokens from the transformer to form
+multi-scale patches. Utilizing the transformer's self-attention mechanism, it
+selects patches that correspond to task-relevant areas in an image. These
+selected patches undergo geometric verification, generating similarity scores
+across different patch sizes. Subsequently, spatial scores from each patch size
+are fused to produce a final similarity score. This score is then used to
+re-rank the images initially retrieved using global image descriptors.
+Extensive experiments on benchmark datasets demonstrate that PlaceFormer
+outperforms several state-of-the-art methods in terms of accuracy and
+computational efficiency, requiring less time and memory.",cs.CV,"['cs.CV', 'cs.RO']"
+Domain Gap Embeddings for Generative Dataset Augmentation,Yinong Wang · Younjoon Chung · Chen Henry Wu · Fernando De la Torre, ,https://arxiv.org/abs/2312.05387,,2312.05387.pdf,Cross Domain Generative Augmentation: Domain Generalization with Latent Diffusion Models,"Despite the huge effort in developing novel regularizers for Domain
+Generalization (DG), adding simple data augmentation to the vanilla ERM which
+is a practical implementation of the Vicinal Risk Minimization principle (VRM)
+\citep{chapelle2000vicinal} outperforms or stays competitive with many of the
+proposed regularizers. The VRM reduces the estimation error in ERM by replacing
+the point-wise kernel estimates with a more precise estimation of true data
+distribution that reduces the gap between data points \textbf{within each
+domain}. However, in the DG setting, the estimation error of true data
+distribution by ERM is mainly caused by the distribution shift \textbf{between
+domains} which cannot be fully addressed by simple data augmentation techniques
+within each domain. Inspired by this limitation of VRM, we propose a novel data
+augmentation named Cross Domain Generative Augmentation (CDGA) that replaces
+the pointwise kernel estimates in ERM with new density estimates in the
+\textbf{vicinity of domain pairs} so that the gap between domains is further
+reduced. To this end, CDGA, which is built upon latent diffusion models (LDM),
+generates synthetic images to fill the gap between all domains and as a result,
+reduces the non-iidness. We show that CDGA outperforms SOTA DG methods under
+the Domainbed benchmark. To explain the effectiveness of CDGA, we generate more
+than 5 Million synthetic images and perform extensive ablation studies
+including data scaling laws, distribution visualization, domain shift
+quantification, adversarial robustness, and loss landscape analysis.",cs.LG,['cs.LG']
+Detours for Navigating Instructional Videos,Kumar Ashutosh · Zihui Xue · Tushar Nagarajan · Kristen Grauman, ,https://arxiv.org/abs/2401.01823,,2401.01823.pdf,Detours for Navigating Instructional Videos,"We introduce the video detours problem for navigating instructional videos.
+Given a source video and a natural language query asking to alter the how-to
+video's current path of execution in a certain way, the goal is to find a
+related ''detour video'' that satisfies the requested alteration. To address
+this challenge, we propose VidDetours, a novel video-language approach that
+learns to retrieve the targeted temporal segments from a large repository of
+how-to's using video-and-text conditioned queries. Furthermore, we devise a
+language-based pipeline that exploits how-to video narration text to create
+weakly supervised training data. We demonstrate our idea applied to the domain
+of how-to cooking videos, where a user can detour from their current recipe to
+find steps with alternate ingredients, tools, and techniques. Validating on a
+ground truth annotated dataset of 16K samples, we show our model's significant
+improvements over best available methods for video retrieval and question
+answering, with recall rates exceeding the state of the art by 35%.",cs.CV,['cs.CV']
+Iterated Learning Improves Compositionality in Large Vision-Language Models,Chenhao Zheng · Jieyu Zhang · Aniruddha Kembhavi · Ranjay Krishna, ,https://arxiv.org/abs/2404.02145,,2404.02145.pdf,Iterated Learning Improves Compositionality in Large Vision-Language Models,"A fundamental characteristic common to both human vision and natural language
+is their compositional nature. Yet, despite the performance gains contributed
+by large vision and language pretraining, recent investigations find that
+most-if not all-our state-of-the-art vision-language models struggle at
+compositionality. They are unable to distinguish between images of "" a girl in
+white facing a man in black"" and ""a girl in black facing a man in white"".
+Moreover, prior work suggests that compositionality doesn't arise with scale:
+larger model sizes or training data don't help. This paper develops a new
+iterated training algorithm that incentivizes compositionality. We draw on
+decades of cognitive science research that identifies cultural transmission-the
+need to teach a new generation-as a necessary inductive prior that incentivizes
+humans to develop compositional languages. Specifically, we reframe
+vision-language contrastive learning as the Lewis Signaling Game between a
+vision agent and a language agent, and operationalize cultural transmission by
+iteratively resetting one of the agent's weights during training. After every
+iteration, this training paradigm induces representations that become ""easier
+to learn"", a property of compositional languages: e.g. our model trained on
+CC3M and CC12M improves standard CLIP by 4.7%, 4.0% respectfully in the
+SugarCrepe benchmark.",cs.CV,['cs.CV']
+Contrastive Mean-Shift Learning for Generalized Category Discovery,Sua Choi · Dahyun Kang · Minsu Cho, ,https://arxiv.org/abs/2404.09451,,2404.09451.pdf,Contrastive Mean-Shift Learning for Generalized Category Discovery,"We address the problem of generalized category discovery (GCD) that aims to
+partition a partially labeled collection of images; only a small part of the
+collection is labeled and the total number of target classes is unknown. To
+address this generalized image clustering problem, we revisit the mean-shift
+algorithm, i.e., a classic, powerful technique for mode seeking, and
+incorporate it into a contrastive learning framework. The proposed method,
+dubbed Contrastive Mean-Shift (CMS) learning, trains an image encoder to
+produce representations with better clustering properties by an iterative
+process of mean shift and contrastive update. Experiments demonstrate that our
+method, both in settings with and without the total number of clusters being
+known, achieves state-of-the-art performance on six public GCD benchmarks
+without bells and whistles.",cs.CV,['cs.CV']
+Volumetric Environment Representation for Vision-Language Navigation,Liu · Wenguan Wang · Yi Yang, ,https://arxiv.org/abs/2403.14158v1,,2403.14158v1.pdf,Volumetric Environment Representation for Vision-Language Navigation,"Vision-language navigation (VLN) requires an agent to navigate through an 3D
+environment based on visual observations and natural language instructions. It
+is clear that the pivotal factor for successful navigation lies in the
+comprehensive scene understanding. Previous VLN agents employ monocular
+frameworks to extract 2D features of perspective views directly. Though
+straightforward, they struggle for capturing 3D geometry and semantics, leading
+to a partial and incomplete environment representation. To achieve a
+comprehensive 3D representation with fine-grained details, we introduce a
+Volumetric Environment Representation (VER), which voxelizes the physical world
+into structured 3D cells. For each cell, VER aggregates multi-view 2D features
+into such a unified 3D space via 2D-3D sampling. Through coarse-to-fine feature
+extraction and multi-task learning for VER, our agent predicts 3D occupancy, 3D
+room layout, and 3D bounding boxes jointly. Based on online collected VERs, our
+agent performs volume state estimation and builds episodic memory for
+predicting the next step. Experimental results show our environment
+representations from multi-task learning lead to evident performance gains on
+VLN. Our model achieves state-of-the-art performance across VLN benchmarks
+(R2R, REVERIE, and R4R).",cs.CV,['cs.CV']
+DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data,Qihao Liu · Yi Zhang · Song Bai · Adam Kortylewski · Alan L. Yuille, ,https://arxiv.org/abs/2405.14832,,2405.14832.pdf,Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer,"Generating high-quality 3D assets from text and images has long been
+challenging, primarily due to the absence of scalable 3D representations
+capable of capturing intricate geometry distributions. In this work, we
+introduce Direct3D, a native 3D generative model scalable to in-the-wild input
+images, without requiring a multiview diffusion model or SDS optimization. Our
+approach comprises two primary components: a Direct 3D Variational Auto-Encoder
+(D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficiently
+encodes high-resolution 3D shapes into a compact and continuous latent triplane
+space. Notably, our method directly supervises the decoded geometry using a
+semi-continuous surface sampling strategy, diverging from previous methods
+relying on rendered images as supervision signals. D3D-DiT models the
+distribution of encoded 3D latents and is specifically designed to fuse
+positional information from the three feature maps of the triplane latent,
+enabling a native 3D generative model scalable to large-scale 3D datasets.
+Additionally, we introduce an innovative image-to-3D generation pipeline
+incorporating semantic and pixel-level image conditions, allowing the model to
+produce 3D shapes consistent with the provided conditional image input.
+Extensive experiments demonstrate the superiority of our large-scale
+pre-trained Direct3D over previous image-to-3D approaches, achieving
+significantly better generation quality and generalization ability, thus
+establishing a new state-of-the-art for 3D content creation. Project page:
+https://nju-3dv.github.io/projects/Direct3D/.",cs.CV,['cs.CV']
+One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models,Lin Li · Haoyan Guan · Jianing Qiu · Michael Spratling,https://github.com/TreeLLi/APT,https://arxiv.org/abs/2403.01849,,2403.01849.pdf,One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models,"Large pre-trained Vision-Language Models (VLMs) like CLIP, despite having
+remarkable generalization ability, are highly vulnerable to adversarial
+examples. This work studies the adversarial robustness of VLMs from the novel
+perspective of the text prompt instead of the extensively studied model weights
+(frozen in this work). We first show that the effectiveness of both adversarial
+attack and defense are sensitive to the used text prompt. Inspired by this, we
+propose a method to improve resilience to adversarial attacks by learning a
+robust text prompt for VLMs. The proposed method, named Adversarial Prompt
+Tuning (APT), is effective while being both computationally and data efficient.
+Extensive experiments are conducted across 15 datasets and 4 data sparsity
+schemes (from 1-shot to full training data settings) to show APT's superiority
+over hand-engineered prompts and other state-of-the-art adaption methods. APT
+demonstrated excellent abilities in terms of the in-distribution performance
+and the generalization under input distribution shift and across datasets.
+Surprisingly, by simply adding one learned word to the prompts, APT can
+significantly boost the accuracy and robustness (epsilon=4/255) over the
+hand-engineered prompts by +13% and +8.5% on average respectively. The
+improvement further increases, in our most effective setting, to +26.4% for
+accuracy and +16.7% for robustness. Code is available at
+https://github.com/TreeLLi/APT.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Progressive Divide-and-Conquer via Subsampling Decomposition for Accelerated MRI,Chong Wang · Lanqing Guo · Yufei Wang · Hao Cheng · Yi Yu · Bihan Wen,https://github.com/ChongWang1024/PDAC,https://arxiv.org/abs/2403.10064,,2403.10064.pdf,Progressive Divide-and-Conquer via Subsampling Decomposition for Accelerated MRI,"Deep unfolding networks (DUN) have emerged as a popular iterative framework
+for accelerated magnetic resonance imaging (MRI) reconstruction. However,
+conventional DUN aims to reconstruct all the missing information within the
+entire null space in each iteration. Thus it could be challenging when dealing
+with highly ill-posed degradation, usually leading to unsatisfactory
+reconstruction. In this work, we propose a Progressive Divide-And-Conquer
+(PDAC) strategy, aiming to break down the subsampling process in the actual
+severe degradation and thus perform reconstruction sequentially. Starting from
+decomposing the original maximum-a-posteriori problem of accelerated MRI, we
+present a rigorous derivation of the proposed PDAC framework, which could be
+further unfolded into an end-to-end trainable network. Specifically, each
+iterative stage in PDAC focuses on recovering a distinct moderate degradation
+according to the decomposition. Furthermore, as part of the PDAC iteration,
+such decomposition is adaptively learned as an auxiliary task through a
+degradation predictor which provides an estimation of the decomposed sampling
+mask. Following this prediction, the sampling mask is further integrated via a
+severity conditioning module to ensure awareness of the degradation severity at
+each stage. Extensive experiments demonstrate that our proposed method achieves
+superior performance on the publicly available fastMRI and Stanford2D FSE
+datasets in both multi-coil and single-coil settings.",eess.IV,"['eess.IV', 'cs.CV']"
+Physical Backdoor: Towards Temperature-based Backdoor Attacks in the Physical World,Wen Yin · Jian Lou · Pan Zhou · Yulai Xie · Dan Feng · Yuhua Sun · Tailai Zhang · Lichao Sun, ,http://export.arxiv.org/abs/2404.19417,,2404.19417.pdf,Physical Backdoor: Towards Temperature-based Backdoor Attacks in the Physical World,"Backdoor attacks have been well-studied in visible light object detection
+(VLOD) in recent years. However, VLOD can not effectively work in dark and
+temperature-sensitive scenarios. Instead, thermal infrared object detection
+(TIOD) is the most accessible and practical in such environments. In this
+paper, our team is the first to investigate the security vulnerabilities
+associated with TIOD in the context of backdoor attacks, spanning both the
+digital and physical realms. We introduce two novel types of backdoor attacks
+on TIOD, each offering unique capabilities: Object-affecting Attack and
+Range-affecting Attack. We conduct a comprehensive analysis of key factors
+influencing trigger design, which include temperature, size, material, and
+concealment. These factors, especially temperature, significantly impact the
+efficacy of backdoor attacks on TIOD. A thorough understanding of these factors
+will serve as a foundation for designing physical triggers and temperature
+controlling experiments. Our study includes extensive experiments conducted in
+both digital and physical environments. In the digital realm, we evaluate our
+approach using benchmark datasets for TIOD, achieving an Attack Success Rate
+(ASR) of up to 98.21%. In the physical realm, we test our approach in two
+real-world settings: a traffic intersection and a parking lot, using a thermal
+infrared camera. Here, we attain an ASR of up to 98.38%.",cs.CV,['cs.CV']
+Diffusion Model Alignment Using Direct Preference Optimization,Bram Wallace · Meihua Dang · Rafael Rafailov · Linqi Zhou · Aaron Lou · Senthil Purushwalkam · Stefano Ermon · Caiming Xiong · Shafiq Joty · Nikhil Naik, ,https://arxiv.org/abs/2311.12908,,2311.12908.pdf,Diffusion Model Alignment Using Direct Preference Optimization,"Large language models (LLMs) are fine-tuned using human comparison data with
+Reinforcement Learning from Human Feedback (RLHF) methods to make them better
+aligned with users' preferences. In contrast to LLMs, human preference learning
+has not been widely explored in text-to-image diffusion models; the best
+existing approach is to fine-tune a pretrained model using carefully curated
+high quality images and captions to improve visual appeal and text alignment.
+We propose Diffusion-DPO, a method to align diffusion models to human
+preferences by directly optimizing on human comparison data. Diffusion-DPO is
+adapted from the recently developed Direct Preference Optimization (DPO), a
+simpler alternative to RLHF which directly optimizes a policy that best
+satisfies human preferences under a classification objective. We re-formulate
+DPO to account for a diffusion model notion of likelihood, utilizing the
+evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic
+dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model
+of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with
+Diffusion-DPO. Our fine-tuned base model significantly outperforms both base
+SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement
+model in human evaluation, improving visual appeal and prompt alignment. We
+also develop a variant that uses AI feedback and has comparable performance to
+training on human preferences, opening the door for scaling of diffusion model
+alignment methods.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG']"
+LAMP: Learn A Motion Pattern for Few-Shot Video Generation,Rui-Qi Wu · Liangyu Chen · Tong Yang · Chun-Le Guo · Chongyi Li · Xiangyu Zhang,https://rq-wu.github.io/projects/LAMP/index.html,https://arxiv.org/abs/2310.10769,,2310.10769.pdf,LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation,"With the impressive progress in diffusion-based text-to-image generation,
+extending such powerful generative ability to text-to-video raises enormous
+attention. Existing methods either require large-scale text-video pairs and a
+large number of training resources or learn motions that are precisely aligned
+with template videos. It is non-trivial to balance a trade-off between the
+degree of generation freedom and the resource costs for video generation. In
+our study, we present a few-shot-based tuning framework, LAMP, which enables
+text-to-image diffusion model Learn A specific Motion Pattern with 8~16 videos
+on a single GPU. Specifically, we design a first-frame-conditioned pipeline
+that uses an off-the-shelf text-to-image model for content generation so that
+our tuned video diffusion model mainly focuses on motion learning. The
+well-developed text-to-image techniques can provide visually pleasing and
+diverse content as generation conditions, which highly improves video quality
+and generation freedom. To capture the features of temporal dimension, we
+expand the pretrained 2D convolution layers of the T2I model to our novel
+temporal-spatial motion learning layers and modify the attention blocks to the
+temporal level. Additionally, we develop an effective inference trick,
+shared-noise sampling, which can improve the stability of videos with
+computational costs. Our method can also be flexibly applied to other tasks,
+e.g. real-world image animation and video editing. Extensive experiments
+demonstrate that LAMP can effectively learn the motion pattern on limited data
+and generate high-quality videos. The code and models are available at
+https://rq-wu.github.io/projects/LAMP.",cs.CV,['cs.CV']
+Instance-Adaptive and Geometric-Aware Keypoint Learning for Category-Level 6D Object Pose Estimation,Xiao Lin · Wenfei Yang · Yuan Gao · Tianzhu Zhang, ,https://arxiv.org/abs/2403.19527,,2403.19527.pdf,Instance-Adaptive and Geometric-Aware Keypoint Learning for Category-Level 6D Object Pose Estimation,"Category-level 6D object pose estimation aims to estimate the rotation,
+translation and size of unseen instances within specific categories. In this
+area, dense correspondence-based methods have achieved leading performance.
+However, they do not explicitly consider the local and global geometric
+information of different instances, resulting in poor generalization ability to
+unseen instances with significant shape variations. To deal with this problem,
+we propose a novel Instance-Adaptive and Geometric-Aware Keypoint Learning
+method for category-level 6D object pose estimation (AG-Pose), which includes
+two key designs: (1) The first design is an Instance-Adaptive Keypoint
+Detection module, which can adaptively detect a set of sparse keypoints for
+various instances to represent their geometric structures. (2) The second
+design is a Geometric-Aware Feature Aggregation module, which can efficiently
+integrate the local and global geometric information into keypoint features.
+These two modules can work together to establish robust keypoint-level
+correspondences for unseen instances, thus enhancing the generalization ability
+of the model.Experimental results on CAMERA25 and REAL275 datasets show that
+the proposed AG-Pose outperforms state-of-the-art methods by a large margin
+without category-specific shape priors.",cs.CV,['cs.CV']
+RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback,Tianyu Yu · Yuan Yao · Haoye Zhang · Taiwen He · Yifeng Han · Ganqu Cui · Jinyi Hu · Zhiyuan Liu · Hai-Tao Zheng · Maosong Sun,https://github.com/RLHF-V/RLHF-V,https://arxiv.org/abs/2312.00849,,2312.00849.pdf,RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback,"Multimodal Large Language Models (MLLMs) have recently demonstrated
+impressive capabilities in multimodal understanding, reasoning, and
+interaction. However, existing MLLMs prevalently suffer from serious
+hallucination problems, generating text that is not factually grounded in
+associated images. The problem makes existing MLLMs untrustworthy and thus
+impractical in real-world (especially high-stakes) applications. To address the
+challenge, we present RLHF-V, which enhances MLLM trustworthiness via behavior
+alignment from fine-grained correctional human feedback. Specifically, RLHF-V
+collects human preference in the form of segment-level corrections on
+hallucinations, and performs dense direct preference optimization over the
+human feedback. Comprehensive experiments on five benchmarks in both automatic
+and human evaluation show that, RLHF-V can enable substantially more
+trustworthy MLLM behaviors with promising data and computation efficiency.
+Remarkably, using 1.4k annotated data samples, RLHF-V significantly reduces the
+hallucination rate of the base MLLM by 34.8%, outperforming the concurrent
+LLaVA-RLHF trained on 10k annotated data. The final model achieves
+state-of-the-art performance in trustworthiness among open-source MLLMs, and
+shows better robustness than GPT-4V in preventing hallucinations aroused from
+over-generalization. We open-source our code, model, and data at
+https://github.com/RLHF-V/RLHF-V.",cs.CL,"['cs.CL', 'cs.CV']"
+"WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concept",Yong Hyun Ahn · Hyeon Bae Kim · Seong Tae Kim, ,https://arxiv.org/abs/2402.18956,,2402.18956.pdf,"WWW: A Unified Framework for Explaining What, Where and Why of Neural Networks by Interpretation of Neuron Concepts","Recent advancements in neural networks have showcased their remarkable
+capabilities across various domains. Despite these successes, the ""black box""
+problem still remains. Addressing this, we propose a novel framework, WWW, that
+offers the 'what', 'where', and 'why' of the neural network decisions in
+human-understandable terms. Specifically, WWW utilizes adaptive selection for
+concept discovery, employing adaptive cosine similarity and thresholding
+techniques to effectively explain 'what'. To address the 'where' and 'why', we
+proposed a novel combination of neuron activation maps (NAMs) with Shapley
+values, generating localized concept maps and heatmaps for individual inputs.
+Furthermore, WWW introduces a method for predicting uncertainty, leveraging
+heatmap similarities to estimate 'how' reliable the prediction is. Experimental
+evaluations of WWW demonstrate superior performance in both quantitative and
+qualitative metrics, outperforming existing methods in interpretability. WWW
+provides a unified solution for explaining 'what', 'where', and 'why',
+introducing a method for localized explanations from global interpretations and
+offering a plug-and-play solution adaptable to various architectures.",cs.CV,['cs.CV']
+Towards Variable and Coordinated Holistic Co-Speech Motion Generation,Yifei Liu · Qiong Cao · Yandong Wen · Huaiguang Jiang · Changxing Ding, ,https://arxiv.org/abs/2404.00368,,2404.00368.pdf,Towards Variable and Coordinated Holistic Co-Speech Motion Generation,"This paper addresses the problem of generating lifelike holistic co-speech
+motions for 3D avatars, focusing on two key aspects: variability and
+coordination. Variability allows the avatar to exhibit a wide range of motions
+even with similar speech content, while coordination ensures a harmonious
+alignment among facial expressions, hand gestures, and body poses. We aim to
+achieve both with ProbTalk, a unified probabilistic framework designed to
+jointly model facial, hand, and body movements in speech. ProbTalk builds on
+the variational autoencoder (VAE) architecture and incorporates three core
+designs. First, we introduce product quantization (PQ) to the VAE, which
+enriches the representation of complex holistic motion. Second, we devise a
+novel non-autoregressive model that embeds 2D positional encoding into the
+product-quantized representation, thereby preserving essential structure
+information of the PQ codes. Last, we employ a secondary stage to refine the
+preliminary prediction, further sharpening the high-frequency details. Coupling
+these three designs enables ProbTalk to generate natural and diverse holistic
+co-speech motions, outperforming several state-of-the-art methods in
+qualitative and quantitative evaluations, particularly in terms of realism. Our
+code and model will be released for research purposes at
+https://feifeifeiliu.github.io/probtalk/.",cs.CV,['cs.CV']
+MatchU: Matching Unseen Objects for 6D Pose Estimation from RGB-D Images,Junwen Huang · Hao Yu · Kuan-Ting Yu · Nassir Navab · Slobodan Ilic · Benjamin Busam, ,https://arxiv.org/abs/2403.01517,,2403.01517.pdf,MatchU: Matching Unseen Objects for 6D Pose Estimation from RGB-D Images,"Recent learning methods for object pose estimation require resource-intensive
+training for each individual object instance or category, hampering their
+scalability in real applications when confronted with previously unseen
+objects. In this paper, we propose MatchU, a Fuse-Describe-Match strategy for
+6D pose estimation from RGB-D images. MatchU is a generic approach that fuses
+2D texture and 3D geometric cues for 6D pose prediction of unseen objects. We
+rely on learning geometric 3D descriptors that are rotation-invariant by
+design. By encoding pose-agnostic geometry, the learned descriptors naturally
+generalize to unseen objects and capture symmetries. To tackle ambiguous
+associations using 3D geometry only, we fuse additional RGB information into
+our descriptor. This is achieved through a novel attention-based mechanism that
+fuses cross-modal information, together with a matching loss that leverages the
+latent space learned from RGB data to guide the descriptor learning process.
+Extensive experiments reveal the generalizability of both the RGB-D fusion
+strategy as well as the descriptor efficacy. Benefiting from the novel designs,
+MatchU surpasses all existing methods by a significant margin in terms of both
+accuracy and speed, even without the requirement of expensive re-training or
+rendering.",cs.CV,['cs.CV']
+SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining,Chull Hwan Song · Taebaek Hwang · Jooyoung Yoon · Shunghyun Choi · Yeong Hyeon Gu, ,https://arxiv.org/abs/2404.01156,,2404.01156.pdf,SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining,"Vision-language models (VLMs) have made significant strides in cross-modal
+understanding through large-scale paired datasets. However, in fashion domain,
+datasets often exhibit a disparity between the information conveyed in image
+and text. This issue stems from datasets containing multiple images of a single
+fashion item all paired with one text, leading to cases where some textual
+details are not visible in individual images. This mismatch, particularly when
+non-co-occurring elements are masked, undermines the training of conventional
+VLM objectives like Masked Language Modeling and Masked Image Modeling, thereby
+hindering the model's ability to accurately align fine-grained visual and
+textual features. Addressing this problem, we propose Synchronized attentional
+Masking (SyncMask), which generate masks that pinpoint the image patches and
+word tokens where the information co-occur in both image and text. This
+synchronization is accomplished by harnessing cross-attentional features
+obtained from a momentum model, ensuring a precise alignment between the two
+modalities. Additionally, we enhance grouped batch sampling with semi-hard
+negatives, effectively mitigating false negative issues in Image-Text Matching
+and Image-Text Contrastive learning objectives within fashion datasets. Our
+experiments demonstrate the effectiveness of the proposed approach,
+outperforming existing methods in three downstream tasks.",cs.CV,"['cs.CV', 'cs.AI']"
+Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception,Junwen He · Yifan Wang · Lijun Wang · Huchuan Lu · Bin Luo · Jun-Yan He · Jin-Peng Lan · Xuansong Xie, ,https://arxiv.org/abs/2403.02969,,2403.02969.pdf,Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception,"Multimodal Large Language Model (MLLMs) leverages Large Language Models as a
+cognitive framework for diverse visual-language tasks. Recent efforts have been
+made to equip MLLMs with visual perceiving and grounding capabilities. However,
+there still remains a gap in providing fine-grained pixel-level perceptions and
+extending interactions beyond text-specific inputs. In this work, we propose
+{\bf{AnyRef}}, a general MLLM model that can generate pixel-wise object
+perceptions and natural language descriptions from multi-modality references,
+such as texts, boxes, images, or audio. This innovation empowers users with
+greater flexibility to engage with the model beyond textual and regional
+prompts, without modality-specific designs. Through our proposed refocusing
+mechanism, the generated grounding output is guided to better focus on the
+referenced object, implicitly incorporating additional pixel-level supervision.
+This simple modification utilizes attention scores generated during the
+inference of LLM, eliminating the need for extra computations while exhibiting
+performance enhancements in both grounding masks and referring expressions.
+With only publicly available training data, our model achieves state-of-the-art
+results across multiple benchmarks, including diverse modality referring
+segmentation and region-level referring expression generation.",cs.CV,['cs.CV']
+Neural 3D Strokes: Creating Stylized 3D Scenes with Vectorized 3D Strokes,Haobin Duan · Miao Wang · Yanxun Li · Yong-Liang Yang, ,https://arxiv.org/abs/2311.15637,,2311.15637.pdf,Neural 3D Strokes: Creating Stylized 3D Scenes with Vectorized 3D Strokes,"We present Neural 3D Strokes, a novel technique to generate stylized images
+of a 3D scene at arbitrary novel views from multi-view 2D images. Different
+from existing methods which apply stylization to trained neural radiance fields
+at the voxel level, our approach draws inspiration from image-to-painting
+methods, simulating the progressive painting process of human artwork with
+vector strokes. We develop a palette of stylized 3D strokes from basic
+primitives and splines, and consider the 3D scene stylization task as a
+multi-view reconstruction process based on these 3D stroke primitives. Instead
+of directly searching for the parameters of these 3D strokes, which would be
+too costly, we introduce a differentiable renderer that allows optimizing
+stroke parameters using gradient descent, and propose a training scheme to
+alleviate the vanishing gradient issue. The extensive evaluation demonstrates
+that our approach effectively synthesizes 3D scenes with significant geometric
+and aesthetic stylization while maintaining a consistent appearance across
+different views. Our method can be further integrated with style loss and
+image-text contrastive models to extend its applications, including color
+transfer and text-driven 3D scene drawing. Results and code are available at
+http://buaavrcg.github.io/Neural3DStrokes.",cs.CV,"['cs.CV', 'cs.GR']"
+A Theory of Joint Light and Heat Transport for Lambertian Scenes,Mani Ramanagopal · Sriram Narayanan · Aswin C. Sankaranarayanan · Srinivasa G. Narasimhan, ,,https://dl.acm.org/doi/10.1145/3596711.3596745,,,,,nan
+Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences,Seungwook Kim · Kejie Li · Xueqing Deng · Yichun Shi · Minsu Cho · Peng Wang, ,https://arxiv.org/abs/2404.10603,,2404.10603.pdf,Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences,"Leveraging multi-view diffusion models as priors for 3D optimization have
+alleviated the problem of 3D consistency, e.g., the Janus face problem or the
+content drift problem, in zero-shot text-to-3D models. However, the 3D
+geometric fidelity of the output remains an unresolved issue; albeit the
+rendered 2D views are realistic, the underlying geometry may contain errors
+such as unreasonable concavities. In this work, we propose CorrespondentDream,
+an effective method to leverage annotation-free, cross-view correspondences
+yielded from the diffusion U-Net to provide additional 3D prior to the NeRF
+optimization process. We find that these correspondences are strongly
+consistent with human perception, and by adopting it in our loss design, we are
+able to produce NeRF models with geometries that are more coherent with common
+sense, e.g., more smoothed object surface, yielding higher 3D fidelity. We
+demonstrate the efficacy of our approach through various comparative
+qualitative results and a solid user study.",cs.CV,['cs.CV']
+Generative Region-Language Pretraining for Open-Ended Object Detection,Chuang Lin · Yi Jiang · Lizhen Qu · Zehuan Yuan · Jianfei Cai, ,https://arxiv.org/abs/2403.10191,,2403.10191.pdf,Generative Region-Language Pretraining for Open-Ended Object Detection,"In recent research, significant attention has been devoted to the
+open-vocabulary object detection task, aiming to generalize beyond the limited
+number of classes labeled during training and detect objects described by
+arbitrary category names at inference. Compared with conventional object
+detection, open vocabulary object detection largely extends the object
+detection categories. However, it relies on calculating the similarity between
+image regions and a set of arbitrary category names with a pretrained
+vision-and-language model. This implies that, despite its open-set nature, the
+task still needs the predefined object categories during the inference stage.
+This raises the question: What if we do not have exact knowledge of object
+categories during inference? In this paper, we call such a new setting as
+generative open-ended object detection, which is a more general and practical
+problem. To address it, we formulate object detection as a generative problem
+and propose a simple framework named GenerateU, which can detect dense objects
+and generate their names in a free-form way. Particularly, we employ Deformable
+DETR as a region proposal generator with a language model translating visual
+regions to object names. To assess the free-form object detection task, we
+introduce an evaluation method designed to quantitatively measure the
+performance of generative outcomes. Extensive experiments demonstrate strong
+zero-shot detection performance of our GenerateU. For example, on the LVIS
+dataset, our GenerateU achieves comparable results to the open-vocabulary
+object detection method GLIP, even though the category names are not seen by
+GenerateU during inference. Code is available at: https://
+github.com/FoundationVision/GenerateU .",cs.CV,['cs.CV']
+GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding,Chengyao Wang · Li Jiang · Xiaoyang Wu · Zhuotao Tian · Bohao Peng · Hengshuang Zhao · Jiaya Jia,https://github.com/dvlab-research/GroupContrast,https://arxiv.org/abs/2403.09639,,2403.09639.pdf,GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding,"Self-supervised 3D representation learning aims to learn effective
+representations from large-scale unlabeled point clouds. Most existing
+approaches adopt point discrimination as the pretext task, which assigns
+matched points in two distinct views as positive pairs and unmatched points as
+negative pairs. However, this approach often results in semantically identical
+points having dissimilar representations, leading to a high number of false
+negatives and introducing a ""semantic conflict"" problem. To address this issue,
+we propose GroupContrast, a novel approach that combines segment grouping and
+semantic-aware contrastive learning. Segment grouping partitions points into
+semantically meaningful regions, which enhances semantic coherence and provides
+semantic guidance for the subsequent contrastive representation learning.
+Semantic-aware contrastive learning augments the semantic information extracted
+from segment grouping and helps to alleviate the issue of ""semantic conflict"".
+We conducted extensive experiments on multiple 3D scene understanding tasks.
+The results demonstrate that GroupContrast learns semantically meaningful
+representations and achieves promising transfer learning performance.",cs.CV,['cs.CV']
+Improved Visual Grounding through Self-Consistent Explanations,Ruozhen He · Paola Cascante-Bonilla · Ziyan Yang · Alex Berg · Vicente Ordonez,https://catherine-r-he.github.io/SelfEQ/,https://arxiv.org/abs/2312.04554v1,,2312.04554v1.pdf,Improved Visual Grounding through Self-Consistent Explanations,"Vision-and-language models trained to match images with text can be combined
+with visual explanation methods to point to the locations of specific objects
+in an image. Our work shows that the localization --""grounding""-- abilities of
+these models can be further improved by finetuning for self-consistent visual
+explanations. We propose a strategy for augmenting existing text-image datasets
+with paraphrases using a large language model, and SelfEQ, a weakly-supervised
+strategy on visual explanation maps for paraphrases that encourages
+self-consistency. Specifically, for an input textual phrase, we attempt to
+generate a paraphrase and finetune the model so that the phrase and paraphrase
+map to the same region in the image. We posit that this both expands the
+vocabulary that the model is able to handle, and improves the quality of the
+object locations highlighted by gradient-based visual explanation methods (e.g.
+GradCAM). We demonstrate that SelfEQ improves performance on Flickr30k,
+ReferIt, and RefCOCO+ over a strong baseline method and several prior works.
+Particularly, comparing to other methods that do not use any type of box
+annotations, we obtain 84.07% on Flickr30k (an absolute improvement of 4.69%),
+67.40% on ReferIt (an absolute improvement of 7.68%), and 75.10%, 55.49% on
+RefCOCO+ test sets A and B respectively (an absolute improvement of 3.74% on
+average).",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']"
+Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution,Shangchen Zhou · Peiqing Yang · Jianyi Wang · Yihang Luo · Chen Change Loy, ,https://arxiv.org/abs/2312.06640,,2312.06640.pdf,Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution,"Text-based diffusion models have exhibited remarkable success in generation
+and editing, showing great promise for enhancing visual content with their
+generative prior. However, applying these models to video super-resolution
+remains challenging due to the high demands for output fidelity and temporal
+consistency, which is complicated by the inherent randomness in diffusion
+models. Our study introduces Upscale-A-Video, a text-guided latent diffusion
+framework for video upscaling. This framework ensures temporal coherence
+through two key mechanisms: locally, it integrates temporal layers into U-Net
+and VAE-Decoder, maintaining consistency within short sequences; globally,
+without training, a flow-guided recurrent latent propagation module is
+introduced to enhance overall video stability by propagating and fusing latent
+across the entire sequences. Thanks to the diffusion paradigm, our model also
+offers greater flexibility by allowing text prompts to guide texture creation
+and adjustable noise levels to balance restoration and generation, enabling a
+trade-off between fidelity and quality. Extensive experiments show that
+Upscale-A-Video surpasses existing methods in both synthetic and real-world
+benchmarks, as well as in AI-generated videos, showcasing impressive visual
+realism and temporal consistency.",cs.CV,['cs.CV']
+Image Neural Field Diffusion Models,Yinbo Chen · Oliver Wang · Richard Zhang · Eli Shechtman · Xiaolong Wang · Michaël Gharbi, ,https://arxiv.org/abs/2310.08337,,2310.08337.pdf,Neural Diffusion Models,"Diffusion models have shown remarkable performance on many generative tasks.
+Despite recent success, most diffusion models are restricted in that they only
+allow linear transformation of the data distribution. In contrast, broader
+family of transformations can potentially help train generative distributions
+more efficiently, simplifying the reverse process and closing the gap between
+the true negative log-likelihood and the variational approximation. In this
+paper, we present Neural Diffusion Models (NDMs), a generalization of
+conventional diffusion models that enables defining and learning time-dependent
+non-linear transformations of data. We show how to optimise NDMs using a
+variational bound in a simulation-free setting. Moreover, we derive a
+time-continuous formulation of NDMs, which allows fast and reliable inference
+using off-the-shelf numerical ODE and SDE solvers. Finally, we demonstrate the
+utility of NDMs with learnable transformations through experiments on standard
+image generation benchmarks, including CIFAR-10, downsampled versions of
+ImageNet and CelebA-HQ. NDMs outperform conventional diffusion models in terms
+of likelihood and produce high-quality samples.",cs.LG,"['cs.LG', 'stat.ML']"
+ViTamin: Designing Scalable Vision Models in the Vision-Language Era,Jieneng Chen · Qihang Yu · Xiaohui Shen · Alan L. Yuille · Liang-Chieh Chen, ,https://arxiv.org/abs/2404.02132,,2404.02132.pdf,ViTamin: Designing Scalable Vision Models in the Vision-Language Era,"Recent breakthroughs in vision-language models (VLMs) start a new page in the
+vision community. The VLMs provide stronger and more generalizable feature
+embeddings compared to those from ImageNet-pretrained models, thanks to the
+training on the large-scale Internet image-text pairs. However, despite the
+amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain
+the default choice for the image encoder. Although pure transformer proves its
+effectiveness in the text encoding area, it remains questionable whether it is
+also the case for image encoding, especially considering that various types of
+networks are proposed on the ImageNet benchmark, which, unfortunately, are
+rarely studied in VLMs. Due to small data/model scale, the original conclusions
+of model design on ImageNet can be limited and biased. In this paper, we aim at
+building an evaluation protocol of vision models in the vision-language era
+under the contrastive language-image pretraining (CLIP) framework. We provide a
+comprehensive way to benchmark different vision models, covering their
+zero-shot performance and scalability in both model and training data sizes. To
+this end, we introduce ViTamin, a new vision models tailored for VLMs.
+ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy,
+when using the same publicly available DataComp-1B dataset and the same
+OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse
+benchmarks, including classification, retrieval, open-vocabulary detection and
+segmentation, and large multi-modal models. When further scaling up the model
+size, our ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot
+accuracy, surpassing 82.0% achieved by EVA-E that has ten times more parameters
+(4.4B).",cs.CV,['cs.CV']
+Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery,Mubashir Noman · Muzammal Naseer · Hisham Cholakkal · Rao Anwer · Salman Khan · Fahad Shahbaz Khan, ,https://web3.arxiv.org/abs/2403.05419,,2403.05419.pdf,Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery,"Recent advances in unsupervised learning have demonstrated the ability of
+large vision models to achieve promising results on downstream tasks by
+pre-training on large amount of unlabelled data. Such pre-training techniques
+have also been explored recently in the remote sensing domain due to the
+availability of large amount of unlabelled data. Different from standard
+natural image datasets, remote sensing data is acquired from various sensor
+technologies and exhibit diverse range of scale variations as well as
+modalities. Existing satellite image pre-training methods either ignore the
+scale information present in the remote sensing imagery or restrict themselves
+to use only a single type of data modality. In this paper, we re-visit
+transformers pre-training and leverage multi-scale information that is
+effectively utilized with multiple modalities. Our proposed approach, named
+SatMAE++, performs multi-scale pre-training and utilizes convolution based
+upsampling blocks to reconstruct the image at higher scales making it
+extensible to include more scales. Compared to existing works, the proposed
+SatMAE++ with multi-scale pre-training is equally effective for both optical as
+well as multi-spectral imagery. Extensive experiments on six datasets reveal
+the merits of proposed contributions, leading to state-of-the-art performance
+on all datasets. SatMAE++ achieves mean average precision (mAP) gain of 2.5\%
+for multi-label classification task on BigEarthNet dataset. Our code and
+pre-trained models are available at \url{https://github.com/techmn/satmae_pp}.",cs.CV,['cs.CV']
+RCL: Reliable Continual Learning for Unified Failure Detection,Fei Zhu · Zhen Cheng · Xu-Yao Zhang · Cheng-Lin Liu · Zhaoxiang Zhang, ,https://arxiv.org/abs/2403.02886,,2403.02886.pdf,Revisiting Confidence Estimation: Towards Reliable Failure Prediction,"Reliable confidence estimation is a challenging yet fundamental requirement
+in many risk-sensitive applications. However, modern deep neural networks are
+often overconfident for their incorrect predictions, i.e., misclassified
+samples from known classes, and out-of-distribution (OOD) samples from unknown
+classes. In recent years, many confidence calibration and OOD detection methods
+have been developed. In this paper, we find a general, widely existing but
+actually-neglected phenomenon that most confidence estimation methods are
+harmful for detecting misclassification errors. We investigate this problem and
+reveal that popular calibration and OOD detection methods often lead to worse
+confidence separation between correctly classified and misclassified examples,
+making it difficult to decide whether to trust a prediction or not. Finally, we
+propose to enlarge the confidence gap by finding flat minima, which yields
+state-of-the-art failure prediction performance under various settings
+including balanced, long-tailed, and covariate-shift classification scenarios.
+Our study not only provides a strong baseline for reliable confidence
+estimation but also acts as a bridge between understanding calibration, OOD
+detection, and failure prediction. The code is available at
+\url{https://github.com/Impression2805/FMFP}.",cs.CV,"['cs.CV', 'cs.LG']"
+Boosting Neural Representations for Videos with a Conditional Decoder,XINJIE ZHANG · Ren Yang · Dailan He · Xingtong Ge · Tongda Xu · Yan Wang · Hongwei Qin · Jun Zhang,https://github.com/Xinjie-Q/Boosting-NeRV,https://arxiv.org/abs/2402.18152,,2402.18152.pdf,Boosting Neural Representations for Videos with a Conditional Decoder,"Implicit neural representations (INRs) have emerged as a promising approach
+for video storage and processing, showing remarkable versatility across various
+video tasks. However, existing methods often fail to fully leverage their
+representation capabilities, primarily due to inadequate alignment of
+intermediate features during target frame decoding. This paper introduces a
+universal boosting framework for current implicit video representation
+approaches. Specifically, we utilize a conditional decoder with a
+temporal-aware affine transform module, which uses the frame index as a prior
+condition to effectively align intermediate features with target frames.
+Besides, we introduce a sinusoidal NeRV-like block to generate diverse
+intermediate features and achieve a more balanced parameter distribution,
+thereby enhancing the model's capacity. With a high-frequency
+information-preserving reconstruction loss, our approach successfully boosts
+multiple baseline INRs in the reconstruction quality and convergence speed for
+video regression, and exhibits superior inpainting and interpolation results.
+Further, we integrate a consistent entropy minimization technique and develop
+video codecs based on these boosted INRs. Experiments on the UVG dataset
+confirm that our enhanced codecs significantly outperform baseline INRs and
+offer competitive rate-distortion performance compared to traditional and
+learning-based codecs. Code is available at
+https://github.com/Xinjie-Q/Boosting-NeRV.",eess.IV,"['eess.IV', 'cs.AI', 'cs.CV']"
+Uncertainty Visualization via Low-Dimensional Posterior Projections,Omer Yair · Tomer Michaeli · Elias Nehme, ,https://arxiv.org/abs/2312.07804,,2312.07804.pdf,Uncertainty Visualization via Low-Dimensional Posterior Projections,"In ill-posed inverse problems, it is commonly desirable to obtain insight
+into the full spectrum of plausible solutions, rather than extracting only a
+single reconstruction. Information about the plausible solutions and their
+likelihoods is encoded in the posterior distribution. However, for
+high-dimensional data, this distribution is challenging to visualize. In this
+work, we introduce a new approach for estimating and visualizing posteriors by
+employing energy-based models (EBMs) over low-dimensional subspaces.
+Specifically, we train a conditional EBM that receives an input measurement and
+a set of directions that span some low-dimensional subspace of solutions, and
+outputs the probability density function of the posterior within that space. We
+demonstrate the effectiveness of our method across a diverse range of datasets
+and image restoration problems, showcasing its strength in uncertainty
+quantification and visualization. As we show, our method outperforms a baseline
+that projects samples from a diffusion-based posterior sampler, while being
+orders of magnitude faster. Furthermore, it is more accurate than a baseline
+that assumes a Gaussian posterior.",cs.CV,['cs.CV']
+ElasticDiffusion: Training-free Arbitrary Size Image Generation,Moayed Haji Ali · Guha Balakrishnan · Vicente Ordonez, ,https://arxiv.org/abs/2311.18822,,2311.18822.pdf,ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content Separation,"Diffusion models have revolutionized image generation in recent years, yet
+they are still limited to a few sizes and aspect ratios. We propose
+ElasticDiffusion, a novel training-free decoding method that enables pretrained
+text-to-image diffusion models to generate images with various sizes.
+ElasticDiffusion attempts to decouple the generation trajectory of a pretrained
+model into local and global signals. The local signal controls low-level pixel
+information and can be estimated on local patches, while the global signal is
+used to maintain overall structural consistency and is estimated with a
+reference image. We test our method on CelebA-HQ (faces) and LAION-COCO
+(objects/indoor/outdoor scenes). Our experiments and qualitative results show
+superior image coherence quality across aspect ratios compared to
+MultiDiffusion and the standard decoding strategy of Stable Diffusion. Project
+page: https://elasticdiffusion.github.io/",cs.CV,['cs.CV']
+Exploiting Diffusion Prior for Generalizable Dense Prediction,Hsin-Ying Lee · Hung-Yu Tseng · Hsin-Ying Lee · Ming-Hsuan Yang,https://shinying.github.io/dmp,https://arxiv.org/abs/2311.18832,,2311.18832.pdf,Exploiting Diffusion Prior for Generalizable Dense Prediction,"Contents generated by recent advanced Text-to-Image (T2I) diffusion models
+are sometimes too imaginative for existing off-the-shelf dense predictors to
+estimate due to the immitigable domain gap. We introduce DMP, a pipeline
+utilizing pre-trained T2I models as a prior for dense prediction tasks. To
+address the misalignment between deterministic prediction tasks and stochastic
+T2I models, we reformulate the diffusion process through a sequence of
+interpolations, establishing a deterministic mapping between input RGB images
+and output prediction distributions. To preserve generalizability, we use
+low-rank adaptation to fine-tune pre-trained models. Extensive experiments
+across five tasks, including 3D property estimation, semantic segmentation, and
+intrinsic image decomposition, showcase the efficacy of the proposed method.
+Despite limited-domain training data, the approach yields faithful estimations
+for arbitrary images, surpassing existing state-of-the-art algorithms.",cs.CV,['cs.CV']
+Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text,Junshu Tang · Yanhong Zeng · Ke Fan · Xuheng Wang · Bo Dai · Kai Chen · Lizhuang Ma, ,https://arxiv.org/abs/2403.16897,,2403.16897.pdf,Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text,"Creating and animating 3D biped cartoon characters is crucial and valuable in
+various applications. Compared with geometry, the diverse texture design plays
+an important role in making 3D biped cartoon characters vivid and charming.
+Therefore, we focus on automatic texture design for cartoon characters based on
+input instructions. This is challenging for domain-specific requirements and a
+lack of high-quality data. To address this challenge, we propose Make-It-Vivid,
+the first attempt to enable high-quality texture generation from text in UV
+space. We prepare a detailed text-texture paired data for 3D characters by
+using vision-question-answering agents. Then we customize a pretrained
+text-to-image model to generate texture map with template structure while
+preserving the natural 2D image knowledge. Furthermore, to enhance fine-grained
+details, we propose a novel adversarial learning scheme to shorten the domain
+gap between original dataset and realistic texture domain. Extensive
+experiments show that our approach outperforms current texture generation
+methods, resulting in efficient character texturing and faithful generation
+with prompts. Besides, we showcase various applications such as out of domain
+generation and texture stylization. We also provide an efficient generation
+system for automatic text-guided textured character generation and animation.",cs.CV,['cs.CV']
+Eclipse: Disambiguating Illumination and Materials using Unintended Shadows,Dor Verbin · Ben Mildenhall · Peter Hedman · Jonathan T. Barron · Todd Zickler · Pratul P. Srinivasan, ,,https://www.youtube.com/watch?v=amQLGyza3EU,,,,,nan
+Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection,Zhiyuan Yan · Yuhao Luo · Siwei Lyu · Qingshan Liu · Baoyuan Wu, ,https://arxiv.org/abs/2311.11278v1,,2311.11278v1.pdf,Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection,"Deepfake detection faces a critical generalization hurdle, with performance
+deteriorating when there is a mismatch between the distributions of training
+and testing data. A broadly received explanation is the tendency of these
+detectors to be overfitted to forgery-specific artifacts, rather than learning
+features that are widely applicable across various forgeries. To address this
+issue, we propose a simple yet effective detector called LSDA
+(\underline{L}atent \underline{S}pace \underline{D}ata
+\underline{A}ugmentation), which is based on a heuristic idea: representations
+with a wider variety of forgeries should be able to learn a more generalizable
+decision boundary, thereby mitigating the overfitting of method-specific
+features (see Figure. 1). Following this idea, we propose to enlarge the
+forgery space by constructing and simulating variations within and across
+forgery features in the latent space. This approach encompasses the acquisition
+of enriched, domain-specific features and the facilitation of smoother
+transitions between different forgery types, effectively bridging domain gaps.
+Our approach culminates in refining a binary classifier that leverages the
+distilled knowledge from the enhanced features, striving for a generalizable
+deepfake detector. Comprehensive experiments show that our proposed method is
+surprisingly effective and transcends state-of-the-art detectors across several
+widely used benchmarks.",cs.CV,['cs.CV']
+Revisiting Sampson Approximations for Geometric Estimation Problems,Felix Rydell · Angelica Torres · Viktor Larsson, ,https://arxiv.org/abs/2401.07114,,2401.07114.pdf,Revisiting Sampson Approximations for Geometric Estimation Problems,"Many problems in computer vision can be formulated as geometric estimation
+problems, i.e. given a collection of measurements (e.g. point correspondences)
+we wish to fit a model (e.g. an essential matrix) that agrees with our
+observations. This necessitates some measure of how much an observation
+``agrees"" with a given model. A natural choice is to consider the smallest
+perturbation that makes the observation exactly satisfy the constraints.
+However, for many problems, this metric is expensive or otherwise intractable
+to compute. The so-called Sampson error approximates this geometric error
+through a linearization scheme. For epipolar geometry, the Sampson error is a
+popular choice and in practice known to yield very tight approximations of the
+corresponding geometric residual (the reprojection error).
+  In this paper we revisit the Sampson approximation and provide new
+theoretical insights as to why and when this approximation works, as well as
+provide explicit bounds on the tightness under some mild assumptions. Our
+theoretical results are validated in several experiments on real data and in
+the context of different geometric estimation tasks.",cs.CV,"['cs.CV', 'math.AG', '68T45 (Primary), 14Q99 (Secondary), 68W30']"
+Pick-or-Mix: Dynamic Channel Sampling for ConvNets,Ashish Kumar · Daneul Kim · Jaesik Park · Laxmidhar Behera, ,,https://openreview.net/forum?id=Howb7fXB4V,,,,,nan
+FreePoint: Unsupervised Point Cloud Instance Segmentation,Zhikai Zhang · Jian Ding · Li Jiang · Dengxin Dai · Gui-Song Xia, ,,https://medium.com/forestree/reviewing-unsupervised-semantic-segmentation-methods-for-point-cloud-a50a508f7f88,,,,,nan
+Mind marginal non-crack regions: Clustering-inspired representation learning for crack segmentation,zhuangzhuang chen · Zhuonan Lai · Jie Chen · Jianqiang Li, ,https://arxiv.org/html/2403.03063v1,,2403.03063v1.pdf,CrackNex: a Few-shot Low-light Crack Segmentation Model Based on Retinex Theory for UAV Inspections,"Routine visual inspections of concrete structures are imperative for
+upholding the safety and integrity of critical infrastructure. Such visual
+inspections sometimes happen under low-light conditions, e.g., checking for
+bridge health. Crack segmentation under such conditions is challenging due to
+the poor contrast between cracks and their surroundings. However, most deep
+learning methods are designed for well-illuminated crack images and hence their
+performance drops dramatically in low-light scenes. In addition, conventional
+approaches require many annotated low-light crack images which is
+time-consuming. In this paper, we address these challenges by proposing
+CrackNex, a framework that utilizes reflectance information based on Retinex
+Theory to help the model learn a unified illumination-invariant representation.
+Furthermore, we utilize few-shot segmentation to solve the inefficient training
+data problem. In CrackNex, both a support prototype and a reflectance prototype
+are extracted from the support set. Then, a prototype fusion module is designed
+to integrate the features from both prototypes. CrackNex outperforms the SOTA
+methods on multiple datasets. Additionally, we present the first benchmark
+dataset, LCSD, for low-light crack segmentation. LCSD consists of 102
+well-illuminated crack images and 41 low-light crack images. The dataset and
+code are available at https://github.com/zy1296/CrackNex.",cs.CV,['cs.CV']
+MV-Adapter: Exploring Parameter Efficient Learning for Video Text Retrieval,bowen zhang · Xiaojie Jin · Weibo Gong · Kai Xu · Xueqing Deng · Peng Wang · Zhao Zhang · Xiaohui Shen · Jiashi Feng, ,https://arxiv.org/abs/2405.19465,,2405.19465.pdf,RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter,"Text-Video Retrieval (TVR) aims to align relevant video content with natural
+language queries. To date, most state-of-the-art TVR methods learn
+image-to-video transfer learning based on large-scale pre-trained
+visionlanguage models (e.g., CLIP). However, fully fine-tuning these
+pre-trained models for TVR incurs prohibitively expensive computation costs. To
+this end, we propose to conduct efficient text-video Retrieval with a
+sparse-andcorrelated AdaPter (RAP), i.e., fine-tuning the pre-trained model
+with a few parameterized layers. To accommodate the text-video scenario, we
+equip our RAP with two indispensable characteristics: temporal sparsity and
+correlation. Specifically, we propose a low-rank modulation module to refine
+the per-image features from the frozen CLIP backbone, which accentuates salient
+frames within the video features while alleviating temporal redundancy.
+Besides, we introduce an asynchronous self-attention mechanism that first
+selects the top responsive visual patches and augments the correlation modeling
+between them with learnable temporal and patch offsets. Extensive experiments
+on four TVR datasets demonstrate that RAP achieves superior or comparable
+performance compared to the fully fine-tuned counterpart and other
+parameter-efficient fine-tuning methods.",cs.CV,['cs.CV']
+Few-shot Learner Parameterization by Diffusion Time-steps,Zhongqi Yue · Pan Zhou · Richang Hong · Hanwang Zhang · Qianru Sun, ,https://arxiv.org/abs/2403.02649,,2403.02649.pdf,Few-shot Learner Parameterization by Diffusion Time-steps,"Even when using large multi-modal foundation models, few-shot learning is
+still challenging -- if there is no proper inductive bias, it is nearly
+impossible to keep the nuanced class attributes while removing the visually
+prominent attributes that spuriously correlate with class labels. To this end,
+we find an inductive bias that the time-steps of a Diffusion Model (DM) can
+isolate the nuanced class attributes, i.e., as the forward diffusion adds noise
+to an image at each time-step, nuanced attributes are usually lost at an
+earlier time-step than the spurious attributes that are visually prominent.
+Building on this, we propose Time-step Few-shot (TiF) learner. We train
+class-specific low-rank adapters for a text-conditioned DM to make up for the
+lost attributes, such that images can be accurately reconstructed from their
+noisy ones given a prompt. Hence, at a small time-step, the adapter and prompt
+are essentially a parameterization of only the nuanced class attributes. For a
+test image, we can use the parameterization to only extract the nuanced class
+attributes for classification. TiF learner significantly outperforms OpenCLIP
+and its adapters on a variety of fine-grained and customized few-shot learning
+tasks. Codes are in https://github.com/yue-zhongqi/tif.",cs.CV,['cs.CV']
+"SPIN: Simultaneous Perception, Interaction and Navigation",Shagun Uppal · Ananye Agarwal · Haoyu Xiong · Kenneth Shaw · Deepak Pathak, ,https://arxiv.org/abs/2405.07991,,2405.07991.pdf,"SPIN: Simultaneous Perception, Interaction and Navigation","While there has been remarkable progress recently in the fields of
+manipulation and locomotion, mobile manipulation remains a long-standing
+challenge. Compared to locomotion or static manipulation, a mobile system must
+make a diverse range of long-horizon tasks feasible in unstructured and dynamic
+environments. While the applications are broad and interesting, there are a
+plethora of challenges in developing these systems such as coordination between
+the base and arm, reliance on onboard perception for perceiving and interacting
+with the environment, and most importantly, simultaneously integrating all
+these parts together. Prior works approach the problem using disentangled
+modular skills for mobility and manipulation that are trivially tied together.
+This causes several limitations such as compounding errors, delays in
+decision-making, and no whole-body coordination. In this work, we present a
+reactive mobile manipulation framework that uses an active visual system to
+consciously perceive and react to its environment. Similar to how humans
+leverage whole-body and hand-eye coordination, we develop a mobile manipulator
+that exploits its ability to move and see, more specifically -- to move in
+order to see and to see in order to move. This allows it to not only move
+around and interact with its environment but also, choose ""when"" to perceive
+""what"" using an active visual system. We observe that such an agent learns to
+navigate around complex cluttered scenarios while displaying agile whole-body
+coordination using only ego-vision without needing to create environment maps.
+Results visualizations and videos at https://spin-robot.github.io/",cs.RO,"['cs.RO', 'cs.AI', 'cs.CV', 'cs.LG', 'cs.SY', 'eess.SY']"
+Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery,Siddharth Tourani · Ahmed Alwheibi · Arif Mahmood · Muhammad Haris Khan, ,https://arxiv.org/abs/2403.16194,,2403.16194.pdf,Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery,"Unsupervised landmarks discovery (ULD) for an object category is a
+challenging computer vision problem. In pursuit of developing a robust ULD
+framework, we explore the potential of a recent paradigm of self-supervised
+learning algorithms, known as diffusion models. Some recent works have shown
+that these models implicitly contain important correspondence cues. Towards
+harnessing the potential of diffusion models for the ULD task, we make the
+following core contributions. First, we propose a ZeroShot ULD baseline based
+on simple clustering of random pixel locations with nearest neighbour matching.
+It delivers better results than existing ULD methods. Second, motivated by the
+ZeroShot performance, we develop a ULD algorithm based on diffusion features
+using self-training and clustering which also outperforms prior methods by
+notable margins. Third, we introduce a new proxy task based on generating
+latent pose codes and also propose a two-stage clustering mechanism to
+facilitate effective pseudo-labeling, resulting in a significant performance
+improvement. Overall, our approach consistently outperforms state-of-the-art
+methods on four challenging benchmarks AFLW, MAFL, CatHeads and LS3D by
+significant margins.",cs.CV,['cs.CV']
+Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and Consistency,Yuqi Zhang · Han Luo · Yinjie Lei, ,https://arxiv.org/abs/2311.15383,,2311.15383.pdf,Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding,"3D Visual Grounding (3DVG) aims at localizing 3D object based on textual
+descriptions. Conventional supervised methods for 3DVG often necessitate
+extensive annotations and a predefined vocabulary, which can be restrictive. To
+address this issue, we propose a novel visual programming approach for
+zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language
+models (LLMs). Our approach begins with a unique dialog-based method, engaging
+with LLMs to establish a foundational understanding of zero-shot 3DVG. Building
+on this, we design a visual program that consists of three types of modules,
+i.e., view-independent, view-dependent, and functional modules. These modules,
+specifically tailored for 3D scenarios, work collaboratively to perform complex
+reasoning and inference. Furthermore, we develop an innovative language-object
+correlation module to extend the scope of existing 3D object detectors into
+open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot
+approach can outperform some supervised baselines, marking a significant stride
+towards effective 3DVG.",cs.CV,['cs.CV']
+POPDG: Popular 3D Dance Generation with PopDanceSet,Zhenye Luo · Min Ren · Xuecai Hu · Yongzhen Huang · Li Yao, ,https://arxiv.org/abs/2405.03178,,2405.03178.pdf,POPDG: Popular 3D Dance Generation with PopDanceSet,"Generating dances that are both lifelike and well-aligned with music
+continues to be a challenging task in the cross-modal domain. This paper
+introduces PopDanceSet, the first dataset tailored to the preferences of young
+audiences, enabling the generation of aesthetically oriented dances. And it
+surpasses the AIST++ dataset in music genre diversity and the intricacy and
+depth of dance movements. Moreover, the proposed POPDG model within the iDDPM
+framework enhances dance diversity and, through the Space Augmentation
+Algorithm, strengthens spatial physical connections between human body joints,
+ensuring that increased diversity does not compromise generation quality. A
+streamlined Alignment Module is also designed to improve the temporal alignment
+between dance and music. Extensive experiments show that POPDG achieves SOTA
+results on two datasets. Furthermore, the paper also expands on current
+evaluation metrics. The dataset and code are available at
+https://github.com/Luke-Luo1/POPDG.",cs.SD,"['cs.SD', 'eess.AS']"
+CLiC: Concept Learning in Context,Mehdi Safaee · Aryan Mikaeili · Or Patashnik · Daniel Cohen-Or · Ali Mahdavi Amiri, ,https://arxiv.org/abs/2311.17083,,2311.17083.pdf,CLiC: Concept Learning in Context,"This paper addresses the challenge of learning a local visual pattern of an
+object from one image, and generating images depicting objects with that
+pattern. Learning a localized concept and placing it on an object in a target
+image is a nontrivial task, as the objects may have different orientations and
+shapes. Our approach builds upon recent advancements in visual concept
+learning. It involves acquiring a visual concept (e.g., an ornament) from a
+source image and subsequently applying it to an object (e.g., a chair) in a
+target image. Our key idea is to perform in-context concept learning, acquiring
+the local visual concept within the broader context of the objects they belong
+to. To localize the concept learning, we employ soft masks that contain both
+the concept within the mask and the surrounding image area. We demonstrate our
+approach through object generation within an image, showcasing plausible
+embedding of in-context learned concepts. We also introduce methods for
+directing acquired concepts to specific locations within target images,
+employing cross-attention mechanisms, and establishing correspondences between
+source and target objects. The effectiveness of our method is demonstrated
+through quantitative and qualitative experiments, along with comparisons
+against baseline techniques.",cs.CV,['cs.CV']
+Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing,Xun Lin · Shuai Wang · RIZHAO CAI · Yizhong Liu · Ying Fu · Wenzhong Tang · Zitong YU · Alex C. Kot, ,https://arxiv.org/abs/2402.19298,,2402.19298.pdf,Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing,"Face Anti-Spoofing (FAS) is crucial for securing face recognition systems
+against presentation attacks. With advancements in sensor manufacture and
+multi-modal learning techniques, many multi-modal FAS approaches have emerged.
+However, they face challenges in generalizing to unseen attacks and deployment
+conditions. These challenges arise from (1) modality unreliability, where some
+modality sensors like depth and infrared undergo significant domain shifts in
+varying environments, leading to the spread of unreliable information during
+cross-modal feature fusion, and (2) modality imbalance, where training overly
+relies on a dominant modality hinders the convergence of others, reducing
+effectiveness against attack types that are indistinguishable sorely using the
+dominant modality. To address modality unreliability, we propose the
+Uncertainty-Guided Cross-Adapter (U-Adapter) to recognize unreliably detected
+regions within each modality and suppress the impact of unreliable regions on
+other modalities. For modality imbalance, we propose a Rebalanced Modality
+Gradient Modulation (ReGrad) strategy to rebalance the convergence speed of all
+modalities by adaptively adjusting their gradients. Besides, we provide the
+first large-scale benchmark for evaluating multi-modal FAS performance under
+domain generalization scenarios. Extensive experiments demonstrate that our
+method outperforms state-of-the-art methods. Source code and protocols will be
+released on https://github.com/OMGGGGG/mmdg.",cs.CV,['cs.CV']
+Alchemist: Parametric Control of Material Properties with Diffusion Models,Prafull Sharma · Varun Jampani · Yuanzhen Li · Xuhui Jia · Dmitry Lagun · Fredo Durand · William Freeman · Mark Matthews, ,https://arxiv.org/abs/2312.02970,,2312.02970.pdf,Alchemist: Parametric Control of Material Properties with Diffusion Models,"We propose a method to control material attributes of objects like roughness,
+metallic, albedo, and transparency in real images. Our method capitalizes on
+the generative prior of text-to-image models known for photorealism, employing
+a scalar value and instructions to alter low-level material properties.
+Addressing the lack of datasets with controlled material attributes, we
+generated an object-centric synthetic dataset with physically-based materials.
+Fine-tuning a modified pre-trained text-to-image model on this synthetic
+dataset enables us to edit material properties in real-world images while
+preserving all other attributes. We show the potential application of our model
+to material edited NeRFs.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR']"
+Noisy One-point Homographies are Surprisingly Good,Yaqing Ding · Jonathan Astermark · Magnus Oskarsson · Viktor Larsson, ,,https://vlarsson.github.io/publications/,,,,,nan
+Small Scale Data-Free Knowledge Distillation,He Liu · Yikai Wang · Huaping Liu · Fuchun Sun · Anbang Yao, ,https://arxiv.org/abs/2403.19539,,2403.19539.pdf,De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts,"Data-Free Knowledge Distillation (DFKD) is a promising task to train
+high-performance small models to enhance actual deployment without relying on
+the original training data. Existing methods commonly avoid relying on private
+data by utilizing synthetic or sampled data. However, a long-overlooked issue
+is that the severe distribution shifts between their substitution and original
+data, which manifests as huge differences in the quality of images and class
+proportions. The harmful shifts are essentially the confounder that
+significantly causes performance bottlenecks. To tackle the issue, this paper
+proposes a novel perspective with causal inference to disentangle the student
+models from the impact of such shifts. By designing a customized causal graph,
+we first reveal the causalities among the variables in the DFKD task.
+Subsequently, we propose a Knowledge Distillation Causal Intervention (KDCI)
+framework based on the backdoor adjustment to de-confound the confounder. KDCI
+can be flexibly combined with most existing state-of-the-art baselines.
+Experiments in combination with six representative DFKD methods demonstrate the
+effectiveness of our KDCI, which can obviously help existing methods under
+almost all settings, \textit{e.g.}, improving the baseline by up to 15.54\%
+accuracy on the CIFAR-100 dataset.",cs.CV,['cs.CV']
+Efficient Multitask Dense Predictor via Binarization,Yuzhang Shang · Dan Xu · Gaowen Liu · Ramana Kompella · Yan Yan, ,https://arxiv.org/abs/2405.14136,,2405.14136.pdf,Efficient Multitask Dense Predictor via Binarization,"Multi-task learning for dense prediction has emerged as a pivotal area in
+computer vision, enabling simultaneous processing of diverse yet interrelated
+pixel-wise prediction tasks. However, the substantial computational demands of
+state-of-the-art (SoTA) models often limit their widespread deployment. This
+paper addresses this challenge by introducing network binarization to compress
+resource-intensive multi-task dense predictors. Specifically, our goal is to
+significantly accelerate multi-task dense prediction models via Binary Neural
+Networks (BNNs) while maintaining and even improving model performance at the
+same time. To reach this goal, we propose a Binary Multi-task Dense Predictor,
+Bi-MTDP, and several variants of Bi-MTDP, in which a multi-task dense predictor
+is constructed via specified binarized modules. Our systematical analysis of
+this predictor reveals that performance drop from binarization is primarily
+caused by severe information degradation. To address this issue, we introduce a
+deep information bottleneck layer that enforces representations for downstream
+tasks satisfying Gaussian distribution in forward propagation. Moreover, we
+introduce a knowledge distillation mechanism to correct the direction of
+information flow in backward propagation. Intriguingly, one variant of Bi-MTDP
+outperforms full-precision (FP) multi-task dense prediction SoTAs, ARTC
+(CNN-based) and InvPT (ViT-Based). This result indicates that Bi-MTDP is not
+merely a naive trade-off between performance and efficiency, but is rather a
+benefit of the redundant information flow thanks to the multi-task
+architecture. Code is available at https://github.com/42Shawn/BiMTDP.",cs.CV,['cs.CV']
+Neural Super-Resolution for Real-time Rendering with Radiance Demodulation,Jia Li · Ziling Chen · Xiaolong Wu · Lu Wang · Beibei Wang · Lei Zhang, ,https://arxiv.org/abs/2308.06699,,2308.06699.pdf,Neural Super-Resolution for Real-time Rendering with Radiance Demodulation,"It is time-consuming to render high-resolution images in applications such as
+video games and virtual reality, and thus super-resolution technologies become
+increasingly popular for real-time rendering. However, it is challenging to
+preserve sharp texture details, keep the temporal stability and avoid the
+ghosting artifacts in real-time super-resolution rendering. To address this
+issue, we introduce radiance demodulation to separate the rendered image or
+radiance into a lighting component and a material component, considering the
+fact that the light component is smoother than the rendered image so that the
+high-resolution material component with detailed textures can be easily
+obtained. We perform the super-resolution on the lighting component only and
+re-modulate it with the high-resolution material component to obtain the final
+super-resolution image with more texture details. A reliable warping module is
+proposed by explicitly marking the occluded regions to avoid the ghosting
+artifacts. To further enhance the temporal stability, we design a
+frame-recurrent neural network and a temporal loss to aggregate the previous
+and current frames, which can better capture the spatial-temporal consistency
+among reconstructed frames. As a result, our method is able to produce
+temporally stable results in real-time rendering with high-quality details,
+even in the challenging 4 $\times$ 4 super-resolution scenarios.",cs.GR,['cs.GR']
+Multiple View Geometry Transformers for 3D Human Pose Estimation,Ziwei Liao · jialiang zhu · Chunyu Wang · Han Hu · Steven L. Waslander, ,https://arxiv.org/abs/2311.10983,,2311.10983.pdf,Multiple View Geometry Transformers for 3D Human Pose Estimation,"In this work, we aim to improve the 3D reasoning ability of Transformers in
+multi-view 3D human pose estimation. Recent works have focused on end-to-end
+learning-based transformer designs, which struggle to resolve geometric
+information accurately, particularly during occlusion. Instead, we propose a
+novel hybrid model, MVGFormer, which has a series of geometric and appearance
+modules organized in an iterative manner. The geometry modules are
+learning-free and handle all viewpoint-dependent 3D tasks geometrically which
+notably improves the model's generalization ability. The appearance modules are
+learnable and are dedicated to estimating 2D poses from image signals
+end-to-end which enables them to achieve accurate estimates even when occlusion
+occurs, leading to a model that is both accurate and generalizable to new
+cameras and geometries. We evaluate our approach for both in-domain and
+out-of-domain settings, where our model consistently outperforms
+state-of-the-art methods, and especially does so by a significant margin in the
+out-of-domain setting. We will release the code and models:
+https://github.com/XunshanMan/MVGFormer.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Efficient Scene Recovery Using Luminous Flux Prior,ZhongYu Li · Lei Zhang, ,,,,,,,nan
+ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe,Yifan Bai · Zeyang Zhao · Yihong Gong · Xing Wei, ,https://arxiv.org/abs/2312.17133,,2312.17133.pdf,ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe,"We present ARTrackV2, which integrates two pivotal aspects of tracking:
+determining where to look (localization) and how to describe (appearance
+analysis) the target object across video frames. Building on the foundation of
+its predecessor, ARTrackV2 extends the concept by introducing a unified
+generative framework to ""read out"" object's trajectory and ""retell"" its
+appearance in an autoregressive manner. This approach fosters a time-continuous
+methodology that models the joint evolution of motion and visual features,
+guided by previous estimates. Furthermore, ARTrackV2 stands out for its
+efficiency and simplicity, obviating the less efficient intra-frame
+autoregression and hand-tuned parameters for appearance updates. Despite its
+simplicity, ARTrackV2 achieves state-of-the-art performance on prevailing
+benchmark datasets while demonstrating remarkable efficiency improvement. In
+particular, ARTrackV2 achieves AO score of 79.5\% on GOT-10k, and AUC of 86.1\%
+on TrackingNet while being $3.6 \times$ faster than ARTrack. The code will be
+released.",cs.CV,['cs.CV']
+ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion,Jiayu Yang · Ziang Cheng · Yunfei Duan · Pan Ji · Hongdong Li, ,https://arxiv.org/abs/2310.10343,,2310.10343.pdf,ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion,"Given a single image of a 3D object, this paper proposes a novel method
+(named ConsistNet) that is able to generate multiple images of the same object,
+as if seen they are captured from different viewpoints, while the 3D
+(multi-view) consistencies among those multiple generated images are
+effectively exploited. Central to our method is a multi-view consistency block
+which enables information exchange across multiple single-view diffusion
+processes based on the underlying multi-view geometry principles. ConsistNet is
+an extension to the standard latent diffusion model, and consists of two
+sub-modules: (a) a view aggregation module that unprojects multi-view features
+into global 3D volumes and infer consistency, and (b) a ray aggregation module
+that samples and aggregate 3D consistent features back to each view to enforce
+consistency. Our approach departs from previous methods in multi-view image
+generation, in that it can be easily dropped-in pre-trained LDMs without
+requiring explicit pixel correspondences or depth prediction. Experiments show
+that our method effectively learns 3D consistency over a frozen Zero123
+backbone and can generate 16 surrounding views of the object within 40 seconds
+on a single A100 GPU. Our code will be made available on
+https://github.com/JiayuYANG/ConsistNet",cs.CV,['cs.CV']
+Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction,Junuk Cha · Jihyeon Kim · Jae Shin Yoon · Seungryul Baek, ,https://arxiv.org/abs/2404.00562,,2404.00562.pdf,Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction,"This paper introduces the first text-guided work for generating the sequence
+of hand-object interaction in 3D. The main challenge arises from the lack of
+labeled data where existing ground-truth datasets are nowhere near
+generalizable in interaction type and object category, which inhibits the
+modeling of diverse 3D hand-object interaction with the correct physical
+implication (e.g., contacts and semantics) from text prompts. To address this
+challenge, we propose to decompose the interaction generation task into two
+subtasks: hand-object contact generation; and hand-object motion generation.
+For contact generation, a VAE-based network takes as input a text and an object
+mesh, and generates the probability of contacts between the surfaces of hands
+and the object during the interaction. The network learns a variety of local
+geometry structure of diverse objects that is independent of the objects'
+category, and thus, it is applicable to general objects. For motion generation,
+a Transformer-based diffusion model utilizes this 3D contact map as a strong
+prior for generating physically plausible hand-object motion as a function of
+text prompts by learning from the augmented labeled dataset; where we annotate
+text labels from many existing 3D hand and object motion data. Finally, we
+further introduce a hand refiner module that minimizes the distance between the
+object surface and hand joints to improve the temporal stability of the
+object-hand contacts and to suppress the penetration artifacts. In the
+experiments, we demonstrate that our method can generate more realistic and
+diverse interactions compared to other baseline methods. We also show that our
+method is applicable to unseen objects. We will release our model and newly
+labeled data as a strong foundation for future research. Codes and data are
+available in: https://github.com/JunukCha/Text2HOI.",cs.CV,['cs.CV']
+Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image,Yiqun Mei · Yu Zeng · He Zhang · Zhixin Shu · Xuaner Zhang · Sai Bi · Jianming Zhang · HyunJoon Jung · Vishal M. Patel,https://yiqunmei.net/holo-web/,https://arxiv.org/abs/2403.09632,,2403.09632.pdf,Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image,"At the core of portrait photography is the search for ideal lighting and
+viewpoint. The process often requires advanced knowledge in photography and an
+elaborate studio setup. In this work, we propose Holo-Relighting, a volumetric
+relighting method that is capable of synthesizing novel viewpoints, and novel
+lighting from a single image. Holo-Relighting leverages the pretrained 3D GAN
+(EG3D) to reconstruct geometry and appearance from an input portrait as a set
+of 3D-aware features. We design a relighting module conditioned on a given
+lighting to process these features, and predict a relit 3D representation in
+the form of a tri-plane, which can render to an arbitrary viewpoint through
+volume rendering. Besides viewpoint and lighting control, Holo-Relighting also
+takes the head pose as a condition to enable head-pose-dependent lighting
+effects. With these novel designs, Holo-Relighting can generate complex
+non-Lambertian lighting effects (e.g., specular highlights and cast shadows)
+without using any explicit physical lighting priors. We train Holo-Relighting
+with data captured with a light stage, and propose two data-rendering
+techniques to improve the data quality for training the volumetric relighting
+system. Through quantitative and qualitative experiments, we demonstrate
+Holo-Relighting can achieve state-of-the-arts relighting quality with better
+photorealism, 3D consistency and controllability.",cs.CV,['cs.CV']
+Uncertainty-Guided Never-Ending Learning to Drive,Lei Lai · Eshed Ohn-Bar · Sanjay Arora · John Yi, ,,https://paperswithcode.com/paper/learning-to-drive-anywhere,,,,,nan
+Positive-Unlabeled Learning by Latent Group-Aware Meta Disambiguation,Lin Long · Haobo Wang · Zhijie Jiang · Lei Feng · Chang Yao · Gang Chen · Junbo Zhao, ,https://arxiv.org/abs/2307.15973,,2307.15973.pdf,Debiased Pairwise Learning from Positive-Unlabeled Implicit Feedback,"Learning contrastive representations from pairwise comparisons has achieved
+remarkable success in various fields, such as natural language processing,
+computer vision, and information retrieval. Collaborative filtering algorithms
+based on pairwise learning also rooted in this paradigm. A significant concern
+is the absence of labels for negative instances in implicit feedback data,
+which often results in the random selected negative instances contains false
+negatives and inevitably, biased embeddings. To address this issue, we
+introduce a novel correction method for sampling bias that yields a modified
+loss for pairwise learning called debiased pairwise loss (DPL). The key idea
+underlying DPL is to correct the biased probability estimates that result from
+false negatives, thereby correcting the gradients to approximate those of fully
+supervised data. The implementation of DPL only requires a small modification
+of the codes. Experimental studies on five public datasets validate the
+effectiveness of proposed learning method.",cs.IR,['cs.IR']
+Overcoming Generic Knowledge Loss with Selective Parameter Update,Wenxuan Zhang · Paul Janson · Rahaf Aljundi · Mohamed Elhoseiny, ,https://arxiv.org/abs/2308.12462,,2308.12462.pdf,Overcoming Generic Knowledge Loss with Selective Parameter Update,"Foundation models encompass an extensive knowledge base and offer remarkable
+transferability. However, this knowledge becomes outdated or insufficient over
+time. The challenge lies in continuously updating foundation models to
+accommodate novel information while retaining their original capabilities.
+Leveraging the fact that foundation models have initial knowledge on various
+tasks and domains, we propose a novel approach that, instead of updating all
+parameters equally, localizes the updates to a sparse set of parameters
+relevant to the task being learned. We strike a balance between efficiency and
+new task performance, while maintaining the transferability and
+generalizability of foundation models. We extensively evaluate our method on
+foundational vision-language models with a diverse spectrum of continual
+learning tasks. Our method achieves improvements on the accuracy of the newly
+learned tasks up to 7% while preserving the pretraining knowledge with a
+negligible decrease of 0.9% on a representative control set accuracy.",cs.CV,['cs.CV']
+Projecting Trackable Thermal Patterns for Dynamic Computer Vision,Mark Sheinin · Aswin C. Sankaranarayanan · Srinivasa G. Narasimhan, ,,https://www.globotreks.com/destinations/canada/day-trips-manitoba-winnipeg/,,,,,nan
+DePT: Decoupled Prompt Tuning,Ji Zhang · Shihan Wu · Lianli Gao · Heng Tao Shen · Jingkuan Song, ,https://arxiv.org/abs/2309.07439,,2309.07439.pdf,DePT: Decoupled Prompt Tuning,"This work breaks through the Base-New Tradeoff (BNT)dilemma in prompt tuning,
+i.e., the better the tuned model generalizes to the base (or target) task, the
+worse it generalizes to new tasks, and vice versa. Specifically, through an
+in-depth analysis of the learned features of the base and new tasks, we observe
+that the BNT stems from a channel bias issue, i.e., the vast majority of
+feature channels are occupied by base-specific knowledge, resulting in the
+collapse of taskshared knowledge important to new tasks. To address this, we
+propose the Decoupled Prompt Tuning (DePT) framework, which decouples
+base-specific knowledge from feature channels into an isolated feature space
+during prompt tuning, so as to maximally preserve task-shared knowledge in the
+original feature space for achieving better zero-shot generalization on new
+tasks. Importantly, our DePT is orthogonal to existing prompt tuning methods,
+hence it can improve all of them. Extensive experiments on 11 datasets show the
+strong flexibility and effectiveness of DePT. Our code and pretrained models
+are available at https://github.com/Koorye/DePT.",cs.CV,['cs.CV']
+Sharingan: A Transformer Architecture for Multi-Person Gaze Following,Samy Tafasca · Anshul Gupta · Jean-marc Odobez, ,https://arxiv.org/abs/2310.00816,,2310.00816.pdf,Sharingan: A Transformer-based Architecture for Gaze Following,"Gaze is a powerful form of non-verbal communication and social interaction
+that humans develop from an early age. As such, modeling this behavior is an
+important task that can benefit a broad set of application domains ranging from
+robotics to sociology. In particular, Gaze Following is defined as the
+prediction of the pixel-wise 2D location where a person in the image is
+looking. Prior efforts in this direction have focused primarily on CNN-based
+architectures to perform the task. In this paper, we introduce a novel
+transformer-based architecture for 2D gaze prediction. We experiment with 2
+variants: the first one retains the same task formulation of predicting a gaze
+heatmap for one person at a time, while the second one casts the problem as a
+2D point regression and allows us to perform multi-person gaze prediction with
+a single forward pass. This new architecture achieves state-of-the-art results
+on the GazeFollow and VideoAttentionTarget datasets. The code for this paper
+will be made publicly available.",cs.CV,['cs.CV']
+Fully Exploiting Every Real Sample: Super-Pixel Sample Gradient Model Stealing,Yunlong Zhao · Xiaoheng Deng · Yijing Liu · Xinjun Pei · Jiazhi Xia · Wei Chen, ,https://ar5iv.labs.arxiv.org/html/2309.10058,,2309.10058.pdf,Dual Student Networks for Data-Free Model Stealing,"Existing data-free model stealing methods use a generator to produce samples
+in order to train a student model to match the target model outputs. To this
+end, the two main challenges are estimating gradients of the target model
+without access to its parameters, and generating a diverse set of training
+samples that thoroughly explores the input space. We propose a Dual Student
+method where two students are symmetrically trained in order to provide the
+generator a criterion to generate samples that the two students disagree on. On
+one hand, disagreement on a sample implies at least one student has classified
+the sample incorrectly when compared to the target model. This incentive
+towards disagreement implicitly encourages the generator to explore more
+diverse regions of the input space. On the other hand, our method utilizes
+gradients of student models to indirectly estimate gradients of the target
+model. We show that this novel training objective for the generator network is
+equivalent to optimizing a lower bound on the generator's loss if we had access
+to the target model gradients. We show that our new optimization framework
+provides more accurate gradient estimation of the target model and better
+accuracies on benchmark classification datasets. Additionally, our approach
+balances improved query efficiency with training computation cost. Finally, we
+demonstrate that our method serves as a better proxy model for transfer-based
+adversarial attacks than existing data-free model stealing methods.",cs.LG,"['cs.LG', 'cs.CR']"
+MaskClustering:  View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation,Mi Yan · Jiazhao Zhang · Yan Zhu · He Wang,https://pku-epic.github.io/MaskClustering/,https://arxiv.org/abs/2401.07745,,2401.07745.pdf,MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation,"Open-vocabulary 3D instance segmentation is cutting-edge for its ability to
+segment 3D instances without predefined categories. However, progress in 3D
+lags behind its 2D counterpart due to limited annotated 3D data. To address
+this, recent works first generate 2D open-vocabulary masks through 2D models
+and then merge them into 3D instances based on metrics calculated between two
+neighboring frames. In contrast to these local metrics, we propose a novel
+metric, view consensus rate, to enhance the utilization of multi-view
+observations. The key insight is that two 2D masks should be deemed part of the
+same 3D instance if a significant number of other 2D masks from different views
+contain both these two masks. Using this metric as edge weight, we construct a
+global mask graph where each mask is a node. Through iterative clustering of
+masks showing high view consensus, we generate a series of clusters, each
+representing a distinct 3D instance. Notably, our model is training-free.
+Through extensive experiments on publicly available datasets, including
+ScanNet++, ScanNet200 and MatterPort3D, we demonstrate that our method achieves
+state-of-the-art performance in open-vocabulary 3D instance segmentation. Our
+project page is at https://pku-epic.github.io/MaskClustering.",cs.CV,['cs.CV']
+Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts,Jialin Wu · Xia Hu · Yaqing Wang · Bo Pang · Radu Soricut, ,https://arxiv.org/abs/2312.00968,,2312.00968.pdf,Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts,"Large multi-modal models (LMMs) exhibit remarkable performance across
+numerous tasks. However, generalist LMMs often suffer from performance
+degradation when tuned over a large collection of tasks. Recent research
+suggests that Mixture of Experts (MoE) architectures are useful for instruction
+tuning, but for LMMs of parameter size around O(50-100B), the prohibitive cost
+of replicating and storing the expert models severely limits the number of
+experts we can use. We propose Omni-SMoLA, an architecture that uses the Soft
+MoE approach to (softly) mix many multimodal low rank experts, and avoids
+introducing a significant number of new parameters compared to conventional MoE
+models. The core intuition here is that the large model provides a foundational
+backbone, while different lightweight experts residually learn specialized
+knowledge, either per-modality or multimodally. Extensive experiments
+demonstrate that the SMoLA approach helps improve the generalist performance
+across a broad range of generative vision-and-language tasks, achieving new
+SoTA generalist performance that often matches or outperforms single
+specialized LMM baselines, as well as new SoTA specialist performance.",cs.CV,"['cs.CV', 'cs.CL']"
+SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation,Thuan Nguyen · Anh Tran,thuanz123.github.io/swiftbrush,https://arxiv.org/abs/2312.05239,,2312.05239.pdf,SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation,"Despite their ability to generate high-resolution and diverse images from
+text prompts, text-to-image diffusion models often suffer from slow iterative
+sampling processes. Model distillation is one of the most effective directions
+to accelerate these models. However, previous distillation methods fail to
+retain the generation quality while requiring a significant amount of images
+for training, either from real data or synthetically generated by the teacher
+model. In response to this limitation, we present a novel image-free
+distillation scheme named $\textbf{SwiftBrush}$. Drawing inspiration from
+text-to-3D synthesis, in which a 3D neural radiance field that aligns with the
+input prompt can be obtained from a 2D text-to-image diffusion prior via a
+specialized loss without the use of any 3D data ground-truth, our approach
+re-purposes that same loss for distilling a pretrained multi-step text-to-image
+model to a student network that can generate high-fidelity images with just a
+single inference step. In spite of its simplicity, our model stands as one of
+the first one-step text-to-image generators that can produce images of
+comparable quality to Stable Diffusion without reliance on any training image
+data. Remarkably, SwiftBrush achieves an FID score of $\textbf{16.67}$ and a
+CLIP score of $\textbf{0.29}$ on the COCO-30K benchmark, achieving competitive
+results or even substantially surpassing existing state-of-the-art distillation
+techniques.",cs.CV,['cs.CV']
+HardMo: A Large-Scale Hardcase Dataset for Motion Capture,Jiaqi Liao · Chuanchen Luo · Yinuo Du · Yuxi Wang · Xu-Cheng Yin · Man Zhang · Zhaoxiang Zhang · Junran Peng, ,,,,,,,nan
+Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models,Huan Ling · Seung Wook Kim · Antonio Torralba · Sanja Fidler · Karsten Kreis, ,https://arxiv.org/abs/2312.13763,,2312.13763.pdf,Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models,"Text-guided diffusion models have revolutionized image and video generation
+and have also been successfully used for optimization-based 3D object
+synthesis. Here, we instead focus on the underexplored text-to-4D setting and
+synthesize dynamic, animated 3D objects using score distillation methods with
+an additional temporal dimension. Compared to previous work, we pursue a novel
+compositional generation-based approach, and combine text-to-image,
+text-to-video, and 3D-aware multiview diffusion models to provide feedback
+during 4D object optimization, thereby simultaneously enforcing temporal
+consistency, high-quality visual appearance and realistic geometry. Our method,
+called Align Your Gaussians (AYG), leverages dynamic 3D Gaussian Splatting with
+deformation fields as 4D representation. Crucial to AYG is a novel method to
+regularize the distribution of the moving 3D Gaussians and thereby stabilize
+the optimization and induce motion. We also propose a motion amplification
+mechanism as well as a new autoregressive synthesis scheme to generate and
+combine multiple 4D sequences for longer generation. These techniques allow us
+to synthesize vivid dynamic scenes, outperform previous work qualitatively and
+quantitatively and achieve state-of-the-art text-to-4D performance. Due to the
+Gaussian 4D representation, different 4D animations can be seamlessly combined,
+as we demonstrate. AYG opens up promising avenues for animation, simulation and
+digital content creation as well as synthetic data generation.",cs.CV,"['cs.CV', 'cs.LG']"
+THRONE: A Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models,Prannay Kaul · Zhizhong Li · Hao Yang · Yonatan Dukler · Ashwin Swaminathan · CJ Taylor · Stefano Soatto · Stefano Soatto, ,,,,,,,nan
+CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs,Yingji Zhong · Lanqing Hong · Zhenguo Li · Dan Xu, ,,,,,,,nan
+Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction,Xiaoyang Lyu · Chirui Chang · Peng Dai · Yangtian Sun · Xiaojuan Qi, ,https://arxiv.org/abs/2403.19314,,2403.19314.pdf,Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction,"Scene reconstruction from multi-view images is a fundamental problem in
+computer vision and graphics. Recent neural implicit surface reconstruction
+methods have achieved high-quality results; however, editing and manipulating
+the 3D geometry of reconstructed scenes remains challenging due to the absence
+of naturally decomposed object entities and complex object/background
+compositions. In this paper, we present Total-Decom, a novel method for
+decomposed 3D reconstruction with minimal human interaction. Our approach
+seamlessly integrates the Segment Anything Model (SAM) with hybrid
+implicit-explicit neural surface representations and a mesh-based
+region-growing technique for accurate 3D object decomposition. Total-Decom
+requires minimal human annotations while providing users with real-time control
+over the granularity and quality of decomposition. We extensively evaluate our
+method on benchmark datasets and demonstrate its potential for downstream
+applications, such as animation and scene editing. The code is available at
+https://github.com/CVMI-Lab/Total-Decom.git.",cs.CV,['cs.CV']
+Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment,Ziyu Shan · Yujie Zhang · Qi Yang · Haichen Yang · Yiling Xu · Jenq-Neng Hwang · Xiaozhong Xu · Shan Liu, ,https://arxiv.org/abs/2403.10066,,2403.10066.pdf,Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment,"No-reference point cloud quality assessment (NR-PCQA) aims to automatically
+evaluate the perceptual quality of distorted point clouds without available
+reference, which have achieved tremendous improvements due to the utilization
+of deep neural networks. However, learning-based NR-PCQA methods suffer from
+the scarcity of labeled data and usually perform suboptimally in terms of
+generalization. To solve the problem, we propose a novel contrastive
+pre-training framework tailored for PCQA (CoPA), which enables the pre-trained
+model to learn quality-aware representations from unlabeled data. To obtain
+anchors in the representation space, we project point clouds with different
+distortions into images and randomly mix their local patches to form mixed
+images with multiple distortions. Utilizing the generated anchors, we constrain
+the pre-training process via a quality-aware contrastive loss following the
+philosophy that perceptual quality is closely related to both content and
+distortion. Furthermore, in the model fine-tuning stage, we propose a
+semantic-guided multi-view fusion module to effectively integrate the features
+of projected images from multiple perspectives. Extensive experiments show that
+our method outperforms the state-of-the-art PCQA methods on popular benchmarks.
+Further investigations demonstrate that CoPA can also benefit existing
+learning-based PCQA models.",cs.CV,"['cs.CV', 'cs.MM']"
+PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding,Xuesong Nie · Haoyuan Jin · Yunfeng Yan · Xi Chen · Zhihang Zhu · Donglian Qi, ,http://export.arxiv.org/abs/2310.18698,,2310.18698.pdf,Triplet Attention Transformer for Spatiotemporal Predictive Learning,"Spatiotemporal predictive learning offers a self-supervised learning paradigm
+that enables models to learn both spatial and temporal patterns by predicting
+future sequences based on historical sequences. Mainstream methods are
+dominated by recurrent units, yet they are limited by their lack of
+parallelization and often underperform in real-world scenarios. To improve
+prediction quality while maintaining computational efficiency, we propose an
+innovative triplet attention transformer designed to capture both inter-frame
+dynamics and intra-frame static features. Specifically, the model incorporates
+the Triplet Attention Module (TAM), which replaces traditional recurrent units
+by exploring self-attention mechanisms in temporal, spatial, and channel
+dimensions. In this configuration: (i) temporal tokens contain abstract
+representations of inter-frame, facilitating the capture of inherent temporal
+dependencies; (ii) spatial and channel attention combine to refine the
+intra-frame representation by performing fine-grained interactions across
+spatial and channel dimensions. Alternating temporal, spatial, and
+channel-level attention allows our approach to learn more complex short- and
+long-range spatiotemporal dependencies. Extensive experiments demonstrate
+performance surpassing existing recurrent-based and recurrent-free methods,
+achieving state-of-the-art under multi-scenario examination including moving
+object trajectory prediction, traffic flow prediction, driving scene
+prediction, and human motion capture.",cs.CV,"['cs.CV', 'cs.LG']"
+U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation,You Wu · Kean Liu · Xiaoyue Mi · Fan Tang · Juan Cao · Jintao Li, ,https://arxiv.org/abs/2403.20231,,2403.20231.pdf,U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation,"Concept personalization methods enable large text-to-image models to learn
+specific subjects (e.g., objects/poses/3D models) and synthesize renditions in
+new contexts. Given that the image references are highly biased towards visual
+attributes, state-of-the-art personalization models tend to overfit the whole
+subject and cannot disentangle visual characteristics in pixel space. In this
+study, we proposed a more challenging setting, namely fine-grained visual
+appearance personalization. Different from existing methods, we allow users to
+provide a sentence describing the desired attributes. A novel decoupled
+self-augmentation strategy is proposed to generate target-related and
+non-target samples to learn user-specified visual attributes. These augmented
+data allow for refining the model's understanding of the target attribute while
+mitigating the impact of unrelated attributes. At the inference stage,
+adjustments are conducted on semantic space through the learned target and
+non-target embeddings to further enhance the disentanglement of target
+attributes. Extensive experiments on various kinds of visual attributes with
+SOTA personalization methods show the ability of the proposed method to mimic
+target visual appearance in novel contexts, thus improving the controllability
+and flexibility of personalization.",cs.CV,['cs.CV']
+OVMR: Open-Vocabulary Recognition with Multi-Modal References,Zehong Ma · Shiliang Zhang · Longhui Wei · Qi Tian, ,https://arxiv.org/abs/2306.05493,,2306.05493.pdf,Multi-Modal Classifiers for Open-Vocabulary Object Detection,"The goal of this paper is open-vocabulary object detection (OVOD)
+$\unicode{x2013}$ building a model that can detect objects beyond the set of
+categories seen at training, thus enabling the user to specify categories of
+interest at inference without the need for model retraining. We adopt a
+standard two-stage object detector architecture, and explore three ways for
+specifying novel categories: via language descriptions, via image exemplars, or
+via a combination of the two. We make three contributions: first, we prompt a
+large language model (LLM) to generate informative language descriptions for
+object classes, and construct powerful text-based classifiers; second, we
+employ a visual aggregator on image exemplars that can ingest any number of
+images as input, forming vision-based classifiers; and third, we provide a
+simple method to fuse information from language descriptions and image
+exemplars, yielding a multi-modal classifier. When evaluating on the
+challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our
+text-based classifiers outperform all previous OVOD works; (ii) our
+vision-based classifiers perform as well as text-based classifiers in prior
+work; (iii) using multi-modal classifiers perform better than either modality
+alone; and finally, (iv) our text-based and multi-modal classifiers yield
+better performance than a fully-supervised detector.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'I.4.6; I.4.8; I.4.9; I.2.10']"
+Dynamic Prompt Optimizing for Text-to-Image Generation,Wenyi Mo · Tianyu Zhang · Yalong Bai · Bing Su · Ji-Rong Wen · Qing Yang, ,https://arxiv.org/abs/2404.04095,,2404.04095.pdf,Dynamic Prompt Optimizing for Text-to-Image Generation,"Text-to-image generative models, specifically those based on diffusion models
+like Imagen and Stable Diffusion, have made substantial advancements. Recently,
+there has been a surge of interest in the delicate refinement of text prompts.
+Users assign weights or alter the injection time steps of certain words in the
+text prompts to improve the quality of generated images. However, the success
+of fine-control prompts depends on the accuracy of the text prompts and the
+careful selection of weights and time steps, which requires significant manual
+intervention. To address this, we introduce the \textbf{P}rompt
+\textbf{A}uto-\textbf{E}diting (PAE) method. Besides refining the original
+prompts for image generation, we further employ an online reinforcement
+learning strategy to explore the weights and injection time steps of each word,
+leading to the dynamic fine-control prompts. The reward function during
+training encourages the model to consider aesthetic score, semantic
+consistency, and user preferences. Experimental results demonstrate that our
+proposed method effectively improves the original prompts, generating visually
+more appealing images while maintaining semantic alignment. Code is available
+at https://github.com/Mowenyii/PAE.",cs.CV,"['cs.CV', 'cs.AI']"
+DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery,Yixuan Zhu · Ao Li · Yansong Tang · Wenliang Zhao · Jie Zhou · Jiwen Lu, ,https://arxiv.org/abs/2404.01424,,2404.01424.pdf,DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery,"The recovery of occluded human meshes presents challenges for current methods
+due to the difficulty in extracting effective image features under severe
+occlusion. In this paper, we introduce DPMesh, an innovative framework for
+occluded human mesh recovery that capitalizes on the profound diffusion prior
+about object structure and spatial relationships embedded in a pre-trained
+text-to-image diffusion model. Unlike previous methods reliant on conventional
+backbones for vanilla feature extraction, DPMesh seamlessly integrates the
+pre-trained denoising U-Net with potent knowledge as its image backbone and
+performs a single-step inference to provide occlusion-aware information. To
+enhance the perception capability for occluded poses, DPMesh incorporates
+well-designed guidance via condition injection, which produces effective
+controls from 2D observations for the denoising U-Net. Furthermore, we explore
+a dedicated noisy key-point reasoning approach to mitigate disturbances arising
+from occlusion and crowded scenarios. This strategy fully unleashes the
+perceptual capability of the diffusion prior, thereby enhancing accuracy.
+Extensive experiments affirm the efficacy of our framework, as we outperform
+state-of-the-art methods on both occlusion-specific and standard datasets. The
+persuasive results underscore its ability to achieve precise and robust 3D
+human mesh recovery, particularly in challenging scenarios involving occlusion
+and crowded scenes.",cs.CV,['cs.CV']
+Learning Inclusion Matching for Animation Paint Bucket Colorization,Yuekun Dai · Shangchen Zhou · Blake Li · Chongyi Li · Chen Change Loy,https://ykdai.github.io/projects/InclusionMatching,https://arxiv.org/abs/2403.18342,,2403.18342.pdf,Learning Inclusion Matching for Animation Paint Bucket Colorization,"Colorizing line art is a pivotal task in the production of hand-drawn cel
+animation. This typically involves digital painters using a paint bucket tool
+to manually color each segment enclosed by lines, based on RGB values
+predetermined by a color designer. This frame-by-frame process is both arduous
+and time-intensive. Current automated methods mainly focus on segment matching.
+This technique migrates colors from a reference to the target frame by aligning
+features within line-enclosed segments across frames. However, issues like
+occlusion and wrinkles in animations often disrupt these direct
+correspondences, leading to mismatches. In this work, we introduce a new
+learning-based inclusion matching pipeline, which directs the network to
+comprehend the inclusion relationships between segments rather than relying
+solely on direct visual correspondences. Our method features a two-stage
+pipeline that integrates a coarse color warping module with an inclusion
+matching module, enabling more nuanced and accurate colorization. To facilitate
+the training of our network, we also develope a unique dataset, referred to as
+PaintBucket-Character. This dataset includes rendered line arts alongside their
+colorized counterparts, featuring various 3D characters. Extensive experiments
+demonstrate the effectiveness and superiority of our method over existing
+techniques.",cs.CV,['cs.CV']
+Grounded Question-Answering in Long Egocentric Videos,Shangzhe Di · Weidi Xie,https://github.com/Becomebright/GroundVQA,https://arxiv.org/abs/2312.06505,,2312.06505.pdf,Grounded Question-Answering in Long Egocentric Videos,"Existing approaches to video understanding, mainly designed for short videos
+from a third-person perspective, are limited in their applicability in certain
+fields, such as robotics. In this paper, we delve into open-ended
+question-answering (QA) in long, egocentric videos, which allows individuals or
+robots to inquire about their own past visual experiences. This task presents
+unique challenges, including the complexity of temporally grounding queries
+within extensive video content, the high resource demands for precise data
+annotation, and the inherent difficulty of evaluating open-ended answers due to
+their ambiguous nature. Our proposed approach tackles these challenges by (i)
+integrating query grounding and answering within a unified model to reduce
+error propagation; (ii) employing large language models for efficient and
+scalable data synthesis; and (iii) introducing a close-ended QA task for
+evaluation, to manage answer ambiguity. Extensive experiments demonstrate the
+effectiveness of our method, which also achieves state-of-the-art performance
+on the QaEgo4D and Ego4D-NLQ benchmarks. Code, data, and models are available
+at https://github.com/Becomebright/GroundVQA.",cs.CV,['cs.CV']
+SimAC: A Simple Anti-Customization Method for Protecting Face Privacy against Text-to-Image Synthesis of Diffusion Models,Feifei Wang · Zhentao Tan · Tianyi Wei · Yue Wu · Qidong Huang, ,https://arxiv.org/abs/2312.07865,,2312.07865.pdf,SimAC: A Simple Anti-Customization Method for Protecting Face Privacy against Text-to-Image Synthesis of Diffusion Models,"Despite the success of diffusion-based customization methods on visual
+content creation, increasing concerns have been raised about such techniques
+from both privacy and political perspectives. To tackle this issue, several
+anti-customization methods have been proposed in very recent months,
+predominantly grounded in adversarial attacks. Unfortunately, most of these
+methods adopt straightforward designs, such as end-to-end optimization with a
+focus on adversarially maximizing the original training loss, thereby
+neglecting nuanced internal properties intrinsic to the diffusion model, and
+even leading to ineffective optimization in some diffusion time steps.In this
+paper, we strive to bridge this gap by undertaking a comprehensive exploration
+of these inherent properties, to boost the performance of current
+anti-customization approaches. Two aspects of properties are investigated: 1)
+We examine the relationship between time step selection and the model's
+perception in the frequency domain of images and find that lower time steps can
+give much more contributions to adversarial noises. This inspires us to propose
+an adaptive greedy search for optimal time steps that seamlessly integrates
+with existing anti-customization methods. 2) We scrutinize the roles of
+features at different layers during denoising and devise a sophisticated
+feature-based optimization framework for anti-customization.Experiments on
+facial benchmarks demonstrate that our approach significantly increases
+identity disruption, thereby protecting user privacy and copyright. Our code is
+available at: https://github.com/somuchtome/SimAC.",cs.CV,['cs.CV']
+DYSON: Dynamic Feature Space Self-Organization for Online Task-Free Class Incremental Learning,Yuhang He · YingJie Chen · Yuhan Jin · Songlin Dong · Xing Wei · Yihong Gong, ,https://arxiv.org/abs/2405.08533,,2405.08533.pdf,Dynamic Feature Learning and Matching for Class-Incremental Learning,"Class-incremental learning (CIL) has emerged as a means to learn new classes
+incrementally without catastrophic forgetting of previous classes. Recently,
+CIL has undergone a paradigm shift towards dynamic architectures due to their
+superior performance. However, these models are still limited by the following
+aspects: (i) Data augmentation (DA), which are tightly coupled with CIL,
+remains under-explored in dynamic architecture scenarios. (ii) Feature
+representation. The discriminativeness of dynamic feature are sub-optimal and
+possess potential for refinement. (iii) Classifier. The misalignment between
+dynamic feature and classifier constrains the capabilities of the model. To
+tackle the aforementioned drawbacks, we propose the Dynamic Feature Learning
+and Matching (DFLM) model in this paper from above three perspectives.
+Specifically, we firstly introduce class weight information and non-stationary
+functions to extend the mix DA method for dynamically adjusting the focus on
+memory during training. Then, von Mises-Fisher (vMF) classifier is employed to
+effectively model the dynamic feature distribution and implicitly learn their
+discriminative properties. Finally, the matching loss is proposed to facilitate
+the alignment between the learned dynamic features and the classifier by
+minimizing the distribution distance. Extensive experiments on CIL benchmarks
+validate that our proposed model achieves significant performance improvements
+over existing methods.",cs.CV,['cs.CV']
+NightCC: Nighttime Color Constancy via  Adaptive Channel Masking,Shuwei Li · Robby T. Tan, ,,,,,,,nan
+G$^3$-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding,Yuan Wang · Yali Li · Shengjin Wang, ,https://arxiv.org/abs/2403.08182,,2403.08182.pdf,SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention,"3D visual grounding aims to automatically locate the 3D region of the
+specified object given the corresponding textual description. Existing works
+fail to distinguish similar objects especially when multiple referred objects
+are involved in the description. Experiments show that direct matching of
+language and visual modal has limited capacity to comprehend complex
+referential relationships in utterances. It is mainly due to the interference
+caused by redundant visual information in cross-modal alignment. To strengthen
+relation-orientated mapping between different modalities, we propose SeCG, a
+semantic-enhanced relational learning model based on a graph network with our
+designed memory graph attention layer. Our method replaces original
+language-independent encoding with cross-modal encoding in visual analysis.
+More text-related feature expressions are obtained through the guidance of
+global semantics and implicit relationships. Experimental results on ReferIt3D
+and ScanRefer benchmarks show that the proposed method outperforms the existing
+state-of-the-art methods, particularly improving the localization performance
+for the multi-relation challenges.",cs.CV,['cs.CV']
+Training Like a Medical Resident: Context-Prior Learning Toward Universal Medical Image Segmentation,Yunhe Gao, ,https://arxiv.org/abs/2306.02416,,2306.02416.pdf,Training Like a Medical Resident: Context-Prior Learning Toward Universal Medical Image Segmentation,"A major focus of clinical imaging workflow is disease diagnosis and
+management, leading to medical imaging datasets strongly tied to specific
+clinical objectives. This scenario has led to the prevailing practice of
+developing task-specific segmentation models, without gaining insights from
+widespread imaging cohorts. Inspired by the training program of medical
+radiology residents, we propose a shift towards universal medical image
+segmentation, a paradigm aiming to build medical image understanding foundation
+models by leveraging the diversity and commonality across clinical targets,
+body regions, and imaging modalities. Towards this goal, we develop Hermes, a
+novel context-prior learning approach to address the challenges of data
+heterogeneity and annotation differences in medical image segmentation. In a
+large collection of eleven diverse datasets (2,438 3D images) across five
+modalities (CT, PET, T1, T2 and cine MRI) and multiple body regions, we
+demonstrate the merit of the universal paradigm over the traditional paradigm
+on addressing multiple tasks within a single model. By exploiting the synergy
+across tasks, Hermes achieves state-of-the-art performance on all testing
+datasets and shows superior model scalability. Results on two additional
+datasets reveals Hermes' strong performance for transfer learning, incremental
+learning, and generalization to downstream tasks. Hermes's learned priors
+demonstrate an appealing trait to reflect the intricate relations among tasks
+and modalities, which aligns with the established anatomical and imaging
+principles in radiology. The code is available:
+https://github.com/yhygao/universal-medical-image-segmentation.",cs.CV,['cs.CV']
+Generative Quanta Color Imaging,Vishal Purohit · Junjie Luo · Yiheng Chi · Qi Guo · Stanley H. Chan · Qiang Qiu, ,https://arxiv.org/abs/2403.19066,,2403.19066.pdf,Generative Quanta Color Imaging,"The astonishing development of single-photon cameras has created an
+unprecedented opportunity for scientific and industrial imaging. However, the
+high data throughput generated by these 1-bit sensors creates a significant
+bottleneck for low-power applications. In this paper, we explore the
+possibility of generating a color image from a single binary frame of a
+single-photon camera. We evidently find this problem being particularly
+difficult to standard colorization approaches due to the substantial degree of
+exposure variation. The core innovation of our paper is an exposure synthesis
+model framed under a neural ordinary differential equation (Neural ODE) that
+allows us to generate a continuum of exposures from a single observation. This
+innovation ensures consistent exposure in binary images that colorizers take
+on, resulting in notably enhanced colorization. We demonstrate applications of
+the method in single-image and burst colorization and show superior generative
+performance over baselines. Project website can be found at
+https://vishal-s-p.github.io/projects/2023/generative_quanta_color.html.",cs.CV,"['cs.CV', 'cs.AI']"
+Polarization Wavefront Lidar: Learning Large Scene Reconstruction from Polarized Wavefronts,Dominik Scheuble · Chenyang Lei · Mario Bijelic · Seung-Hwan Baek · Felix Heide, ,,https://cg.postech.ac.kr/2024/03/01/9-papers-are-accepted-to-cvpr-2024/,,,,,nan
+MirageRoom: 3D Scene Segmentation with 2D Pre-trained Models by Mirage Projection,Haowen Sun · Yueqi Duan · Juncheng Yan · Yifan Liu · Jiwen Lu, ,https://arxiv.org/abs/2403.06403,,2403.06403.pdf,PointSeg: A Training-Free Paradigm for 3D Scene Segmentation via Foundation Models,"Recent success of vision foundation models have shown promising performance
+for the 2D perception tasks. However, it is difficult to train a 3D foundation
+network directly due to the limited dataset and it remains under explored
+whether existing foundation models can be lifted to 3D space seamlessly. In
+this paper, we present PointSeg, a novel training-free paradigm that leverages
+off-the-shelf vision foundation models to address 3D scene perception tasks.
+PointSeg can segment anything in 3D scene by acquiring accurate 3D prompts to
+align their corresponding pixels across frames. Concretely, we design a
+two-branch prompts learning structure to construct the 3D point-box prompts
+pairs, combining with the bidirectional matching strategy for accurate point
+and proposal prompts generation. Then, we perform the iterative post-refinement
+adaptively when cooperated with different vision foundation models. Moreover,
+we design a affinity-aware merging algorithm to improve the final ensemble
+masks. PointSeg demonstrates impressive segmentation performance across various
+datasets, all without training. Specifically, our approach significantly
+surpasses the state-of-the-art specialist model by 13.4$\%$, 11.3$\%$, and
+12$\%$ mAP on ScanNet, ScanNet++, and KITTI-360 datasets, respectively. On top
+of that, PointSeg can incorporate with various segmentation models and even
+surpasses the supervised methods.",cs.CV,['cs.CV']
+Overcoming Data Limitations for High-Quality Video Diffusion Models,Haoxin Chen · Yong Zhang · Xiaodong Cun · Menghan Xia · Xintao Wang · CHAO WENG · Ying Shan, ,,,,,,,nan
+FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion,George Cazenavette · Avneesh Sud · Thomas Leung · Ben Usman, ,https://ar5iv.labs.arxiv.org/html/2210.06998,,2210.06998.pdf,DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models,"Text-to-image generation models that generate images based on prompt
+descriptions have attracted an increasing amount of attention during the past
+few months. Despite their encouraging performance, these models raise concerns
+about the misuse of their generated fake images. To tackle this problem, we
+pioneer a systematic study on the detection and attribution of fake images
+generated by text-to-image generation models. Concretely, we first build a
+machine learning classifier to detect the fake images generated by various
+text-to-image generation models. We then attribute these fake images to their
+source models, such that model owners can be held responsible for their models'
+misuse. We further investigate how prompts that generate fake images affect
+detection and attribution. We conduct extensive experiments on four popular
+text-to-image generation models, including DALL$\cdot$E 2, Stable Diffusion,
+GLIDE, and Latent Diffusion, and two benchmark prompt-image datasets. Empirical
+results show that (1) fake images generated by various models can be
+distinguished from real ones, as there exists a common artifact shared by fake
+images from different models; (2) fake images can be effectively attributed to
+their source models, as different models leave unique fingerprints in their
+generated images; (3) prompts with the ``person'' topic or a length between 25
+and 75 enable models to generate fake images with higher authenticity. All
+findings contribute to the community's insight into the threats caused by
+text-to-image generation models. We appeal to the community's consideration of
+the counterpart solutions, like ours, against the rapidly-evolving fake image
+generation.",cs.CR,"['cs.CR', 'cs.CV', 'cs.LG']"
+"Separating the ""Chirp"" from the ""Chat"": Self-supervised Visual Grounding of Sound and Language",Mark Hamilton · Andrew Zisserman · John Hershey · William Freeman, ,https://arxiv.org/abs/2404.19696,,,Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners,"3D visual grounding is a challenging task that often requires direct and
+dense supervision, notably the semantic label for each object in the scene. In
+this paper, we instead study the naturally supervised setting that learns from
+only 3D scene and QA pairs, where prior works underperform. We propose the
+Language-Regularized Concept Learner (LARC), which uses constraints from
+language as regularization to significantly improve the accuracy of
+neuro-symbolic concept learners in the naturally supervised setting. Our
+approach is based on two core insights: the first is that language constraints
+(e.g., a word's relation to another) can serve as effective regularization for
+structured representations in neuro-symbolic models; the second is that we can
+query large language models to distill such constraints from language
+properties. We show that LARC improves performance of prior works in naturally
+supervised 3D visual grounding, and demonstrates a wide range of 3D visual
+reasoning capabilities-from zero-shot composition, to data efficiency and
+transferability. Our method represents a promising step towards regularizing
+structured visual reasoning frameworks with language-based priors, for learning
+in settings without dense supervision.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']"
+Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models,Weiwei Cao · Jianpeng Zhang · Yingda Xia · Tony C. W. MOK · Zi Li · Xianghua Ye · Le Lu · Jian Zheng · Yuxing Tang · Ling Zhang, ,https://arxiv.org/abs/2404.04936,,2404.04936.pdf,Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models,"Radiologists highly desire fully automated versatile AI for medical imaging
+interpretation. However, the lack of extensively annotated large-scale
+multi-disease datasets has hindered the achievement of this goal. In this
+paper, we explore the feasibility of leveraging language as a naturally
+high-quality supervision for chest CT imaging. In light of the limited
+availability of image-report pairs, we bootstrap the understanding of 3D chest
+CT images by distilling chest-related diagnostic knowledge from an extensively
+pre-trained 2D X-ray expert model. Specifically, we propose a language-guided
+retrieval method to match each 3D CT image with its semantically closest 2D
+X-ray image, and perform pair-wise and semantic relation knowledge
+distillation. Subsequently, we use contrastive learning to align images and
+reports within the same patient while distinguishing them from the other
+patients. However, the challenge arises when patients have similar semantic
+diagnoses, such as healthy patients, potentially confusing if treated as
+negatives. We introduce a robust contrastive learning that identifies and
+corrects these false negatives. We train our model with over 12,000 pairs of
+chest CT images and radiology reports. Extensive experiments across multiple
+scenarios, including zero-shot learning, report generation, and fine-tuning
+processes, demonstrate the model's feasibility in interpreting chest CT images.",cs.CV,['cs.CV']
+Towards Automated Movie Trailer Generation,Dawit Argaw Argaw · Mattia Soldan · Alejandro Pardo · Chen Zhao · Fabian Caba Heilbron · Joon Chung · Bernard Ghanem, ,https://arxiv.org/abs/2404.03477,,2404.03477.pdf,Towards Automated Movie Trailer Generation,"Movie trailers are an essential tool for promoting films and attracting
+audiences. However, the process of creating trailers can be time-consuming and
+expensive. To streamline this process, we propose an automatic trailer
+generation framework that generates plausible trailers from a full movie by
+automating shot selection and composition. Our approach draws inspiration from
+machine translation techniques and models the movies and trailers as sequences
+of shots, thus formulating the trailer generation problem as a
+sequence-to-sequence task. We introduce Trailer Generation Transformer (TGT), a
+deep-learning framework utilizing an encoder-decoder architecture. TGT movie
+encoder is tasked with contextualizing each movie shot representation via
+self-attention, while the autoregressive trailer decoder predicts the feature
+representation of the next trailer shot, accounting for the relevance of shots'
+temporal order in trailers. Our TGT significantly outperforms previous methods
+on a comprehensive suite of metrics.",cs.CV,['cs.CV']
+COCONut: Modernizing COCO Segmentation,Xueqing Deng · Qihang Yu · Peng Wang · Xiaohui Shen · Liang-Chieh Chen, ,,,,,,,nan
+Investigating Compositional Challenges in Vision-Language Models for Visual Grounding,Yunan Zeng · Yan Huang · Jinjin Zhang · Zequn Jie · Zhenhua Chai · Liang Wang, ,https://arxiv.org/html/2405.17104v1,,2405.17104v1.pdf,LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding,"Visual grounding is an essential tool that links user-provided text queries
+with query-specific regions within an image. Despite advancements in visual
+grounding models, their ability to comprehend complex queries remains limited.
+To overcome this limitation, we introduce LLM-Optic, an innovative method that
+utilizes Large Language Models (LLMs) as an optical lens to enhance existing
+visual grounding models in comprehending complex text queries involving
+intricate text structures, multiple objects, or object spatial relationships,
+situations that current models struggle with. LLM-Optic first employs an LLM as
+a Text Grounder to interpret complex text queries and accurately identify
+objects the user intends to locate. Then a pre-trained visual grounding model
+is used to generate candidate bounding boxes given the refined query by the
+Text Grounder. After that, LLM-Optic annotates the candidate bounding boxes
+with numerical marks to establish a connection between text and specific image
+regions, thereby linking two distinct modalities. Finally, it employs a Large
+Multimodal Model (LMM) as a Visual Grounder to select the marked candidate
+objects that best correspond to the original text query. Through LLM-Optic, we
+have achieved universal visual grounding, which allows for the detection of
+arbitrary objects specified by arbitrary human language input. Importantly, our
+method achieves this enhancement without requiring additional training or
+fine-tuning. Extensive experiments across various challenging benchmarks
+demonstrate that LLM-Optic achieves state-of-the-art zero-shot visual grounding
+capabilities.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']"
+View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network,Quan Zhang · Lei Wang · Vishal M. Patel · Xiaohua Xie · Jianhuang Lai, ,https://arxiv.org/abs/2403.14513,,2403.14513.pdf,View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network,"Existing person re-identification methods have achieved remarkable advances
+in appearance-based identity association across homogeneous cameras, such as
+ground-ground matching. However, as a more practical scenario, aerial-ground
+person re-identification (AGPReID) among heterogeneous cameras has received
+minimal attention. To alleviate the disruption of discriminative identity
+representation by dramatic view discrepancy as the most significant challenge
+in AGPReID, the view-decoupled transformer (VDT) is proposed as a simple yet
+effective framework. Two major components are designed in VDT to decouple
+view-related and view-unrelated features, namely hierarchical subtractive
+separation and orthogonal loss, where the former separates these two features
+inside the VDT, and the latter constrains these two to be independent. In
+addition, we contribute a large-scale AGPReID dataset called CARGO, consisting
+of five/eight aerial/ground cameras, 5,000 identities, and 108,563 images.
+Experiments on two datasets show that VDT is a feasible and effective solution
+for AGPReID, surpassing the previous method on mAP/Rank1 by up to 5.0%/2.7% on
+CARGO and 3.7%/5.2% on AG-ReID, keeping the same magnitude of computational
+complexity. Our project is available at https://github.com/LinlyAC/VDT-AGPReID",cs.CV,['cs.CV']
+Towards Accurate Post-training Quantization for Diffusion Models,Changyuan Wang · Ziwei Wang · Xiuwei Xu · Yansong Tang · Jie Zhou · Jiwen Lu, ,https://arxiv.org/abs/2404.05662,,2404.05662.pdf,Towards Accurate Binarization of Diffusion Model,"With the advancement of diffusion models (DMs) and the substantially
+increased computational requirements, quantization emerges as a practical
+solution to obtain compact and efficient low-bit DMs. However, the highly
+discrete representation leads to severe accuracy degradation, hindering the
+quantization of diffusion models to ultra-low bit-widths. This paper proposes a
+novel quantization-aware training approach for DMs, namely BinaryDM. The
+proposed method pushes DMs' weights toward accurate and efficient binarization,
+considering the representation and computation properties. From the
+representation perspective, we present a Learnable Multi-basis Binarizer (LMB)
+to recover the representations generated by the binarized DM. The LMB enhances
+detailed information through the flexible combination of dual binary bases
+while applying to parameter-sparse locations of DM architectures to achieve
+minor burdens. From the optimization perspective, a Low-rank Representation
+Mimicking (LRM) is applied to assist the optimization of binarized DMs. The LRM
+mimics the representations of full-precision DMs in low-rank space, alleviating
+the direction ambiguity of the optimization process caused by fine-grained
+alignment. Moreover, a quick progressive warm-up is applied to BinaryDM,
+avoiding convergence difficulties by layerwisely progressive quantization at
+the beginning of training. Comprehensive experiments demonstrate that BinaryDM
+achieves significant accuracy and efficiency gains compared to SOTA
+quantization methods of DMs under ultra-low bit-widths. With 1.1-bit weight and
+4-bit activation (W1.1A4), BinaryDM achieves as low as 7.11 FID and saves the
+performance from collapse (baseline FID 39.69). As the first binarization
+method for diffusion models, W1.1A4 BinaryDM achieves impressive 9.3 times OPs
+and 24.8 times model size savings, showcasing its substantial potential for
+edge deployment.",cs.CV,['cs.CV']
+Density-Adaptive Model Based on Motif Matrix for Multi-Agent Trajectory Prediction,Di Wen · Haoran Xu · Zhaocheng He · Zhe Wu · Guang Tan · Peixi Peng, ,,https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/itr2.12502,,,,,nan
+Improving Transferable Targeted Adversarial Attacks with Model Self-Enhancement,Han Wu · Guanyan Ou · Weibin Wu · Zibin Zheng, ,https://arxiv.org/abs/2312.04913,,2312.04913.pdf,SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation,"Current Visual-Language Pre-training (VLP) models are vulnerable to
+adversarial examples. These adversarial examples present substantial security
+risks to VLP models, as they can leverage inherent weaknesses in the models,
+resulting in incorrect predictions. In contrast to white-box adversarial
+attacks, transfer attacks (where the adversary crafts adversarial examples on a
+white-box model to fool another black-box model) are more reflective of
+real-world scenarios, thus making them more meaningful for research. By
+summarizing and analyzing existing research, we identified two factors that can
+influence the efficacy of transfer attacks on VLP models: inter-modal
+interaction and data diversity. Based on these insights, we propose a
+self-augment-based transfer attack method, termed SA-Attack. Specifically,
+during the generation of adversarial images and adversarial texts, we apply
+different data augmentation methods to the image modality and text modality,
+respectively, with the aim of improving the adversarial transferability of the
+generated adversarial images and texts. Experiments conducted on the FLickr30K
+and COCO datasets have validated the effectiveness of our method. Our code will
+be available after this paper is accepted.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CR', 'cs.LG']"
+Disentangled Prompt Representation for Domain Generalization,De Cheng · Zhipeng Xu · XINYANG JIANG · Nannan Wang · Dongsheng Li · Xinbo Gao, ,https://arxiv.org/abs/2403.08506,,,DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning,"Federated learning (FL) has emerged as a powerful paradigm for learning from
+decentralized data, and federated domain generalization further considers the
+test dataset (target domain) is absent from the decentralized training data
+(source domains). However, most existing FL methods assume that domain labels
+are provided during training, and their evaluation imposes explicit constraints
+on the number of domains, which must strictly match the number of clients.
+Because of the underutilization of numerous edge devices and additional
+cross-client domain annotations in the real world, such restrictions may be
+impractical and involve potential privacy leaks. In this paper, we propose an
+efficient and novel approach, called Disentangled Prompt Tuning (DiPrompT), a
+method that tackles the above restrictions by learning adaptive prompts for
+domain generalization in a distributed manner. Specifically, we first design
+two types of prompts, i.e., global prompt to capture general knowledge across
+all clients and domain prompts to capture domain-specific knowledge. They
+eliminate the restriction on the one-to-one mapping between source domains and
+local clients. Furthermore, a dynamic query metric is introduced to
+automatically search the suitable domain label for each sample, which includes
+two-substep text-image alignments based on prompt tuning without
+labor-intensive annotation. Extensive experiments on multiple datasets
+demonstrate that our DiPrompT achieves superior domain generalization
+performance over state-of-the-art FL methods when domain labels are not
+provided, and even outperforms many centralized learning methods using domain
+labels.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner,Mengfei Xia · Yujun Shen · Changsong Lei · Yu Zhou · Deli Zhao · Ran Yi · Wenping Wang · Yong-Jin Liu, ,https://arxiv.org/abs/2310.09469,,2310.09469.pdf,Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner,"A diffusion model, which is formulated to produce an image using thousands of
+denoising steps, usually suffers from a slow inference speed. Existing
+acceleration algorithms simplify the sampling by skipping most steps yet
+exhibit considerable performance degradation. By viewing the generation of
+diffusion models as a discretized integrating process, we argue that the
+quality drop is partly caused by applying an inaccurate integral direction to a
+timestep interval. To rectify this issue, we propose a timestep aligner that
+helps find a more accurate integral direction for a particular interval at the
+minimum cost. Specifically, at each denoising step, we replace the original
+parameterization by conditioning the network on a new timestep, which is
+obtained by aligning the sampling distribution to the real distribution.
+Extensive experiments show that our plug-in design can be trained efficiently
+and boost the inference performance of various state-of-the-art acceleration
+methods, especially when there are few denoising steps. For example, when using
+10 denoising steps on the popular LSUN Bedroom dataset, we improve the FID of
+DDIM from 9.65 to 6.07, simply by adopting our method for a more appropriate
+set of timesteps. Code will be made publicly available.",cs.CV,['cs.CV']
+AutoAD III: The Prequel -- Back to the Pixels,Tengda Han · Max Bain · Arsha Nagrani · Gül Varol · Weidi Xie · Andrew Zisserman, ,https://arxiv.org/abs/2404.14412v1,,2404.14412v1.pdf,AutoAD III: The Prequel -- Back to the Pixels,"Generating Audio Description (AD) for movies is a challenging task that
+requires fine-grained visual understanding and an awareness of the characters
+and their names. Currently, visual language models for AD generation are
+limited by a lack of suitable training data, and also their evaluation is
+hampered by using performance measures not specialized to the AD domain. In
+this paper, we make three contributions: (i) We propose two approaches for
+constructing AD datasets with aligned video data, and build training and
+evaluation datasets using these. These datasets will be publicly released; (ii)
+We develop a Q-former-based architecture which ingests raw video and generates
+AD, using frozen pre-trained visual encoders and large language models; and
+(iii) We provide new evaluation metrics to benchmark AD quality that are
+well-matched to human performance. Taken together, we improve the state of the
+art on AD generation.",cs.CV,['cs.CV']
+Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting,Taeho Kang · Youngki Lee,https://tho-kn.github.io/projects/EgoTAP/,https://arxiv.org/abs/2402.18330,,2402.18330.pdf,Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting,"We present EgoTAP, a heatmap-to-3D pose lifting method for highly accurate
+stereo egocentric 3D pose estimation. Severe self-occlusion and out-of-view
+limbs in egocentric camera views make accurate pose estimation a challenging
+problem. To address the challenge, prior methods employ joint
+heatmaps-probabilistic 2D representations of the body pose, but heatmap-to-3D
+pose conversion still remains an inaccurate process. We propose a novel
+heatmap-to-3D lifting method composed of the Grid ViT Encoder and the
+Propagation Network. The Grid ViT Encoder summarizes joint heatmaps into
+effective feature embedding using self-attention. Then, the Propagation Network
+estimates the 3D pose by utilizing skeletal information to better estimate the
+position of obscure joints. Our method significantly outperforms the previous
+state-of-the-art qualitatively and quantitatively demonstrated by a 23.9\%
+reduction of error in an MPJPE metric. Our source code is available in GitHub.",cs.CV,['cs.CV']
+Data Valuation and Detections in Federated Learning,Wenqian Li · Shuran Fu · Fengrui Zhang · Yan Pang,https://github.com/muz1lee/MOTdata/tree/main,https://arxiv.org/abs/2311.05304v2,,2311.05304v2.pdf,Data Valuation and Detections in Federated Learning,"Federated Learning (FL) enables collaborative model training while preserving
+the privacy of raw data. A challenge in this framework is the fair and
+efficient valuation of data, which is crucial for incentivizing clients to
+contribute high-quality data in the FL task. In scenarios involving numerous
+data clients within FL, it is often the case that only a subset of clients and
+datasets are pertinent to a specific learning task, while others might have
+either a negative or negligible impact on the model training process. This
+paper introduces a novel privacy-preserving method for evaluating client
+contributions and selecting relevant datasets without a pre-specified training
+algorithm in an FL task. Our proposed approach FedBary, utilizes Wasserstein
+distance within the federated context, offering a new solution for data
+valuation in the FL framework. This method ensures transparent data valuation
+and efficient computation of the Wasserstein barycenter and reduces the
+dependence on validation datasets. Through extensive empirical experiments and
+theoretical analyses, we demonstrate the potential of this data valuation
+method as a promising avenue for FL research.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CR']"
+WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects under Occlusion,Khiem Vuong · N. Dinesh Reddy · Robert Tamburo · Srinivasa G. Narasimhan, ,https://arxiv.org/abs/2403.19022,,2403.19022.pdf,WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects under Occlusion,"Current methods for 2D and 3D object understanding struggle with severe
+occlusions in busy urban environments, partly due to the lack of large-scale
+labeled ground-truth annotations for learning occlusion. In this work, we
+introduce a novel framework for automatically generating a large, realistic
+dataset of dynamic objects under occlusions using freely available time-lapse
+imagery. By leveraging off-the-shelf 2D (bounding box, segmentation, keypoint)
+and 3D (pose, shape) predictions as pseudo-groundtruth, unoccluded 3D objects
+are identified automatically and composited into the background in a clip-art
+style, ensuring realistic appearances and physically accurate occlusion
+configurations. The resulting clip-art image with pseudo-groundtruth enables
+efficient training of object reconstruction methods that are robust to
+occlusions. Our method demonstrates significant improvements in both 2D and 3D
+reconstruction, particularly in scenarios with heavily occluded objects like
+vehicles and people in urban scenes.",cs.CV,['cs.CV']
+Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models,Yushi Hu · Otilia Stretcu · Chun-Ta Lu · Krishnamurthy Viswanathan · Kenji Hata · Enming Luo · Ranjay Krishna · Ariel Fuxman, ,https://arxiv.org/abs/2312.03052,,2312.03052.pdf,Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models,"Solving complex visual tasks such as ""Who invented the musical instrument on
+the right?"" involves a composition of skills: understanding space, recognizing
+instruments, and also retrieving prior knowledge. Recent work shows promise by
+decomposing such tasks using a large language model (LLM) into an executable
+program that invokes specialized vision models. However, generated programs are
+error-prone: they omit necessary steps, include spurious ones, and are unable
+to recover when the specialized models give incorrect outputs. Moreover, they
+require loading multiple models, incurring high latency and computation costs.
+We propose Visual Program Distillation (VPD), an instruction tuning framework
+that produces a vision-language model (VLM) capable of solving complex visual
+tasks with a single forward pass. VPD distills the reasoning ability of LLMs by
+using them to sample multiple candidate programs, which are then executed and
+verified to identify a correct one. It translates each correct program into a
+language description of the reasoning steps, which are then distilled into a
+VLM. Extensive experiments show that VPD improves the VLM's ability to count,
+understand spatial relations, and reason compositionally. Our VPD-trained
+PaLI-X outperforms all prior VLMs, achieving state-of-the-art performance
+across complex vision tasks, including MMBench, OK-VQA, A-OKVQA, TallyQA, POPE,
+and Hateful Memes. An evaluation with human annotators also confirms that VPD
+improves model response factuality and consistency. Finally, experiments on
+content moderation demonstrate that VPD is also helpful for adaptation to
+real-world applications with limited data.",cs.CV,"['cs.CV', 'cs.CL']"
+Learning Multi-dimensional Human Preference for Text-to-Image Generation,Sixian Zhang · Bohan Wang · Junqiang Wu · Yan Li · Tingting Gao · Di ZHANG · Zhongyuan Wang,https://wangbohan97.github.io/MPS/,,,,,,,nan
+IQ-VFI: Implicit Quadratic Motion Estimation for Video Frame Interpolation,Mengshun Hu · Kui Jiang · Zhihang Zhong · Zheng Wang · Yinqiang Zheng, ,https://arxiv.org/abs/2404.13534,,2404.13534.pdf,Motion-aware Latent Diffusion Models for Video Frame Interpolation,"With the advancement of AIGC, video frame interpolation (VFI) has become a
+crucial component in existing video generation frameworks, attracting
+widespread research interest. For the VFI task, the motion estimation between
+neighboring frames plays a crucial role in avoiding motion ambiguity. However,
+existing VFI methods always struggle to accurately predict the motion
+information between consecutive frames, and this imprecise estimation leads to
+blurred and visually incoherent interpolated frames. In this paper, we propose
+a novel diffusion framework, motion-aware latent diffusion models (MADiff),
+which is specifically designed for the VFI task. By incorporating motion priors
+between the conditional neighboring frames with the target interpolated frame
+predicted throughout the diffusion sampling procedure, MADiff progressively
+refines the intermediate outcomes, culminating in generating both visually
+smooth and realistic results. Extensive experiments conducted on benchmark
+datasets demonstrate that our method achieves state-of-the-art performance
+significantly outperforming existing approaches, especially under challenging
+scenarios involving dynamic textures with complex motion.",cs.CV,['cs.CV']
+MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation,Hanzhe Hu · Zhizhuo Zhou · Varun Jampani · Shubham Tulsiani, ,https://arxiv.org/abs/2404.03656,,2404.03656.pdf,MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation,"We present MVD-Fusion: a method for single-view 3D inference via generative
+modeling of multi-view-consistent RGB-D images. While recent methods pursuing
+3D inference advocate learning novel-view generative models, these generations
+are not 3D-consistent and require a distillation process to generate a 3D
+output. We instead cast the task of 3D inference as directly generating
+mutually-consistent multiple views and build on the insight that additionally
+inferring depth can provide a mechanism for enforcing this consistency.
+Specifically, we train a denoising diffusion model to generate multi-view RGB-D
+images given a single RGB input image and leverage the (intermediate noisy)
+depth estimates to obtain reprojection-based conditioning to maintain
+multi-view consistency. We train our model using large-scale synthetic dataset
+Obajverse as well as the real-world CO3D dataset comprising of generic camera
+viewpoints. We demonstrate that our approach can yield more accurate synthesis
+compared to recent state-of-the-art, including distillation-based 3D inference
+and prior multi-view generation methods. We also evaluate the geometry induced
+by our multi-view depth prediction and find that it yields a more accurate
+representation than other direct 3D inference approaches.",cs.CV,['cs.CV']
+Video ReCap: Recursive Captioning of Hour-Long Videos,Md Mohaiminul Islam · Vu Bao Ngan Ho · Xitong Yang · Tushar Nagarajan · Lorenzo Torresani · Gedas Bertasius, ,https://arxiv.org/abs/2402.13250,,2402.13250.pdf,Video ReCap: Recursive Captioning of Hour-Long Videos,"Most video captioning models are designed to process short video clips of few
+seconds and output text describing low-level visual concepts (e.g., objects,
+scenes, atomic actions). However, most real-world videos last for minutes or
+hours and have a complex hierarchical structure spanning different temporal
+granularities. We propose Video ReCap, a recursive video captioning model that
+can process video inputs of dramatically different lengths (from 1 second to 2
+hours) and output video captions at multiple hierarchy levels. The recursive
+video-language architecture exploits the synergy between different video
+hierarchies and can process hour-long videos efficiently. We utilize a
+curriculum learning training scheme to learn the hierarchical structure of
+videos, starting from clip-level captions describing atomic actions, then
+focusing on segment-level descriptions, and concluding with generating
+summaries for hour-long videos. Furthermore, we introduce Ego4D-HCap dataset by
+augmenting Ego4D with 8,267 manually collected long-range video summaries. Our
+recursive model can flexibly generate captions at different hierarchy levels
+while also being useful for other complex video understanding tasks, such as
+VideoQA on EgoSchema. Data, code, and models are available at:
+https://sites.google.com/view/vidrecap",cs.CV,['cs.CV']
+SuperSVG: Superpixel-based Scalable Vector Graphics Synthesis,Teng Hu · Ran Yi · Baihong Qian · Jiangning Zhang · Paul L. Rosin · Yu-Kun Lai, ,https://arxiv.org/html/2405.02962v1,,2405.02962v1.pdf,VectorPainter: A Novel Approach to Stylized Vector Graphics Synthesis with Vectorized Strokes,"We propose a novel method, VectorPainter, for the task of stylized vector
+graphics synthesis. Given a text prompt and a reference style image,
+VectorPainter generates a vector graphic that aligns in content with the text
+prompt and remains faithful in style to the reference image. We recognize that
+the key to this task lies in fully leveraging the intrinsic properties of
+vector graphics. Innovatively, we conceptualize the stylization process as the
+rearrangement of vectorized strokes extracted from the reference image.
+VectorPainter employs an optimization-based pipeline. It begins by extracting
+vectorized strokes from the reference image, which are then used to initialize
+the synthesis process. To ensure fidelity to the reference style, a novel style
+preservation loss is introduced. Extensive experiments have been conducted to
+demonstrate that our method is capable of aligning with the text description
+while remaining faithful to the reference image.",cs.CV,['cs.CV']
+GeoAuxNet: Towards Universal 3D Representation Learning for Multi-sensor Point Clouds,Shengjun Zhang · Xin Fei · Yueqi Duan, ,https://arxiv.org/abs/2403.19220,,2403.19220.pdf,GeoAuxNet: Towards Universal 3D Representation Learning for Multi-sensor Point Clouds,"Point clouds captured by different sensors such as RGB-D cameras and LiDAR
+possess non-negligible domain gaps. Most existing methods design different
+network architectures and train separately on point clouds from various
+sensors. Typically, point-based methods achieve outstanding performances on
+even-distributed dense point clouds from RGB-D cameras, while voxel-based
+methods are more efficient for large-range sparse LiDAR point clouds. In this
+paper, we propose geometry-to-voxel auxiliary learning to enable voxel
+representations to access point-level geometric information, which supports
+better generalisation of the voxel-based backbone with additional
+interpretations of multi-sensor point clouds. Specifically, we construct
+hierarchical geometry pools generated by a voxel-guided dynamic point network,
+which efficiently provide auxiliary fine-grained geometric information adapted
+to different stages of voxel features. We conduct experiments on joint
+multi-sensor datasets to demonstrate the effectiveness of GeoAuxNet. Enjoying
+elaborate geometric information, our method outperforms other models
+collectively trained on multi-sensor datasets, and achieve competitive results
+with the-state-of-art experts on each single dataset.",cs.CV,['cs.CV']
+BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics,Wenqian Zhang · Molin Huang · Yuxuan Zhou · Juze Zhang · Jingyi Yu · Jingya Wang · Lan Xu,https://github.com/Godheritage/BOTH2Hands,https://arxiv.org/abs/2312.07937,,2312.07937.pdf,BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics,"The recently emerging text-to-motion advances have spired numerous attempts
+for convenient and interactive human motion generation. Yet, existing methods
+are largely limited to generating body motions only without considering the
+rich two-hand motions, let alone handling various conditions like body dynamics
+or texts. To break the data bottleneck, we propose BOTH57M, a novel multi-modal
+dataset for two-hand motion generation. Our dataset includes accurate motion
+tracking for the human body and hands and provides pair-wised finger-level hand
+annotations and body descriptions. We further provide a strong baseline method,
+BOTH2Hands, for the novel task: generating vivid two-hand motions from both
+implicit body dynamics and explicit text prompts. We first warm up two parallel
+body-to-hand and text-to-hand diffusion models and then utilize the
+cross-attention transformer for motion blending. Extensive experiments and
+cross-validations demonstrate the effectiveness of our approach and dataset for
+generating convincing two-hand motions from the hybrid body-and-textual
+conditions. Our dataset and code will be disseminated to the community for
+future research.",cs.CV,['cs.CV']
+Paint3D: Paint Anything 3D with Lighting-less Texture Diffusion Models,Xianfang Zeng · Xin Chen · Zhongqi Qi · Wen Liu · Zibo Zhao · Zhibin Wang · Bin Fu · Yong Liu · Gang Yu, ,https://arxiv.org/abs/2312.13913,,2312.13913.pdf,Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models,"This paper presents Paint3D, a novel coarse-to-fine generative framework that
+is capable of producing high-resolution, lighting-less, and diverse 2K UV
+texture maps for untextured 3D meshes conditioned on text or image inputs. The
+key challenge addressed is generating high-quality textures without embedded
+illumination information, which allows the textures to be re-lighted or
+re-edited within modern graphics pipelines. To achieve this, our method first
+leverages a pre-trained depth-aware 2D diffusion model to generate
+view-conditional images and perform multi-view texture fusion, producing an
+initial coarse texture map. However, as 2D models cannot fully represent 3D
+shapes and disable lighting effects, the coarse texture map exhibits incomplete
+areas and illumination artifacts. To resolve this, we train separate UV
+Inpainting and UVHD diffusion models specialized for the shape-aware refinement
+of incomplete areas and the removal of illumination artifacts. Through this
+coarse-to-fine process, Paint3D can produce high-quality 2K UV textures that
+maintain semantic consistency while being lighting-less, significantly
+advancing the state-of-the-art in texturing 3D objects.",cs.CV,['cs.CV']
+Overload: Latency Attacks on Object Detection for Edge Devices,Erh-Chung Chen · Pin-Yu Chen · I-Hsin Chung · Che-Rung Lee, ,https://ar5iv.labs.arxiv.org/html/2304.05370,,2304.05370.pdf,Overload: Latency Attacks on Object Detection for Edge Devices,"Nowadays, the deployment of deep learning-based applications is an essential
+task owing to the increasing demands on intelligent services. In this paper, we
+investigate latency attacks on deep learning applications. Unlike common
+adversarial attacks for misclassification, the goal of latency attacks is to
+increase the inference time, which may stop applications from responding to the
+requests within a reasonable time. This kind of attack is ubiquitous for
+various applications, and we use object detection to demonstrate how such kind
+of attacks work. We also design a framework named Overload to generate latency
+attacks at scale. Our method is based on a newly formulated optimization
+problem and a novel technique, called spatial attention. This attack serves to
+escalate the required computing costs during the inference time, consequently
+leading to an extended inference time for object detection. It presents a
+significant threat, especially to systems with limited computing resources. We
+conducted experiments using YOLOv5 models on Nvidia NX. Compared to existing
+methods, our method is simpler and more effective. The experimental results
+show that with latency attacks, the inference time of a single image can be
+increased ten times longer in reference to the normal setting. Moreover, our
+findings pose a potential new threat to all object detection tasks requiring
+non-maximum suppression (NMS), as our attack is NMS-agnostic.",cs.CV,['cs.CV']
+OmniGlue: Generalizable Feature Matching with Foundation Model Guidance,Hanwen Jiang · Arjun Karpur · Bingyi Cao · Qixing Huang · André Araujo, ,https://arxiv.org/abs/2405.12979,,2405.12979.pdf,OmniGlue: Generalizable Feature Matching with Foundation Model Guidance,"The image matching field has been witnessing a continuous emergence of novel
+learnable feature matching techniques, with ever-improving performance on
+conventional benchmarks. However, our investigation shows that despite these
+gains, their potential for real-world applications is restricted by their
+limited generalization capabilities to novel image domains. In this paper, we
+introduce OmniGlue, the first learnable image matcher that is designed with
+generalization as a core principle. OmniGlue leverages broad knowledge from a
+vision foundation model to guide the feature matching process, boosting
+generalization to domains not seen at training time. Additionally, we propose a
+novel keypoint position-guided attention mechanism which disentangles spatial
+and appearance information, leading to enhanced matching descriptors. We
+perform comprehensive experiments on a suite of $7$ datasets with varied image
+domains, including scene-level, object-centric and aerial images. OmniGlue's
+novel components lead to relative gains on unseen domains of $20.9\%$ with
+respect to a directly comparable reference model, while also outperforming the
+recent LightGlue method by $9.5\%$ relatively.Code and model can be found at
+https://hwjiang1510.github.io/OmniGlue",cs.CV,['cs.CV']
+InstaGen: Enhancing Object Detection by Training on Synthetic Dataset,Chengjian Feng · Yujie Zhong · Zequn Jie · Weidi Xie · Lin Ma, ,https://arxiv.org/abs/2402.05937,,2402.05937.pdf,InstaGen: Enhancing Object Detection by Training on Synthetic Dataset,"In this paper, we present a novel paradigm to enhance the ability of object
+detector, e.g., expanding categories or improving detection performance, by
+training on synthetic dataset generated from diffusion models. Specifically, we
+integrate an instance-level grounding head into a pre-trained, generative
+diffusion model, to augment it with the ability of localising instances in the
+generated images. The grounding head is trained to align the text embedding of
+category names with the regional visual feature of the diffusion model, using
+supervision from an off-the-shelf object detector, and a novel self-training
+scheme on (novel) categories not covered by the detector. We conduct thorough
+experiments to show that, this enhanced version of diffusion model, termed as
+InstaGen, can serve as a data synthesizer, to enhance object detectors by
+training on its generated samples, demonstrating superior performance over
+existing state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse
+(+1.2 to 5.2 AP) scenarios. Project page with code:
+https://fcjian.github.io/InstaGen.",cs.CV,['cs.CV']
+LTM: Lightweight Textured Mesh Extraction and Refinement of Large Unbounded Scenes for Efficient Storage and Real-time Rendering,Jaehoon Choi · Rajvi Shah · Qinbo Li · Yipeng Wang · Ayush Saraf · Changil Kim · Jia-Bin Huang · Dinesh Manocha · Suhib Alsisan · Johannes Kopf,https://jh-choi.github.io/LTMM,https://arxiv.org/html/2404.15891v2,,2404.15891v2.pdf,OMEGAS: Object Mesh Extraction from Large Scenes Guided by Gaussian Segmentation,"Recent advancements in 3D reconstruction technologies have paved the way for
+high-quality and real-time rendering of complex 3D scenes. Despite these
+achievements, a notable challenge persists: it is difficult to precisely
+reconstruct specific objects from large scenes. Current scene reconstruction
+techniques frequently result in the loss of object detail textures and are
+unable to reconstruct object portions that are occluded or unseen in views. To
+address this challenge, we delve into the meticulous 3D reconstruction of
+specific objects within large scenes and propose a framework termed OMEGAS:
+Object Mesh Extraction from Large Scenes Guided by GAussian Segmentation.
+OMEGAS employs a multi-step approach, grounded in several excellent
+off-the-shelf methodologies. Specifically, initially, we utilize the Segment
+Anything Model (SAM) to guide the segmentation of 3D Gaussian Splatting (3DGS),
+thereby creating a basic 3DGS model of the target object. Then, we leverage
+large-scale diffusion priors to further refine the details of the 3DGS model,
+especially aimed at addressing invisible or occluded object portions from the
+original scene views. Subsequently, by re-rendering the 3DGS model onto the
+scene views, we achieve accurate object segmentation and effectively remove the
+background. Finally, these target-only images are used to improve the 3DGS
+model further and extract the definitive 3D object mesh by the SuGaR model. In
+various scenarios, our experiments demonstrate that OMEGAS significantly
+surpasses existing scene reconstruction methods. Our project page is at:
+https://github.com/CrystalWlz/OMEGAS",cs.CV,['cs.CV']
+Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models,Pablo Marcos-Manchón · Roberto Alcover-Couso · Juan SanMiguel · Jose M. Martinez,https://github.com/vpulab/ovam,https://arxiv.org/abs/2403.14291v1,,2403.14291v1.pdf,Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models,"Diffusion models represent a new paradigm in text-to-image generation. Beyond
+generating high-quality images from text prompts, models such as Stable
+Diffusion have been successfully extended to the joint generation of semantic
+segmentation pseudo-masks. However, current extensions primarily rely on
+extracting attentions linked to prompt words used for image synthesis. This
+approach limits the generation of segmentation masks derived from word tokens
+not contained in the text prompt. In this work, we introduce Open-Vocabulary
+Attention Maps (OVAM)-a training-free method for text-to-image diffusion models
+that enables the generation of attention maps for any word. In addition, we
+propose a lightweight optimization process based on OVAM for finding tokens
+that generate accurate attention maps for an object class with a single
+annotation. We evaluate these tokens within existing state-of-the-art Stable
+Diffusion extensions. The best-performing model improves its mIoU from 52.1 to
+86.6 for the synthetic images' pseudo-masks, demonstrating that our optimized
+tokens are an efficient way to improve the performance of existing methods
+without architectural changes or retraining.",cs.CV,['cs.CV']
+Leveraging Frame Affinity for sRGB-to-RAW Video De-rendering,Chen Zhang · Wencheng Han · Yang Zhou · Jianbing Shen · Cheng-Zhong Xu · Wentao Liu, ,https://arxiv.org/abs/2404.09490,,2404.09490.pdf,Leveraging Temporal Contextualization for Video Action Recognition,"Pretrained vision-language models have shown effectiveness in video
+understanding. However, recent studies have not sufficiently leveraged
+essential temporal information from videos, simply averaging frame-wise
+representations or referencing consecutive frames. We introduce Temporally
+Contextualized CLIP (TC-CLIP), a pioneering framework for video understanding
+that effectively and efficiently leverages comprehensive video information. We
+propose Temporal Contextualization (TC), a novel layer-wise temporal
+information infusion mechanism for video that extracts core information from
+each frame, interconnects relevant information across the video to summarize
+into context tokens, and ultimately leverages the context tokens during the
+feature encoding process. Furthermore, our Video-conditional Prompting (VP)
+module manufactures context tokens to generate informative prompts in text
+modality. We conduct extensive experiments in zero-shot, few-shot,
+base-to-novel, and fully-supervised action recognition to validate the
+superiority of our TC-CLIP. Ablation studies for TC and VP guarantee our design
+choices. Code is available at https://github.com/naver-ai/tc-clip",cs.CV,['cs.CV']
+UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion,Junsheng Zhou · Weiqi Zhang · Baorui Ma · Kanle Shi · Yu-Shen Liu · Zhizhong Han, ,https://arxiv.org/abs/2404.06851,,2404.06851.pdf,UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion,"Diffusion models have shown remarkable results for image generation, editing
+and inpainting. Recent works explore diffusion models for 3D shape generation
+with neural implicit functions, i.e., signed distance function and occupancy
+function. However, they are limited to shapes with closed surfaces, which
+prevents them from generating diverse 3D real-world contents containing open
+surfaces. In this work, we present UDiFF, a 3D diffusion model for unsigned
+distance fields (UDFs) which is capable to generate textured 3D shapes with
+open surfaces from text conditions or unconditionally. Our key idea is to
+generate UDFs in spatial-frequency domain with an optimal wavelet
+transformation, which produces a compact representation space for UDF
+generation. Specifically, instead of selecting an appropriate wavelet
+transformation which requires expensive manual efforts and still leads to large
+information loss, we propose a data-driven approach to learn the optimal
+wavelet transformation for UDFs. We evaluate UDiFF to show our advantages by
+numerical and visual comparisons with the latest methods on widely used
+benchmarks. Page: https://weiqi-zhang.github.io/UDiFF.",cs.CV,['cs.CV']
+OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation,Xiongwei Wu · Sicheng Yu · Ee-Peng Lim · Chong Wah Ngo, ,https://arxiv.org/abs/2404.01409,,2404.01409.pdf,OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation,"In the realm of food computing, segmenting ingredients from images poses
+substantial challenges due to the large intra-class variance among the same
+ingredients, the emergence of new ingredients, and the high annotation costs
+associated with large food segmentation datasets. Existing approaches primarily
+utilize a closed-vocabulary and static text embeddings setting. These methods
+often fall short in effectively handling the ingredients, particularly new and
+diverse ones. In response to these limitations, we introduce OVFoodSeg, a
+framework that adopts an open-vocabulary setting and enhances text embeddings
+with visual context. By integrating vision-language models (VLMs), our approach
+enriches text embedding with image-specific information through two innovative
+modules, eg, an image-to-text learner FoodLearner and an Image-Informed Text
+Encoder. The training process of OVFoodSeg is divided into two stages: the
+pre-training of FoodLearner and the subsequent learning phase for segmentation.
+The pre-training phase equips FoodLearner with the capability to align visual
+information with corresponding textual representations that are specifically
+related to food, while the second phase adapts both the FoodLearner and the
+Image-Informed Text Encoder for the segmentation task. By addressing the
+deficiencies of previous models, OVFoodSeg demonstrates a significant
+improvement, achieving an 4.9\% increase in mean Intersection over Union (mIoU)
+on the FoodSeg103 dataset, setting a new milestone for food image segmentation.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM']"
+LaneCPP: Continuous 3D Lane Detection using Physical Priors,Maximilian Pittner · Joel Janai · Alexandru Paul Condurache, ,https://arxiv.org/abs/2401.08036,,2401.08036.pdf,3D Lane Detection from Front or Surround-View using Joint-Modeling & Matching,"3D lanes offer a more comprehensive understanding of the road surface
+geometry than 2D lanes, thereby providing crucial references for driving
+decisions and trajectory planning. While many efforts aim to improve prediction
+accuracy, we recognize that an efficient network can bring results closer to
+lane modeling. However, if the modeling data is imprecise, the results might
+not accurately capture the real-world scenario. Therefore, accurate lane
+modeling is essential to align prediction results closely with the environment.
+This study centers on efficient and accurate lane modeling, proposing a joint
+modeling approach that combines Bezier curves and interpolation methods.
+Furthermore, based on this lane modeling approach, we developed a Global2Local
+Lane Matching method with Bezier Control-Point and Key-Point, which serve as a
+comprehensive solution that leverages hierarchical features with two
+mathematical models to ensure a precise match. We also introduce a novel 3D
+Spatial Encoder, representing an exploration of 3D surround-view lane detection
+research. The framework is suitable for front-view or surround-view 3D lane
+detection. By directly outputting the key points of lanes in 3D space, it
+overcomes the limitations of anchor-based methods, enabling accurate prediction
+of closed-loop or U-shaped lanes and effective adaptation to complex road
+conditions. This innovative method establishes a new benchmark in front-view 3D
+lane detection on the Openlane dataset and achieves competitive performance in
+surround-view 2D lane detection on the Argoverse2 dataset.",cs.CV,['cs.CV']
+MonoNPHM: Dynamic Head Reconstruction from Monocular Videos,Simon Giebenhain · Tobias Kirschstein · Markos Georgopoulos · Martin Rünz · Lourdes Agapito · Matthias Nießner,https://simongiebenhain.github.io/MonoNPHM/,https://arxiv.org/abs/2312.06740,,2312.06740.pdf,MonoNPHM: Dynamic Head Reconstruction from Monocular Videos,"We present Monocular Neural Parametric Head Models (MonoNPHM) for dynamic 3D
+head reconstructions from monocular RGB videos. To this end, we propose a
+latent appearance space that parameterizes a texture field on top of a neural
+parametric model. We constrain predicted color values to be correlated with the
+underlying geometry such that gradients from RGB effectively influence latent
+geometry codes during inverse rendering. To increase the representational
+capacity of our expression space, we augment our backward deformation field
+with hyper-dimensions, thus improving color and geometry representation in
+topologically challenging expressions. Using MonoNPHM as a learned prior, we
+approach the task of 3D head reconstruction using signed distance field based
+volumetric rendering. By numerically inverting our backward deformation field,
+we incorporated a landmark loss using facial anchor points that are closely
+tied to our canonical geometry representation. To evaluate the task of dynamic
+face reconstruction from monocular RGB videos we record 20 challenging Kinect
+sequences under casual conditions. MonoNPHM outperforms all baselines with a
+significant margin, and makes an important step towards easily accessible
+neural parametric face models through RGB tracking.",cs.CV,['cs.CV']
+Retrieval-Augmented Egocentric Video Captioning,Jilan Xu · Yifei Huang · Junlin Hou · Guo Chen · Yuejie Zhang · Rui Feng · Weidi Xie, ,https://arxiv.org/abs/2401.00789,,2401.00789.pdf,Retrieval-Augmented Egocentric Video Captioning,"Understanding human actions from videos of first-person view poses
+significant challenges. Most prior approaches explore representation learning
+on egocentric videos only, while overlooking the potential benefit of
+exploiting existing large-scale third-person videos. In this paper, (1) we
+develop EgoInstructor, a retrieval-augmented multimodal captioning model that
+automatically retrieves semantically relevant third-person instructional videos
+to enhance the video captioning of egocentric videos. (2) For training the
+cross-view retrieval module, we devise an automatic pipeline to discover
+ego-exo video pairs from distinct large-scale egocentric and exocentric
+datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE
+loss that pulls egocentric and exocentric video features closer by aligning
+them to shared text features that describe similar actions. (4) Through
+extensive experiments, our cross-view retrieval module demonstrates superior
+performance across seven benchmarks. Regarding egocentric video captioning,
+EgoInstructor exhibits significant improvements by leveraging third-person
+videos as references.",cs.CV,['cs.CV']
+Relaxed Contrastive Learning for Federated Learning,Seonguk Seo · Jinkyu Kim · Geeho Kim · Bohyung Han, ,https://arxiv.org/abs/2401.04928,,2401.04928.pdf,Relaxed Contrastive Learning for Federated Learning,"We propose a novel contrastive learning framework to effectively address the
+challenges of data heterogeneity in federated learning. We first analyze the
+inconsistency of gradient updates across clients during local training and
+establish its dependence on the distribution of feature representations,
+leading to the derivation of the supervised contrastive learning (SCL)
+objective to mitigate local deviations. In addition, we show that a na\""ive
+adoption of SCL in federated learning leads to representation collapse,
+resulting in slow convergence and limited performance gains. To address this
+issue, we introduce a relaxed contrastive learning loss that imposes a
+divergence penalty on excessively similar sample pairs within each class. This
+strategy prevents collapsed representations and enhances feature
+transferability, facilitating collaborative training and leading to significant
+performance improvements. Our framework outperforms all existing federated
+learning approaches by huge margins on the standard benchmarks through
+extensive experimental results.",cs.LG,['cs.LG']
+Rewrite the stars,Xu Ma · Xiyang Dai · Yue Bai · Yizhou Wang · Yun Fu, ,https://arxiv.org/abs/2403.19967,,2403.19967.pdf,Rewrite the Stars,"Recent studies have drawn attention to the untapped potential of the ""star
+operation"" (element-wise multiplication) in network design. While intuitive
+explanations abound, the foundational rationale behind its application remains
+largely unexplored. Our study attempts to reveal the star operation's ability
+to map inputs into high-dimensional, non-linear feature spaces -- akin to
+kernel tricks -- without widening the network. We further introduce StarNet, a
+simple yet powerful prototype, demonstrating impressive performance and low
+latency under compact network structure and efficient budget. Like stars in the
+sky, the star operation appears unremarkable but holds a vast universe of
+potential. Our work encourages further exploration across tasks, with codes
+available at https://github.com/ma-xu/Rewrite-the-Stars.",cs.CV,['cs.CV']
+PointInfinity: Resolution-Invariant Point Diffusion Models,Zixuan Huang · Justin Johnson · Shoubhik Debnath · James Rehg · Chao-Yuan Wu,https://zixuanh.com/projects/pointinfinity,https://arxiv.org/abs/2404.03566v1,,2404.03566v1.pdf,PointInfinity: Resolution-Invariant Point Diffusion Models,"We present PointInfinity, an efficient family of point cloud diffusion
+models. Our core idea is to use a transformer-based architecture with a
+fixed-size, resolution-invariant latent representation. This enables efficient
+training with low-resolution point clouds, while allowing high-resolution point
+clouds to be generated during inference. More importantly, we show that scaling
+the test-time resolution beyond the training resolution improves the fidelity
+of generated point clouds and surfaces. We analyze this phenomenon and draw a
+link to classifier-free guidance commonly used in diffusion models,
+demonstrating that both allow trading off fidelity and variability during
+inference. Experiments on CO3D show that PointInfinity can efficiently generate
+high-resolution point clouds (up to 131k points, 31 times more than Point-E)
+with state-of-the-art quality.",cs.CV,['cs.CV']
+JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation,Yu Zeng · Vishal M. Patel · Haochen Wang · Xun Huang · Ting-Chun Wang · Ming-Yu Liu · Yogesh Balaji,https://research.nvidia.com/labs/dir/jedi/,https://arxiv.org/html/2307.04725v2,,2307.04725v2.pdf,AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning,"With the advance of text-to-image (T2I) diffusion models (e.g., Stable
+Diffusion) and corresponding personalization techniques such as DreamBooth and
+LoRA, everyone can manifest their imagination into high-quality images at an
+affordable cost. However, adding motion dynamics to existing high-quality
+personalized T2Is and enabling them to generate animations remains an open
+challenge. In this paper, we present AnimateDiff, a practical framework for
+animating personalized T2I models without requiring model-specific tuning. At
+the core of our framework is a plug-and-play motion module that can be trained
+once and seamlessly integrated into any personalized T2Is originating from the
+same base T2I. Through our proposed training strategy, the motion module
+effectively learns transferable motion priors from real-world videos. Once
+trained, the motion module can be inserted into a personalized T2I model to
+form a personalized animation generator. We further propose MotionLoRA, a
+lightweight fine-tuning technique for AnimateDiff that enables a pre-trained
+motion module to adapt to new motion patterns, such as different shot types, at
+a low training and data collection cost. We evaluate AnimateDiff and MotionLoRA
+on several public representative personalized T2I models collected from the
+community. The results demonstrate that our approaches help these models
+generate temporally smooth animation clips while preserving the visual quality
+and motion diversity. Codes and pre-trained weights are available at
+https://github.com/guoyww/AnimateDiff.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models,Shweta Mahajan · Tanzila Rahman · Kwang Moo Yi · Leonid Sigal, ,https://arxiv.org/abs/2312.12416,,2312.12416.pdf,Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models,"The quality of the prompts provided to text-to-image diffusion models
+determines how faithful the generated content is to the user's intent, often
+requiring `prompt engineering'. To harness visual concepts from target images
+without prompt engineering, current approaches largely rely on embedding
+inversion by optimizing and then mapping them to pseudo-tokens. However,
+working with such high-dimensional vector representations is challenging
+because they lack semantics and interpretability, and only allow simple vector
+operations when using them. Instead, this work focuses on inverting the
+diffusion model to obtain interpretable language prompts directly. The
+challenge of doing this lies in the fact that the resulting optimization
+problem is fundamentally discrete and the space of prompts is exponentially
+large; this makes using standard optimization techniques, such as stochastic
+gradient descent, difficult. To this end, we utilize a delayed projection
+scheme to optimize for prompts representative of the vocabulary space in the
+model. Further, we leverage the findings that different timesteps of the
+diffusion process cater to different levels of detail in an image. The later,
+noisy, timesteps of the forward diffusion process correspond to the semantic
+information, and therefore, prompt inversion in this range provides tokens
+representative of the image semantics. We show that our approach can identify
+semantically interpretable and meaningful prompts for a target image which can
+be used to synthesize diverse images with similar content. We further
+illustrate the application of the optimized prompts in evolutionary image
+generation and concept removal.",cs.CV,"['cs.CV', 'cs.LG']"
+Pixel Aligned Language Models,Jiarui Xu · Xingyi Zhou · Shen Yan · Xiuye Gu · Anurag Arnab · Chen Sun · Xiaolong Wang · Cordelia Schmid,https://jerryxu.net/PixelLLM/,https://arxiv.org/abs/2312.09237,,2312.09237.pdf,Pixel Aligned Language Models,"Large language models have achieved great success in recent years, so as
+their variants in vision. Existing vision-language models can describe images
+in natural languages, answer visual-related questions, or perform complex
+reasoning about the image. However, it is yet unclear how localization tasks,
+such as word grounding or referring localization, can be performed using large
+language models. In this work, we aim to develop a vision-language model that
+can take locations, for example, a set of points or boxes, as either inputs or
+outputs. When taking locations as inputs, the model performs
+location-conditioned captioning, which generates captions for the indicated
+object or region. When generating locations as outputs, our model regresses
+pixel coordinates for each output word generated by the language model, and
+thus performs dense word grounding. Our model is pre-trained on the Localized
+Narrative dataset, which contains pixel-word-aligned captioning from human
+attention. We show our model can be applied to various location-aware
+vision-language tasks, including referring localization, location-conditioned
+captioning, and dense object captioning, archiving state-of-the-art performance
+on RefCOCO and Visual Genome. Project page: https://jerryxu.net/PixelLLM .",cs.CV,['cs.CV']
+Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning,Xinshun Wang · Zhongbin Fang · Xia Li · Xiangtai Li · Chen Chen · Mengyuan Liu, ,https://arxiv.org/abs/2312.03703,,2312.03703.pdf,Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning,"In-context learning provides a new perspective for multi-task modeling for
+vision and NLP. Under this setting, the model can perceive tasks from prompts
+and accomplish them without any extra task-specific head predictions or model
+fine-tuning. However, Skeleton sequence modeling via in-context learning
+remains unexplored. Directly applying existing in-context models from other
+areas onto skeleton sequences fails due to the inter-frame and cross-task pose
+similarity that makes it outstandingly hard to perceive the task correctly from
+a subtle context. To address this challenge, we propose Skeleton-in-Context
+(SiC), an effective framework for in-context skeleton sequence modeling. Our
+SiC is able to handle multiple skeleton-based tasks simultaneously after a
+single training process and accomplish each task from context according to the
+given prompt. It can further generalize to new, unseen tasks according to
+customized prompts. To facilitate context perception, we additionally propose a
+task-unified prompt, which adaptively learns tasks of different natures, such
+as partial joint-level generation, sequence-level prediction, or 2D-to-3D
+motion prediction. We conduct extensive experiments to evaluate the
+effectiveness of our SiC on multiple tasks, including motion prediction, pose
+estimation, joint completion, and future pose estimation. We also evaluate its
+generalization capability on unseen tasks such as motion-in-between. These
+experiments show that our model achieves state-of-the-art multi-task
+performance and even outperforms single-task methods on certain tasks.",cs.CV,['cs.CV']
+CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment,Hyeongmin Lee · Kyoungkook Kang · Jungseul Ok · Sunghyun Cho, ,https://arxiv.org/abs/2404.01123,,2404.01123.pdf,CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment,"Recent image tone adjustment (or enhancement) approaches have predominantly
+adopted supervised learning for learning human-centric perceptual assessment.
+However, these approaches are constrained by intrinsic challenges of supervised
+learning. Primarily, the requirement for expertly-curated or retouched images
+escalates the data acquisition expenses. Moreover, their coverage of target
+style is confined to stylistic variants inferred from the training data. To
+surmount the above challenges, we propose an unsupervised learning-based
+approach for text-based image tone adjustment method, CLIPtone, that extends an
+existing image enhancement method to accommodate natural language descriptions.
+Specifically, we design a hyper-network to adaptively modulate the pretrained
+parameters of the backbone model based on text description. To assess whether
+the adjusted image aligns with the text description without ground truth image,
+we utilize CLIP, which is trained on a vast set of language-image pairs and
+thus encompasses knowledge of human perception. The major advantages of our
+approach are three fold: (i) minimal data collection expenses, (ii) support for
+a range of adjustments, and (iii) the ability to handle novel text descriptions
+unseen in training. Our approach's efficacy is demonstrated through
+comprehensive experiments, including a user study.",cs.CV,"['cs.CV', 'cs.GR', 'eess.IV']"
+PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor,Jaewon Jung · Hongsun Jang · Jaeyong Song · Jinho Lee,https://github.com/jaewonalive/PeerAiD,https://arxiv.org/abs/2403.06668,,2403.06668.pdf,PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor,"Adversarial robustness of the neural network is a significant concern when it
+is applied to security-critical domains. In this situation, adversarial
+distillation is a promising option which aims to distill the robustness of the
+teacher network to improve the robustness of a small student network. Previous
+works pretrain the teacher network to make it robust against the adversarial
+examples aimed at itself. However, the adversarial examples are dependent on
+the parameters of the target network. The fixed teacher network inevitably
+degrades its robustness against the unseen transferred adversarial examples
+which target the parameters of the student network in the adversarial
+distillation process. We propose PeerAiD to make a peer network learn the
+adversarial examples of the student network instead of adversarial examples
+aimed at itself. PeerAiD is an adversarial distillation that trains the peer
+network and the student network simultaneously in order to specialize the peer
+network for defending the student network. We observe that such peer networks
+surpass the robustness of the pretrained robust teacher model against
+adversarial examples aimed at the student network. With this peer network and
+adversarial distillation, PeerAiD achieves significantly higher robustness of
+the student network with AutoAttack (AA) accuracy by up to 1.66%p and improves
+the natural accuracy of the student network by up to 4.72%p with ResNet-18 on
+TinyImageNet dataset. Code is available at
+https://github.com/jaewonalive/PeerAiD.",cs.LG,"['cs.LG', 'cs.CV']"
+MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer,Jianjian Cao · Peng Ye · Shengze Li · Chong Yu · Yansong Tang · Jiwen Lu · Tao Chen,https://github.com/double125/MADTP,https://arxiv.org/abs/2403.02991,,2403.02991.pdf,MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer,"Vision-Language Transformers (VLTs) have shown great success recently, but
+are meanwhile accompanied by heavy computation costs, where a major reason can
+be attributed to the large number of visual and language tokens. Existing token
+pruning research for compressing VLTs mainly follows a single-modality-based
+scheme yet ignores the critical role of aligning different modalities for
+guiding the token pruning process, causing the important tokens for one
+modality to be falsely pruned in another modality branch. Meanwhile, existing
+VLT pruning works also lack the flexibility to dynamically compress each layer
+based on different input samples. To this end, we propose a novel framework
+named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for
+accelerating various VLTs. Specifically, we first introduce a well-designed
+Multi-modality Alignment Guidance (MAG) module that can align features of the
+same semantic concept from different modalities, to ensure the pruned tokens
+are less important for all modalities. We further design a novel Dynamic Token
+Pruning (DTP) module, which can adaptively adjust the token compression ratio
+in each layer based on different input instances. Extensive experiments on
+various benchmarks demonstrate that MADTP significantly reduces the
+computational complexity of kinds of multimodal models while preserving
+competitive performance. Notably, when applied to the BLIP model in the NLVR2
+dataset, MADTP can reduce the GFLOPs by 80% with less than 4% performance
+degradation.",cs.CV,['cs.CV']
+VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models,Hyeonho Jeong · Geon Yeong Park · Jong Chul Ye,https://video-motion-customization.github.io/,https://arxiv.org/abs/2312.00845,,2312.00845.pdf,VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models,"Text-to-video diffusion models have advanced video generation significantly.
+However, customizing these models to generate videos with tailored motions
+presents a substantial challenge. In specific, they encounter hurdles in (a)
+accurately reproducing motion from a target video, and (b) creating diverse
+visual variations. For example, straightforward extensions of static image
+customization methods to video often lead to intricate entanglements of
+appearance and motion data. To tackle this, here we present the Video Motion
+Customization (VMC) framework, a novel one-shot tuning approach crafted to
+adapt temporal attention layers within video diffusion models. Our approach
+introduces a novel motion distillation objective using residual vectors between
+consecutive frames as a motion reference. The diffusion process then preserves
+low-frequency motion trajectories while mitigating high-frequency
+motion-unrelated noise in image space. We validate our method against
+state-of-the-art video generative models across diverse real-world motions and
+contexts. Our codes, data and the project demo can be found at
+https://video-motion-customization.github.io",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting,Zijie Chen · Lichao Zhang · Fangsheng Weng · Lili Pan · ZHENZHONG Lan,https://github.com/zzjchen/Tailored-Visions,https://arxiv.org/abs/2310.08129,,2310.08129.pdf,Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting,"Despite significant progress in the field, it is still challenging to create
+personalized visual representations that align closely with the desires and
+preferences of individual users. This process requires users to articulate
+their ideas in words that are both comprehensible to the models and accurately
+capture their vision, posing difficulties for many users. In this paper, we
+tackle this challenge by leveraging historical user interactions with the
+system to enhance user prompts. We propose a novel approach that involves
+rewriting user prompts based on a newly collected large-scale text-to-image
+dataset with over 300k prompts from 3115 users. Our rewriting model enhances
+the expressiveness and alignment of user prompts with their intended visual
+outputs. Experimental results demonstrate the superiority of our methods over
+baseline approaches, as evidenced in our new offline evaluation method and
+online tests. Our code and dataset are available at
+https://github.com/zzjchen/Tailored-Visions.",cs.CV,['cs.CV']
+VideoBooth: Diffusion-based Video Generation with Image Prompts,Yuming Jiang · Tianxing Wu · Shuai Yang · Chenyang Si · Dahua Lin · Yu Qiao · Chen Change Loy · Ziwei Liu, ,https://arxiv.org/abs/2312.00777,,2312.00777.pdf,VideoBooth: Diffusion-based Video Generation with Image Prompts,"Text-driven video generation witnesses rapid progress. However, merely using
+text prompts is not enough to depict the desired subject appearance that
+accurately aligns with users' intents, especially for customized content
+creation. In this paper, we study the task of video generation with image
+prompts, which provide more accurate and direct content control beyond the text
+prompts. Specifically, we propose a feed-forward framework VideoBooth, with two
+dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine
+manner. Coarse visual embeddings from image encoder provide high-level
+encodings of image prompts, while fine visual embeddings from the proposed
+attention injection module provide multi-scale and detailed encoding of image
+prompts. These two complementary embeddings can faithfully capture the desired
+appearance. 2) In the attention injection module at fine level, multi-scale
+image prompts are fed into different cross-frame attention layers as additional
+keys and values. This extra spatial information refines the details in the
+first frame and then it is propagated to the remaining frames, which maintains
+temporal consistency. Extensive experiments demonstrate that VideoBooth
+achieves state-of-the-art performance in generating customized high-quality
+videos with subjects specified in image prompts. Notably, VideoBooth is a
+generalizable framework where a single model works for a wide range of image
+prompts with feed-forward pass.",cs.CV,['cs.CV']
+FreeU: Free Lunch in Diffusion U-Net,Chenyang Si · Ziqi Huang · Yuming Jiang · Ziwei Liu,https://chenyangsi.top/FreeU/,https://arxiv.org/abs/2309.11497,,2309.11497.pdf,FreeU: Free Lunch in Diffusion U-Net,"In this paper, we uncover the untapped potential of diffusion U-Net, which
+serves as a ""free lunch"" that substantially improves the generation quality on
+the fly. We initially investigate the key contributions of the U-Net
+architecture to the denoising process and identify that its main backbone
+primarily contributes to denoising, whereas its skip connections mainly
+introduce high-frequency features into the decoder module, causing the network
+to overlook the backbone semantics. Capitalizing on this discovery, we propose
+a simple yet effective method-termed ""FreeU"" - that enhances generation quality
+without additional training or finetuning. Our key insight is to strategically
+re-weight the contributions sourced from the U-Net's skip connections and
+backbone feature maps, to leverage the strengths of both components of the
+U-Net architecture. Promising results on image and video generation tasks
+demonstrate that our FreeU can be readily integrated to existing diffusion
+models, e.g., Stable Diffusion, DreamBooth, ModelScope, Rerender and ReVersion,
+to improve the generation quality with only a few lines of code. All you need
+is to adjust two scaling factors during inference. Project page:
+https://chenyangsi.top/FreeU/.",cs.CV,['cs.CV']
+One-Shot Structure-Aware Stylized Image Synthesis,Hansam Cho · Jonghyun Lee · Seunggyu Chang · Yonghyun Jeong,https://github.com/hansam95/OSASIS,https://arxiv.org/abs/2402.17275,,2402.17275.pdf,One-Shot Structure-Aware Stylized Image Synthesis,"While GAN-based models have been successful in image stylization tasks, they
+often struggle with structure preservation while stylizing a wide range of
+input images. Recently, diffusion models have been adopted for image
+stylization but still lack the capability to maintain the original quality of
+input images. Building on this, we propose OSASIS: a novel one-shot stylization
+method that is robust in structure preservation. We show that OSASIS is able to
+effectively disentangle the semantics from the structure of an image, allowing
+it to control the level of content and style implemented to a given input. We
+apply OSASIS to various experimental settings, including stylization with
+out-of-domain reference images and stylization with text-driven manipulation.
+Results show that OSASIS outperforms other stylization methods, especially for
+input images that were rarely encountered during training, providing a
+promising solution to stylization via diffusion models.",cs.CV,['cs.CV']
+OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning,Noor Ahmed · Anna Kukleva · Bernt Schiele, ,https://arxiv.org/abs/2403.18550,,2403.18550.pdf,OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning,"Few-Shot Class-Incremental Learning (FSCIL) introduces a paradigm in which
+the problem space expands with limited data. FSCIL methods inherently face the
+challenge of catastrophic forgetting as data arrives incrementally, making
+models susceptible to overwriting previously acquired knowledge. Moreover,
+given the scarcity of labeled samples available at any given time, models may
+be prone to overfitting and find it challenging to strike a balance between
+extensive pretraining and the limited incremental data. To address these
+challenges, we propose the OrCo framework built on two core principles:
+features' orthogonality in the representation space, and contrastive learning.
+In particular, we improve the generalization of the embedding space by
+employing a combination of supervised and self-supervised contrastive losses
+during the pretraining phase. Additionally, we introduce OrCo loss to address
+challenges arising from data limitations during incremental sessions. Through
+feature space perturbations and orthogonality between classes, the OrCo loss
+maximizes margins and reserves space for the following incremental data. This,
+in turn, ensures the accommodation of incoming classes in the feature space
+without compromising previously acquired knowledge. Our experimental results
+showcase state-of-the-art performance across three benchmark datasets,
+including mini-ImageNet, CIFAR100, and CUB datasets. Code is available at
+https://github.com/noorahmedds/OrCo",cs.CV,['cs.CV']
+ZeroShape: Regression-based Zero-shot Shape Reconstruction,Zixuan Huang · Stefan Stojanov · Anh Thai · Varun Jampani · James Rehg, ,https://arxiv.org/abs/2312.14198,,2312.14198.pdf,ZeroShape: Regression-based Zero-shot Shape Reconstruction,"We study the problem of single-image zero-shot 3D shape reconstruction.
+Recent works learn zero-shot shape reconstruction through generative modeling
+of 3D assets, but these models are computationally expensive at train and
+inference time. In contrast, the traditional approach to this problem is
+regression-based, where deterministic models are trained to directly regress
+the object shape. Such regression methods possess much higher computational
+efficiency than generative methods. This raises a natural question: is
+generative modeling necessary for high performance, or conversely, are
+regression-based approaches still competitive? To answer this, we design a
+strong regression-based model, called ZeroShape, based on the converging
+findings in this field and a novel insight. We also curate a large real-world
+evaluation benchmark, with objects from three different real-world 3D datasets.
+This evaluation benchmark is more diverse and an order of magnitude larger than
+what prior works use to quantitatively evaluate their models, aiming at
+reducing the evaluation variance in our field. We show that ZeroShape not only
+achieves superior performance over state-of-the-art methods, but also
+demonstrates significantly higher computational and data efficiency.",cs.CV,['cs.CV']
+Robust Self-calibration of Focal Lengths from the Fundamental Matrix,Viktor Kocur · Daniel Kyselica · Zuzana Kukelova,https://github.com/kocurvik/robust_self_calibration,https://arxiv.org/abs/2311.16304,,2311.16304.pdf,Robust Self-calibration of Focal Lengths from the Fundamental Matrix,"The problem of self-calibration of two cameras from a given fundamental
+matrix is one of the basic problems in geometric computer vision. Under the
+assumption of known principal points and square pixels, the well-known Bougnoux
+formula offers a means to compute the two unknown focal lengths. However, in
+many practical situations, the formula yields inaccurate results due to
+commonly occurring singularities. Moreover, the estimates are sensitive to
+noise in the computed fundamental matrix and to the assumed positions of the
+principal points. In this paper, we therefore propose an efficient and robust
+iterative method to estimate the focal lengths along with the principal points
+of the cameras given a fundamental matrix and priors for the estimated camera
+parameters. In addition, we study a computationally efficient check of models
+generated within RANSAC that improves the accuracy of the estimated models
+while reducing the total computational time. Extensive experiments on real and
+synthetic data show that our iterative method brings significant improvements
+in terms of the accuracy of the estimated focal lengths over the Bougnoux
+formula and other state-of-the-art methods, even when relying on inaccurate
+priors.",cs.CV,['cs.CV']
+GauHuman: Articulated Gaussian Splatting from Monocular Human Videos,Shoukang Hu · Tao Hu · Ziwei Liu, ,,https://paperswithcode.com/paper/gauhuman-articulated-gaussian-splatting-from,,,,,nan
+Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation,Hyunwoo Ryu · Jiwoo Kim · Hyunseok An · Junwoo Chang · Joohwan Seo · Taehan Kim · Yubin Kim · Chaewon Hwang · Jongeun Choi · Roberto Horowitz,https://sites.google.com/view/diffusion-edfs,https://arxiv.org/abs/2309.02685,,2309.02685.pdf,Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation,"Diffusion generative modeling has become a promising approach for learning
+robotic manipulation tasks from stochastic human demonstrations. In this paper,
+we present Diffusion-EDFs, a novel SE(3)-equivariant diffusion-based approach
+for visual robotic manipulation tasks. We show that our proposed method
+achieves remarkable data efficiency, requiring only 5 to 10 human
+demonstrations for effective end-to-end training in less than an hour.
+Furthermore, our benchmark experiments demonstrate that our approach has
+superior generalizability and robustness compared to state-of-the-art methods.
+Lastly, we validate our methods with real hardware experiments. Project
+Website: https://sites.google.com/view/diffusion-edfs/home",cs.RO,"['cs.RO', 'cs.AI', 'cs.LG']"
+DREAM: Diffusion Rectification and Estimation-Adaptive Models,Jinxin Zhou · Tianyu Ding · Tianyi Chen · Jiachen Jiang · Ilya Zharkov · Zhihui Zhu · Luming Liang, ,https://arxiv.org/abs/2312.00210,,2312.00210.pdf,DREAM: Diffusion Rectification and Estimation-Adaptive Models,"We present DREAM, a novel training framework representing Diffusion
+Rectification and Estimation Adaptive Models, requiring minimal code changes
+(just three lines) yet significantly enhancing the alignment of training with
+sampling in diffusion models. DREAM features two components: diffusion
+rectification, which adjusts training to reflect the sampling process, and
+estimation adaptation, which balances perception against distortion. When
+applied to image super-resolution (SR), DREAM adeptly navigates the tradeoff
+between minimizing distortion and preserving high image quality. Experiments
+demonstrate DREAM's superiority over standard diffusion-based SR methods,
+showing a $2$ to $3\times $ faster training convergence and a $10$ to
+$20\times$ reduction in sampling steps to achieve comparable results. We hope
+DREAM will inspire a rethinking of diffusion model training paradigms.",cs.CV,"['cs.CV', 'cs.AI']"
+Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers,Hongjie Wang · Bhishma Dedhia · Niraj Jha,https://jha-lab.github.io/zerotprune/,https://ar5iv.labs.arxiv.org/html/2305.17328,,2305.17328.pdf,Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers,"Deployment of Transformer models on edge devices is becoming increasingly
+challenging due to the exponentially growing inference cost that scales
+quadratically with the number of tokens in the input sequence. Token pruning is
+an emerging solution to address this challenge due to its ease of deployment on
+various Transformer backbones. However, most token pruning methods require
+computationally expensive fine-tuning, which is undesirable in many edge
+deployment cases. In this work, we propose Zero-TPrune, the first zero-shot
+method that considers both the importance and similarity of tokens in
+performing token pruning. It leverages the attention graph of pre-trained
+Transformer models to produce an importance distribution for tokens via our
+proposed Weighted Page Rank (WPR) algorithm. This distribution further guides
+token partitioning for efficient similarity-based pruning. Due to the
+elimination of the fine-tuning overhead, Zero-TPrune can prune large models at
+negligible computational cost, switch between different pruning configurations
+at no computational cost, and perform hyperparameter tuning efficiently. We
+evaluate the performance of Zero-TPrune on vision tasks by applying it to
+various vision Transformer backbones and testing them on ImageNet. Without any
+fine-tuning, Zero-TPrune reduces the FLOPs cost of DeiT-S by 34.7% and improves
+its throughput by 45.3% with only 0.4% accuracy loss. Compared with
+state-of-the-art pruning methods that require fine-tuning, Zero-TPrune not only
+eliminates the need for fine-tuning after pruning but also does so with only
+0.1% accuracy loss. Compared with state-of-the-art fine-tuning-free pruning
+methods, Zero-TPrune reduces accuracy loss by up to 49% with similar FLOPs
+budgets. Project webpage: https://jha-lab.github.io/zerotprune.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'eess.IV']"
+FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects,Bowen Wen · Wei Yang · Jan Kautz · Stan Birchfield, ,https://arxiv.org/abs/2312.08344,,2312.08344.pdf,FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects,"We present FoundationPose, a unified foundation model for 6D object pose
+estimation and tracking, supporting both model-based and model-free setups. Our
+approach can be instantly applied at test-time to a novel object without
+fine-tuning, as long as its CAD model is given, or a small number of reference
+images are captured. We bridge the gap between these two setups with a neural
+implicit representation that allows for effective novel view synthesis, keeping
+the downstream pose estimation modules invariant under the same unified
+framework. Strong generalizability is achieved via large-scale synthetic
+training, aided by a large language model (LLM), a novel transformer-based
+architecture, and contrastive learning formulation. Extensive evaluation on
+multiple public datasets involving challenging scenarios and objects indicate
+our unified approach outperforms existing methods specialized for each task by
+a large margin. In addition, it even achieves comparable results to
+instance-level methods despite the reduced assumptions. Project page:
+https://nvlabs.github.io/FoundationPose/",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']"
+Improving Bird’s Eye View Semantic Segmentation by Task Decomposition,Tianhao Zhao · Yongcan Chen · Yu Wu · Tianyang Liu · Bo Du · Peilun Xiao · shi qiu · Hongda Yang · Guozhen Li · yi yang · Yutian Lin, ,https://arxiv.org/abs/2404.01925v1,,2404.01925v1.pdf,Improving Bird's Eye View Semantic Segmentation by Task Decomposition,"Semantic segmentation in bird's eye view (BEV) plays a crucial role in
+autonomous driving. Previous methods usually follow an end-to-end pipeline,
+directly predicting the BEV segmentation map from monocular RGB inputs.
+However, the challenge arises when the RGB inputs and BEV targets from distinct
+perspectives, making the direct point-to-point predicting hard to optimize. In
+this paper, we decompose the original BEV segmentation task into two stages,
+namely BEV map reconstruction and RGB-BEV feature alignment. In the first
+stage, we train a BEV autoencoder to reconstruct the BEV segmentation maps
+given corrupted noisy latent representation, which urges the decoder to learn
+fundamental knowledge of typical BEV patterns. The second stage involves
+mapping RGB input images into the BEV latent space of the first stage, directly
+optimizing the correlations between the two views at the feature level. Our
+approach simplifies the complexity of combining perception and generation into
+distinct steps, equipping the model to handle intricate and challenging scenes
+effectively. Besides, we propose to transform the BEV segmentation map from the
+Cartesian to the polar coordinate system to establish the column-wise
+correspondence between RGB images and BEV maps. Moreover, our method requires
+neither multi-scale features nor camera intrinsic parameters for depth
+estimation and saves computational overhead. Extensive experiments on nuScenes
+and Argoverse show the effectiveness and efficiency of our method. Code is
+available at https://github.com/happytianhao/TaDe.",cs.CV,"['cs.CV', 'cs.AI']"
+Optimal Transport Aggregation for Visual Place Recognition,Sergio Izquierdo · Javier Civera,https://serizba.github.io/salad.html,https://arxiv.org/abs/2311.15937,,2311.15937.pdf,Optimal Transport Aggregation for Visual Place Recognition,"The task of Visual Place Recognition (VPR) aims to match a query image
+against references from an extensive database of images from different places,
+relying solely on visual cues. State-of-the-art pipelines focus on the
+aggregation of features extracted from a deep backbone, in order to form a
+global descriptor for each image. In this context, we introduce SALAD (Sinkhorn
+Algorithm for Locally Aggregated Descriptors), which reformulates NetVLAD's
+soft-assignment of local features to clusters as an optimal transport problem.
+In SALAD, we consider both feature-to-cluster and cluster-to-feature relations
+and we also introduce a 'dustbin' cluster, designed to selectively discard
+features deemed non-informative, enhancing the overall descriptor quality.
+Additionally, we leverage and fine-tune DINOv2 as a backbone, which provides
+enhanced description power for the local features, and dramatically reduces the
+required training time. As a result, our single-stage method not only surpasses
+single-stage baselines in public VPR datasets, but also surpasses two-stage
+methods that add a re-ranking with significantly higher cost. Code and models
+are available at https://github.com/serizba/salad.",cs.CV,['cs.CV']
+DAP: A Dynamic Adversarial Patch for Evading Person Detectors,Amira Guesmi · Ruitian Ding · Muhammad Abdullah Hanif · Ihsen Alouani · Muhammad Shafique, ,,https://dblp.org/rec/journals/corr/abs-2305-11618,,,,,nan
+UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures,Mingyuan Zhou · Rakib Hyder · Ziwei Xuan · Guo-Jun Qi,https://usrc-sea.github.io/UltrAvatar/,https://arxiv.org/abs/2401.11078,,2401.11078.pdf,UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures,"Recent advances in 3D avatar generation have gained significant attentions.
+These breakthroughs aim to produce more realistic animatable avatars, narrowing
+the gap between virtual and real-world experiences. Most of existing works
+employ Score Distillation Sampling (SDS) loss, combined with a differentiable
+renderer and text condition, to guide a diffusion model in generating 3D
+avatars. However, SDS often generates oversmoothed results with few facial
+details, thereby lacking the diversity compared with ancestral sampling. On the
+other hand, other works generate 3D avatar from a single image, where the
+challenges of unwanted lighting effects, perspective views, and inferior image
+quality make them difficult to reliably reconstruct the 3D face meshes with the
+aligned complete textures. In this paper, we propose a novel 3D avatar
+generation approach termed UltrAvatar with enhanced fidelity of geometry, and
+superior quality of physically based rendering (PBR) textures without unwanted
+lighting. To this end, the proposed approach presents a diffuse color
+extraction model and an authenticity guided texture diffusion model. The former
+removes the unwanted lighting effects to reveal true diffuse colors so that the
+generated avatars can be rendered under various lighting conditions. The latter
+follows two gradient-based guidances for generating PBR textures to render
+diverse face-identity features and details better aligning with 3D mesh
+geometry. We demonstrate the effectiveness and robustness of the proposed
+method, outperforming the state-of-the-art methods by a large margin in the
+experiments.",cs.CV,['cs.CV']
+IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing,Shaofei Wang · Bozidar Antic · Andreas Geiger · Siyu Tang, ,https://arxiv.org/abs/2312.05210,,2312.05210.pdf,IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing,"We present IntrinsicAvatar, a novel approach to recovering the intrinsic
+properties of clothed human avatars including geometry, albedo, material, and
+environment lighting from only monocular videos. Recent advancements in
+human-based neural rendering have enabled high-quality geometry and appearance
+reconstruction of clothed humans from just monocular videos. However, these
+methods bake intrinsic properties such as albedo, material, and environment
+lighting into a single entangled neural representation. On the other hand, only
+a handful of works tackle the problem of estimating geometry and disentangled
+appearance properties of clothed humans from monocular videos. They usually
+achieve limited quality and disentanglement due to approximations of secondary
+shading effects via learned MLPs. In this work, we propose to model secondary
+shading effects explicitly via Monte-Carlo ray tracing. We model the rendering
+process of clothed humans as a volumetric scattering process, and combine ray
+tracing with body articulation. Our approach can recover high-quality geometry,
+albedo, material, and lighting properties of clothed humans from a single
+monocular video, without requiring supervised pre-training using ground truth
+materials. Furthermore, since we explicitly model the volumetric scattering
+process and ray tracing, our model naturally generalizes to novel poses,
+enabling animation of the reconstructed avatar in novel lighting conditions.",cs.CV,['cs.CV']
+A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark,Jakub Paplham · Vojtech Franc, ,https://arxiv.org/abs/2307.04570v2,,2307.04570v2.pdf,A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark,"Comparing different age estimation methods poses a challenge due to the
+unreliability of published results stemming from inconsistencies in the
+benchmarking process. Previous studies have reported continuous performance
+improvements over the past decade using specialized methods; however, our
+findings challenge these claims. This paper identifies two trivial, yet
+persistent issues with the currently used evaluation protocol and describes how
+to resolve them. We describe our evaluation protocol in detail and provide
+specific examples of how the protocol should be used. We utilize the protocol
+to offer an extensive comparative analysis for state-of-the-art facial age
+estimation methods. Surprisingly, we find that the performance differences
+between the methods are negligible compared to the effect of other factors,
+such as facial alignment, facial coverage, image resolution, model
+architecture, or the amount of data used for pretraining. We use the gained
+insights to propose using FaRL as the backbone model and demonstrate its
+efficiency. The results emphasize the importance of consistent data
+preprocessing practices for reliable and meaningful comparisons. We make our
+source code public at
+https://github.com/paplhjak/Facial-Age-Estimation-Benchmark.",cs.CV,"['cs.CV', 'cs.LG']"
+REACTO: Reconstructing Articulated Objects from a Single Video,Chaoyue Song · Jiacheng Wei · Chuan-Sheng Foo · Guosheng Lin · Fayao Liu,https://chaoyuesong.github.io/REACTO/,https://arxiv.org/abs/2404.11151,,2404.11151.pdf,REACTO: Reconstructing Articulated Objects from a Single Video,"In this paper, we address the challenge of reconstructing general articulated
+3D objects from a single video. Existing works employing dynamic neural
+radiance fields have advanced the modeling of articulated objects like humans
+and animals from videos, but face challenges with piece-wise rigid general
+articulated objects due to limitations in their deformation models. To tackle
+this, we propose Quasi-Rigid Blend Skinning, a novel deformation model that
+enhances the rigidity of each part while maintaining flexible deformation of
+the joints. Our primary insight combines three distinct approaches: 1) an
+enhanced bone rigging system for improved component modeling, 2) the use of
+quasi-sparse skinning weights to boost part rigidity and reconstruction
+fidelity, and 3) the application of geodesic point assignment for precise
+motion and seamless deformation. Our method outperforms previous works in
+producing higher-fidelity 3D reconstructions of general articulated objects, as
+demonstrated on both real and synthetic datasets. Project page:
+https://chaoyuesong.github.io/REACTO.",cs.CV,['cs.CV']
+DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation,Junming Chen · Yunfei Liu · Jianan Wang · Ailing Zeng · Yu Li · Qifeng Chen,https://jeremycjm.github.io/proj/DiffSHEG/,https://arxiv.org/abs/2401.04747,,2401.04747.pdf,DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation,"We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D
+Expression and Gesture generation with arbitrary length. While previous works
+focused on co-speech gesture or expression generation individually, the joint
+generation of synchronized expressions and gestures remains barely explored. To
+address this, our diffusion-based co-speech motion generation transformer
+enables uni-directional information flow from expression to gesture,
+facilitating improved matching of joint expression-gesture distributions.
+Furthermore, we introduce an outpainting-based sampling strategy for arbitrary
+long sequence generation in diffusion models, offering flexibility and
+computational efficiency. Our method provides a practical solution that
+produces high-quality synchronized expression and gesture generation driven by
+speech. Evaluated on two public datasets, our approach achieves
+state-of-the-art performance both quantitatively and qualitatively.
+Additionally, a user study confirms the superiority of DiffSHEG over prior
+approaches. By enabling the real-time generation of expressive and synchronized
+motions, DiffSHEG showcases its potential for various applications in the
+development of digital humans and embodied agents.",cs.SD,"['cs.SD', 'cs.AI', 'cs.CV', 'cs.GR', 'eess.AS']"
+Building Vision-Language Models on Solid Foundations with Masked Distillation,Sepehr Sameni · Kushal Kafle · Hao Tan · Simon Jenni, ,https://arxiv.org/abs/2311.03149,,2311.03149.pdf,Asymmetric Masked Distillation for Pre-Training Small Foundation Models,"Self-supervised foundation models have shown great potential in computer
+vision thanks to the pre-training paradigm of masked autoencoding. Scale is a
+primary factor influencing the performance of these foundation models. However,
+these large foundation models often result in high computational cost. This
+paper focuses on pre-training relatively small vision transformer models that
+could be efficiently adapted to downstream tasks. Specifically, taking
+inspiration from knowledge distillation in model compression, we propose a new
+asymmetric masked distillation (AMD) framework for pre-training relatively
+small models with autoencoding. The core of AMD is to devise an asymmetric
+masking strategy, where the teacher model is enabled to see more context
+information with a lower masking ratio, while the student model is still
+equipped with a high masking ratio. We design customized multi-layer feature
+alignment between the teacher encoder and student encoder to regularize the
+pre-training of student MAE. To demonstrate the effectiveness and versatility
+of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively
+small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the
+ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B
+model on the Something-in-Something V2 dataset, a 3.7% improvement over the
+original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to
+downstream tasks and obtain consistent performance improvement over the
+original masked autoencoding. The code and models are available at
+https://github.com/MCG-NJU/AMD.",cs.CV,['cs.CV']
+Small Steps and Level Sets: Fitting Neural Surface Models with Point Guidance,Chamin Hewa Koneputugodage · Yizhak Ben-Shabat · Dylan Campbell · Stephen Gould, ,http://export.arxiv.org/abs/2310.07997,,2310.07997.pdf,PG-NeuS: Robust and Efficient Point Guidance for Multi-View Neural Surface Reconstruction,"Recently, learning multi-view neural surface reconstruction with the
+supervision of point clouds or depth maps has been a promising way. However,
+due to the underutilization of prior information, current methods still
+struggle with the challenges of limited accuracy and excessive time complexity.
+In addition, prior data perturbation is also an important but rarely considered
+issue. To address these challenges, we propose a novel point-guided method
+named PG-NeuS, which achieves accurate and efficient reconstruction while
+robustly coping with point noise. Specifically, aleatoric uncertainty of the
+point cloud is modeled to capture the distribution of noise, leading to noise
+robustness. Furthermore, a Neural Projection module connecting points and
+images is proposed to add geometric constraints to implicit surface, achieving
+precise point guidance. To better compensate for geometric bias between volume
+rendering and point modeling, high-fidelity points are filtered into a Bias
+Network to further improve details representation. Benefiting from the
+effective point guidance, even with a lightweight network, the proposed PG-NeuS
+achieves fast convergence with an impressive 11x speedup compared to NeuS.
+Extensive experiments show that our method yields high-quality surfaces with
+high efficiency, especially for fine-grained details and smooth regions,
+outperforming the state-of-the-art methods. Moreover, it exhibits strong
+robustness to noisy data and sparse data.",cs.CV,"['cs.CV', 'cs.AI']"
+Self-correcting LLM-controlled Diffusion,Tsung-Han Wu · Long Lian · Joseph Gonzalez · Boyi Li · Trevor Darrell,https://self-correcting-llm-diffusion.github.io/,https://arxiv.org/abs/2311.16090,,2311.16090.pdf,Self-correcting LLM-controlled Diffusion Models,"Text-to-image generation has witnessed significant progress with the advent
+of diffusion models. Despite the ability to generate photorealistic images,
+current text-to-image diffusion models still often struggle to accurately
+interpret and follow complex input text prompts. In contrast to existing models
+that aim to generate images only with their best effort, we introduce
+Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that
+generates an image from the input prompt, assesses its alignment with the
+prompt, and performs self-corrections on the inaccuracies in the generated
+image. Steered by an LLM controller, SLD turns text-to-image generation into an
+iterative closed-loop process, ensuring correctness in the resulting image. SLD
+is not only training-free but can also be seamlessly integrated with diffusion
+models behind API access, such as DALL-E 3, to further boost the performance of
+state-of-the-art diffusion models. Experimental results show that our approach
+can rectify a majority of incorrect generations, particularly in generative
+numeracy, attribute binding, and spatial relationships. Furthermore, by simply
+adjusting the instructions to the LLM, SLD can perform image editing tasks,
+bridging the gap between text-to-image generation and image editing pipelines.
+We will make our code available for future research and applications.",cs.CV,['cs.CV']
+Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval,Yucheng Suo · Fan Ma · Linchao Zhu · Yi Yang, ,https://arxiv.org/abs/2403.16005,,2403.16005.pdf,Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval,"We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to
+retrieve the target image given a reference image and a description without
+training on the triplet datasets. Previous works generate pseudo-word tokens by
+projecting the reference image features to the text embedding space. However,
+they focus on the global visual representation, ignoring the representation of
+detailed attributes, e.g., color, object number and layout. To address this
+challenge, we propose a Knowledge-Enhanced Dual-stream zero-shot composed image
+retrieval framework (KEDs). KEDs implicitly models the attributes of the
+reference images by incorporating a database. The database enriches the
+pseudo-word tokens by providing relevant images and captions, emphasizing
+shared attribute information in various aspects. In this way, KEDs recognizes
+the reference image from diverse perspectives. Moreover, KEDs adopts an extra
+stream that aligns pseudo-word tokens with textual concepts, leveraging
+pseudo-triplets mined from image-text pairs. The pseudo-word tokens generated
+in this stream are explicitly aligned with fine-grained semantics in the text
+embedding space. Extensive experiments on widely used benchmarks, i.e.
+ImageNet-R, COCO object, Fashion-IQ and CIRR, show that KEDs outperforms
+previous zero-shot composed image retrieval methods.",cs.CV,['cs.CV']
+MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training,Pavan Kumar Anasosalu Vasu · Hadi Pouransari · Fartash Faghri · Raviteja Vemulapalli · Oncel Tuzel,https://github.com/apple/ml-mobileclip,https://arxiv.org/abs/2311.17049,,2311.17049.pdf,MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training,"Contrastive pretraining of image-text foundation models, such as CLIP,
+demonstrated excellent zero-shot performance and improved robustness on a wide
+range of downstream tasks. However, these models utilize large
+transformer-based encoders with significant memory and latency overhead which
+pose challenges for deployment on mobile devices. In this work, we introduce
+MobileCLIP -- a new family of efficient image-text models optimized for runtime
+performance along with a novel and efficient training approach, namely
+multi-modal reinforced training. The proposed training approach leverages
+knowledge transfer from an image captioning model and an ensemble of strong
+CLIP encoders to improve the accuracy of efficient models. Our approach avoids
+train-time compute overhead by storing the additional knowledge in a reinforced
+dataset. MobileCLIP sets a new state-of-the-art latency-accuracy tradeoff for
+zero-shot classification and retrieval tasks on several datasets. Our
+MobileCLIP-S2 variant is 2.3$\times$ faster while more accurate compared to
+previous best CLIP model based on ViT-B/16. We further demonstrate the
+effectiveness of our multi-modal reinforced training by training a CLIP model
+based on ViT-B/16 image backbone and achieving +2.9% average performance
+improvement on 38 evaluation benchmarks compared to the previous best.
+Moreover, we show that the proposed approach achieves 10$\times$-1000$\times$
+improved learning efficiency when compared with non-reinforced CLIP training.
+Code and models are available at https://github.com/apple/ml-mobileclip .",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']"
+Locally Adaptive Neural 3D Morphable Models,Michail Tarasiou · Rolandos Alexandros Potamias · Eimear O' Sullivan · Stylianos Ploumpis · Stefanos Zafeiriou, ,https://arxiv.org/abs/2401.02937,,2401.02937.pdf,Locally Adaptive Neural 3D Morphable Models,"We present the Locally Adaptive Morphable Model (LAMM), a highly flexible
+Auto-Encoder (AE) framework for learning to generate and manipulate 3D meshes.
+We train our architecture following a simple self-supervised training scheme in
+which input displacements over a set of sparse control vertices are used to
+overwrite the encoded geometry in order to transform one training sample into
+another. During inference, our model produces a dense output that adheres
+locally to the specified sparse geometry while maintaining the overall
+appearance of the encoded object. This approach results in state-of-the-art
+performance in both disentangling manipulated geometry and 3D mesh
+reconstruction. To the best of our knowledge LAMM is the first end-to-end
+framework that enables direct local control of 3D vertex geometry in a single
+forward pass. A very efficient computational graph allows our network to train
+with only a fraction of the memory required by previous methods and run faster
+during inference, generating 12k vertex meshes at $>$60fps on a single CPU
+thread. We further leverage local geometry control as a primitive for higher
+level editing operations and present a set of derivative capabilities such as
+swapping and sampling object parts. Code and pretrained models can be found at
+https://github.com/michaeltrs/LAMM.",cs.CV,['cs.CV']
+Finsler-Laplace-Beltrami Operators with Application to Shape Analysis,Simon Weber · Thomas Dagès · Maolin Gao · Daniel Cremers, ,https://arxiv.org/abs/2404.03999,,2404.03999.pdf,Finsler-Laplace-Beltrami Operators with Application to Shape Analysis,"The Laplace-Beltrami operator (LBO) emerges from studying manifolds equipped
+with a Riemannian metric. It is often called the Swiss army knife of geometry
+processing as it allows to capture intrinsic shape information and gives rise
+to heat diffusion, geodesic distances, and a multitude of shape descriptors. It
+also plays a central role in geometric deep learning. In this work, we explore
+Finsler manifolds as a generalization of Riemannian manifolds. We revisit the
+Finsler heat equation and derive a Finsler heat kernel and a
+Finsler-Laplace-Beltrami Operator (FLBO): a novel theoretically justified
+anisotropic Laplace-Beltrami operator (ALBO). In experimental evaluations we
+demonstrate that the proposed FLBO is a valuable alternative to the traditional
+Riemannian-based LBO and ALBOs for spatial filtering and shape correspondence
+estimation. We hope that the proposed Finsler heat kernel and the FLBO will
+inspire further exploration of Finsler geometry in the computer vision
+community.",cs.CV,['cs.CV']
+InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks,Zhe Chen · Jiannan Wu · Wenhai Wang · Weijie Su · Guo Chen · Sen Xing · Zhong Muyan · Qing-Long Zhang · Xizhou Zhu · Lewei Lu · Bin Li · Ping Luo · Tong Lu · Yu Qiao · Jifeng Dai, ,https://arxiv.org/abs/2312.14238,,2312.14238.pdf,InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks,"The exponential growth of large language models (LLMs) has opened up numerous
+possibilities for multimodal AGI systems. However, the progress in vision and
+vision-language foundation models, which are also critical elements of
+multi-modal AGI, has not kept pace with LLMs. In this work, we design a
+large-scale vision-language foundation model (InternVL), which scales up the
+vision foundation model to 6 billion parameters and progressively aligns it
+with the LLM, using web-scale image-text data from various sources. This model
+can be broadly applied to and achieve state-of-the-art performance on 32
+generic visual-linguistic benchmarks including visual perception tasks such as
+image-level or pixel-level recognition, vision-language tasks such as zero-shot
+image/video classification, zero-shot image/video-text retrieval, and link with
+LLMs to create multi-modal dialogue systems. It has powerful visual
+capabilities and can be a good alternative to the ViT-22B. We hope that our
+research could contribute to the development of multi-modal large models. Code
+and models are available at https://github.com/OpenGVLab/InternVL.",cs.CV,['cs.CV']
+LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry,Weirong Chen · Le Chen · Rui Wang · Marc Pollefeys,https://chiaki530.github.io/projects/leapvo/,https://arxiv.org/abs/2401.01887v1,,2401.01887v1.pdf,LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry,"Visual odometry estimates the motion of a moving camera based on visual
+input. Existing methods, mostly focusing on two-view point tracking, often
+ignore the rich temporal context in the image sequence, thereby overlooking the
+global motion patterns and providing no assessment of the full trajectory
+reliability. These shortcomings hinder performance in scenarios with occlusion,
+dynamic objects, and low-texture areas. To address these challenges, we present
+the Long-term Effective Any Point Tracking (LEAP) module. LEAP innovatively
+combines visual, inter-track, and temporal cues with mindfully selected anchors
+for dynamic track estimation. Moreover, LEAP's temporal probabilistic
+formulation integrates distribution updates into a learnable iterative
+refinement module to reason about point-wise uncertainty. Based on these
+traits, we develop LEAP-VO, a robust visual odometry system adept at handling
+occlusions and dynamic scenes. Our mindful integration showcases a novel
+practice by employing long-term point tracking as the front-end. Extensive
+experiments demonstrate that the proposed pipeline significantly outperforms
+existing baselines across various visual odometry benchmarks.",cs.CV,['cs.CV']
+MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding,Bo He · Hengduo Li · Young Kyun Jang · Menglin Jia · Xuefei Cao · Ashish Shah · Abhinav Shrivastava · Ser-Nam Lim, ,https://arxiv.org/html/2404.05726v2,,2404.05726v2.pdf,MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding,"With the success of large language models (LLMs), integrating the vision
+model into LLMs to build vision-language foundation models has gained much more
+interest recently. However, existing LLM-based large multimodal models (e.g.,
+Video-LLaMA, VideoChat) can only take in a limited number of frames for short
+video understanding. In this study, we mainly focus on designing an efficient
+and effective model for long-term video understanding. Instead of trying to
+process more frames simultaneously like most existing work, we propose to
+process videos in an online manner and store past video information in a memory
+bank. This allows our model to reference historical video content for long-term
+analysis without exceeding LLMs' context length constraints or GPU memory
+limits. Our memory bank can be seamlessly integrated into current multimodal
+LLMs in an off-the-shelf manner. We conduct extensive experiments on various
+video understanding tasks, such as long-video understanding, video question
+answering, and video captioning, and our model can achieve state-of-the-art
+performances across multiple datasets. Code available at
+https://boheumd.github.io/MA-LMM/.",cs.CV,['cs.CV']
+Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption,Buzhen Huang · Chen Li · Chongyang Xu · Liang Pan · Yangang Wang · Gim Hee Lee, ,https://arxiv.org/abs/2404.11291,,2404.11291.pdf,Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption,"Existing multi-person human reconstruction approaches mainly focus on
+recovering accurate poses or avoiding penetration, but overlook the modeling of
+close interactions. In this work, we tackle the task of reconstructing closely
+interactive humans from a monocular video. The main challenge of this task
+comes from insufficient visual information caused by depth ambiguity and severe
+inter-person occlusion. In view of this, we propose to leverage knowledge from
+proxemic behavior and physics to compensate the lack of visual information.
+This is based on the observation that human interaction has specific patterns
+following the social proxemics. Specifically, we first design a latent
+representation based on Vector Quantised-Variational AutoEncoder (VQ-VAE) to
+model human interaction. A proxemics and physics guided diffusion model is then
+introduced to denoise the initial distribution. We design the diffusion model
+as dual branch with each branch representing one individual such that the
+interaction can be modeled via cross attention. With the learned priors of
+VQ-VAE and physical constraint as the additional information, our proposed
+approach is capable of estimating accurate poses that are also proxemics and
+physics plausible. Experimental results on Hi4D, 3DPW, and CHI3D demonstrate
+that our method outperforms existing approaches. The code is available at
+\url{https://github.com/boycehbz/HumanInteraction}.",cs.CV,['cs.CV']
+The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement,Gabriele Trivigno · Carlo Masone · Barbara Caputo · Torsten Sattler, ,https://arxiv.org/abs/2404.10438v1,,2404.10438v1.pdf,The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement,"Pose refinement is an interesting and practically relevant research
+direction. Pose refinement can be used to (1) obtain a more accurate pose
+estimate from an initial prior (e.g., from retrieval), (2) as pre-processing,
+i.e., to provide a better starting point to a more expensive pose estimator,
+(3) as post-processing of a more accurate localizer. Existing approaches focus
+on learning features / scene representations for the pose refinement task. This
+involves training an implicit scene representation or learning features while
+optimizing a camera pose-based loss. A natural question is whether training
+specific features / representations is truly necessary or whether similar
+results can be already achieved with more generic features. In this work, we
+present a simple approach that combines pre-trained features with a particle
+filter and a renderable representation of the scene. Despite its simplicity, it
+achieves state-of-the-art results, demonstrating that one can easily build a
+pose refiner without the need for specific training. The code is at
+https://github.com/ga1i13o/mcloc_poseref",cs.CV,['cs.CV']
+GDA: Generalized Diffusion for Robust Test-time Adaptation,Yun-Yun Tsai · Fu-Chen Chen · Albert Chen · Junfeng Yang · Che-Chun Su · Min Sun · Cheng-Hao Kuo, ,https://arxiv.org/abs/2404.00095,,2404.00095.pdf,GDA: Generalized Diffusion for Robust Test-time Adaptation,"Machine learning models struggle with generalization when encountering
+out-of-distribution (OOD) samples with unexpected distribution shifts. For
+vision tasks, recent studies have shown that test-time adaptation employing
+diffusion models can achieve state-of-the-art accuracy improvements on OOD
+samples by generating new samples that align with the model's domain without
+the need to modify the model's weights. Unfortunately, those studies have
+primarily focused on pixel-level corruptions, thereby lacking the
+generalization to adapt to a broader range of OOD types. We introduce
+Generalized Diffusion Adaptation (GDA), a novel diffusion-based test-time
+adaptation method robust against diverse OOD types. Specifically, GDA
+iteratively guides the diffusion by applying a marginal entropy loss derived
+from the model, in conjunction with style and content preservation losses
+during the reverse sampling process. In other words, GDA considers the model's
+output behavior with the semantic information of the samples as a whole, which
+can reduce ambiguity in downstream tasks during the generation process.
+Evaluation across various popular model architectures and OOD benchmarks shows
+that GDA consistently outperforms prior work on diffusion-driven adaptation.
+Notably, it achieves the highest classification accuracy improvements, ranging
+from 4.4\% to 5.02\% on ImageNet-C and 2.5\% to 7.4\% on Rendition, Sketch, and
+Stylized benchmarks. This performance highlights GDA's generalization to a
+broader range of OOD benchmarks.",cs.CV,['cs.CV']
+RecDiffusion: Rectangling for Image Stitching with Diffusion Models,Tianhao Zhou · Li Haipeng · Ziyi Wang · Ao Luo · Chenlin Zhang · Jiajun Li · Bing Zeng · Shuaicheng Liu, ,https://arxiv.org/abs/2403.19164,,2403.19164.pdf,RecDiffusion: Rectangling for Image Stitching with Diffusion Models,"Image stitching from different captures often results in non-rectangular
+boundaries, which is often considered unappealing. To solve non-rectangular
+boundaries, current solutions involve cropping, which discards image content,
+inpainting, which can introduce unrelated content, or warping, which can
+distort non-linear features and introduce artifacts. To overcome these issues,
+we introduce a novel diffusion-based learning framework, \textbf{RecDiffusion},
+for image stitching rectangling. This framework combines Motion Diffusion
+Models (MDM) to generate motion fields, effectively transitioning from the
+stitched image's irregular borders to a geometrically corrected intermediary.
+Followed by Content Diffusion Models (CDM) for image detail refinement.
+Notably, our sampling process utilizes a weighted map to identify regions
+needing correction during each iteration of CDM. Our RecDiffusion ensures
+geometric accuracy and overall visual appeal, surpassing all previous methods
+in both quantitative and qualitative measures when evaluated on public
+benchmarks. Code is released at https://github.com/lhaippp/RecDiffusion.",cs.CV,['cs.CV']
+"Rethinking Interactive Image Segmentation with Low Latency, High Quality, and Diverse Prompts",Qin Liu · Jaemin Cho · Mohit Bansal · Marc Niethammer,https://github.com/uncbiag/SegNext,https://arxiv.org/abs/2404.00741,,2404.00741.pdf,"Rethinking Interactive Image Segmentation with Low Latency, High Quality, and Diverse Prompts","The goal of interactive image segmentation is to delineate specific regions
+within an image via visual or language prompts. Low-latency and high-quality
+interactive segmentation with diverse prompts remain challenging for existing
+specialist and generalist models. Specialist models, with their limited prompts
+and task-specific designs, experience high latency because the image must be
+recomputed every time the prompt is updated, due to the joint encoding of image
+and visual prompts. Generalist models, exemplified by the Segment Anything
+Model (SAM), have recently excelled in prompt diversity and efficiency, lifting
+image segmentation to the foundation model era. However, for high-quality
+segmentations, SAM still lags behind state-of-the-art specialist models despite
+SAM being trained with x100 more segmentation masks. In this work, we delve
+deep into the architectural differences between the two types of models. We
+observe that dense representation and fusion of visual prompts are the key
+design choices contributing to the high segmentation quality of specialist
+models. In light of this, we reintroduce this dense design into the generalist
+models, to facilitate the development of generalist models with high
+segmentation quality. To densely represent diverse visual prompts, we propose
+to use a dense map to capture five types: clicks, boxes, polygons, scribbles,
+and masks. Thus, we propose SegNext, a next-generation interactive segmentation
+approach offering low latency, high quality, and diverse prompt support. Our
+method outperforms current state-of-the-art methods on HQSeg-44K and DAVIS,
+both quantitatively and qualitatively.",cs.CV,['cs.CV']
+Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior,Fangfu Liu · Diankun Wu · Yi Wei · Yongming Rao · Yueqi Duan,https://liuff19.github.io/Sherpa3D/,https://arxiv.org/abs/2312.06655,,2312.06655.pdf,Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior,"Recently, 3D content creation from text prompts has demonstrated remarkable
+progress by utilizing 2D and 3D diffusion models. While 3D diffusion models
+ensure great multi-view consistency, their ability to generate high-quality and
+diverse 3D assets is hindered by the limited 3D data. In contrast, 2D diffusion
+models find a distillation approach that achieves excellent generalization and
+rich details without any 3D data. However, 2D lifting methods suffer from
+inherent view-agnostic ambiguity thereby leading to serious multi-face Janus
+issues, where text prompts fail to provide sufficient guidance to learn
+coherent 3D results. Instead of retraining a costly viewpoint-aware model, we
+study how to fully exploit easily accessible coarse 3D knowledge to enhance the
+prompts and guide 2D lifting optimization for refinement. In this paper, we
+propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity,
+generalizability, and geometric consistency simultaneously. Specifically, we
+design a pair of guiding strategies derived from the coarse 3D prior generated
+by the 3D diffusion model: a structural guidance for geometric fidelity and a
+semantic guidance for 3D coherence. Employing the two types of guidance, the 2D
+diffusion model enriches the 3D content with diversified and high-quality
+results. Extensive experiments show the superiority of our Sherpa3D over the
+state-of-the-art text-to-3D methods in terms of quality and 3D consistency.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+Identifying Important Group of Pixels using Interactions,Kosuke Sumiyasu · Kazuhiko Kawamoto · Hiroshi Kera, ,https://arxiv.org/abs/2401.03785,,2401.03785.pdf,Identifying Important Group of Pixels using Interactions,"To better understand the behavior of image classifiers, it is useful to
+visualize the contribution of individual pixels to the model prediction. In
+this study, we propose a method, MoXI ($\textbf{Mo}$del e$\textbf{X}$planation
+by $\textbf{I}$nteractions), that efficiently and accurately identifies a group
+of pixels with high prediction confidence. The proposed method employs
+game-theoretic concepts, Shapley values and interactions, taking into account
+the effects of individual pixels and the cooperative influence of pixels on
+model confidence. Theoretical analysis and experiments demonstrate that our
+method better identifies the pixels that are highly contributing to the model
+outputs than widely-used visualization by Grad-CAM, Attention rollout, and
+Shapley value. While prior studies have suffered from the exponential
+computational cost in the computation of Shapley value and interactions, we
+show that this can be reduced to quadratic cost for our task. The code is
+available at https://github.com/KosukeSumiyasu/MoXI.",cs.CV,"['cs.CV', 'cs.LG']"
+DualAD: Disentangling the Dynamic and Static World for End-to-End Driving,Simon Doll · Niklas Hanselmann · Lukas Schneider · Richard Schulz · Marius Cordts · Markus Enzweiler · Hendrik Lensch, ,https://arxiv.org/html/2306.16927v2,,2306.16927v2.pdf,End-to-end Autonomous Driving: Challenges and Frontiers,"The autonomous driving community has witnessed a rapid growth in approaches
+that embrace an end-to-end algorithm framework, utilizing raw sensor input to
+generate vehicle motion plans, instead of concentrating on individual tasks
+such as detection and motion prediction. End-to-end systems, in comparison to
+modular pipelines, benefit from joint feature optimization for perception and
+planning. This field has flourished due to the availability of large-scale
+datasets, closed-loop evaluation, and the increasing need for autonomous
+driving algorithms to perform effectively in challenging scenarios. In this
+survey, we provide a comprehensive analysis of more than 270 papers, covering
+the motivation, roadmap, methodology, challenges, and future trends in
+end-to-end autonomous driving. We delve into several critical challenges,
+including multi-modality, interpretability, causal confusion, robustness, and
+world models, amongst others. Additionally, we discuss current advancements in
+foundation models and visual pre-training, as well as how to incorporate these
+techniques within the end-to-end driving framework. we maintain an active
+repository that contains up-to-date literature and open-source projects at
+https://github.com/OpenDriveLab/End-to-end-Autonomous-Driving.",cs.RO,"['cs.RO', 'cs.AI', 'cs.CV', 'cs.LG']"
+A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing,Li Maomao · Yu Li · Tianyu Yang · Yunfei Liu · Dongxu Yue · Zhihui Lin · Dong Xu, ,https://arxiv.org/abs/2312.05856,,2312.05856.pdf,A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing,"This paper presents a video inversion approach for zero-shot video editing,
+which models the input video with low-rank representation during the inversion
+process. The existing video editing methods usually apply the typical 2D DDIM
+inversion or naive spatial-temporal DDIM inversion before editing, which
+leverages time-varying representation for each frame to derive noisy latent.
+Unlike most existing approaches, we propose a Spatial-Temporal
+Expectation-Maximization (STEM) inversion, which formulates the dense video
+feature under an expectation-maximization manner and iteratively estimates a
+more compact basis set to represent the whole video. Each frame applies the
+fixed and global representation for inversion, which is more friendly for
+temporal consistency during reconstruction and editing. Extensive qualitative
+and quantitative experiments demonstrate that our STEM inversion can achieve
+consistent improvement on two state-of-the-art video editing methods. Project
+page: https://stem-inv.github.io/page/.",cs.CV,['cs.CV']
+"Time-, Memory- and Parameter-Efficient Visual Adaptation",Otniel-Bogdan Mercea · Alexey Gritsenko · Cordelia Schmid · Anurag Arnab, ,https://arxiv.org/abs/2402.02887,,2402.02887.pdf,"Time-, Memory- and Parameter-Efficient Visual Adaptation","As foundation models become more popular, there is a growing need to
+efficiently finetune them for downstream tasks. Although numerous adaptation
+methods have been proposed, they are designed to be efficient only in terms of
+how many parameters are trained. They, however, typically still require
+backpropagating gradients throughout the model, meaning that their
+training-time and -memory cost does not reduce as significantly. We propose an
+adaptation method which does not backpropagate gradients through the backbone.
+We achieve this by designing a lightweight network in parallel that operates on
+features from the frozen, pretrained backbone. As a result, our method is
+efficient not only in terms of parameters, but also in training-time and memory
+usage. Our approach achieves state-of-the-art accuracy-parameter trade-offs on
+the popular VTAB benchmark, and we further show how we outperform prior works
+with respect to training-time and -memory usage too. We further demonstrate the
+training efficiency and scalability of our method by adapting a vision
+transformer backbone of 4 billion parameters for the computationally demanding
+task of video classification, without any intricate model parallelism. Here, we
+outperform a prior adaptor-based method which could only scale to a 1 billion
+parameter backbone, or fully-finetuning a smaller backbone, with the same GPU
+and less training time.",cs.CV,"['cs.CV', 'cs.LG']"
+WildlifeMapper:  Aerial Image Analysis for Multi-Species Detection and Identification,Satish Kumar · Bowen Zhang · Chandrakanth Gudavalli · Connor Levenson · Lacey Hughey · Jared Stabach · Irene Amoke · Gordon Ojwang · Joseph Mukeka · Howard Frederick · Stephen Mwiu · Joseph Ochieng Ogutu · B S Manjunath, ,https://arxiv.org/abs/2311.12956,,2311.12956.pdf,Innovative Horizons in Aerial Imagery: LSKNet Meets DiffusionDet for Advanced Object Detection,"In the realm of aerial image analysis, object detection plays a pivotal role,
+with significant implications for areas such as remote sensing, urban planning,
+and disaster management. This study addresses the inherent challenges in this
+domain, notably the detection of small objects, managing densely packed
+elements, and accounting for diverse orientations. We present an in-depth
+evaluation of an object detection model that integrates the Large Selective
+Kernel Network (LSKNet)as its backbone with the DiffusionDet head, utilizing
+the iSAID dataset for empirical analysis. Our approach encompasses the
+introduction of novel methodologies and extensive ablation studies. These
+studies critically assess various aspects such as loss functions, box
+regression techniques, and classification strategies to refine the model's
+precision in object detection. The paper details the experimental application
+of the LSKNet backbone in synergy with the DiffusionDet heads, a combination
+tailored to meet the specific challenges in aerial image object detection. The
+findings of this research indicate a substantial enhancement in the model's
+performance, especially in the accuracy-time tradeoff. The proposed model
+achieves a mean average precision (MAP) of approximately 45.7%, which is a
+significant improvement, outperforming the RCNN model by 4.7% on the same
+dataset. This advancement underscores the effectiveness of the proposed
+modifications and sets a new benchmark in aerial image analysis, paving the way
+for more accurate and efficient object detection methodologies. The code is
+publicly available at https://github.com/SashaMatsun/LSKDiffDet",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Flattening the Parent Bias: Hierarchical Semantic Segmentation in the Poincaré Ball,Simon Weber · Barış Zöngür · Nikita Araslanov · Daniel Cremers, ,https://arxiv.org/abs/2404.03778,,2404.03778.pdf,Flattening the Parent Bias: Hierarchical Semantic Segmentation in the Poincaré Ball,"Hierarchy is a natural representation of semantic taxonomies, including the
+ones routinely used in image segmentation. Indeed, recent work on semantic
+segmentation reports improved accuracy from supervised training leveraging
+hierarchical label structures. Encouraged by these results, we revisit the
+fundamental assumptions behind that work. We postulate and then empirically
+verify that the reasons for the observed improvement in segmentation accuracy
+may be entirely unrelated to the use of the semantic hierarchy. To demonstrate
+this, we design a range of cross-domain experiments with a representative
+hierarchical approach. We find that on the new testing domains, a flat
+(non-hierarchical) segmentation network, in which the parents are inferred from
+the children, has superior segmentation accuracy to the hierarchical approach
+across the board. Complementing these findings and inspired by the intrinsic
+properties of hyperbolic spaces, we study a more principled approach to
+hierarchical segmentation using the Poincar\'e ball model. The hyperbolic
+representation largely outperforms the previous (Euclidean) hierarchical
+approach as well and is on par with our flat Euclidean baseline in terms of
+segmentation accuracy. However, it additionally exhibits surprisingly strong
+calibration quality of the parent nodes in the semantic hierarchy, especially
+on the more challenging domains. Our combined analysis suggests that the
+established practice of hierarchical segmentation may be limited to in-domain
+settings, whereas flat classifiers generalize substantially better, especially
+if they are modeled in the hyperbolic space.",cs.CV,['cs.CV']
+OTE: Exploring Accurate Scene Text Recognition Using One Token,Jianjun Xu · Yuxin Wang · Hongtao Xie · Yongdong Zhang, ,https://arxiv.org/html/2403.07518v1,,2403.07518v1.pdf,Open-Vocabulary Scene Text Recognition via Pseudo-Image Labeling and Margin Loss,"Scene text recognition is an important and challenging task in computer
+vision. However, most prior works focus on recognizing pre-defined words, while
+there are various out-of-vocabulary (OOV) words in real-world applications.
+  In this paper, we propose a novel open-vocabulary text recognition framework,
+Pseudo-OCR, to recognize OOV words. The key challenge in this task is the lack
+of OOV training data. To solve this problem, we first propose a pseudo label
+generation module that leverages character detection and image inpainting to
+produce substantial pseudo OOV training data from real-world images. Unlike
+previous synthetic data, our pseudo OOV data contains real characters and
+backgrounds to simulate real-world applications. Secondly, to reduce noises in
+pseudo data, we present a semantic checking mechanism to filter semantically
+meaningful data. Thirdly, we introduce a quality-aware margin loss to boost the
+training with pseudo data. Our loss includes a margin-based part to enhance the
+classification ability, and a quality-aware part to penalize low-quality
+samples in both real and pseudo data.
+  Extensive experiments demonstrate that our approach outperforms the
+state-of-the-art on eight datasets and achieves the first rank in the ICDAR2022
+challenge.",cs.CV,['cs.CV']
+Language-Driven Anchors for Zero-Shot Adversarial Robustness,Xiao Li · Wei Zhang · Yining Liu · Zhanhao Hu · Bo Zhang · Xiaolin Hu,https://github.com/LixiaoTHU/LAAT,,https://paperswithcode.com/search?q=author:Xiaolin+Hu&order_by=stars,,,,,nan
+DAVE -- A Detect-and-Verify Paradigm for Low-Shot Counting,Jer Pelhan · Alan Lukezic · Vitjan Zavrtanik · Matej Kristan, ,https://arxiv.org/abs/2404.16622,,2404.16622.pdf,DAVE -- A Detect-and-Verify Paradigm for Low-Shot Counting,"Low-shot counters estimate the number of objects corresponding to a selected
+category, based on only few or no exemplars annotated in the image. The current
+state-of-the-art estimates the total counts as the sum over the object location
+density map, but does not provide individual object locations and sizes, which
+are crucial for many applications. This is addressed by detection-based
+counters, which, however fall behind in the total count accuracy. Furthermore,
+both approaches tend to overestimate the counts in the presence of other object
+classes due to many false positives. We propose DAVE, a low-shot counter based
+on a detect-and-verify paradigm, that avoids the aforementioned issues by first
+generating a high-recall detection set and then verifying the detections to
+identify and remove the outliers. This jointly increases the recall and
+precision, leading to accurate counts. DAVE outperforms the top density-based
+counters by ~20% in the total count MAE, it outperforms the most recent
+detection-based counter by ~20% in detection quality and sets a new
+state-of-the-art in zero-shot as well as text-prompt-based counting.",cs.CV,['cs.CV']
+SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes,Soubhik Sanyal · Partha Ghosh · Jinlong Yang · Michael J. Black · Justus Thies · Timo Bolkart,https://sculpt.is.tue.mpg.de/,https://arxiv.org/html/2308.10638v2,,2308.10638v2.pdf,SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes,"We present SCULPT, a novel 3D generative model for clothed and textured 3D
+meshes of humans. Specifically, we devise a deep neural network that learns to
+represent the geometry and appearance distribution of clothed human bodies.
+Training such a model is challenging, as datasets of textured 3D meshes for
+humans are limited in size and accessibility. Our key observation is that there
+exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image
+datasets of clothed humans and multiple appearances can be mapped to a single
+geometry. To effectively learn from the two data modalities, we propose an
+unpaired learning procedure for pose-dependent clothed and textured human
+meshes. Specifically, we learn a pose-dependent geometry space from 3D scan
+data. We represent this as per vertex displacements w.r.t. the SMPL model.
+Next, we train a geometry conditioned texture generator in an unsupervised way
+using the 2D image data. We use intermediate activations of the learned
+geometry model to condition our texture generator. To alleviate entanglement
+between pose and clothing type, and pose and clothing appearance, we condition
+both the texture and geometry generators with attribute labels such as clothing
+types for the geometry, and clothing colors for the texture generator. We
+automatically generated these conditioning labels for the 2D images based on
+the visual question answering model BLIP and CLIP. We validate our method on
+the SCULPT dataset, and compare to state-of-the-art 3D generative models for
+clothed human bodies. Our code and data can be found at
+https://sculpt.is.tue.mpg.de.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG']"
+Brush2Prompt: Contextual Prompt Generator for Object Inpainting,Mang Tik Chiu · Yuqian Zhou · Lingzhi Zhang · Zhe Lin · Connelly Barnes · Sohrab Amirghodsi · Eli Shechtman · Humphrey Shi, ,https://ar5iv.labs.arxiv.org/html/2204.07845,,2204.07845.pdf,Shape-guided Object Inpainting,"Previous works on image inpainting mainly focus on inpainting background or
+partially missing objects, while the problem of inpainting an entire missing
+object remains unexplored. This work studies a new image inpainting task, i.e.
+shape-guided object inpainting. Given an incomplete input image, the goal is to
+fill in the hole by generating an object based on the context and implicit
+guidance given by the hole shape. Since previous methods for image inpainting
+are mainly designed for background inpainting, they are not suitable for this
+task. Therefore, we propose a new data preparation method and a novel
+Contextual Object Generator (CogNet) for the object inpainting task. On the
+data side, we incorporate object priors into training data by using object
+instances as holes. The CogNet has a two-stream architecture that combines the
+standard bottom-up image completion process with a top-down object generation
+process. A predictive class embedding module bridges the two streams by
+predicting the class of the missing object from the bottom-up features, from
+which a semantic object map is derived as the input of the top-down stream.
+Experiments demonstrate that the proposed method can generate realistic objects
+that fit the context in terms of both visual appearance and semantic meanings.
+Code can be found at the project page
+\url{https://zengxianyu.github.io/objpaint}",cs.CV,"['cs.CV', 'cs.MM']"
+AV-RIR: Audio-Visual Room Impulse Response Estimation,Anton Ratnarajah · Sreyan Ghosh · Sonal Kumar · Purva Chiniya · Dinesh Manocha,https://anton-jeran.github.io/AVRIR/,https://arxiv.org/abs/2312.00834,,2312.00834.pdf,AV-RIR: Audio-Visual Room Impulse Response Estimation,"Accurate estimation of Room Impulse Response (RIR), which captures an
+environment's acoustic properties, is important for speech processing and AR/VR
+applications. We propose AV-RIR, a novel multi-modal multi-task learning
+approach to accurately estimate the RIR from a given reverberant speech signal
+and the visual cues of its corresponding environment. AV-RIR builds on a novel
+neural codec-based architecture that effectively captures environment geometry
+and materials properties and solves speech dereverberation as an auxiliary task
+by using multi-task learning. We also propose Geo-Mat features that augment
+material information into visual cues and CRIP that improves late reverberation
+components in the estimated RIR via image-to-RIR retrieval by 86%. Empirical
+results show that AV-RIR quantitatively outperforms previous audio-only and
+visual-only approaches by achieving 36% - 63% improvement across various
+acoustic metrics in RIR estimation. Additionally, it also achieves higher
+preference scores in human evaluation. As an auxiliary benefit, dereverbed
+speech from AV-RIR shows competitive performance with the state-of-the-art in
+various spoken language processing tasks and outperforms reverberation time
+error score in the real-world AVSpeech dataset. Qualitative examples of both
+synthesized reverberant speech and enhanced speech can be found at
+https://www.youtube.com/watch?v=tTsKhviukAE.",cs.SD,"['cs.SD', 'cs.CV']"
+DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization,Jisu Nam · Heesu Kim · DongJae Lee · Siyoon Jin · Seungryong Kim · Seunggyu Chang, ,https://arxiv.org/abs/2402.09812,,2402.09812.pdf,DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization,"The objective of text-to-image (T2I) personalization is to customize a
+diffusion model to a user-provided reference concept, generating diverse images
+of the concept aligned with the target prompts. Conventional methods
+representing the reference concepts using unique text embeddings often fail to
+accurately mimic the appearance of the reference. To address this, one solution
+may be explicitly conditioning the reference images into the target denoising
+process, known as key-value replacement. However, prior works are constrained
+to local editing since they disrupt the structure path of the pre-trained T2I
+model. To overcome this, we propose a novel plug-in method, called
+DreamMatcher, which reformulates T2I personalization as semantic matching.
+Specifically, DreamMatcher replaces the target values with reference values
+aligned by semantic matching, while leaving the structure path unchanged to
+preserve the versatile capability of pre-trained T2I models for generating
+diverse structures. We also introduce a semantic-consistent masking strategy to
+isolate the personalized concept from irrelevant regions introduced by the
+target prompts. Compatible with existing T2I models, DreamMatcher shows
+significant improvements in complex scenarios. Intensive analyses demonstrate
+the effectiveness of our approach.",cs.CV,['cs.CV']
+"FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features",Andre Rochow · Max Schwarz · Sven Behnke,https://andrerochow.github.io/fsrt,https://arxiv.org/abs/2404.09736,,2404.09736.pdf,"FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance, Head-pose, and Facial Expression Features","The task of face reenactment is to transfer the head motion and facial
+expressions from a driving video to the appearance of a source image, which may
+be of a different person (cross-reenactment). Most existing methods are
+CNN-based and estimate optical flow from the source image to the current
+driving frame, which is then inpainted and refined to produce the output
+animation. We propose a transformer-based encoder for computing a set-latent
+representation of the source image(s). We then predict the output color of a
+query pixel using a transformer-based decoder, which is conditioned with
+keypoints and a facial expression vector extracted from the driving frame.
+Latent representations of the source person are learned in a self-supervised
+manner that factorize their appearance, head pose, and facial expressions.
+Thus, they are perfectly suited for cross-reenactment. In contrast to most
+related work, our method naturally extends to multiple source images and can
+thus adapt to person-specific facial dynamics. We also propose data
+augmentation and regularization schemes that are necessary to prevent
+overfitting and support generalizability of the learned representations. We
+evaluated our approach in a randomized user study. The results indicate
+superior performance compared to the state-of-the-art in terms of motion
+transfer quality and temporal consistency.",cs.CV,['cs.CV']
+Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models,Hongjie Wang · Difan Liu · Yan Kang · Yijun Li · Zhe Lin · Niraj Jha · Yuchen Liu,https://atedm.github.io/,https://arxiv.org/abs/2405.05252,,2405.05252.pdf,Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models,"Diffusion Models (DMs) have exhibited superior performance in generating
+high-quality and diverse images. However, this exceptional performance comes at
+the cost of expensive architectural design, particularly due to the attention
+module heavily used in leading models. Existing works mainly adopt a retraining
+process to enhance DM efficiency. This is computationally expensive and not
+very scalable. To this end, we introduce the Attention-driven Training-free
+Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to
+perform run-time pruning of redundant tokens, without the need for any
+retraining. Specifically, for single-denoising-step pruning, we develop a novel
+ranking algorithm, Generalized Weighted Page Rank (G-WPR), to identify
+redundant tokens, and a similarity-based recovery method to restore tokens for
+the convolution operation. In addition, we propose a Denoising-Steps-Aware
+Pruning (DSAP) approach to adjust the pruning budget across different denoising
+timesteps for better generation quality. Extensive evaluations show that AT-EDM
+performs favorably against prior art in terms of efficiency (e.g., 38.8% FLOPs
+saving and up to 1.53x speed-up over Stable Diffusion XL) while maintaining
+nearly the same FID and CLIP scores as the full model. Project webpage:
+https://atedm.github.io.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'eess.IV', 'eess.SP']"
+Laplacian-guided Entropy Model in Neural Codec with Blur-dissipated Synthesis,Atefeh Khoshkhahtinat · Ali Zafari · Piyush Mehta · Nasser Nasrabadi, ,https://arxiv.org/abs/2403.16258,,2403.16258.pdf,Laplacian-guided Entropy Model in Neural Codec with Blur-dissipated Synthesis,"While replacing Gaussian decoders with a conditional diffusion model enhances
+the perceptual quality of reconstructions in neural image compression, their
+lack of inductive bias for image data restricts their ability to achieve
+state-of-the-art perceptual levels. To address this limitation, we adopt a
+non-isotropic diffusion model at the decoder side. This model imposes an
+inductive bias aimed at distinguishing between frequency contents, thereby
+facilitating the generation of high-quality images. Moreover, our framework is
+equipped with a novel entropy model that accurately models the probability
+distribution of latent representation by exploiting spatio-channel correlations
+in latent space, while accelerating the entropy decoding step. This
+channel-wise entropy model leverages both local and global spatial contexts
+within each channel chunk. The global spatial context is built upon the
+Transformer, which is specifically designed for image compression tasks. The
+designed Transformer employs a Laplacian-shaped positional encoding, the
+learnable parameters of which are adaptively adjusted for each channel cluster.
+Our experiments demonstrate that our proposed framework yields better
+perceptual quality compared to cutting-edge generative-based codecs, and the
+proposed entropy model contributes to notable bitrate savings.",eess.IV,"['eess.IV', 'cs.CV', 'cs.IT', 'cs.LG', 'math.IT']"
+Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models,Pengze Zhang · Hubery Yin · Chen Li · Xiaohua Xie,https://pangzecheung.github.io/SingDiffusion/,https://arxiv.org/abs/2403.08381,,2403.08381.pdf,Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models,"Most diffusion models assume that the reverse process adheres to a Gaussian
+distribution. However, this approximation has not been rigorously validated,
+especially at singularities, where t=0 and t=1. Improperly dealing with such
+singularities leads to an average brightness issue in applications, and limits
+the generation of images with extreme brightness or darkness. We primarily
+focus on tackling singularities from both theoretical and practical
+perspectives. Initially, we establish the error bounds for the reverse process
+approximation, and showcase its Gaussian characteristics at singularity time
+steps. Based on this theoretical insight, we confirm the singularity at t=1 is
+conditionally removable while it at t=0 is an inherent property. Upon these
+significant conclusions, we propose a novel plug-and-play method SingDiffusion
+to address the initial singular time step sampling, which not only effectively
+resolves the average brightness issue for a wide range of diffusion models
+without extra training efforts, but also enhances their generation capability
+in achieving notable lower FID scores.",cs.CV,['cs.CV']
+"Looking Similar, Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning",Nikhil Singh · Chih-Wei Wu · Iroro Orife · Kalayeh, ,https://arxiv.org/abs/2404.17753,,,Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification,"CLIP showcases exceptional cross-modal matching capabilities due to its
+training on image-text contrastive learning tasks. However, without specific
+optimization for unimodal scenarios, its performance in single-modality feature
+extraction might be suboptimal. Despite this, some studies have directly used
+CLIP's image encoder for tasks like few-shot classification, introducing a
+misalignment between its pre-training objectives and feature extraction
+methods. This inconsistency can diminish the quality of the image's feature
+representation, adversely affecting CLIP's effectiveness in target tasks. In
+this paper, we view text features as precise neighbors of image features in
+CLIP's space and present a novel CrOss-moDal nEighbor Representation(CODER)
+based on the distance structure between images and their neighbor texts. This
+feature extraction method aligns better with CLIP's pre-training objectives,
+thereby fully leveraging CLIP's robust cross-modal capabilities. The key to
+construct a high-quality CODER lies in how to create a vast amount of
+high-quality and diverse texts to match with images. We introduce the Auto Text
+Generator(ATG) to automatically generate the required texts in a data-free and
+training-free manner. We apply CODER to CLIP's zero-shot and few-shot image
+classification tasks. Experiment results across various datasets and models
+confirm CODER's effectiveness. Code is available
+at:https://github.com/YCaigogogo/CVPR24-CODER.",cs.CV,"['cs.CV', 'cs.AI']"
+MaGGIe: Masked Guided Gradual Human Instance Matting,Chuong Huynh · Seoung Wug Oh · Abhinav Shrivastava · Joon-Young Lee,https://maggie-matt.github.io,https://arxiv.org/abs/2404.16035v1,,2404.16035v1.pdf,MaGGIe: Masked Guided Gradual Human Instance Matting,"Human matting is a foundation task in image and video processing, where human
+foreground pixels are extracted from the input. Prior works either improve the
+accuracy by additional guidance or improve the temporal consistency of a single
+instance across frames. We propose a new framework MaGGIe, Masked Guided
+Gradual Human Instance Matting, which predicts alpha mattes progressively for
+each human instances while maintaining the computational cost, precision, and
+consistency. Our method leverages modern architectures, including transformer
+attention and sparse convolution, to output all instance mattes simultaneously
+without exploding memory and latency. Although keeping constant inference costs
+in the multiple-instance scenario, our framework achieves robust and versatile
+performance on our proposed synthesized benchmarks. With the higher quality
+image and video matting benchmarks, the novel multi-instance synthesis approach
+from publicly available sources is introduced to increase the generalization of
+models in real-world scenarios.",cs.CV,"['cs.CV', 'cs.AI']"
+Aligning and Prompting Everything All at Once for Universal Visual Perception,Yunhang Shen · Chaoyou Fu · Peixian Chen · Mengdan Zhang · Ke Li · Xing Sun · Yunsheng Wu · Shaohui Lin · Rongrong Ji, ,https://arxiv.org/abs/2312.02153v1,,2312.02153v1.pdf,Aligning and Prompting Everything All at Once for Universal Visual Perception,"Vision foundation models have been explored recently to build general-purpose
+vision systems. However, predominant paradigms, driven by casting
+instance-level tasks as an object-word alignment, bring heavy cross-modality
+interaction, which is not effective in prompting object detection and visual
+grounding. Another line of work that focuses on pixel-level tasks often
+encounters a large annotation gap of things and stuff, and suffers from mutual
+interference between foreground-object and background-class segmentation. In
+stark contrast to the prevailing methods, we present APE, a universal visual
+perception model for aligning and prompting everything all at once in an image
+to perform diverse tasks, i.e., detection, segmentation, and grounding, as an
+instance-level sentence-object matching paradigm. Specifically, APE advances
+the convergence of detection and grounding by reformulating language-guided
+grounding as open-vocabulary detection, which efficiently scales up model
+prompting to thousands of category vocabularies and region descriptions while
+maintaining the effectiveness of cross-modality fusion. To bridge the
+granularity gap of different pixel-level tasks, APE equalizes semantic and
+panoptic segmentation to proxy instance learning by considering any isolated
+regions as individual instances. APE aligns vision and language representation
+on broad data with natural and challenging characteristics all at once without
+task-specific fine-tuning. The extensive experiments on over 160 datasets
+demonstrate that, with only one-suit of weights, APE outperforms (or is on par
+with) the state-of-the-art models, proving that an effective yet universal
+perception for anything aligning and prompting is indeed feasible. Codes and
+trained models are released at https://github.com/shenyunhang/APE.",cs.CV,['cs.CV']
+A General and Efficient Training for Transformer via Token Expansion,Wenxuan Huang · Yunhang Shen · Jiao Xie · Baochang Zhang · Gaoqi He · Ke Li · Xing Sun · Shaohui Lin, ,https://arxiv.org/abs/2404.00672v1,,2404.00672v1.pdf,A General and Efficient Training for Transformer via Token Expansion,"The remarkable performance of Vision Transformers (ViTs) typically requires
+an extremely large training cost. Existing methods have attempted to accelerate
+the training of ViTs, yet typically disregard method universality with accuracy
+dropping. Meanwhile, they break the training consistency of the original
+transformers, including the consistency of hyper-parameters, architecture, and
+strategy, which prevents them from being widely applied to different
+Transformer networks. In this paper, we propose a novel token growth scheme
+Token Expansion (termed ToE) to achieve consistent training acceleration for
+ViTs. We introduce an ""initialization-expansion-merging"" pipeline to maintain
+the integrity of the intermediate feature distribution of original
+transformers, preventing the loss of crucial learnable information in the
+training process. ToE can not only be seamlessly integrated into the training
+and fine-tuning process of transformers (e.g., DeiT and LV-ViT), but also
+effective for efficient training frameworks (e.g., EfficientTrain), without
+twisting the original training hyper-parameters, architecture, and introducing
+additional training strategies. Extensive experiments demonstrate that ToE
+achieves about 1.3x faster for the training of ViTs in a lossless manner, or
+even with performance gains over the full-token training baselines. Code is
+available at https://github.com/Osilly/TokenExpansion .",cs.LG,"['cs.LG', 'cs.AI', 'cs.CL', 'cs.CV']"
+RankED: Addressing Imbalance and Uncertainty in Edge Detection Using Ranking-based Losses,bedrettin cetinkaya · Sinan Kalkan · Emre Akbas,https://ranked-cvpr24.github.io/,https://arxiv.org/abs/2403.01795,,2403.01795.pdf,RankED: Addressing Imbalance and Uncertainty in Edge Detection Using Ranking-based Losses,"Detecting edges in images suffers from the problems of (P1) heavy imbalance
+between positive and negative classes as well as (P2) label uncertainty owing
+to disagreement between different annotators. Existing solutions address P1
+using class-balanced cross-entropy loss and dice loss and P2 by only predicting
+edges agreed upon by most annotators. In this paper, we propose RankED, a
+unified ranking-based approach that addresses both the imbalance problem (P1)
+and the uncertainty problem (P2). RankED tackles these two problems with two
+components: One component which ranks positive pixels over negative pixels, and
+the second which promotes high confidence edge pixels to have more label
+certainty. We show that RankED outperforms previous studies and sets a new
+state-of-the-art on NYUD-v2, BSDS500 and Multi-cue datasets. Code is available
+at https://ranked-cvpr24.github.io.",cs.CV,['cs.CV']
+Solving the Catastrophic Forgetting Problem in Generalized Category Discovery,Xinzi Cao · Xiawu Zheng · Guanhong Wang · Weijiang Yu · Yunhang Shen · Ke Li · Yutong Lu · Yonghong Tian, ,https://arxiv.org/abs/2308.12112,,2308.12112.pdf,Generalized Continual Category Discovery,"Most of Continual Learning (CL) methods push the limit of supervised learning
+settings, where an agent is expected to learn new labeled tasks and not forget
+previous knowledge. However, these settings are not well aligned with real-life
+scenarios, where a learning agent has access to a vast amount of unlabeled data
+encompassing both novel (entirely unlabeled) classes and examples from known
+classes. Drawing inspiration from Generalized Category Discovery (GCD), we
+introduce a novel framework that relaxes this assumption. Precisely, in any
+task, we allow for the existence of novel and known classes, and one must use
+continual version of unsupervised learning methods to discover them. We call
+this setting Generalized Continual Category Discovery (GCCD). It unifies CL and
+GCD, bridging the gap between synthetic benchmarks and real-life scenarios.
+With a series of experiments, we present that existing methods fail to
+accumulate knowledge from subsequent tasks in which unlabeled samples of novel
+classes are present. In light of these limitations, we propose a method that
+incorporates both supervised and unsupervised signals and mitigates the
+forgetting through the use of centroid adaptation. Our method surpasses strong
+CL methods adopted for GCD techniques and presents a superior representation
+learning performance.",cs.LG,"['cs.LG', 'cs.CV']"
+Resurrecting Old Classes with New Data for Exemplar-Free Continual Learning,Dipam Goswami · Albin Soutif · Yuyang Liu · Sandesh Kamath · Bartłomiej Twardowski · Joost van de Weijer,https://github.com/dipamgoswami/ADC,https://arxiv.org/abs/2405.19074,,2405.19074.pdf,Resurrecting Old Classes with New Data for Exemplar-Free Continual Learning,"Continual learning methods are known to suffer from catastrophic forgetting,
+a phenomenon that is particularly hard to counter for methods that do not store
+exemplars of previous tasks. Therefore, to reduce potential drift in the
+feature extractor, existing exemplar-free methods are typically evaluated in
+settings where the first task is significantly larger than subsequent tasks.
+Their performance drops drastically in more challenging settings starting with
+a smaller first task. To address this problem of feature drift estimation for
+exemplar-free methods, we propose to adversarially perturb the current samples
+such that their embeddings are close to the old class prototypes in the old
+model embedding space. We then estimate the drift in the embedding space from
+the old to the new model using the perturbed images and compensate the
+prototypes accordingly. We exploit the fact that adversarial samples are
+transferable from the old to the new feature space in a continual learning
+setting. The generation of these images is simple and computationally cheap. We
+demonstrate in our experiments that the proposed approach better tracks the
+movement of prototypes in embedding space and outperforms existing methods on
+several standard continual learning benchmarks as well as on fine-grained
+datasets. Code is available at https://github.com/dipamgoswami/ADC.",cs.CV,"['cs.CV', 'cs.AI']"
+Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion?,Zhengyue Zhao · Jinhao Duan · Kaidi Xu · Chenan Wang · Rui Zhang · Zidong Du · Qi Guo · Xing Hu, ,https://arxiv.org/abs/2312.00084,,2312.00084.pdf,Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion?,"Stable Diffusion has established itself as a foundation model in generative
+AI artistic applications, receiving widespread research and application. Some
+recent fine-tuning methods have made it feasible for individuals to implant
+personalized concepts onto the basic Stable Diffusion model with minimal
+computational costs on small datasets. However, these innovations have also
+given rise to issues like facial privacy forgery and artistic copyright
+infringement. In recent studies, researchers have explored the addition of
+imperceptible adversarial perturbations to images to prevent potential
+unauthorized exploitation and infringements when personal data is used for
+fine-tuning Stable Diffusion. Although these studies have demonstrated the
+ability to protect images, it is essential to consider that these methods may
+not be entirely applicable in real-world scenarios. In this paper, we
+systematically evaluate the use of perturbations to protect images within a
+practical threat model. The results suggest that these approaches may not be
+sufficient to safeguard image privacy and copyright effectively. Furthermore,
+we introduce a purification method capable of removing protected perturbations
+while preserving the original image structure to the greatest extent possible.
+Experiments reveal that Stable Diffusion can effectively learn from purified
+images over all protective methods.",cs.CV,['cs.CV']
+MoST: Multi-modality Scene Tokenization for Motion Prediction,Norman Mu · Jingwei Ji · Zhenpei Yang · Nathan Harada · Haotian Tang · Kan Chen · Charles R. Qi · Runzhou Ge · Kratarth Goel · Zoey Yang · Scott Ettinger · Rami Al-Rfou · Dragomir Anguelov · Yin Zhou, ,http://export.arxiv.org/abs/2404.19531,,2404.19531.pdf,MoST: Multi-modality Scene Tokenization for Motion Prediction,"Many existing motion prediction approaches rely on symbolic perception
+outputs to generate agent trajectories, such as bounding boxes, road graph
+information and traffic lights. This symbolic representation is a high-level
+abstraction of the real world, which may render the motion prediction model
+vulnerable to perception errors (e.g., failures in detecting open-vocabulary
+obstacles) while missing salient information from the scene context (e.g., poor
+road conditions). An alternative paradigm is end-to-end learning from raw
+sensors. However, this approach suffers from the lack of interpretability and
+requires significantly more training resources. In this work, we propose
+tokenizing the visual world into a compact set of scene elements and then
+leveraging pre-trained image foundation models and LiDAR neural networks to
+encode all the scene elements in an open-vocabulary manner. The image
+foundation model enables our scene tokens to encode the general knowledge of
+the open world while the LiDAR neural network encodes geometry information. Our
+proposed representation can efficiently encode the multi-frame multi-modality
+observations with a few hundred tokens and is compatible with most
+transformer-based architectures. To evaluate our method, we have augmented
+Waymo Open Motion Dataset with camera embeddings. Experiments over Waymo Open
+Motion Dataset show that our approach leads to significant performance
+improvements over the state-of-the-art.",cs.CV,['cs.CV']
+Task-Driven Wavelets using Constrained Empirical Risk Minimization,Eric Marcus · Ray Sheombarsing · Jan-Jakob Sonke · Jonas Teuwen,https://github.com/NKI-AI/CERM,,https://aiforoncology.nl/news/2024-02-27/two-papers-accepted-at-cvpr-2024/,,,,,nan
+Insights from the Use of Previously Unseen Neural Architecture Search Datasets,Rob Geada · David Towers · Matthew Forshaw · Amir Atapour-Abarghouei · Stephen McGough,https://github.com/Towers-D/NAS-Unseen-Datasets,https://arxiv.org/abs/2404.02189,,2404.02189.pdf,Insights from the Use of Previously Unseen Neural Architecture Search Datasets,"The boundless possibility of neural networks which can be used to solve a
+problem -- each with different performance -- leads to a situation where a Deep
+Learning expert is required to identify the best neural network. This goes
+against the hope of removing the need for experts. Neural Architecture Search
+(NAS) offers a solution to this by automatically identifying the best
+architecture. However, to date, NAS work has focused on a small set of datasets
+which we argue are not representative of real-world problems. We introduce
+eight new datasets created for a series of NAS Challenges: AddNIST, Language,
+MultNIST, CIFARTile, Gutenberg, Isabella, GeoClassing, and Chesseract. These
+datasets and challenges are developed to direct attention to issues in NAS
+development and to encourage authors to consider how their models will perform
+on datasets unknown to them at development time. We present experimentation
+using standard Deep Learning methods as well as the best results from challenge
+participants.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design Approach,Wei Dong · Xing Zhang · Bihui Chen · Dawei Yan · Zhijun Lin · Qingsen Yan · Peng Wang · Yang Yang, ,https://arxiv.org/abs/2403.19067,,2403.19067.pdf,Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design Approach,"Parameter-efficient fine-tuning for pre-trained Vision Transformers aims to
+adeptly tailor a model to downstream tasks by learning a minimal set of new
+adaptation parameters while preserving the frozen majority of pre-trained
+parameters. Striking a balance between retaining the generalizable
+representation capacity of the pre-trained model and acquiring task-specific
+features poses a key challenge. Currently, there is a lack of focus on guiding
+this delicate trade-off. In this study, we approach the problem from the
+perspective of Singular Value Decomposition (SVD) of pre-trained parameter
+matrices, providing insights into the tuning dynamics of existing methods.
+Building upon this understanding, we propose a Residual-based Low-Rank
+Rescaling (RLRR) fine-tuning strategy. This strategy not only enhances
+flexibility in parameter tuning but also ensures that new parameters do not
+deviate excessively from the pre-trained model through a residual design.
+Extensive experiments demonstrate that our method achieves competitive
+performance across various downstream image classification tasks, all while
+maintaining comparable new parameters. We believe this work takes a step
+forward in offering a unified perspective for interpreting existing methods and
+serves as motivation for the development of new approaches that move closer to
+effectively considering the crucial trade-off mentioned above. Our code is
+available at
+\href{https://github.com/zstarN70/RLRR.git}{https://github.com/zstarN70/RLRR.git}.",cs.CV,['cs.CV']
+Kandinsky Conformal Prediction: Efficient Calibration of Image Segmentation Algorithms,Joren Brunekreef · Eric Marcus · Ray Sheombarsing · Jan-Jakob Sonke · Jonas Teuwen,https://github.com/NKI-AI/kandinsky-calibration,https://arxiv.org/abs/2311.11837v1,,2311.11837v1.pdf,Kandinsky Conformal Prediction: Efficient Calibration of Image Segmentation Algorithms,"Image segmentation algorithms can be understood as a collection of pixel
+classifiers, for which the outcomes of nearby pixels are correlated. Classifier
+models can be calibrated using Inductive Conformal Prediction, but this
+requires holding back a sufficiently large calibration dataset for computing
+the distribution of non-conformity scores of the model's predictions. If one
+only requires only marginal calibration on the image level, this calibration
+set consists of all individual pixels in the images available for calibration.
+However, if the goal is to attain proper calibration for each individual pixel
+classifier, the calibration set consists of individual images. In a scenario
+where data are scarce (such as the medical domain), it may not always be
+possible to set aside sufficiently many images for this pixel-level
+calibration. The method we propose, dubbed ``Kandinsky calibration'', makes use
+of the spatial structure present in the distribution of natural images to
+simultaneously calibrate the classifiers of ``similar'' pixels. This can be
+seen as an intermediate approach between marginal (imagewise) and conditional
+(pixelwise) calibration, where non-conformity scores are aggregated over
+similar image regions, thereby making more efficient use of the images
+available for calibration. We run experiments on segmentation algorithms
+trained and calibrated on subsets of the public MS-COCO and Medical Decathlon
+datasets, demonstrating that Kandinsky calibration method can significantly
+improve the coverage. When compared to both pixelwise and imagewise calibration
+on little data, the Kandinsky method achieves much lower coverage errors,
+indicating the data efficiency of the Kandinsky calibration.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion,Xinyu Zhan · Lixin Yang · Yifei Zhao · Kangrui Mao · Hanlin Xu · Zenan Lin · Kailin Li · Cewu Lu, ,,https://paperswithcode.com/paper/oakink2-a-dataset-of-bimanual-hands-object,,,,,nan
+What Moves Together Belongs Together,Jenny Seidenschwarz · Aljoša Ošep · Francesco Ferroni · Simon Lucey · Laura Leal-Taixe,https://research.nvidia.com/labs/dvl/projects/semoli/,https://arxiv.org/abs/2402.19463,,2402.19463.pdf,SeMoLi: What Moves Together Belongs Together,"We tackle semi-supervised object detection based on motion cues. Recent
+results suggest that heuristic-based clustering methods in conjunction with
+object trackers can be used to pseudo-label instances of moving objects and use
+these as supervisory signals to train 3D object detectors in Lidar data without
+manual supervision. We re-think this approach and suggest that both, object
+detection, as well as motion-inspired pseudo-labeling, can be tackled in a
+data-driven manner. We leverage recent advances in scene flow estimation to
+obtain point trajectories from which we extract long-term, class-agnostic
+motion patterns. Revisiting correlation clustering in the context of message
+passing networks, we learn to group those motion patterns to cluster points to
+object instances. By estimating the full extent of the objects, we obtain
+per-scan 3D bounding boxes that we use to supervise a Lidar object detection
+network. Our method not only outperforms prior heuristic-based approaches (57.5
+AP, +14 improvement over prior work), more importantly, we show we can
+pseudo-label and train object detectors across datasets.",cs.CV,['cs.CV']
+Stratified Avatar Generation from Sparse Observations,Han Feng · Wenchao Ma · Quankai Gao · Xianwei Zheng · Nan Xue · Huijuan Xu, ,https://arxiv.org/abs/2405.20786,,2405.20786.pdf,Stratified Avatar Generation from Sparse Observations,"Estimating 3D full-body avatars from AR/VR devices is essential for creating
+immersive experiences in AR/VR applications. This task is challenging due to
+the limited input from Head Mounted Devices, which capture only sparse
+observations from the head and hands. Predicting the full-body avatars,
+particularly the lower body, from these sparse observations presents
+significant difficulties. In this paper, we are inspired by the inherent
+property of the kinematic tree defined in the Skinned Multi-Person Linear
+(SMPL) model, where the upper body and lower body share only one common
+ancestor node, bringing the potential of decoupled reconstruction. We propose a
+stratified approach to decouple the conventional full-body avatar
+reconstruction pipeline into two stages, with the reconstruction of the upper
+body first and a subsequent reconstruction of the lower body conditioned on the
+previous stage. To implement this straightforward idea, we leverage the latent
+diffusion model as a powerful probabilistic generator, and train it to follow
+the latent distribution of decoupled motions explored by a VQ-VAE
+encoder-decoder model. Extensive experiments on AMASS mocap dataset demonstrate
+our state-of-the-art performance in the reconstruction of full-body motions.",cs.CV,"['cs.CV', 'cs.HC']"
+VLP: Vision Language Planning for Autonomous Driving,Chenbin Pan · Burhaneddin Yaman · Tommaso Nesti · Abhirup Mallik · Alessandro G Allievi · Senem Velipasalar · Liu Ren, ,https://arxiv.org/abs/2401.05577,,2401.05577.pdf,VLP: Vision Language Planning for Autonomous Driving,"Autonomous driving is a complex and challenging task that aims at safe motion
+planning through scene understanding and reasoning. While vision-only
+autonomous driving methods have recently achieved notable performance, through
+enhanced scene understanding, several key issues, including lack of reasoning,
+low generalization performance and long-tail scenarios, still need to be
+addressed. In this paper, we present VLP, a novel Vision-Language-Planning
+framework that exploits language models to bridge the gap between linguistic
+understanding and autonomous driving. VLP enhances autonomous driving systems
+by strengthening both the source memory foundation and the self-driving car's
+contextual understanding. VLP achieves state-of-the-art end-to-end planning
+performance on the challenging NuScenes dataset by achieving 35.9\% and 60.5\%
+reduction in terms of average L2 error and collision rates, respectively,
+compared to the previous best method. Moreover, VLP shows improved performance
+in challenging long-tail scenarios and strong generalization capabilities when
+faced with new urban environments.",cs.CV,['cs.CV']
+Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding,Sai Wang · Yutian Lin · Yu Wu, ,https://ar5iv.labs.arxiv.org/html/2312.09625,,2312.09625.pdf,Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment,"Learning to ground natural language queries to target objects or regions in
+3D point clouds is quite essential for 3D scene understanding. Nevertheless,
+existing 3D visual grounding approaches require a substantial number of
+bounding box annotations for text queries, which is time-consuming and
+labor-intensive to obtain. In this paper, we propose \textbf{3D-VLA}, a weakly
+supervised approach for \textbf{3D} visual grounding based on \textbf{V}isual
+\textbf{L}inguistic \textbf{A}lignment. Our 3D-VLA exploits the superior
+ability of current large-scale vision-language models (VLMs) on aligning the
+semantics between texts and 2D images, as well as the naturally existing
+correspondences between 2D images and 3D point clouds, and thus implicitly
+constructs correspondences between texts and 3D point clouds with no need for
+fine-grained box annotations in the training procedure. During the inference
+stage, the learned text-3D correspondence will help us ground the text queries
+to the 3D target objects even without 2D images. To the best of our knowledge,
+this is the first work to investigate 3D visual grounding in a weakly
+supervised manner by involving large scale vision-language models, and
+extensive experiments on ReferIt3D and ScanRefer datasets demonstrate that our
+3D-VLA achieves comparable and even superior results over the fully supervised
+methods.",cs.CV,"['cs.CV', 'cs.CL']"
+3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions,Weijia Li · Haote Yang · Zhenghao Hu · Juepeng Zheng · Gui-Song Xia · Conghui He, ,https://arxiv.org/abs/2404.04823,,2404.04823.pdf,3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions,"3D building reconstruction from monocular remote sensing images is an
+important and challenging research problem that has received increasing
+attention in recent years, owing to its low cost of data acquisition and
+availability for large-scale applications. However, existing methods rely on
+expensive 3D-annotated samples for fully-supervised training, restricting their
+application to large-scale cross-city scenarios. In this work, we propose
+MLS-BRN, a multi-level supervised building reconstruction network that can
+flexibly utilize training samples with different annotation levels to achieve
+better reconstruction results in an end-to-end manner. To alleviate the demand
+on full 3D supervision, we design two new modules, Pseudo Building Bbox
+Calculator and Roof-Offset guided Footprint Extractor, as well as new tasks and
+training strategies for different types of samples. Experimental results on
+several public and new datasets demonstrate that our proposed MLS-BRN achieves
+competitive performance using much fewer 3D-annotated samples, and
+significantly improves the footprint extraction and 3D reconstruction
+performance compared with current state-of-the-art. The code and datasets of
+this work will be released at https://github.com/opendatalab/MLS-BRN.git.",cs.CV,['cs.CV']
+SeaBird: Segmentation in Bird’s View with Dice Loss Improves Monocular 3D Detection of Large Objects,Abhinav Kumar · Yuliang Guo · Xinyu Huang · Liu Ren · Xiaoming Liu,https://github.com/abhi1kumar/SeaBird,https://arxiv.org/abs/2403.20318,,2403.20318.pdf,SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects,"Monocular 3D detectors achieve remarkable performance on cars and smaller
+objects. However, their performance drops on larger objects, leading to fatal
+accidents. Some attribute the failures to training data scarcity or their
+receptive field requirements of large objects. In this paper, we highlight this
+understudied problem of generalization to large objects. We find that modern
+frontal detectors struggle to generalize to large objects even on nearly
+balanced datasets. We argue that the cause of failure is the sensitivity of
+depth regression losses to noise of larger objects. To bridge this gap, we
+comprehensively investigate regression and dice losses, examining their
+robustness under varying error levels and object sizes. We mathematically prove
+that the dice loss leads to superior noise-robustness and model convergence for
+large objects compared to regression losses for a simplified case. Leveraging
+our theoretical insights, we propose SeaBird (Segmentation in Bird's View) as
+the first step towards generalizing to large objects. SeaBird effectively
+integrates BEV segmentation on foreground objects for 3D detection, with the
+segmentation head trained with the dice loss. SeaBird achieves SoTA results on
+the KITTI-360 leaderboard and improves existing detectors on the nuScenes
+leaderboard, particularly for large objects. Code and models at
+https://github.com/abhi1kumar/SeaBird",cs.CV,"['cs.CV', 'cs.AI']"
+Learning to Count without Annotations,Lukas Knobel · Tengda Han · Yuki Asano,https://github.com/lukasknobel/SelfCollages,https://web3.arxiv.org/abs/2307.08727,,2307.08727.pdf,Learning to Count without Annotations,"While recent supervised methods for reference-based object counting continue
+to improve the performance on benchmark datasets, they have to rely on small
+datasets due to the cost associated with manually annotating dozens of objects
+in images. We propose UnCounTR, a model that can learn this task without
+requiring any manual annotations. To this end, we construct ""Self-Collages"",
+images with various pasted objects as training samples, that provide a rich
+learning signal covering arbitrary object types and counts. Our method builds
+on existing unsupervised representations and segmentation techniques to
+successfully demonstrate for the first time the ability of reference-based
+counting without manual supervision. Our experiments show that our method not
+only outperforms simple baselines and generic models such as FasterRCNN and
+DETR, but also matches the performance of supervised counting models in some
+domains.",cs.CV,['cs.CV']
+AM-RADIO: Agglomerative Models - Reduce All Domains Into One,Mike Ranzinger · Greg Heinrich · Jan Kautz · Pavlo Molchanov,https://github.com/NVlabs/RADIO,https://arxiv.org/abs/2312.06709,,2312.06709.pdf,AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One,"A handful of visual foundation models (VFMs) have recently emerged as the
+backbones for numerous downstream tasks. VFMs like CLIP, DINOv2, SAM are
+trained with distinct objectives, exhibiting unique characteristics for various
+downstream tasks. We find that despite their conceptual differences, these
+models can be effectively merged into a unified model through multi-teacher
+distillation. We name this approach AM-RADIO (Agglomerative Model -- Reduce All
+Domains Into One). This integrative approach not only surpasses the performance
+of individual teacher models but also amalgamates their distinctive features,
+such as zero-shot vision-language comprehension, detailed pixel-level
+understanding, and open vocabulary segmentation capabilities. In pursuit of the
+most hardware-efficient backbone, we evaluated numerous architectures in our
+multi-teacher distillation pipeline using the same training recipe. This led to
+the development of a novel architecture (E-RADIO) that exceeds the performance
+of its predecessors and is at least 7x faster than the teacher models. Our
+comprehensive benchmarking process covers downstream tasks including ImageNet
+classification, ADE20k semantic segmentation, COCO object detection and
+LLaVa-1.5 framework.
+  Code: https://github.com/NVlabs/RADIO",cs.CV,['cs.CV']
+Activity-Biometrics: Person Identification from Daily Activities,Shehreen Azad · Yogesh S. Rawat, ,https://arxiv.org/abs/2403.17360,,2403.17360.pdf,Activity-Biometrics: Person Identification from Daily Activities,"In this work, we study a novel problem which focuses on person identification
+while performing daily activities. Learning biometric features from RGB videos
+is challenging due to spatio-temporal complexity and presence of appearance
+biases such as clothing color and background. We propose ABNet, a novel
+framework which leverages disentanglement of biometric and non-biometric
+features to perform effective person identification from daily activities.
+ABNet relies on a bias-less teacher to learn biometric features from RGB videos
+and explicitly disentangle non-biometric features with the help of biometric
+distortion. In addition, ABNet also exploits activity prior for biometrics
+which is enabled by joint biometric and activity learning. We perform
+comprehensive evaluation of the proposed approach across five different
+datasets which are derived from existing activity recognition benchmarks.
+Furthermore, we extensively compare ABNet with existing works in person
+identification and demonstrate its effectiveness for activity-based biometrics
+across all five datasets. The code and dataset can be accessed at:
+\url{https://github.com/sacrcv/Activity-Biometrics/}",cs.CV,['cs.CV']
+Unsupervised Template-assisted  Point Cloud Shape Correspondence Network,Jiacheng Deng · Jiahao Lu · Tianzhu Zhang, ,https://arxiv.org/abs/2403.16412,,2403.16412.pdf,Unsupervised Template-assisted Point Cloud Shape Correspondence Network,"Unsupervised point cloud shape correspondence aims to establish point-wise
+correspondences between source and target point clouds. Existing methods obtain
+correspondences directly by computing point-wise feature similarity between
+point clouds. However, non-rigid objects possess strong deformability and
+unusual shapes, making it a longstanding challenge to directly establish
+correspondences between point clouds with unconventional shapes. To address
+this challenge, we propose an unsupervised Template-Assisted point cloud shape
+correspondence Network, termed TANet, including a template generation module
+and a template assistance module. The proposed TANet enjoys several merits.
+Firstly, the template generation module establishes a set of learnable
+templates with explicit structures. Secondly, we introduce a template
+assistance module that extensively leverages the generated templates to
+establish more accurate shape correspondences from multiple perspectives.
+Extensive experiments on four human and animal datasets demonstrate that TANet
+achieves favorable performance against state-of-the-art methods.",cs.CV,['cs.CV']
+Real-time Acquisition and Reconstruction of Dynamic Volumes with Neural Structured Illumination,Yixin Zeng · Zoubin Bi · Yin Mingrui · Xiang Feng · Kun Zhou · Hongzhi Wu, ,https://arxiv.org/html/2404.10766v1,,2404.10766v1.pdf,RapidVol: Rapid Reconstruction of 3D Ultrasound Volumes from Sensorless 2D Scans,"Two-dimensional (2D) freehand ultrasonography is one of the most commonly
+used medical imaging modalities, particularly in obstetrics and gynaecology.
+However, it only captures 2D cross-sectional views of inherently 3D anatomies,
+losing valuable contextual information. As an alternative to requiring costly
+and complex 3D ultrasound scanners, 3D volumes can be constructed from 2D scans
+using machine learning. However this usually requires long computational time.
+Here, we propose RapidVol: a neural representation framework to speed up
+slice-to-volume ultrasound reconstruction. We use tensor-rank decomposition, to
+decompose the typical 3D volume into sets of tri-planes, and store those
+instead, as well as a small neural network. A set of 2D ultrasound scans, with
+their ground truth (or estimated) 3D position and orientation (pose) is all
+that is required to form a complete 3D reconstruction. Reconstructions are
+formed from real fetal brain scans, and then evaluated by requesting novel
+cross-sectional views. When compared to prior approaches based on fully
+implicit representation (e.g. neural radiance fields), our method is over 3x
+quicker, 46% more accurate, and if given inaccurate poses is more robust.
+Further speed-up is also possible by reconstructing from a structural prior
+rather than from scratch.",eess.IV,"['eess.IV', 'cs.CV']"
+FC-GNN: Recovering Reliable and Accurate Correspondences from Interferences,Haobo Xu · Jun Zhou · Hua Yang · Renjie Pan · Cunyan Li, ,,https://www.researchgate.net/publication/376796777_Matching-to-Detecting_Establishing_Dense_and_Reliable_Correspondences_Between_Images,,,,,nan
+CFAT: Unleashing Triangular Windows for Image Super-resolution,Abhisek Ray · Gaurav Kumar · Maheshkumar Kolekar, ,https://arxiv.org/abs/2403.16143,,2403.16143.pdf,CFAT: Unleashing TriangularWindows for Image Super-resolution,"Transformer-based models have revolutionized the field of image
+super-resolution (SR) by harnessing their inherent ability to capture complex
+contextual features. The overlapping rectangular shifted window technique used
+in transformer architecture nowadays is a common practice in super-resolution
+models to improve the quality and robustness of image upscaling. However, it
+suffers from distortion at the boundaries and has limited unique shifting
+modes. To overcome these weaknesses, we propose a non-overlapping triangular
+window technique that synchronously works with the rectangular one to mitigate
+boundary-level distortion and allows the model to access more unique sifting
+modes. In this paper, we propose a Composite Fusion Attention Transformer
+(CFAT) that incorporates triangular-rectangular window-based local attention
+with a channel-based global attention technique in image super-resolution. As a
+result, CFAT enables attention mechanisms to be activated on more image pixels
+and captures long-range, multi-scale features to improve SR performance. The
+extensive experimental results and ablation study demonstrate the effectiveness
+of CFAT in the SR domain. Our proposed model shows a significant 0.7 dB
+performance improvement over other state-of-the-art SR architectures.",eess.IV,"['eess.IV', 'cs.CV', 'cs.LG', 'cs.MM']"
+Deciphering ‘What’ and ‘Where’ Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations,Xiao Zhang · David Yunis · Michael Maire, ,https://arxiv.org/abs/2312.06716,,2312.06716.pdf,Deciphering 'What' and 'Where' Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations,"We present an approach for analyzing grouping information contained within a
+neural network's activations, permitting extraction of spatial layout and
+semantic segmentation from the behavior of large pre-trained vision models.
+Unlike prior work, our method conducts a wholistic analysis of a network's
+activation state, leveraging features from all layers and obviating the need to
+guess which part of the model contains relevant information. Motivated by
+classic spectral clustering, we formulate this analysis in terms of an
+optimization objective involving a set of affinity matrices, each formed by
+comparing features within a different layer. Solving this optimization problem
+using gradient descent allows our technique to scale from single images to
+dataset-level analysis, including, in the latter, both intra- and inter-image
+relationships. Analyzing a pre-trained generative transformer provides insight
+into the computational strategy learned by such models. Equating affinity with
+key-query similarity across attention layers yields eigenvectors encoding scene
+spatial layout, whereas defining affinity by value vector similarity yields
+eigenvectors encoding object identity. This result suggests that key and query
+vectors coordinate attentional information flow according to spatial proximity
+(a `where' pathway), while value vectors refine a semantic category
+representation (a `what' pathway).",cs.CV,['cs.CV']
+DART: Implicit Doppler Tomography for Radar Novel View Synthesis,Tianshu Huang · John Miller · Akarsh Prabhakara · Tao Jin · Tarana Laroia · Zico Kolter · Anthony Rowe,https://wiselabcmu.github.io/dart/,https://arxiv.org/abs/2403.03896v1,,2403.03896v1.pdf,DART: Implicit Doppler Tomography for Radar Novel View Synthesis,"Simulation is an invaluable tool for radio-frequency system designers that
+enables rapid prototyping of various algorithms for imaging, target detection,
+classification, and tracking. However, simulating realistic radar scans is a
+challenging task that requires an accurate model of the scene, radio frequency
+material properties, and a corresponding radar synthesis function. Rather than
+specifying these models explicitly, we propose DART - Doppler Aided Radar
+Tomography, a Neural Radiance Field-inspired method which uses radar-specific
+physics to create a reflectance and transmittance-based rendering pipeline for
+range-Doppler images. We then evaluate DART by constructing a custom data
+collection platform and collecting a novel radar dataset together with accurate
+position and instantaneous velocity measurements from lidar-based localization.
+In comparison to state-of-the-art baselines, DART synthesizes superior radar
+range-Doppler images from novel views across all datasets and additionally can
+be used to generate high quality tomographic images.",cs.CV,"['cs.CV', 'cs.LG']"
+Don’t drop your samples! Coherence-aware training benefits Conditional diffusion,Nicolas Dufour · Victor Besnier · Vicky Kalogeiton · David Picard,https://nicolas-dufour.github.io/cad,https://arxiv.org/abs/2405.20324,,2405.20324.pdf,Don't drop your samples! Coherence-aware training benefits Conditional diffusion,"Conditional diffusion models are powerful generative models that can leverage
+various types of conditional information, such as class labels, segmentation
+masks, or text captions. However, in many real-world scenarios, conditional
+information may be noisy or unreliable due to human annotation errors or weak
+alignment. In this paper, we propose the Coherence-Aware Diffusion (CAD), a
+novel method that integrates coherence in conditional information into
+diffusion models, allowing them to learn from noisy annotations without
+discarding data. We assume that each data point has an associated coherence
+score that reflects the quality of the conditional information. We then
+condition the diffusion model on both the conditional information and the
+coherence score. In this way, the model learns to ignore or discount the
+conditioning when the coherence is low. We show that CAD is theoretically sound
+and empirically effective on various conditional generation tasks. Moreover, we
+show that leveraging coherence generates realistic and diverse samples that
+respect conditional information better than models trained on cleaned datasets
+where samples with low coherence have been discarded.",cs.CV,"['cs.CV', 'cs.LG']"
+"Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation",Biao Gong · Siteng Huang · Yutong Feng · Shiwei Zhang · Yuyuan Li · Yu Liu, ,https://arxiv.org/abs/2311.15773,,2311.15773.pdf,"Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation","Diffusion models have recently achieved remarkable progress in generating
+realistic images. However, challenges remain in accurately understanding and
+synthesizing the layout requirements in the textual prompts. To align the
+generated image with layout instructions, we present a training-free layout
+calibration system SimM that intervenes in the generative process on the fly
+during inference time. Specifically, following a ""check-locate-rectify""
+pipeline, the system first analyses the prompt to generate the target layout
+and compares it with the intermediate outputs to automatically detect errors.
+Then, by moving the located activations and making intra- and inter-map
+adjustments, the rectification process can be performed with negligible
+computational overhead. To evaluate SimM over a range of layout requirements,
+we present a benchmark SimMBench that compensates for the lack of superlative
+spatial relations in existing datasets. And both quantitative and qualitative
+results demonstrate the effectiveness of the proposed SimM in calibrating the
+layout inconsistencies. Our project page is at https://simm-t2i.github.io/SimM.",cs.CV,['cs.CV']
+GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos,Tomas Soucek · Dima Damen · Michael Wray · Ivan Laptev · Josef Sivic,https://soczech.github.io/genhowto/,https://arxiv.org/abs/2312.07322,,2312.07322.pdf,GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos,"We address the task of generating temporally consistent and physically
+plausible images of actions and object state transformations. Given an input
+image and a text prompt describing the targeted transformation, our generated
+images preserve the environment and transform objects in the initial image. Our
+contributions are threefold. First, we leverage a large body of instructional
+videos and automatically mine a dataset of triplets of consecutive frames
+corresponding to initial object states, actions, and resulting object
+transformations. Second, equipped with this data, we develop and train a
+conditioned diffusion model dubbed GenHowTo. Third, we evaluate GenHowTo on a
+variety of objects and actions and show superior performance compared to
+existing methods. In particular, we introduce a quantitative evaluation where
+GenHowTo achieves 88% and 74% on seen and unseen interaction categories,
+respectively, outperforming prior work by a large margin.",cs.CV,['cs.CV']
+From Coarse to Fine-Grained Open-Set Recognition,Nico Lang · Vésteinn Snæbjarnarson · Elijah Cole · Oisin Mac Aodha · Christian Igel · Serge Belongie, ,https://arxiv.org/abs/2307.07214,,2307.07214.pdf,Complementary Frequency-Varying Awareness Network for Open-Set Fine-Grained Image Recognition,"Open-set image recognition is a challenging topic in computer vision. Most of
+the existing works in literature focus on learning more discriminative features
+from the input images, however, they are usually insensitive to the high- or
+low-frequency components in features, resulting in a decreasing performance on
+fine-grained image recognition. To address this problem, we propose a
+Complementary Frequency-varying Awareness Network that could better capture
+both high-frequency and low-frequency information, called CFAN. The proposed
+CFAN consists of three sequential modules: (i) a feature extraction module is
+introduced for learning preliminary features from the input images; (ii) a
+frequency-varying filtering module is designed to separate out both high- and
+low-frequency components from the preliminary features in the frequency domain
+via a frequency-adjustable filter; (iii) a complementary temporal aggregation
+module is designed for aggregating the high- and low-frequency components via
+two Long Short-Term Memory networks into discriminative features. Based on
+CFAN, we further propose an open-set fine-grained image recognition method,
+called CFAN-OSFGR, which learns image features via CFAN and classifies them via
+a linear classifier. Experimental results on 3 fine-grained datasets and 2
+coarse-grained datasets demonstrate that CFAN-OSFGR performs significantly
+better than 9 state-of-the-art methods in most cases.",cs.CV,['cs.CV']
+FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models,Jinglin Xu · Yijie Guo · Yuxin Peng,https://pku-icst-mipl.github.io/FinePOSE_ProjectPage/,https://arxiv.org/abs/2405.05216,,,FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models,"The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to
+predict human joint coordinates in 3D space. Despite recent advancements in
+deep learning-based methods, they mostly ignore the capability of coupling
+accessible texts and naturally feasible knowledge of humans, missing out on
+valuable implicit supervision to guide the 3D HPE task. Moreover, previous
+efforts often study this task from the perspective of the whole human body,
+neglecting fine-grained guidance hidden in different body parts. To this end,
+we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model
+for 3D HPE, named \textbf{FinePOSE}. It consists of three core blocks enhancing
+the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt
+learning (FPP) block constructs fine-grained part-aware prompts via coupling
+accessible texts and naturally feasible knowledge of body parts with learnable
+prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication
+(FPC) block establishes fine-grained communications between learned part-aware
+prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp
+Stylization (PTS) block integrates learned prompt embedding and temporal
+information related to the noise level to enable adaptive adjustment at each
+denoising step. Extensive experiments on public single-human pose estimation
+datasets show that FinePOSE outperforms state-of-the-art methods. We further
+extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE
+on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with
+complex multi-human scenarios. Code is available at
+https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024.",cs.CV,['cs.CV']
+FedMef: Towards Memory-efficient Federated Dynamic Pruning,Hong Huang · Weiming Zhuang · Chen Chen · Lingjuan Lyu, ,https://arxiv.org/abs/2403.14737,,2403.14737.pdf,FedMef: Towards Memory-efficient Federated Dynamic Pruning,"Federated learning (FL) promotes decentralized training while prioritizing
+data confidentiality. However, its application on resource-constrained devices
+is challenging due to the high demand for computation and memory resources to
+train deep learning models. Neural network pruning techniques, such as dynamic
+pruning, could enhance model efficiency, but directly adopting them in FL still
+poses substantial challenges, including post-pruning performance degradation,
+high activation memory usage, etc. To address these challenges, we propose
+FedMef, a novel and memory-efficient federated dynamic pruning framework.
+FedMef comprises two key components. First, we introduce the budget-aware
+extrusion that maintains pruning efficiency while preserving post-pruning
+performance by salvaging crucial information from parameters marked for pruning
+within a given budget. Second, we propose scaled activation pruning to
+effectively reduce activation memory footprints, which is particularly
+beneficial for deploying FL to memory-limited devices. Extensive experiments
+demonstrate the effectiveness of our proposed FedMef. In particular, it
+achieves a significant reduction of 28.5% in memory footprint compared to
+state-of-the-art methods while obtaining superior accuracy.",cs.LG,"['cs.LG', 'cs.DC']"
+FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition,Sicheng Mo · Fangzhou Mu · Kuan Heng Lin · Yanli Liu · Bochen Guan · Yin Li · Bolei Zhou, ,https://arxiv.org/abs/2312.07536,,2312.07536.pdf,FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition,"Recent approaches such as ControlNet offer users fine-grained spatial control
+over text-to-image (T2I) diffusion models. However, auxiliary modules have to
+be trained for each type of spatial condition, model architecture, and
+checkpoint, putting them at odds with the diverse intents and preferences a
+human designer would like to convey to the AI models during the content
+creation process. In this work, we present FreeControl, a training-free
+approach for controllable T2I generation that supports multiple conditions,
+architectures, and checkpoints simultaneously. FreeControl designs structure
+guidance to facilitate the structure alignment with a guidance image, and
+appearance guidance to enable the appearance sharing between images generated
+using the same seed. Extensive qualitative and quantitative experiments
+demonstrate the superior performance of FreeControl across a variety of
+pre-trained T2I models. In particular, FreeControl facilitates convenient
+training-free control over many different architectures and checkpoints, allows
+the challenging input conditions on which most of the existing training-free
+methods fail, and achieves competitive synthesis quality with training-based
+approaches.",cs.CV,['cs.CV']
+Revamping Federated Learning Security from a Defender's Perspective: A Unified Defense with Homomorphic Encrypted Data Space,Naveen Kumar Kummari · Reshmi Mitra · Krishna Mohan Chalavadi,https://github.com/NaveenKumar-1311/FCD,https://arxiv.org/abs/2307.08672,,2307.08672.pdf,FedDefender: Backdoor Attack Defense in Federated Learning,"Federated Learning (FL) is a privacy-preserving distributed machine learning
+technique that enables individual clients (e.g., user participants, edge
+devices, or organizations) to train a model on their local data in a secure
+environment and then share the trained model with an aggregator to build a
+global model collaboratively. In this work, we propose FedDefender, a defense
+mechanism against targeted poisoning attacks in FL by leveraging differential
+testing. Our proposed method fingerprints the neuron activations of clients'
+models on the same input and uses differential testing to identify a
+potentially malicious client containing a backdoor. We evaluate FedDefender
+using MNIST and FashionMNIST datasets with 20 and 30 clients, and our results
+demonstrate that FedDefender effectively mitigates such attacks, reducing the
+attack success rate (ASR) to 10\% without deteriorating the global model
+performance.",cs.CR,"['cs.CR', 'cs.AI', 'cs.CV', 'cs.LG']"
+DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions,Yunxiao Shi · Manish Singh · Hong Cai · Fatih Porikli, ,https://arxiv.org/abs/2403.12202,,2403.12202.pdf,DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions,"In this paper, we introduce a novel approach that harnesses both 2D and 3D
+attentions to enable highly accurate depth completion without requiring
+iterative spatial propagations. Specifically, we first enhance a baseline
+convolutional depth completion model by applying attention to 2D features in
+the bottleneck and skip connections. This effectively improves the performance
+of this simple network and sets it on par with the latest, complex
+transformer-based models. Leveraging the initial depths and features from this
+network, we uplift the 2D features to form a 3D point cloud and construct a 3D
+point transformer to process it, allowing the model to explicitly learn and
+exploit 3D geometric features. In addition, we propose normalization techniques
+to process the point cloud, which improves learning and leads to better
+accuracy than directly using point transformers off the shelf. Furthermore, we
+incorporate global attention on downsampled point cloud features, which enables
+long-range context while still being computationally feasible. We evaluate our
+method, DeCoTR, on established depth completion benchmarks, including NYU Depth
+V2 and KITTI, showcasing that it sets new state-of-the-art performance. We
+further conduct zero-shot evaluations on ScanNet and DDAD benchmarks and
+demonstrate that DeCoTR has superior generalizability compared to existing
+approaches.",cs.CV,['cs.CV']
+TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing,Sherry X. Chen · Yaron Vaxman · Elad Ben Baruch · David Asulin · Aviad Moreshet · Kuo-Chin Lien · Misha Sra · Pradeep Sen,https://github.com/SherryXTChen/TiNO-Edit,https://arxiv.org/abs/2404.11120,,2404.11120.pdf,TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing,"Despite many attempts to leverage pre-trained text-to-image models (T2I) like
+Stable Diffusion (SD) for controllable image editing, producing good
+predictable results remains a challenge. Previous approaches have focused on
+either fine-tuning pre-trained T2I models on specific datasets to generate
+certain kinds of images (e.g., with a specific object or person), or on
+optimizing the weights, text prompts, and/or learning features for each input
+image in an attempt to coax the image generator to produce the desired result.
+However, these approaches all have shortcomings and fail to produce good
+results in a predictable and controllable manner. To address this problem, we
+present TiNO-Edit, an SD-based method that focuses on optimizing the noise
+patterns and diffusion timesteps during editing, something previously
+unexplored in the literature. With this simple change, we are able to generate
+results that both better align with the original images and reflect the desired
+result. Furthermore, we propose a set of new loss functions that operate in the
+latent domain of SD, greatly speeding up the optimization when compared to
+prior approaches, which operate in the pixel domain. Our method can be easily
+applied to variations of SD including Textual Inversion and DreamBooth that
+encode new concepts and incorporate them into the edited results. We present a
+host of image-editing capabilities enabled by our approach. Our code is
+publicly available at https://github.com/SherryXTChen/TiNO-Edit.",cs.CV,['cs.CV']
+Memory-Scalable and Simplified Functional Map Learning,Robin Magnet · Maks Ovsjanikov, ,https://arxiv.org/abs/2404.00330,,2404.00330.pdf,Memory-Scalable and Simplified Functional Map Learning,"Deep functional maps have emerged in recent years as a prominent
+learning-based framework for non-rigid shape matching problems. While early
+methods in this domain only focused on learning in the functional domain, the
+latest techniques have demonstrated that by promoting consistency between
+functional and pointwise maps leads to significant improvements in accuracy.
+Unfortunately, existing approaches rely heavily on the computation of large
+dense matrices arising from soft pointwise maps, which compromises their
+efficiency and scalability. To address this limitation, we introduce a novel
+memory-scalable and efficient functional map learning pipeline. By leveraging
+the specific structure of functional maps, we offer the possibility to achieve
+identical results without ever storing the pointwise map in memory.
+Furthermore, based on the same approach, we present a differentiable map
+refinement layer adapted from an existing axiomatic refinement algorithm.
+Unlike many functional map learning methods, which use this algorithm at a
+post-processing step, ours can be easily used at train time, enabling to
+enforce consistency between the refined and initial versions of the map. Our
+resulting approach is both simpler, more efficient and more numerically stable,
+by avoiding differentiation through a linear system, while achieving close to
+state-of-the-art results in challenging scenarios.",cs.CV,"['cs.CV', 'cs.AI']"
+FineParser: A Fine-grained Spatio-temporal Action Parser for Human-centric Action Quality Assessment,Jinglin Xu · Sibo Yin · Guohao Zhao · Zishuo Wang · Yuxin Peng, ,https://arxiv.org/abs/2405.06887,,2405.06887.pdf,FineParser: A Fine-grained Spatio-temporal Action Parser for Human-centric Action Quality Assessment,"Existing action quality assessment (AQA) methods mainly learn deep
+representations at the video level for scoring diverse actions. Due to the lack
+of a fine-grained understanding of actions in videos, they harshly suffer from
+low credibility and interpretability, thus insufficient for stringent
+applications, such as Olympic diving events. We argue that a fine-grained
+understanding of actions requires the model to perceive and parse actions in
+both time and space, which is also the key to the credibility and
+interpretability of the AQA technique. Based on this insight, we propose a new
+fine-grained spatial-temporal action parser named \textbf{FineParser}. It
+learns human-centric foreground action representations by focusing on target
+action regions within each frame and exploiting their fine-grained alignments
+in time and space to minimize the impact of invalid backgrounds during the
+assessment. In addition, we construct fine-grained annotations of human-centric
+foreground action masks for the FineDiving dataset, called
+\textbf{FineDiving-HM}. With refined annotations on diverse target action
+procedures, FineDiving-HM can promote the development of real-world AQA
+systems. Through extensive experiments, we demonstrate the effectiveness of
+FineParser, which outperforms state-of-the-art methods while supporting more
+tasks of fine-grained action understanding. Data and code are available at
+\url{https://github.com/PKU-ICST-MIPL/FineParser_CVPR2024}.",cs.CV,['cs.CV']
+Spike-guided Motion Deblurring with Unknown Modal Spatiotemporal Alignment,Jiyuan Zhang · Shiyan Chen · Yajing Zheng · Zhaofei Yu · Tiejun Huang, ,https://arxiv.org/abs/2403.09486,,2403.09486.pdf,SpikeReveal: Unlocking Temporal Sequences from Real Blurry Inputs with Spike Streams,"Reconstructing a sequence of sharp images from the blurry input is crucial
+for enhancing our insights into the captured scene and poses a significant
+challenge due to the limited temporal features embedded in the image. Spike
+cameras, sampling at rates up to 40,000 Hz, have proven effective in capturing
+motion features and beneficial for solving this ill-posed problem. Nonetheless,
+existing methods fall into the supervised learning paradigm, which suffers from
+notable performance degradation when applied to real-world scenarios that
+diverge from the synthetic training data domain. Moreover, the quality of
+reconstructed images is capped by the generated images based on motion analysis
+interpolation, which inherently differs from the actual scene, affecting the
+generalization ability of these methods in real high-speed scenarios. To
+address these challenges, we propose the first self-supervised framework for
+the task of spike-guided motion deblurring. Our approach begins with the
+formulation of a spike-guided deblurring model that explores the theoretical
+relationships among spike streams, blurry images, and their corresponding sharp
+sequences. We subsequently develop a self-supervised cascaded framework to
+alleviate the issues of spike noise and spatial-resolution mismatching
+encountered in the deblurring model. With knowledge distillation and
+re-blurring loss, we further design a lightweight deblur network to generate
+high-quality sequences with brightness and texture consistency with the
+original input. Quantitative and qualitative experiments conducted on our
+real-world and synthetic datasets with spikes validate the superior
+generalization of the proposed framework. Our code, data and trained models
+will be available at \url{https://github.com/chenkang455/S-SDM}.",cs.CV,['cs.CV']
+Weakly Supervised Point Cloud Semantic Segmentation via Artificial Oracle,Hyeokjun Kweon · Jihun Kim · Kuk-Jin Yoon, ,,https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/cit2.12239,,,,,nan
+From SAM to CAMs: Exploring Segment Anything Model for Weakly Supervised Semantic Segmentation,Hyeokjun Kweon · Kuk-Jin Yoon, ,https://arxiv.org/abs/2312.03585,,2312.03585.pdf,Foundation Model Assisted Weakly Supervised Semantic Segmentation,"This work aims to leverage pre-trained foundation models, such as contrastive
+language-image pre-training (CLIP) and segment anything model (SAM), to address
+weakly supervised semantic segmentation (WSSS) using image-level labels. To
+this end, we propose a coarse-to-fine framework based on CLIP and SAM for
+generating high-quality segmentation seeds. Specifically, we construct an image
+classification task and a seed segmentation task, which are jointly performed
+by CLIP with frozen weights and two sets of learnable task-specific prompts. A
+SAM-based seeding (SAMS) module is designed and applied to each task to produce
+either coarse or fine seed maps. Moreover, we design a multi-label contrastive
+loss supervised by image-level labels and a CAM activation loss supervised by
+the generated coarse seed map. These losses are used to learn the prompts,
+which are the only parts need to be learned in our framework. Once the prompts
+are learned, we input each image along with the learned segmentation-specific
+prompts into CLIP and the SAMS module to produce high-quality segmentation
+seeds. These seeds serve as pseudo labels to train an off-the-shelf
+segmentation network like other two-stage WSSS methods. Experiments show that
+our method achieves the state-of-the-art performance on PASCAL VOC 2012 and
+competitive results on MS COCO 2014. Code is available at
+https://github.com/HAL-42/FMA-WSSS.git.",cs.CV,"['cs.CV', 'cs.AI']"
+SchurVINS: Schur Complement-Based Lightweight Visual Inertial Navigation System,Yunfei Fan · Tianyu Zhao · Guidong Wang,https://github.com/bytedance/SchurVINS,https://arxiv.org/abs/2312.01616,,2312.01616.pdf,SchurVINS: Schur Complement-Based Lightweight Visual Inertial Navigation System,"Accuracy and computational efficiency are the most important metrics to
+Visual Inertial Navigation System (VINS). The existing VINS algorithms with
+either high accuracy or low computational complexity, are difficult to provide
+the high precision localization in resource-constrained devices. To this end,
+we propose a novel filter-based VINS framework named SchurVINS, which could
+guarantee both high accuracy by building a complete residual model and low
+computational complexity with Schur complement. Technically, we first formulate
+the full residual model where Gradient, Hessian and observation covariance are
+explicitly modeled. Then Schur complement is employed to decompose the full
+model into ego-motion residual model and landmark residual model. Finally,
+Extended Kalman Filter (EKF) update is implemented in these two models with
+high efficiency. Experiments on EuRoC and TUM-VI datasets show that our method
+notably outperforms state-of-the-art (SOTA) methods in both accuracy and
+computational complexity. The experimental code of SchurVINS is available at
+https://github.com/bytedance/SchurVINS.",cs.CV,"['cs.CV', 'cs.RO']"
+CAGE: Controllable Articulation GEneration,Jiayi Liu · Hou In Ivan Tam · Ali Mahdavi Amiri · Manolis Savva, ,https://arxiv.org/abs/2312.09570,,2312.09570.pdf,CAGE: Controllable Articulation GEneration,"We address the challenge of generating 3D articulated objects in a
+controllable fashion. Currently, modeling articulated 3D objects is either
+achieved through laborious manual authoring, or using methods from prior work
+that are hard to scale and control directly. We leverage the interplay between
+part shape, connectivity, and motion using a denoising diffusion-based method
+with attention modules designed to extract correlations between part
+attributes. Our method takes an object category label and a part connectivity
+graph as input and generates an object's geometry and motion parameters. The
+generated objects conform to user-specified constraints on the object category,
+part shape, and part articulation. Our experiments show that our method
+outperforms the state-of-the-art in articulated object generation, producing
+more realistic objects while conforming better to user constraints.
+  Video Summary at: http://youtu.be/cH_rbKbyTpE",cs.CV,['cs.CV']
+Neural Modes: Self-supervised Learning of Nonlinear Modal Subspaces,Jiahong Wang · Yinwei DU · Stelian Coros · Bernhard Thomaszewski, ,https://arxiv.org/abs/2404.17620,,2404.17620.pdf,Neural Modes: Self-supervised Learning of Nonlinear Modal Subspaces,"We propose a self-supervised approach for learning physics-based subspaces
+for real-time simulation. Existing learning-based methods construct subspaces
+by approximating pre-defined simulation data in a purely geometric way.
+However, this approach tends to produce high-energy configurations, leads to
+entangled latent space dimensions, and generalizes poorly beyond the training
+set. To overcome these limitations, we propose a self-supervised approach that
+directly minimizes the system's mechanical energy during training. We show that
+our method leads to learned subspaces that reflect physical equilibrium
+constraints, resolve overfitting issues of previous methods, and offer
+interpretable latent space parameters.",cs.LG,"['cs.LG', 'cs.CV', 'cs.GR']"
+Beyond Average: Individualized Visual Scanpath Prediction,Xianyu Chen · Ming Jiang · Qi Zhao, ,https://arxiv.org/abs/2404.12235,,2404.12235.pdf,Beyond Average: Individualized Visual Scanpath Prediction,"Understanding how attention varies across individuals has significant
+scientific and societal impacts. However, existing visual scanpath models treat
+attention uniformly, neglecting individual differences. To bridge this gap,
+this paper focuses on individualized scanpath prediction (ISP), a new attention
+modeling task that aims to accurately predict how different individuals shift
+their attention in diverse visual tasks. It proposes an ISP method featuring
+three novel technical components: (1) an observer encoder to characterize and
+integrate an observer's unique attention traits, (2) an observer-centric
+feature integration approach that holistically combines visual features, task
+guidance, and observer-specific characteristics, and (3) an adaptive fixation
+prioritization mechanism that refines scanpath predictions by dynamically
+prioritizing semantic feature maps based on individual observers' attention
+traits. These novel components allow scanpath models to effectively address the
+attention variations across different observers. Our method is generally
+applicable to different datasets, model architectures, and visual tasks,
+offering a comprehensive tool for transforming general scanpath models into
+individualized ones. Comprehensive evaluations using value-based and
+ranking-based metrics verify the method's effectiveness and generalizability.",cs.CV,['cs.CV']
+CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow,Chenbin Pan · Burhaneddin Yaman · Senem Velipasalar · Liu Ren, ,https://arxiv.org/abs/2403.08919,,2403.08919.pdf,CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow,"Autonomous driving stands as a pivotal domain in computer vision, shaping the
+future of transportation. Within this paradigm, the backbone of the system
+plays a crucial role in interpreting the complex environment. However, a
+notable challenge has been the loss of clear supervision when it comes to
+Bird's Eye View elements. To address this limitation, we introduce
+CLIP-BEVFormer, a novel approach that leverages the power of contrastive
+learning techniques to enhance the multi-view image-derived BEV backbones with
+ground truth information flow. We conduct extensive experiments on the
+challenging nuScenes dataset and showcase significant and consistent
+improvements over the SOTA. Specifically, CLIP-BEVFormer achieves an impressive
+8.5\% and 9.2\% enhancement in terms of NDS and mAP, respectively, over the
+previous best BEV model on the 3D object detection task.",cs.CV,['cs.CV']
+"OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition",Jianqiang Wan · Sibo Song · Wenwen Yu · Yuliang Liu · Wenqing Cheng · Fei Huang · Xiang Bai · Cong Yao · Zhibo Yang, ,https://arxiv.org/abs/2403.19128,,2403.19128.pdf,"OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition","Recently, visually-situated text parsing (VsTP) has experienced notable
+advancements, driven by the increasing demand for automated document
+understanding and the emergence of Generative Large Language Models (LLMs)
+capable of processing document-based questions. Various methods have been
+proposed to address the challenging problem of VsTP. However, due to the
+diversified targets and heterogeneous schemas, previous works usually design
+task-specific architectures and objectives for individual tasks, which
+inadvertently leads to modal isolation and complex workflow. In this paper, we
+propose a unified paradigm for parsing visually-situated text across diverse
+scenarios. Specifically, we devise a universal model, called OmniParser, which
+can simultaneously handle three typical visually-situated text parsing tasks:
+text spotting, key information extraction, and table recognition. In
+OmniParser, all tasks share the unified encoder-decoder architecture, the
+unified objective: point-conditioned text generation, and the unified input &
+output representation: prompt & structured sequences. Extensive experiments
+demonstrate that the proposed OmniParser achieves state-of-the-art (SOTA) or
+highly competitive performances on 7 datasets for the three visually-situated
+text parsing tasks, despite its unified, concise design. The code is available
+at https://github.com/AlibabaResearch/AdvancedLiterateMachinery.",cs.CV,['cs.CV']
+ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image,Kyle Sargent · Zizhang Li · Tanmay Shah · Charles Herrmann · Hong-Xing Yu · Yunzhi Zhang · Eric Ryan Chan · Dmitry Lagun · Li Fei-Fei · Deqing Sun · Jiajun Wu,kylesargent.github.io/zeronvs,https://arxiv.org/abs/2310.17994,,2310.17994.pdf,ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image,"We introduce a 3D-aware diffusion model, ZeroNVS, for single-image novel view
+synthesis for in-the-wild scenes. While existing methods are designed for
+single objects with masked backgrounds, we propose new techniques to address
+challenges introduced by in-the-wild multi-object scenes with complex
+backgrounds. Specifically, we train a generative prior on a mixture of data
+sources that capture object-centric, indoor, and outdoor scenes. To address
+issues from data mixture such as depth-scale ambiguity, we propose a novel
+camera conditioning parameterization and normalization scheme. Further, we
+observe that Score Distillation Sampling (SDS) tends to truncate the
+distribution of complex backgrounds during distillation of 360-degree scenes,
+and propose ""SDS anchoring"" to improve the diversity of synthesized novel
+views. Our model sets a new state-of-the-art result in LPIPS on the DTU dataset
+in the zero-shot setting, even outperforming methods specifically trained on
+DTU. We further adapt the challenging Mip-NeRF 360 dataset as a new benchmark
+for single-image novel view synthesis, and demonstrate strong performance in
+this setting. Our code and data are at http://kylesargent.github.io/zeronvs/",cs.CV,"['cs.CV', 'cs.GR']"
+Efficient Privacy-Preserving Visual Localization Using 3D Ray Clouds,Heejoon Moon · Chunghwan Lee · Je Hyeong Hong,https://github.com/PHANTOM0122/Ray-cloud,,https://ieeexplore.ieee.org/abstract/document/10203590,,,,,nan
+BigGait: Learning Gait Representation You Want by Large Vision Models,Dingqiang Ye · Chao Fan · Jingzhe Ma · Xiaoming Liu · Shiqi Yu,https://github.com/ShiqiYu/OpenGait,https://arxiv.org/abs/2402.19122,,,BigGait: Learning Gait Representation You Want by Large Vision Models,"Gait recognition stands as one of the most pivotal remote identification
+technologies and progressively expands across research and industry
+communities. However, existing gait recognition methods heavily rely on
+task-specific upstream driven by supervised learning to provide explicit gait
+representations like silhouette sequences, which inevitably introduce expensive
+annotation costs and potential error accumulation. Escaping from this trend,
+this work explores effective gait representations based on the all-purpose
+knowledge produced by task-agnostic Large Vision Models (LVMs) and proposes a
+simple yet efficient gait framework, termed BigGait. Specifically, the Gait
+Representation Extractor (GRE) within BigGait draws upon design principles from
+established gait representations, effectively transforming all-purpose
+knowledge into implicit gait representations without requiring third-party
+supervision signals. Experiments on CCPG, CAISA-B* and SUSTech1K indicate that
+BigGait significantly outperforms the previous methods in both within-domain
+and cross-domain tasks in most cases, and provides a more practical paradigm
+for learning the next-generation gait representation. Finally, we delve into
+prospective challenges and promising directions in LVMs-based gait recognition,
+aiming to inspire future work in this emerging topic. The source code is
+available at https://github.com/ShiqiYu/OpenGait.",cs.CV,['cs.CV']
+ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding,Le Xue · Ning Yu · Shu Zhang · Artemis Panagopoulou · Junnan Li · Roberto Martín-Martín · Jiajun Wu · Caiming Xiong · Ran Xu · Juan Carlos Niebles · Silvio Savarese, ,https://ar5iv.labs.arxiv.org/html/2305.08275,,2305.08275.pdf,ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding,"Recent advancements in multimodal pre-training have shown promising efficacy
+in 3D representation learning by aligning multimodal features across 3D shapes,
+their 2D counterparts, and language descriptions. However, the methods used by
+existing frameworks to curate such multimodal data, in particular language
+descriptions for 3D shapes, are not scalable, and the collected language
+descriptions are not diverse. To address this, we introduce ULIP-2, a simple
+yet effective tri-modal pre-training framework that leverages large multimodal
+models to automatically generate holistic language descriptions for 3D shapes.
+It only needs 3D data as input, eliminating the need for any manual 3D
+annotations, and is therefore scalable to large datasets. ULIP-2 is also
+equipped with scaled-up backbones for better multimodal representation
+learning. We conduct experiments on two large-scale 3D datasets, Objaverse and
+ShapeNet, and augment them with tri-modal datasets of 3D point clouds, images,
+and language for training ULIP-2. Experiments show that ULIP-2 demonstrates
+substantial benefits in three downstream tasks: zero-shot 3D classification,
+standard 3D classification with fine-tuning, and 3D captioning (3D-to-language
+generation). It achieves a new SOTA of 50.6% (top-1) on Objaverse-LVIS and
+84.7% (top-1) on ModelNet40 in zero-shot classification. In the ScanObjectNN
+benchmark for standard fine-tuning, ULIP-2 reaches an overall accuracy of 91.5%
+with a compact model of only 1.4 million parameters. ULIP-2 sheds light on a
+new paradigm for scalable multimodal 3D representation learning without human
+annotations and shows significant improvements over existing baselines. The
+code and datasets are released at https://github.com/salesforce/ULIP.",cs.CV,['cs.CV']
+On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?,Maxime Zanella · Ismail Ben Ayed,https://github.com/MaxZanella/MTA,https://arxiv.org/abs/2405.02266,,2405.02266.pdf,On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?,"The development of large vision-language models, notably CLIP, has catalyzed
+research into effective adaptation techniques, with a particular focus on soft
+prompt tuning. Conjointly, test-time augmentation, which utilizes multiple
+augmented views of a single image to enhance zero-shot generalization, is
+emerging as a significant area of interest. This has predominantly directed
+research efforts toward test-time prompt tuning. In contrast, we introduce a
+robust MeanShift for Test-time Augmentation (MTA), which surpasses prompt-based
+methods without requiring this intensive training procedure. This positions MTA
+as an ideal solution for both standalone and API-based applications.
+Additionally, our method does not rely on ad hoc rules (e.g., confidence
+threshold) used in some previous test-time augmentation techniques to filter
+the augmented views. Instead, MTA incorporates a quality assessment variable
+for each view directly into its optimization process, termed as the inlierness
+score. This score is jointly optimized with a density mode seeking process,
+leading to an efficient training- and hyperparameter-free approach. We
+extensively benchmark our method on 15 datasets and demonstrate MTA's
+superiority and computational efficiency. Deployed easily as plug-and-play
+module on top of zero-shot models and state-of-the-art few-shot methods, MTA
+shows systematic and consistent improvements.",cs.CV,['cs.CV']
+Context-Guided Spatio-Temporal Video Grounding,Xin Gu · Heng Fan · Yan Huang · Tiejian Luo · Libo Zhang, ,https://arxiv.org/abs/2401.01578,,2401.01578.pdf,Context-Guided Spatio-Temporal Video Grounding,"Spatio-temporal video grounding (or STVG) task aims at locating a
+spatio-temporal tube for a specific instance given a text query. Despite
+advancements, current methods easily suffer the distractors or heavy object
+appearance variations in videos due to insufficient object information from the
+text, leading to degradation. Addressing this, we propose a novel framework,
+context-guided STVG (CG-STVG), which mines discriminative instance context for
+object in videos and applies it as a supplementary guidance for target
+localization. The key of CG-STVG lies in two specially designed modules,
+including instance context generation (ICG), which focuses on discovering
+visual context information (in both appearance and motion) of the instance, and
+instance context refinement (ICR), which aims to improve the instance context
+from ICG by eliminating irrelevant or even harmful information from the
+context. During grounding, ICG, together with ICR, are deployed at each
+decoding stage of a Transformer architecture for instance context learning.
+Particularly, instance context learned from one decoding stage is fed to the
+next stage, and leveraged as a guidance containing rich and discriminative
+object feature to enhance the target-awareness in decoding feature, which
+conversely benefits generating better new instance context for improving
+localization finally. Compared to existing methods, CG-STVG enjoys object
+information in text query and guidance from mined instance visual context for
+more accurate target localization. In our experiments on three benchmarks,
+including HCSTVG-v1/-v2 and VidSTG, CG-STVG sets new state-of-the-arts in
+m_tIoU and m_vIoU on all of them, showing its efficacy. The code will be
+released at https://github.com/HengLan/CGSTVG.",cs.CV,['cs.CV']
+GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions,Junjie Wang · Jiemin Fang · Xiaopeng Zhang · Lingxi Xie · Qi Tian, ,https://arxiv.org/abs/2311.16037,,2311.16037.pdf,GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions,"Recently, impressive results have been achieved in 3D scene editing with text
+instructions based on a 2D diffusion model. However, current diffusion models
+primarily generate images by predicting noise in the latent space, and the
+editing is usually applied to the whole image, which makes it challenging to
+perform delicate, especially localized, editing for 3D scenes. Inspired by
+recent 3D Gaussian splatting, we propose a systematic framework, named
+GaussianEditor, to edit 3D scenes delicately via 3D Gaussians with text
+instructions. Benefiting from the explicit property of 3D Gaussians, we design
+a series of techniques to achieve delicate editing. Specifically, we first
+extract the region of interest (RoI) corresponding to the text instruction,
+aligning it to 3D Gaussians. The Gaussian RoI is further used to control the
+editing process. Our framework can achieve more delicate and precise editing of
+3D scenes than previous methods while enjoying much faster training speed, i.e.
+within 20 minutes on a single V100 GPU, more than twice as fast as
+Instruct-NeRF2NeRF (45 minutes -- 2 hours).",cs.CV,"['cs.CV', 'cs.GR']"
+GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis,Shunyuan Zheng · Boyao ZHOU · Ruizhi Shao · Boning Liu · Shengping Zhang · Liqiang Nie · Yebin Liu,https://shunyuanzheng.github.io/GPS-Gaussian,https://arxiv.org/abs/2312.02155,,2312.02155.pdf,GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis,"We present a new approach, termed GPS-Gaussian, for synthesizing novel views
+of a character in a real-time manner. The proposed method enables 2K-resolution
+rendering under a sparse-view camera setting. Unlike the original Gaussian
+Splatting or neural implicit rendering methods that necessitate per-subject
+optimizations, we introduce Gaussian parameter maps defined on the source views
+and regress directly Gaussian Splatting properties for instant novel view
+synthesis without any fine-tuning or optimization. To this end, we train our
+Gaussian parameter regression module on a large amount of human scan data,
+jointly with a depth estimation module to lift 2D parameter maps to 3D space.
+The proposed framework is fully differentiable and experiments on several
+datasets demonstrate that our method outperforms state-of-the-art methods while
+achieving an exceeding rendering speed.",cs.CV,['cs.CV']
+OpenStreetView-5M: The Many Roads to Global Visual Geolocation,Guillaume Astruc · Nicolas Dufour · Ioannis Siglidis · Constantin Aronssohn · Nacim Bouia · Stephanie Fu · Romain Loiseau · Van Nguyen Nguyen · Charles Raude · Elliot Vincent · Lintao XU · Hongyu Zhou · Loic Landrieu,https://imagine.enpc.fr/~ioannis.siglidis/osv5m/,https://arxiv.org/abs/2404.18873v1,,2404.18873v1.pdf,OpenStreetView-5M: The Many Roads to Global Visual Geolocation,"Determining the location of an image anywhere on Earth is a complex visual
+task, which makes it particularly relevant for evaluating computer vision
+algorithms. Yet, the absence of standard, large-scale, open-access datasets
+with reliably localizable images has limited its potential. To address this
+issue, we introduce OpenStreetView-5M, a large-scale, open-access dataset
+comprising over 5.1 million geo-referenced street view images, covering 225
+countries and territories. In contrast to existing benchmarks, we enforce a
+strict train/test separation, allowing us to evaluate the relevance of learned
+geographical features beyond mere memorization. To demonstrate the utility of
+our dataset, we conduct an extensive benchmark of various state-of-the-art
+image encoders, spatial representations, and training strategies. All
+associated codes and models can be found at https://github.com/gastruc/osv5m.",cs.CV,"['cs.CV', 'cs.AI']"
+Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion,Zuoyue Li · Zhenqiang Li · Zhaopeng Cui · Marc Pollefeys · Martin R. Oswald, ,https://arxiv.org/abs/2401.10786,,2401.10786.pdf,Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion,"Directly generating scenes from satellite imagery offers exciting
+possibilities for integration into applications like games and map services.
+However, challenges arise from significant view changes and scene scale.
+Previous efforts mainly focused on image or video generation, lacking
+exploration into the adaptability of scene generation for arbitrary views.
+Existing 3D generation works either operate at the object level or are
+difficult to utilize the geometry obtained from satellite imagery. To overcome
+these limitations, we propose a novel architecture for direct 3D scene
+generation by introducing diffusion models into 3D sparse representations and
+combining them with neural rendering techniques. Specifically, our approach
+generates texture colors at the point level for a given geometry using a 3D
+diffusion model first, which is then transformed into a scene representation in
+a feed-forward manner. The representation can be utilized to render arbitrary
+views which would excel in both single-frame quality and inter-frame
+consistency. Experiments in two city-scale datasets show that our model
+demonstrates proficiency in generating photo-realistic street-view image
+sequences and cross-view urban scenes from satellite imagery.",cs.CV,['cs.CV']
+CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs,Haocheng Yuan · Jing Xu · Hao Pan · Adrien Bousseau · Niloy J. Mitra · Changjian Li,https://enigma-li.github.io/CADTalk/,https://arxiv.org/abs/2311.16703,,2311.16703.pdf,CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs,"CAD programs are a popular way to compactly encode shapes as a sequence of
+operations that are easy to parametrically modify. However, without sufficient
+semantic comments and structure, such programs can be challenging to
+understand, let alone modify. We introduce the problem of semantic commenting
+CAD programs, wherein the goal is to segment the input program into code blocks
+corresponding to semantically meaningful shape parts and assign a semantic
+label to each block. We solve the problem by combining program parsing with
+visual-semantic analysis afforded by recent advances in foundational language
+and vision models. Specifically, by executing the input programs, we create
+shapes, which we use to generate conditional photorealistic images to make use
+of semantic annotators for such images. We then distill the information across
+the images and link back to the original programs to semantically comment on
+them. Additionally, we collected and annotated a benchmark dataset, CADTalk,
+consisting of 5,288 machine-made programs and 45 human-made programs with
+ground truth semantic comments. We extensively evaluated our approach, compared
+it to a GPT-based baseline, and an open-set shape segmentation baseline, and
+reported an 83.24% accuracy on the new CADTalk dataset. Code and data:
+https://enigma-li.github.io/CADTalk/.",cs.CV,"['cs.CV', 'cs.GR']"
+MTLoRA: Low-Rank Adaptation Approach for Efficient Multi-Task Learning,Ahmed Agiza · Marina Neseem · Sherief Reda,https://github.com/scale-lab/MTLoRA,https://arxiv.org/abs/2403.20320,,2403.20320.pdf,MTLoRA: A Low-Rank Adaptation Approach for Efficient Multi-Task Learning,"Adapting models pre-trained on large-scale datasets to a variety of
+downstream tasks is a common strategy in deep learning. Consequently,
+parameter-efficient fine-tuning methods have emerged as a promising way to
+adapt pre-trained models to different tasks while training only a minimal
+number of parameters. While most of these methods are designed for single-task
+adaptation, parameter-efficient training in Multi-Task Learning (MTL)
+architectures is still unexplored. In this paper, we introduce MTLoRA, a novel
+framework for parameter-efficient training of MTL models. MTLoRA employs
+Task-Agnostic and Task-Specific Low-Rank Adaptation modules, which effectively
+disentangle the parameter space in MTL fine-tuning, thereby enabling the model
+to adeptly handle both task specialization and interaction within MTL contexts.
+We applied MTLoRA to hierarchical-transformer-based MTL architectures, adapting
+them to multiple downstream dense prediction tasks. Our extensive experiments
+on the PASCAL dataset show that MTLoRA achieves higher accuracy on downstream
+tasks compared to fully fine-tuning the MTL model while reducing the number of
+trainable parameters by 3.6x. Furthermore, MTLoRA establishes a Pareto-optimal
+trade-off between the number of trainable parameters and the accuracy of the
+downstream tasks, outperforming current state-of-the-art parameter-efficient
+training methods in both accuracy and efficiency. Our code is publicly
+available.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+View-Category Interactive Sharing Transformer for Incomplete Multi-View Multi-Label Learning,Shilong Ou · Zhe Xue · Yawen Li · Meiyu Liang · Yuanqiang Cai · junjiang wu, ,https://arxiv.org/abs/2404.17340,,2404.17340.pdf,Masked Two-channel Decoupling Framework for Incomplete Multi-view Weak Multi-label Learning,"Multi-view learning has become a popular research topic in recent years, but
+research on the cross-application of classic multi-label classification and
+multi-view learning is still in its early stages. In this paper, we focus on
+the complex yet highly realistic task of incomplete multi-view weak multi-label
+learning and propose a masked two-channel decoupling framework based on deep
+neural networks to solve this problem. The core innovation of our method lies
+in decoupling the single-channel view-level representation, which is common in
+deep multi-view learning methods, into a shared representation and a
+view-proprietary representation. We also design a cross-channel contrastive
+loss to enhance the semantic property of the two channels. Additionally, we
+exploit supervised information to design a label-guided graph regularization
+loss, helping the extracted embedding features preserve the geometric structure
+among samples. Inspired by the success of masking mechanisms in image and text
+analysis, we develop a random fragment masking strategy for vector features to
+improve the learning ability of encoders. Finally, it is important to emphasize
+that our model is fully adaptable to arbitrary view and label absences while
+also performing well on the ideal full data. We have conducted sufficient and
+convincing experiments to confirm the effectiveness and advancement of our
+model.",cs.CV,['cs.CV']
+FineSports: A Multi-person Hierarchical Sports Video Dataset for Fine-grained Action Understanding,Jinglin Xu · Guohao Zhao · Sibo Yin · Wenhao Zhou · Yuxin Peng, ,,,,,,,nan
+Visual Layout Composer: Image-Vector Dual Diffusion Model for Design Layout Generation,Mohammad Amin Shabani · Zhaowen Wang · Difan Liu · Nanxuan Zhao · Jimei Yang · Yasutaka Furukawa,https://aminshabani.github.io/visual_layout_composer/index.html,https://web3.arxiv.org/abs/2402.04754,,2402.04754.pdf,Towards Aligned Layout Generation via Diffusion Model with Aesthetic Constraints,"Controllable layout generation refers to the process of creating a plausible
+visual arrangement of elements within a graphic design (e.g., document and web
+designs) with constraints representing design intentions. Although recent
+diffusion-based models have achieved state-of-the-art FID scores, they tend to
+exhibit more pronounced misalignment compared to earlier transformer-based
+models. In this work, we propose the $\textbf{LA}$yout $\textbf{C}$onstraint
+diffusion mod$\textbf{E}$l (LACE), a unified model to handle a broad range of
+layout generation tasks, such as arranging elements with specified attributes
+and refining or completing a coarse layout design. The model is based on
+continuous diffusion models. Compared with existing methods that use discrete
+diffusion models, continuous state-space design can enable the incorporation of
+differentiable aesthetic constraint functions in training. For conditional
+generation, we introduce conditions via masked input. Extensive experiment
+results show that LACE produces high-quality layouts and outperforms existing
+state-of-the-art baselines.",cs.CV,"['cs.CV', 'cs.LG']"
+AdaShift: Learning Discriminative Self-Gated Neural Feature Activation With an Adaptive Shift Factor,Sudong Cai,https://github.com/SudongCAI/AdaShift,,https://www.nature.com/articles/s41598-024-60598-2,,,,,nan
+Generalized Event Cameras,Varun Sundar · Matthew Dutson · Andrei Ardelean · Claudio Bruschini · Edoardo Charbon · Mohit Gupta,https://wisionlab.com/project/generalized-event-cameras/,,https://aim.autm.net/public/project/73780/,,,,,nan
+DiLiGenRT: A Photometric Stereo Dataset with Quantified Roughness and Translucency,Heng Guo · Jieji Ren · Feishi Wang · Boxin Shi · Mingjun Ren · Yasuyuki Matsushita, ,,,,,,,nan
+Instantaneous Perception of Moving Objects in 3D,Di Liu · Bingbing Zhuang · Dimitris N. Metaxas · Manmohan Chandraker, ,https://arxiv.org/abs/2405.02781,,2405.02781.pdf,Instantaneous Perception of Moving Objects in 3D,"The perception of 3D motion of surrounding traffic participants is crucial
+for driving safety. While existing works primarily focus on general large
+motions, we contend that the instantaneous detection and quantification of
+subtle motions is equally important as they indicate the nuances in driving
+behavior that may be safety critical, such as behaviors near a stop sign of
+parking positions. We delve into this under-explored task, examining its unique
+challenges and developing our solution, accompanied by a carefully designed
+benchmark. Specifically, due to the lack of correspondences between consecutive
+frames of sparse Lidar point clouds, static objects might appear to be moving -
+the so-called swimming effect. This intertwines with the true object motion,
+thereby posing ambiguity in accurate estimation, especially for subtle motions.
+To address this, we propose to leverage local occupancy completion of object
+point clouds to densify the shape cue, and mitigate the impact of swimming
+artifacts. The occupancy completion is learned in an end-to-end fashion
+together with the detection of moving objects and the estimation of their
+motion, instantaneously as soon as objects start to move. Extensive experiments
+demonstrate superior performance compared to standard 3D motion estimation
+approaches, particularly highlighting our method's specialized treatment of
+subtle motions.",cs.CV,['cs.CV']
+eTraM: Event-based Traffic Monitoring Dataset,Aayush Atul Verma · Bharatesh Chakravarthi · Arpitsinh Vaghela · Hua Wei · 'YZ' Yezhou Yang,https://eventbasedvision.github.io/eTraM/,https://arxiv.org/abs/2403.19976,,2403.19976.pdf,eTraM: Event-based Traffic Monitoring Dataset,"Event cameras, with their high temporal and dynamic range and minimal memory
+usage, have found applications in various fields. However, their potential in
+static traffic monitoring remains largely unexplored. To facilitate this
+exploration, we present eTraM - a first-of-its-kind, fully event-based traffic
+monitoring dataset. eTraM offers 10 hr of data from different traffic scenarios
+in various lighting and weather conditions, providing a comprehensive overview
+of real-world situations. Providing 2M bounding box annotations, it covers
+eight distinct classes of traffic participants, ranging from vehicles to
+pedestrians and micro-mobility. eTraM's utility has been assessed using
+state-of-the-art methods for traffic participant detection, including RVT, RED,
+and YOLOv8. We quantitatively evaluate the ability of event-based models to
+generalize on nighttime and unseen scenes. Our findings substantiate the
+compelling potential of leveraging event cameras for traffic monitoring,
+opening new avenues for research and application. eTraM is available at
+https://eventbasedvision.github.io/eTraM",cs.CV,['cs.CV']
+Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation,Hang Li · Chengzhi Shen · Philip H.S. Torr · Volker Tresp · Jindong Gu, ,https://arxiv.org/abs/2311.17216,,2311.17216.pdf,Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation,"Diffusion-based models have gained significant popularity for text-to-image
+generation due to their exceptional image-generation capabilities. A risk with
+these models is the potential generation of inappropriate content, such as
+biased or harmful images. However, the underlying reasons for generating such
+undesired content from the perspective of the diffusion model's internal
+representation remain unclear. Previous work interprets vectors in an
+interpretable latent space of diffusion models as semantic concepts. However,
+existing approaches cannot discover directions for arbitrary concepts, such as
+those related to inappropriate concepts. In this work, we propose a novel
+self-supervised approach to find interpretable latent directions for a given
+concept. With the discovered vectors, we further propose a simple approach to
+mitigate inappropriate generation. Extensive experiments have been conducted to
+verify the effectiveness of our mitigation approach, namely, for fair
+generation, safe generation, and responsible text-enhancing generation. Project
+page: \url{https://interpretdiffusion.github.io}.",cs.CV,['cs.CV']
+GenNBV: Generalizable Next-Best-View Policy for Active 3D Reconstruction,Xiao Chen · Quanyi Li · Tai Wang · Tianfan Xue · Jiangmiao Pang, ,https://arxiv.org/abs/2402.16174,,2402.16174.pdf,GenNBV: Generalizable Next-Best-View Policy for Active 3D Reconstruction,"While recent advances in neural radiance field enable realistic digitization
+for large-scale scenes, the image-capturing process is still time-consuming and
+labor-intensive. Previous works attempt to automate this process using the
+Next-Best-View (NBV) policy for active 3D reconstruction. However, the existing
+NBV policies heavily rely on hand-crafted criteria, limited action space, or
+per-scene optimized representations. These constraints limit their
+cross-dataset generalizability. To overcome them, we propose GenNBV, an
+end-to-end generalizable NBV policy. Our policy adopts a reinforcement learning
+(RL)-based framework and extends typical limited action space to 5D free space.
+It empowers our agent drone to scan from any viewpoint, and even interact with
+unseen geometries during training. To boost the cross-dataset generalizability,
+we also propose a novel multi-source state embedding, including geometric,
+semantic, and action representations. We establish a benchmark using the Isaac
+Gym simulator with the Houses3K and OmniObject3D datasets to evaluate this NBV
+policy. Experiments demonstrate that our policy achieves a 98.26% and 97.12%
+coverage ratio on unseen building-scale objects from these datasets,
+respectively, outperforming prior solutions.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']"
+On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation,Agneet Chatterjee · Tejas Gokhale · Chitta Baral · 'YZ' Yezhou Yang,https://agneetchatterjee.com/robustness_depth_lang/,https://arxiv.org/abs/2404.08540,,2404.08540.pdf,On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation,"Recent advances in monocular depth estimation have been made by incorporating
+natural language as additional guidance. Although yielding impressive results,
+the impact of the language prior, particularly in terms of generalization and
+robustness, remains unexplored. In this paper, we address this gap by
+quantifying the impact of this prior and introduce methods to benchmark its
+effectiveness across various settings. We generate ""low-level"" sentences that
+convey object-centric, three-dimensional spatial relationships, incorporate
+them as additional language priors and evaluate their downstream impact on
+depth estimation. Our key finding is that current language-guided depth
+estimators perform optimally only with scene-level descriptions and
+counter-intuitively fare worse with low level descriptions. Despite leveraging
+additional data, these methods are not robust to directed adversarial attacks
+and decline in performance with an increase in distribution shift. Finally, to
+provide a foundation for future research, we identify points of failures and
+offer insights to better understand these shortcomings. With an increasing
+number of methods using language for depth estimation, our findings highlight
+the opportunities and pitfalls that require careful consideration for effective
+deployment in real-world settings",cs.CV,['cs.CV']
+Towards a Perceptual Evaluation Framework for Lighting Estimation,Justine Giroux · Mohammad Reza Karimi Dastjerdi · Yannick Hold-Geoffroy · Javier Vazquez-Corral · Jean-François Lalonde, ,https://arxiv.org/abs/2312.04334,,2312.04334.pdf,Towards a Perceptual Evaluation Framework for Lighting Estimation,"Progress in lighting estimation is tracked by computing existing image
+quality assessment (IQA) metrics on images from standard datasets. While this
+may appear to be a reasonable approach, we demonstrate that doing so does not
+correlate to human preference when the estimated lighting is used to relight a
+virtual scene into a real photograph. To study this, we design a controlled
+psychophysical experiment where human observers must choose their preference
+amongst rendered scenes lit using a set of lighting estimation algorithms
+selected from the recent literature, and use it to analyse how these algorithms
+perform according to human perception. Then, we demonstrate that none of the
+most popular IQA metrics from the literature, taken individually, correctly
+represent human perception. Finally, we show that by learning a combination of
+existing IQA metrics, we can more accurately represent human preference. This
+provides a new perceptual framework to help evaluate future lighting estimation
+algorithms.",cs.CV,['cs.CV']
+HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video,Zicong Fan · Maria Parelli · Maria Kadoglou · Xu Chen · Muhammed Kocabas · Michael J. Black · Otmar Hilliges,https://zc-alexfan.github.io/hold,https://arxiv.org/abs/2311.18448v1,,2311.18448v1.pdf,HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video,"Since humans interact with diverse objects every day, the holistic 3D capture
+of these interactions is important to understand and model human behaviour.
+However, most existing methods for hand-object reconstruction from RGB either
+assume pre-scanned object templates or heavily rely on limited 3D hand-object
+data, restricting their ability to scale and generalize to more unconstrained
+interaction settings. To this end, we introduce HOLD -- the first
+category-agnostic method that reconstructs an articulated hand and object
+jointly from a monocular interaction video. We develop a compositional
+articulated implicit model that can reconstruct disentangled 3D hand and object
+from 2D images. We also further incorporate hand-object constraints to improve
+hand-object poses and consequently the reconstruction quality. Our method does
+not rely on 3D hand-object annotations while outperforming fully-supervised
+baselines in both in-the-lab and challenging in-the-wild settings. Moreover, we
+qualitatively show its robustness in reconstructing from in-the-wild videos.
+Code: https://github.com/zc-alexfan/hold",cs.CV,['cs.CV']
+Language-driven All-in-one Adverse Weather Removal,Hao Yang · Liyuan Pan · Yan Yang · Wei Liang, ,https://arxiv.org/abs/2312.01381,,2312.01381.pdf,Language-driven All-in-one Adverse Weather Removal,"All-in-one (AiO) frameworks restore various adverse weather degradations with
+a single set of networks jointly. To handle various weather conditions, an AiO
+framework is expected to adaptively learn weather-specific knowledge for
+different degradations and shared knowledge for common patterns. However,
+existing methods: 1) rely on extra supervision signals, which are usually
+unknown in real-world applications; 2) employ fixed network structures, which
+restrict the diversity of weather-specific knowledge. In this paper, we propose
+a Language-driven Restoration framework (LDR) to alleviate the aforementioned
+issues. First, we leverage the power of pre-trained vision-language (PVL)
+models to enrich the diversity of weather-specific knowledge by reasoning about
+the occurrence, type, and severity of degradation, generating description-based
+degradation priors. Then, with the guidance of degradation prior, we sparsely
+select restoration experts from a candidate list dynamically based on a
+Mixture-of-Experts (MoE) structure. This enables us to adaptively learn the
+weather-specific and shared knowledge to handle various weather conditions
+(e.g., unknown or mixed weather). Experiments on extensive restoration
+scenarios show our superior performance (see Fig. 1). The source code will be
+made available.",cs.CV,['cs.CV']
+Gaussian Splatting SLAM,Hidenobu Matsuki · Riku Murai · Paul Kelly · Andrew J. Davison,https://rmurai.co.uk/projects/GaussianSplattingSLAM/,https://arxiv.org/abs/2312.06741,,2312.06741.pdf,Gaussian Splatting SLAM,"We present the first application of 3D Gaussian Splatting in monocular SLAM,
+the most fundamental but the hardest setup for Visual SLAM. Our method, which
+runs live at 3fps, utilises Gaussians as the only 3D representation, unifying
+the required representation for accurate, efficient tracking, mapping, and
+high-quality rendering. Designed for challenging monocular settings, our
+approach is seamlessly extendable to RGB-D SLAM when an external depth sensor
+is available. Several innovations are required to continuously reconstruct 3D
+scenes with high fidelity from a live camera. First, to move beyond the
+original 3DGS algorithm, which requires accurate poses from an offline
+Structure from Motion (SfM) system, we formulate camera tracking for 3DGS using
+direct optimisation against the 3D Gaussians, and show that this enables fast
+and robust tracking with a wide basin of convergence. Second, by utilising the
+explicit nature of the Gaussians, we introduce geometric verification and
+regularisation to handle the ambiguities occurring in incremental 3D dense
+reconstruction. Finally, we introduce a full SLAM system which not only
+achieves state-of-the-art results in novel view synthesis and trajectory
+estimation but also reconstruction of tiny and even transparent objects.",cs.CV,"['cs.CV', 'cs.RO']"
+Backdoor Defense via Test-Time Detecting and Repairing,Jiyang Guan · Jian Liang · Ran He, ,https://arxiv.org/abs/2308.06107,,2308.06107.pdf,Test-Time Backdoor Defense via Detecting and Repairing,"Deep neural networks have played a crucial part in many critical domains,
+such as autonomous driving, face recognition, and medical diagnosis. However,
+deep neural networks are facing security threats from backdoor attacks and can
+be manipulated into attacker-decided behaviors by the backdoor attacker. To
+defend the backdoor, prior research has focused on using clean data to remove
+backdoor attacks before model deployment. In this paper, we investigate the
+possibility of defending against backdoor attacks at test time by utilizing
+partially poisoned data to remove the backdoor from the model. To address the
+problem, a two-stage method Test-Time Backdoor Defense (TTBD) is proposed. In
+the first stage, we propose a backdoor sample detection method DDP to identify
+poisoned samples from a batch of mixed, partially poisoned samples. Once the
+poisoned samples are detected, we employ Shapley estimation to calculate the
+contribution of each neuron's significance in the network, locate the poisoned
+neurons, and prune them to remove backdoor in the models. Our experiments
+demonstrate that TTBD removes the backdoor successfully with only a batch of
+partially poisoned data across different model architectures and datasets
+against different types of backdoor attacks.",cs.CR,['cs.CR']
+XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies,Xuanchi Ren · Jiahui Huang · Xiaohui Zeng · Ken Museth · Sanja Fidler · Francis Williams, ,https://arxiv.org/abs/2312.03806,,2312.03806.pdf,XCube ($\mathcal{X}^3$): Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies,"We present $\mathcal{X}^3$ (pronounced XCube), a novel generative model for
+high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can
+generate millions of voxels with a finest effective resolution of up to
+$1024^3$ in a feed-forward fashion without time-consuming test-time
+optimization. To achieve this, we employ a hierarchical voxel latent diffusion
+model which generates progressively higher resolution grids in a coarse-to-fine
+manner using a custom framework built on the highly efficient VDB data
+structure. Apart from generating high-resolution objects, we demonstrate the
+effectiveness of XCube on large outdoor scenes at scales of 100m$\times$100m
+with a voxel size as small as 10cm. We observe clear qualitative and
+quantitative improvements over past approaches. In addition to unconditional
+generation, we show that our model can be used to solve a variety of tasks such
+as user-guided editing, scene completion from a single scan, and text-to-3D.
+More results and details can be found at
+https://research.nvidia.com/labs/toronto-ai/xcube/.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+MultiPhys: Multi-Person Physics-aware 3D Motion Estimation,Nicolás Ugrinovic · Boxiao Pan · Georgios Pavlakos · Despoina Paschalidou · Bokui Shen · Jordi Sanchez-Riera · Francesc Moreno-Noguer · Leonidas Guibas, ,https://arxiv.org/abs/2404.11987,,2404.11987.pdf,MultiPhys: Multi-Person Physics-aware 3D Motion Estimation,"We introduce MultiPhys, a method designed for recovering multi-person motion
+from monocular videos. Our focus lies in capturing coherent spatial placement
+between pairs of individuals across varying degrees of engagement. MultiPhys,
+being physically aware, exhibits robustness to jittering and occlusions, and
+effectively eliminates penetration issues between the two individuals. We
+devise a pipeline in which the motion estimated by a kinematic-based method is
+fed into a physics simulator in an autoregressive manner. We introduce distinct
+components that enable our model to harness the simulator's properties without
+compromising the accuracy of the kinematic estimates. This results in final
+motion estimates that are both kinematically coherent and physically compliant.
+Extensive evaluations on three challenging datasets characterized by
+substantial inter-person interaction show that our method significantly reduces
+errors associated with penetration and foot skating, while performing
+competitively with the state-of-the-art on motion accuracy and smoothness.
+Results and code can be found on our project page
+(http://www.iri.upc.edu/people/nugrinovic/multiphys/).",cs.CV,['cs.CV']
+Implicit Motion Function,Yue Gao · Jiahao Li · Lei Chu · Yan Lu, ,,https://ieeexplore.ieee.org/document/10378136/citations?tabFilter=papers,,,,,nan
+Improving Generalized Zero-Shot Learning by Exploring the Diverse Semantics from  External Class Names,Yapeng Li · Yong Luo · Zengmao Wang · Bo Du, ,,https://ieeexplore.ieee.org/document/10283906,,,,,nan
+Unsupervised 3D Structure Inference from Category-Specific Image Collections,Weikang Wang · Dongliang Cao · Florian Bernard,https://wei-kang-wang.github.io/unsuper3Dstructure/,,,,,,,nan
+Text-Driven Image Editing via Learnable Regions,Yuanze Lin · Yi-Wen Chen · Yi-Hsuan Tsai · Lu Jiang · Ming-Hsuan Yang,https://yuanze-lin.me/LearnableRegions_page/,https://arxiv.org/abs/2311.16432,,2311.16432.pdf,Text-Driven Image Editing via Learnable Regions,"Language has emerged as a natural interface for image editing. In this paper,
+we introduce a method for region-based image editing driven by textual prompts,
+without the need for user-provided masks or sketches. Specifically, our
+approach leverages an existing pre-trained text-to-image model and introduces a
+bounding box generator to identify the editing regions that are aligned with
+the textual prompts. We show that this simple approach enables flexible editing
+that is compatible with current image generation models, and is able to handle
+complex prompts featuring multiple objects, complex sentences, or lengthy
+paragraphs. We conduct an extensive user study to compare our method against
+state-of-the-art methods. The experiments demonstrate the competitive
+performance of our method in manipulating images with high fidelity and realism
+that correspond to the provided language descriptions. Our project webpage can
+be found at: https://yuanze-lin.me/LearnableRegions_page.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Balancing Act: Distribution-Guided Debiasing in Diffusion Models,Rishubh Parihar · Abhijnya Bhat · Abhipsa Basu · Saswat Mallick · Jogendra Kundu Kundu · R. Venkatesh Babu, ,https://arxiv.org/abs/2402.18206,,2402.18206.pdf,Balancing Act: Distribution-Guided Debiasing in Diffusion Models,"Diffusion Models (DMs) have emerged as powerful generative models with
+unprecedented image generation capability. These models are widely used for
+data augmentation and creative applications. However, DMs reflect the biases
+present in the training datasets. This is especially concerning in the context
+of faces, where the DM prefers one demographic subgroup vs others (eg. female
+vs male). In this work, we present a method for debiasing DMs without relying
+on additional data or model retraining. Specifically, we propose Distribution
+Guidance, which enforces the generated images to follow the prescribed
+attribute distribution. To realize this, we build on the key insight that the
+latent features of denoising UNet hold rich demographic semantics, and the same
+can be leveraged to guide debiased generation. We train Attribute Distribution
+Predictor (ADP) - a small mlp that maps the latent features to the distribution
+of attributes. ADP is trained with pseudo labels generated from existing
+attribute classifiers. The proposed Distribution Guidance with ADP enables us
+to do fair generation. Our method reduces bias across single/multiple
+attributes and outperforms the baseline by a significant margin for
+unconditional and text-conditional diffusion models. Further, we present a
+downstream task of training a fair attribute classifier by rebalancing the
+training set with our generated data.",cs.CV,['cs.CV']
+Close Imitation of Expert Retouching for Black-and-White Photography,Seunghyun Shin · Jisu Shin · Jihwan Bae · Inwook Shim · Hae-Gon Jeon,https://github.com/seunghyuns98/Decolorization,,https://retouchinglabs.com/retouching-black-and-white-photos/,,,,,nan
+Generative Image Dynamics,Zhengqi Li · Richard Tucker · Noah Snavely · Aleksander Holynski, ,https://arxiv.org/abs/2309.07906,,2309.07906.pdf,Generative Image Dynamics,"We present an approach to modeling an image-space prior on scene motion. Our
+prior is learned from a collection of motion trajectories extracted from real
+video sequences depicting natural, oscillatory dynamics such as trees, flowers,
+candles, and clothes swaying in the wind. We model this dense, long-term motion
+prior in the Fourier domain:given a single image, our trained model uses a
+frequency-coordinated diffusion sampling process to predict a spectral volume,
+which can be converted into a motion texture that spans an entire video. Along
+with an image-based rendering module, these trajectories can be used for a
+number of downstream applications, such as turning still images into seamlessly
+looping videos, or allowing users to realistically interact with objects in
+real pictures by interpreting the spectral volumes as image-space modal bases,
+which approximate object dynamics.",cs.CV,['cs.CV']
+RAM-Avatar: Real-time Photo-Realistic Avatar from Monocular Videos with Full-body Control,xiang deng · Zerong Zheng · Yuxiang Zhang · Jingxiang Sun · Chao Xu · Xiaodong Yang · Lizhen Wang · Yebin Liu, ,https://arxiv.org/html/2303.10275v2,,2303.10275v2.pdf,MoRF: Mobile Realistic Fullbody Avatars from a Monocular Video,"We present a system to create Mobile Realistic Fullbody (MoRF) avatars. MoRF
+avatars are rendered in real-time on mobile devices, learned from monocular
+videos, and have high realism. We use SMPL-X as a proxy geometry and render it
+with DNR (neural texture and image-2-image network). We improve on prior work,
+by overfitting per-frame warping fields in the neural texture space, allowing
+to better align the training signal between different frames. We also refine
+SMPL-X mesh fitting procedure to improve the overall avatar quality. In the
+comparisons to other monocular video-based avatar systems, MoRF avatars achieve
+higher image sharpness and temporal consistency. Participants of our user study
+also preferred avatars generated by MoRF.",cs.CV,['cs.CV']
+SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation,Zhixuan Liu · Peter Schaldenbrand · Beverley-Claire Okogwu · Wenxuan Peng · Youngsik Yun · Andrew Hundt · Jihie Kim · Jean Oh,ariannaliu.github.io/SCoFT/,https://arxiv.org/abs/2401.08053,,2401.08053.pdf,SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation,"Accurate representation in media is known to improve the well-being of the
+people who consume it. Generative image models trained on large web-crawled
+datasets such as LAION are known to produce images with harmful stereotypes and
+misrepresentations of cultures. We improve inclusive representation in
+generated images by (1) engaging with communities to collect a culturally
+representative dataset that we call the Cross-Cultural Understanding Benchmark
+(CCUB) and (2) proposing a novel Self-Contrastive Fine-Tuning (SCoFT) method
+that leverages the model's known biases to self-improve. SCoFT is designed to
+prevent overfitting on small datasets, encode only high-level information from
+the data, and shift the generated distribution away from misrepresentations
+encoded in a pretrained model. Our user study conducted on 51 participants from
+5 different countries based on their self-selected national cultural
+affiliation shows that fine-tuning on CCUB consistently generates images with
+higher cultural relevance and fewer stereotypes when compared to the Stable
+Diffusion baseline, which is further improved with our SCoFT technique.",cs.CV,['cs.CV']
+Rendering Every Pixel for High-Fidelity Geometry in 3D GANs,Alex Trevithick · Matthew Chan · Towaki Takikawa · Umar Iqbal · Shalini De Mello · Manmohan Chandraker · Ravi Ramamoorthi · Koki Nagano, ,https://arxiv.org/abs/2401.02411,,2401.02411.pdf,What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs,"3D-aware Generative Adversarial Networks (GANs) have shown remarkable
+progress in learning to generate multi-view-consistent images and 3D geometries
+of scenes from collections of 2D images via neural volume rendering. Yet, the
+significant memory and computational costs of dense sampling in volume
+rendering have forced 3D GANs to adopt patch-based training or employ
+low-resolution rendering with post-processing 2D super resolution, which
+sacrifices multiview consistency and the quality of resolved geometry.
+Consequently, 3D GANs have not yet been able to fully resolve the rich 3D
+geometry present in 2D images. In this work, we propose techniques to scale
+neural volume rendering to the much higher resolution of native 2D images,
+thereby resolving fine-grained 3D geometry with unprecedented detail. Our
+approach employs learning-based samplers for accelerating neural rendering for
+3D GAN training using up to 5 times fewer depth samples. This enables us to
+explicitly ""render every pixel"" of the full-resolution image during training
+and inference without post-processing superresolution in 2D. Together with our
+strategy to learn high-quality surface geometry, our method synthesizes
+high-resolution 3D geometry and strictly view-consistent images while
+maintaining image quality on par with baselines relying on post-processing
+super resolution. We demonstrate state-of-the-art 3D gemetric quality on FFHQ
+and AFHQ, setting a new standard for unsupervised learning of 3D shapes in 3D
+GANs.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG']"
+An Interactive Navigation Method with Effect-oriented Affordance,Xiaohan Wang · Yuehu LIU · Xinhang Song · Yuyi Liu · Sixian Zhang · Shuqiang Jiang, ,https://arxiv.org/abs/2310.08873,,2310.08873.pdf,Interactive Navigation in Environments with Traversable Obstacles Using Large Language and Vision-Language Models,"This paper proposes an interactive navigation framework by using large
+language and vision-language models, allowing robots to navigate in
+environments with traversable obstacles. We utilize the large language model
+(GPT-3.5) and the open-set Vision-language Model (Grounding DINO) to create an
+action-aware costmap to perform effective path planning without fine-tuning.
+With the large models, we can achieve an end-to-end system from textual
+instructions like ""Can you pass through the curtains to deliver medicines to
+me?"", to bounding boxes (e.g., curtains) with action-aware attributes. They can
+be used to segment LiDAR point clouds into two parts: traversable and
+untraversable parts, and then an action-aware costmap is constructed for
+generating a feasible path. The pre-trained large models have great
+generalization ability and do not require additional annotated data for
+training, allowing fast deployment in the interactive navigation tasks. We
+choose to use multiple traversable objects such as curtains and grasses for
+verification by instructing the robot to traverse them. Besides, traversing
+curtains in a medical scenario was tested. All experimental results
+demonstrated the proposed framework's effectiveness and adaptability to diverse
+environments.",cs.RO,"['cs.RO', 'cs.AI']"
+Communication-Efficient Federated Learning with Accelerated Client Gradient,Geeho Kim · Jinkyu Kim · Bohyung Han, ,,https://openreview.net/forum?id=qwymfs6cKe,,,,,nan
+InceptionNeXt: When Inception Meets ConvNeXt,Weihao Yu · Pan Zhou · Shuicheng Yan · Xinchao Wang,https://github.com/sail-sg/inceptionnext,,https://dblp.org/rec/journals/corr/abs-2303-16900,,,,,nan
+MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation,Petru-Daniel Tudosiu · Yongxin Yang · Shifeng Zhang · Fei Chen · Steven McDonagh · Gerasimos Lampouras · Ignacio Iacobacci · Sarah Parisot,https://mulan-dataset.github.io/,https://arxiv.org/abs/2404.02790,,2404.02790.pdf,MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation,"Text-to-image generation has achieved astonishing results, yet precise
+spatial controllability and prompt fidelity remain highly challenging. This
+limitation is typically addressed through cumbersome prompt engineering, scene
+layout conditioning, or image editing techniques which often require hand drawn
+masks. Nonetheless, pre-existing works struggle to take advantage of the
+natural instance-level compositionality of scenes due to the typically flat
+nature of rasterized RGB output images. Towards adressing this challenge, we
+introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer ANnotations of
+RGB images as multilayer, instance-wise RGBA decompositions, and over 100K
+instance images. To build MuLAn, we developed a training free pipeline which
+decomposes a monocular RGB image into a stack of RGBA layers comprising of
+background and isolated instances. We achieve this through the use of
+pretrained general-purpose models, and by developing three modules: image
+decomposition for instance discovery and extraction, instance completion to
+reconstruct occluded areas, and image re-assembly. We use our pipeline to
+create MuLAn-COCO and MuLAn-LAION datasets, which contain a variety of image
+decompositions in terms of style, composition and complexity. With MuLAn, we
+provide the first photorealistic resource providing instance decomposition and
+occlusion information for high quality images, opening up new avenues for
+text-to-image generative AI research. With this, we aim to encourage the
+development of novel generation and editing technology, in particular
+layer-wise solutions. MuLAn data resources are available at
+https://MuLAn-dataset.github.io/.",cs.CV,['cs.CV']
+Ink Dot-Oriented Differentiable Optimization for Neural Image Halftoning,Hao Jiang · Bingfeng Zhou · Yadong Mu, ,,https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/ipr2.12998,,,,,nan
+On the Scalability of Diffusion-based Text-to-Image Generation,Hao Li · Yang Zou · Ying Wang · Orchid Majumder · Yusheng Xie · R. Manmatha · Ashwin Swaminathan · Zhuowen Tu · Stefano Ermon · Stefano Soatto, ,https://arxiv.org/abs/2404.02883,,2404.02883.pdf,On the Scalability of Diffusion-based Text-to-Image Generation,"Scaling up model and data size has been quite successful for the evolution of
+LLMs. However, the scaling law for the diffusion based text-to-image (T2I)
+models is not fully explored. It is also unclear how to efficiently scale the
+model for better performance at reduced cost. The different training settings
+and expensive training cost make a fair model comparison extremely difficult.
+In this work, we empirically study the scaling properties of diffusion based
+T2I models by performing extensive and rigours ablations on scaling both
+denoising backbones and training set, including training scaled UNet and
+Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M
+images. For model scaling, we find the location and amount of cross attention
+distinguishes the performance of existing UNet designs. And increasing the
+transformer blocks is more parameter-efficient for improving text-image
+alignment than increasing channel numbers. We then identify an efficient UNet
+variant, which is 45% smaller and 28% faster than SDXL's UNet. On the data
+scaling side, we show the quality and diversity of the training set matters
+more than simply dataset size. Increasing caption density and diversity
+improves text-image alignment performance and the learning efficiency. Finally,
+we provide scaling functions to predict the text-image alignment performance as
+functions of the scale of model size, compute and dataset size.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+G-FARS: Gradient-Field-based Auto-Regressive Sampling for 3D Part Grouping,Junfeng Cheng · Tania Stathaki,https://github.com/J-F-Cheng/G-FARS-3DPartGrouping,https://arxiv.org/abs/2405.06828,,2405.06828.pdf,G-FARS: Gradient-Field-based Auto-Regressive Sampling for 3D Part Grouping,"This paper proposes a novel task named ""3D part grouping"". Suppose there is a
+mixed set containing scattered parts from various shapes. This task requires
+algorithms to find out every possible combination among all the parts. To
+address this challenge, we propose the so called Gradient Field-based
+Auto-Regressive Sampling framework (G-FARS) tailored specifically for the 3D
+part grouping task. In our framework, we design a gradient-field-based
+selection graph neural network (GNN) to learn the gradients of a log
+conditional probability density in terms of part selection, where the condition
+is the given mixed part set. This innovative approach, implemented through the
+gradient-field-based selection GNN, effectively captures complex relationships
+among all the parts in the input. Upon completion of the training process, our
+framework becomes capable of autonomously grouping 3D parts by iteratively
+selecting them from the mixed part set, leveraging the knowledge acquired by
+the trained gradient-field-based selection GNN. Our code is available at:
+https://github.com/J-F-Cheng/G-FARS-3DPartGrouping.",cs.CV,['cs.CV']
+Unsupervised Salient Instance Detection,Xin Tian · Ke Xu · Rynson W.H. Lau, ,https://arxiv.org/abs/2404.14759,,2404.14759.pdf,Unified Unsupervised Salient Object Detection via Knowledge Transfer,"Recently, unsupervised salient object detection (USOD) has gained increasing
+attention due to its annotation-free nature. However, current methods mainly
+focus on specific tasks such as RGB and RGB-D, neglecting the potential for
+task migration. In this paper, we propose a unified USOD framework for generic
+USOD tasks. Firstly, we propose a Progressive Curriculum Learning-based
+Saliency Distilling (PCL-SD) mechanism to extract saliency cues from a
+pre-trained deep network. This mechanism starts with easy samples and
+progressively moves towards harder ones, to avoid initial interference caused
+by hard samples. Afterwards, the obtained saliency cues are utilized to train a
+saliency detector, and we employ a Self-rectify Pseudo-label Refinement (SPR)
+mechanism to improve the quality of pseudo-labels. Finally, an adapter-tuning
+method is devised to transfer the acquired saliency knowledge, leveraging
+shared knowledge to attain superior transferring performance on the target
+tasks. Extensive experiments on five representative SOD tasks confirm the
+effectiveness and feasibility of our proposed method. Code and supplement
+materials are available at https://github.com/I2-Multimedia-Lab/A2S-v3.",cs.CV,['cs.CV']
+TextureDreamer: Image-guided Texture Synthesis through Geometry-aware Diffusion,Yu-Ying Yeh · Jia-Bin Huang · Changil Kim · Lei Xiao · Thu Nguyen-Phuoc · Numair Khan · Cheng Zhang · Manmohan Chandraker · Carl Marshall · Zhao Dong · Zhengqin Li, ,,https://huggingface.co/papers/2401.09416,,,,,nan
+RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models,Ozgur Kara · Bariscan Kurtkaya · Hidir Yesiltepe · James Rehg · Pinar Yanardag,https://rave-video.github.io/,https://arxiv.org/abs/2312.04524,,2312.04524.pdf,RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models,"Recent advancements in diffusion-based models have demonstrated significant
+success in generating images from text. However, video editing models have not
+yet reached the same level of visual quality and user control. To address this,
+we introduce RAVE, a zero-shot video editing method that leverages pre-trained
+text-to-image diffusion models without additional training. RAVE takes an input
+video and a text prompt to produce high-quality videos while preserving the
+original motion and semantic structure. It employs a novel noise shuffling
+strategy, leveraging spatio-temporal interactions between frames, to produce
+temporally consistent videos faster than existing methods. It is also efficient
+in terms of memory requirements, allowing it to handle longer videos. RAVE is
+capable of a wide range of edits, from local attribute modifications to shape
+transformations. In order to demonstrate the versatility of RAVE, we create a
+comprehensive video evaluation dataset ranging from object-focused scenes to
+complex human activities like dancing and typing, and dynamic scenes featuring
+swimming fish and boats. Our qualitative and quantitative experiments highlight
+the effectiveness of RAVE in diverse video editing scenarios compared to
+existing methods. Our code, dataset and videos can be found in
+https://rave-video.github.io.",cs.CV,['cs.CV']
+CosmicMan: A Text-to-Image Foundation Model for Humans,Shikai Li · Jianglin Fu · Kaiyuan Liu · Wentao Wang · Kwan-Yee Lin · Wayne Wu, ,http://export.arxiv.org/abs/2404.01294,,2404.01294.pdf,CosmicMan: A Text-to-Image Foundation Model for Humans,"We present CosmicMan, a text-to-image foundation model specialized for
+generating high-fidelity human images. Unlike current general-purpose
+foundation models that are stuck in the dilemma of inferior quality and
+text-image misalignment for humans, CosmicMan enables generating
+photo-realistic human images with meticulous appearance, reasonable structure,
+and precise text-image alignment with detailed dense descriptions. At the heart
+of CosmicMan's success are the new reflections and perspectives on data and
+models: (1) We found that data quality and a scalable data production flow are
+essential for the final results from trained models. Hence, we propose a new
+data production paradigm, Annotate Anyone, which serves as a perpetual data
+flywheel to produce high-quality data with accurate yet cost-effective
+annotations over time. Based on this, we constructed a large-scale dataset,
+CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean
+resolution of 1488x1255, and attached with precise text annotations deriving
+from 115 Million attributes in diverse granularities. (2) We argue that a
+text-to-image foundation model specialized for humans must be pragmatic -- easy
+to integrate into down-streaming tasks while effective in producing
+high-quality human images. Hence, we propose to model the relationship between
+dense text descriptions and image pixels in a decomposed manner, and present
+Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly
+decomposes the cross-attention features in existing text-to-image diffusion
+model, and enforces attention refocusing without adding extra modules. Through
+Daring, we show that explicitly discretizing continuous text space into several
+basic groups that align with human body structure is the key to tackling the
+misalignment problem in a breeze.",cs.CV,['cs.CV']
+Think Twice Before Selection: Federated Evidential Active Learning for Medical Image Analysis with Domain Shifts,Jiayi Chen · Benteng Ma · Hengfei Cui · Kwang-Ting Cheng · Yong Xia, ,https://arxiv.org/abs/2312.02567,,2312.02567.pdf,Think Twice Before Selection: Federated Evidential Active Learning for Medical Image Analysis with Domain Shifts,"Federated learning facilitates the collaborative learning of a global model
+across multiple distributed medical institutions without centralizing data.
+Nevertheless, the expensive cost of annotation on local clients remains an
+obstacle to effectively utilizing local data. To mitigate this issue, federated
+active learning methods suggest leveraging local and global model predictions
+to select a relatively small amount of informative local data for annotation.
+However, existing methods mainly focus on all local data sampled from the same
+domain, making them unreliable in realistic medical scenarios with domain
+shifts among different clients. In this paper, we make the first attempt to
+assess the informativeness of local data derived from diverse domains and
+propose a novel methodology termed Federated Evidential Active Learning (FEAL)
+to calibrate the data evaluation under domain shift. Specifically, we introduce
+a Dirichlet prior distribution in both local and global models to treat the
+prediction as a distribution over the probability simplex and capture both
+aleatoric and epistemic uncertainties by using the Dirichlet-based evidential
+model. Then we employ the epistemic uncertainty to calibrate the aleatoric
+uncertainty. Afterward, we design a diversity relaxation strategy to reduce
+data redundancy and maintain data diversity. Extensive experiments and analysis
+on five real multi-center medical image datasets demonstrate the superiority of
+FEAL over the state-of-the-art active learning methods in federated scenarios
+with domain shifts. The code will be available at
+https://github.com/JiayiChen815/FEAL.",cs.CV,['cs.CV']
+Riemannian Multinomial Logistics Regression for SPD Neural Networks,Ziheng Chen · Yue Song · Gaowen Liu · Ramana Kompella · Xiaojun Wu · Nicu Sebe,https://github.com/GitZH-Chen/SPDMLR.git,,https://openreview.net/forum?id=S0DUtGgkTM,,,,,nan
+Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval,Young Kyun Jang · Donghyun Kim · Zihang Meng · Dat Huynh · Ser-Nam Lim,https://youngkyunjang.github.io/VDG_project/,https://arxiv.org/abs/2404.15516,,2404.15516.pdf,Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval,"Composed Image Retrieval (CIR) is a task that retrieves images similar to a
+query, based on a provided textual modification. Current techniques rely on
+supervised learning for CIR models using labeled triplets of the reference
+image, text, target image. These specific triplets are not as commonly
+available as simple image-text pairs, limiting the widespread use of CIR and
+its scalability. On the other hand, zero-shot CIR can be relatively easily
+trained with image-caption pairs without considering the image-to-image
+relation, but this approach tends to yield lower accuracy. We propose a new
+semi-supervised CIR approach where we search for a reference and its related
+target images in auxiliary data and learn our large language model-based Visual
+Delta Generator (VDG) to generate text describing the visual difference (i.e.,
+visual delta) between the two. VDG, equipped with fluent language knowledge and
+being model agnostic, can generate pseudo triplets to boost the performance of
+CIR models. Our approach significantly improves the existing supervised
+learning approaches and achieves state-of-the-art results on the CIR
+benchmarks.",cs.CV,"['cs.CV', 'cs.AI']"
+LiDAR-based Person Re-identification,Wenxuan Guo · Zhiyu Pan · Yingping Liang · Ziheng Xi · Zhi Chen Zhong · Jianjiang Feng · Jie Zhou,https://github.com/GWxuan/ReID3D,https://arxiv.org/abs/2312.03033,,2312.03033.pdf,LiDAR-based Person Re-identification,"Camera-based person re-identification (ReID) systems have been widely applied
+in the field of public security. However, cameras often lack the perception of
+3D morphological information of human and are susceptible to various
+limitations, such as inadequate illumination, complex background, and personal
+privacy. In this paper, we propose a LiDAR-based ReID framework, ReID3D, that
+utilizes pre-training strategy to retrieve features of 3D body shape and
+introduces Graph-based Complementary Enhancement Encoder for extracting
+comprehensive features. Due to the lack of LiDAR datasets, we build LReID, the
+first LiDAR-based person ReID dataset, which is collected in several outdoor
+scenes with variations in natural conditions. Additionally, we introduce
+LReID-sync, a simulated pedestrian dataset designed for pre-training encoders
+with tasks of point cloud completion and shape parameter learning. Extensive
+experiments on LReID show that ReID3D achieves exceptional performance with a
+rank-1 accuracy of 94.0, highlighting the significant potential of LiDAR in
+addressing person ReID tasks. To the best of our knowledge, we are the first to
+propose a solution for LiDAR-based ReID. The code and datasets will be released
+soon.",cs.CV,['cs.CV']
+Multi-scale Dynamic and Hierarchical Relationship Modeling for Facial Action Units Recognition,Zihan Wang · Siyang Song · Cheng Luo · Songhe Deng · Weicheng Xie · Linlin Shen,https://github.com/CVI-SZU/MDHR,https://arxiv.org/abs/2404.06443,,2404.06443.pdf,Multi-scale Dynamic and Hierarchical Relationship Modeling for Facial Action Units Recognition,"Human facial action units (AUs) are mutually related in a hierarchical
+manner, as not only they are associated with each other in both spatial and
+temporal domains but also AUs located in the same/close facial regions show
+stronger relationships than those of different facial regions. While none of
+existing approach thoroughly model such hierarchical inter-dependencies among
+AUs, this paper proposes to comprehensively model multi-scale AU-related
+dynamic and hierarchical spatio-temporal relationship among AUs for their
+occurrences recognition. Specifically, we first propose a novel multi-scale
+temporal differencing network with an adaptive weighting block to explicitly
+capture facial dynamics across frames at different spatial scales, which
+specifically considers the heterogeneity of range and magnitude in different
+AUs' activation. Then, a two-stage strategy is introduced to hierarchically
+model the relationship among AUs based on their spatial distribution (i.e.,
+local and cross-region AU relationship modelling). Experimental results
+achieved on BP4D and DISFA show that our approach is the new state-of-the-art
+in the field of AU occurrence recognition. Our code is publicly available at
+https://github.com/CVI-SZU/MDHR.",cs.CV,['cs.CV']
+Adapt or Perish: Adaptive Sparse Transformer with Attentive Feature Refinement for Image Restoration,Shihao Zhou · Duosheng Chen · Jinshan Pan · Jinglei Shi · Jufeng Yang,https://github.com/joshyZhou/AST,https://arxiv.org/abs/2312.06874,,2312.06874.pdf,Dozerformer: Sequence Adaptive Sparse Transformer for Multivariate Time Series Forecasting,"Transformers have achieved remarkable performance in multivariate time
+series(MTS) forecasting due to their capability to capture long-term
+dependencies. However, the canonical attention mechanism has two key
+limitations: (1) its quadratic time complexity limits the sequence length, and
+(2) it generates future values from the entire historical sequence. To address
+this, we propose a Dozer Attention mechanism consisting of three sparse
+components: (1) Local, each query exclusively attends to keys within a
+localized window of neighboring time steps. (2) Stride, enables each query to
+attend to keys at predefined intervals. (3) Vary, allows queries to selectively
+attend to keys from a subset of the historical sequence. Notably, the size of
+this subset dynamically expands as forecasting horizons extend. Those three
+components are designed to capture essential attributes of MTS data, including
+locality, seasonality, and global temporal dependencies. Additionally, we
+present the Dozerformer Framework, incorporating the Dozer Attention mechanism
+for the MTS forecasting task. We evaluated the proposed Dozerformer framework
+with recent state-of-the-art methods on nine benchmark datasets and confirmed
+its superior performance. The code will be released after the manuscript is
+accepted.",cs.LG,"['cs.LG', 'cs.CL']"
+Circuit Design and Efficient Simulation of Quantum Inner Product and Empirical Studies of Its Effect on Near-Term Hybrid Quantum-Classic Machine Learning,Hao Xiong · Yehui Tang · Xinyu Ye · Junchi Yan,https://github.com/ShawXh/qip_cvpr24,https://arxiv.org/abs/2310.03978,,2310.03978.pdf,Efficient Quantum Circuit Simulation by Tensor Network Methods on Modern GPUs,"Efficient simulation of quantum circuits has become indispensable with the
+rapid development of quantum hardware. The primary simulation methods are based
+on state vectors and tensor networks. As the number of qubits and quantum gates
+grows larger in current quantum devices, traditional state-vector based quantum
+circuit simulation methods prove inadequate due to the overwhelming size of the
+Hilbert space and extensive entanglement. Consequently, brutal force tensor
+network simulation algorithms become the only viable solution in such
+scenarios. The two main challenges faced in tensor network simulation
+algorithms are optimal contraction path finding and efficient execution on
+modern computing devices, with the latter determines the actual efficiency. In
+this study, we investigate the optimization of such tensor network simulations
+on modern GPUs and propose general optimization strategies from two aspects:
+computational efficiency and accuracy. Firstly, we propose to transform
+critical Einstein summation operations into GEMM operations, leveraging the
+specific features of tensor network simulations to amplify the efficiency of
+GPUs. Secondly, by analyzing the data characteristics of quantum circuits, we
+employ extended precision to ensure the accuracy of simulation results and
+mixed precision to fully exploit the potential of GPUs, resulting in faster and
+more precise simulations. Our numerical experiments demonstrate that our
+approach can achieve a 3.96x reduction in verification time for random quantum
+circuit samples in the 18-cycle case of Sycamore, with sustained performance
+exceeding 21 TFLOPS on one A100. This method can be easily extended to the
+20-cycle case, maintaining the same performance, accelerating by 12.5x compared
+to the state-of-the-art CPU-based results and 4.48-6.78x compared to the
+state-of-the-art GPU-based results reported in the literature.",quant-ph,"['quant-ph', 'cs.DC', 'physics.comp-ph']"
+Image Sculpting: Precise Object Editing with 3D Geometry Control,Jiraphon Yenphraphai · Xichen Pan · Sainan Liu · Daniele Panozzo · Saining Xie,https://image-sculpting.github.io/,https://arxiv.org/abs/2401.01702,,2401.01702.pdf,Image Sculpting: Precise Object Editing with 3D Geometry Control,"We present Image Sculpting, a new framework for editing 2D images by
+incorporating tools from 3D geometry and graphics. This approach differs
+markedly from existing methods, which are confined to 2D spaces and typically
+rely on textual instructions, leading to ambiguity and limited control. Image
+Sculpting converts 2D objects into 3D, enabling direct interaction with their
+3D geometry. Post-editing, these objects are re-rendered into 2D, merging into
+the original image to produce high-fidelity results through a coarse-to-fine
+enhancement process. The framework supports precise, quantifiable, and
+physically-plausible editing options such as pose editing, rotation,
+translation, 3D composition, carving, and serial addition. It marks an initial
+step towards combining the creative freedom of generative models with the
+precision of graphics pipelines.",cs.GR,"['cs.GR', 'cs.CV']"
+Test-Time Domain Generalization for Face Anti-Spoofing,Qianyu Zhou · Ke-Yue Zhang · Taiping Yao · Xuequan Lu · Shouhong Ding · Lizhuang Ma, ,https://arxiv.org/abs/2403.19334,,2403.19334.pdf,Test-Time Domain Generalization for Face Anti-Spoofing,"Face Anti-Spoofing (FAS) is pivotal in safeguarding facial recognition
+systems against presentation attacks. While domain generalization (DG) methods
+have been developed to enhance FAS performance, they predominantly focus on
+learning domain-invariant features during training, which may not guarantee
+generalizability to unseen data that differs largely from the source
+distributions. Our insight is that testing data can serve as a valuable
+resource to enhance the generalizability beyond mere evaluation for DG FAS. In
+this paper, we introduce a novel Test-Time Domain Generalization (TTDG)
+framework for FAS, which leverages the testing data to boost the model's
+generalizability. Our method, consisting of Test-Time Style Projection (TTSP)
+and Diverse Style Shifts Simulation (DSSS), effectively projects the unseen
+data to the seen domain space. In particular, we first introduce the innovative
+TTSP to project the styles of the arbitrarily unseen samples of the testing
+distribution to the known source space of the training distributions. We then
+design the efficient DSSS to synthesize diverse style shifts via learnable
+style bases with two specifically designed losses in a hyperspherical feature
+space. Our method eliminates the need for model updates at the test time and
+can be seamlessly integrated into not only the CNN but also ViT backbones.
+Comprehensive experiments on widely used cross-domain FAS benchmarks
+demonstrate our method's state-of-the-art performance and effectiveness.",cs.CV,['cs.CV']
+Towards Learning a Generalist Model for Embodied Navigation,Duo Zheng · Shijia Huang · Lin Zhao · Yiwu Zhong · Liwei Wang, ,https://arxiv.org/abs/2312.02010,,2312.02010.pdf,Towards Learning a Generalist Model for Embodied Navigation,"Building a generalist agent that can interact with the world is the
+intriguing target of AI systems, thus spurring the research for embodied
+navigation, where an agent is required to navigate according to instructions or
+respond to queries. Despite the major progress attained, previous works
+primarily focus on task-specific agents and lack generalizability to unseen
+scenarios. Recently, LLMs have presented remarkable capabilities across various
+fields, and provided a promising opportunity for embodied navigation. Drawing
+on this, we propose the first generalist model for embodied navigation,
+NaviLLM. It adapts LLMs to embodied navigation by introducing schema-based
+instruction. The schema-based instruction flexibly casts various tasks into
+generation problems, thereby unifying a wide range of tasks. This approach
+allows us to integrate diverse data sources from various datasets into the
+training, equipping NaviLLM with a wide range of capabilities required by
+embodied navigation. We conduct extensive experiments to evaluate the
+performance and generalizability of our model. The experimental results
+demonstrate that our unified model achieves state-of-the-art performance on
+CVDN, SOON, and ScanQA. Specifically, it surpasses the previous
+stats-of-the-art method by a significant margin of 29% in goal progress on
+CVDN. Moreover, our model also demonstrates strong generalizability and
+presents impressive results on unseen tasks, e.g., embodied question answering
+and 3D captioning.",cs.CV,"['cs.CV', 'cs.AI']"
+Integrating Efficient Optimal Transport and Functional Maps For Unsupervised Shape Correspondence Learning,Tung Le · Khai Nguyen · Shanlin Sun · Nhat Ho · Xiaohui Xie, ,https://arxiv.org/abs/2403.01781v1,,2403.01781v1.pdf,Integrating Efficient Optimal Transport and Functional Maps For Unsupervised Shape Correspondence Learning,"In the realm of computer vision and graphics, accurately establishing
+correspondences between geometric 3D shapes is pivotal for applications like
+object tracking, registration, texture transfer, and statistical shape
+analysis. Moving beyond traditional hand-crafted and data-driven feature
+learning methods, we incorporate spectral methods with deep learning, focusing
+on functional maps (FMs) and optimal transport (OT). Traditional OT-based
+approaches, often reliant on entropy regularization OT in learning-based
+framework, face computational challenges due to their quadratic cost. Our key
+contribution is to employ the sliced Wasserstein distance (SWD) for OT, which
+is a valid fast optimal transport metric in an unsupervised shape matching
+framework. This unsupervised framework integrates functional map regularizers
+with a novel OT-based loss derived from SWD, enhancing feature alignment
+between shapes treated as discrete probability measures. We also introduce an
+adaptive refinement process utilizing entropy regularized OT, further refining
+feature alignments for accurate point-to-point correspondences. Our method
+demonstrates superior performance in non-rigid shape matching, including
+near-isometric and non-isometric scenarios, and excels in downstream tasks like
+segmentation transfer. The empirical results on diverse datasets highlight our
+framework's effectiveness and generalization capabilities, setting new
+standards in non-rigid shape matching with efficient OT metrics and an adaptive
+refinement module.",cs.CV,"['cs.CV', 'cs.AI']"
+NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images,Yufei Han · Heng Guo · Koki Fukai · Hiroaki Santo · Boxin Shi · Fumio Okura · Zhanyu Ma · Yunpeng Jia, ,,,,,,,nan
+Towards Transferable Targeted 3D Adversarial Attack in the Physical World,Yao Huang · Yinpeng Dong · Shouwei Ruan · Xiao Yang · Hang Su · Xingxing Wei, ,https://arxiv.org/abs/2312.09558,,2312.09558.pdf,Towards Transferable Targeted 3D Adversarial Attack in the Physical World,"Compared with transferable untargeted attacks, transferable targeted
+adversarial attacks could specify the misclassification categories of
+adversarial samples, posing a greater threat to security-critical tasks. In the
+meanwhile, 3D adversarial samples, due to their potential of multi-view
+robustness, can more comprehensively identify weaknesses in existing deep
+learning systems, possessing great application value. However, the field of
+transferable targeted 3D adversarial attacks remains vacant. The goal of this
+work is to develop a more effective technique that could generate transferable
+targeted 3D adversarial examples, filling the gap in this field. To achieve
+this goal, we design a novel framework named TT3D that could rapidly
+reconstruct from few multi-view images into Transferable Targeted 3D textured
+meshes. While existing mesh-based texture optimization methods compute
+gradients in the high-dimensional mesh space and easily fall into local optima,
+leading to unsatisfactory transferability and distinct distortions, TT3D
+innovatively performs dual optimization towards both feature grid and
+Multi-layer Perceptron (MLP) parameters in the grid-based NeRF space, which
+significantly enhances black-box transferability while enjoying naturalness.
+Experimental results show that TT3D not only exhibits superior cross-model
+transferability but also maintains considerable adaptability across different
+renders and vision tasks. More importantly, we produce 3D adversarial examples
+with 3D printing techniques in the real world and verify their robust
+performance under various scenarios.",cs.CV,['cs.CV']
+Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences,Axel Barroso-Laguna · Sowmya Munukutla · Victor Adrian Prisacariu · Eric Brachmann,https://nianticlabs.github.io/mickey/,https://arxiv.org/abs/2404.06337,,2404.06337.pdf,Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences,"Given two images, we can estimate the relative camera pose between them by
+establishing image-to-image correspondences. Usually, correspondences are
+2D-to-2D and the pose we estimate is defined only up to scale. Some
+applications, aiming at instant augmented reality anywhere, require
+scale-metric pose estimates, and hence, they rely on external depth estimators
+to recover the scale. We present MicKey, a keypoint matching pipeline that is
+able to predict metric correspondences in 3D camera space. By learning to match
+3D coordinates across images, we are able to infer the metric relative pose
+without depth measurements. Depth measurements are also not required for
+training, nor are scene reconstructions or image overlap information. MicKey is
+supervised only by pairs of images and their relative poses. MicKey achieves
+state-of-the-art performance on the Map-Free Relocalisation benchmark while
+requiring less supervision than competing approaches.",cs.CV,['cs.CV']
+3D LiDAR Mapping in Dynamic Environments using a 4D Implicit Neural Representation,Xingguang Zhong · Yue Pan · Cyrill Stachniss · Jens Behley,https://github.com/PRBonn/4dNDF,http://export.arxiv.org/abs/2405.03388,,2405.03388.pdf,3D LiDAR Mapping in Dynamic Environments Using a 4D Implicit Neural Representation,"Building accurate maps is a key building block to enable reliable
+localization, planning, and navigation of autonomous vehicles. We propose a
+novel approach for building accurate maps of dynamic environments utilizing a
+sequence of LiDAR scans. To this end, we propose encoding the 4D scene into a
+novel spatio-temporal implicit neural map representation by fitting a
+time-dependent truncated signed distance function to each point. Using our
+representation, we extract the static map by filtering the dynamic parts. Our
+neural representation is based on sparse feature grids, a globally shared
+decoder, and time-dependent basis functions, which we jointly optimize in an
+unsupervised fashion. To learn this representation from a sequence of LiDAR
+scans, we design a simple yet efficient loss function to supervise the map
+optimization in a piecewise way. We evaluate our approach on various scenes
+containing moving objects in terms of the reconstruction quality of static maps
+and the segmentation of dynamic point clouds. The experimental results
+demonstrate that our method is capable of removing the dynamic part of the
+input point clouds while reconstructing accurate and complete 3D maps,
+outperforming several state-of-the-art methods. Codes are available at:
+https://github.com/PRBonn/4dNDF",cs.CV,"['cs.CV', 'cs.RO']"
+Robust Emotion Recognition in Context Debiasing,Dingkang Yang · Kun Yang · Mingcheng Li · Shunli Wang · Shuaibing Wang · Lihua Zhang, ,https://arxiv.org/abs/2403.05963,,2403.05963.pdf,Robust Emotion Recognition in Context Debiasing,"Context-aware emotion recognition (CAER) has recently boosted the practical
+applications of affective computing techniques in unconstrained environments.
+Mainstream CAER methods invariably extract ensemble representations from
+diverse contexts and subject-centred characteristics to perceive the target
+person's emotional state. Despite advancements, the biggest challenge remains
+due to context bias interference. The harmful bias forces the models to rely on
+spurious correlations between background contexts and emotion labels in
+likelihood estimation, causing severe performance bottlenecks and confounding
+valuable context priors. In this paper, we propose a counterfactual emotion
+inference (CLEF) framework to address the above issue. Specifically, we first
+formulate a generalized causal graph to decouple the causal relationships among
+the variables in CAER. Following the causal graph, CLEF introduces a
+non-invasive context branch to capture the adverse direct effect caused by the
+context bias. During the inference, we eliminate the direct context effect from
+the total causal effect by comparing factual and counterfactual outcomes,
+resulting in bias mitigation and robust prediction. As a model-agnostic
+framework, CLEF can be readily integrated into existing methods, bringing
+consistent performance gains.",cs.CV,"['cs.CV', 'cs.LG']"
+Learning to Produce Semi-dense Correspondences for Visual Localization,Khang Truong Giang · Soohwan Song · Sungho Jo,https://github.com/TruongKhang/DeViLoc,https://arxiv.org/abs/2402.08359,,2402.08359.pdf,Learning to Produce Semi-dense Correspondences for Visual Localization,"This study addresses the challenge of performing visual localization in
+demanding conditions such as night-time scenarios, adverse weather, and
+seasonal changes. While many prior studies have focused on improving
+image-matching performance to facilitate reliable dense keypoint matching
+between images, existing methods often heavily rely on predefined feature
+points on a reconstructed 3D model. Consequently, they tend to overlook
+unobserved keypoints during the matching process. Therefore, dense keypoint
+matches are not fully exploited, leading to a notable reduction in accuracy,
+particularly in noisy scenes. To tackle this issue, we propose a novel
+localization method that extracts reliable semi-dense 2D-3D matching points
+based on dense keypoint matches. This approach involves regressing semi-dense
+2D keypoints into 3D scene coordinates using a point inference network. The
+network utilizes both geometric and visual cues to effectively infer 3D
+coordinates for unobserved keypoints from the observed ones. The abundance of
+matching information significantly enhances the accuracy of camera pose
+estimation, even in scenarios involving noisy or sparse 3D models.
+Comprehensive evaluations demonstrate that the proposed method outperforms
+other methods in challenging scenes and achieves competitive results in
+large-scale visual localization benchmarks. The code will be available.",cs.CV,['cs.CV']
+Distilling CLIP with Dual Guidance for  Learning Discriminative Human Body Shape Representation,Feng Liu · Minchul Kim · Zhiyuan Ren · Xiaoming Liu, ,https://arxiv.org/abs/2307.12732,,,CLIP-KD: An Empirical Study of CLIP Model Distillation,"Contrastive Language-Image Pre-training (CLIP) has become a promising
+language-supervised visual pre-training framework. This paper aims to distill
+small CLIP models supervised by a large teacher CLIP model. We propose several
+distillation strategies, including relation, feature, gradient and contrastive
+paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We
+show that a simple feature mimicry with Mean Squared Error loss works
+surprisingly well. Moreover, interactive contrastive learning across teacher
+and student encoders is also effective in performance improvement. We explain
+that the success of CLIP-KD can be attributed to maximizing the feature
+similarity between teacher and student. The unified method is applied to
+distill several student models trained on CC3M+12M. CLIP-KD improves student
+CLIP models consistently over zero-shot ImageNet classification and cross-modal
+retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the
+teacher, CLIP-KD achieves 57.5\% and 55.4\% zero-shot top-1 ImageNet accuracy
+over ViT-B/16 and ResNet-50, surpassing the original CLIP without KD by 20.5\%
+and 20.1\% margins, respectively. Our code is released on
+https://github.com/winycg/CLIP-KD.",cs.CV,['cs.CV']
+From Feature to Gaze: A Generalizable Replacement of Linear Layer for Gaze Estimation,Yiwei Bao · Feng Lu, ,https://arxiv.org/abs/2309.02165,,2309.02165.pdf,PCFGaze: Physics-Consistent Feature for Appearance-based Gaze Estimation,"Although recent deep learning based gaze estimation approaches have achieved
+much improvement, we still know little about how gaze features are connected to
+the physics of gaze. In this paper, we try to answer this question by analyzing
+the gaze feature manifold. Our analysis revealed the insight that the geodesic
+distance between gaze features is consistent with the gaze differences between
+samples. According to this finding, we construct the Physics- Consistent
+Feature (PCF) in an analytical way, which connects gaze feature to the physical
+definition of gaze. We further propose the PCFGaze framework that directly
+optimizes gaze feature space by the guidance of PCF. Experimental results
+demonstrate that the proposed framework alleviates the overfitting problem and
+significantly improves cross-domain gaze estimation accuracy without extra
+training data. The insight of gaze feature has the potential to benefit other
+regression tasks with physical meanings.",cs.CV,['cs.CV']
+Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception,Haoming Chen · Zhizhong Zhang · Yanyun Qu · Ruixin Zhang · Xin Tan · Yuan Xie, ,https://arxiv.org/abs/2405.07201,,2405.07201.pdf,Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception,"An effective pre-training framework with universal 3D representations is
+extremely desired in perceiving large-scale dynamic scenes. However,
+establishing such an ideal framework that is both task-generic and
+label-efficient poses a challenge in unifying the representation of the same
+primitive across diverse scenes. The current contrastive 3D pre-training
+methods typically follow a frame-level consistency, which focuses on the 2D-3D
+relationships in each detached image. Such inconsiderate consistency greatly
+hampers the promising path of reaching an universal pre-training framework: (1)
+The cross-scene semantic self-conflict, i.e., the intense collision between
+primitive segments of the same semantics from different scenes; (2) Lacking a
+globally unified bond that pushes the cross-scene semantic consistency into 3D
+representation learning. To address above challenges, we propose a CSC
+framework that puts a scene-level semantic consistency in the heart, bridging
+the connection of the similar semantic segments across various scenes. To
+achieve this goal, we combine the coherent semantic cues provided by the vision
+foundation model and the knowledge-rich cross-scene prototypes derived from the
+complementary multi-modality information. These allow us to train a universal
+3D pre-training model that facilitates various downstream tasks with less
+fine-tuning efforts. Empirically, we achieve consistent improvements over SOTA
+pre-training approaches in semantic segmentation (+1.4% mIoU), object detection
+(+1.0% mAP), and panoptic segmentation (+3.0% PQ) using their task-specific 3D
+network on nuScenes. Code is released at https://github.com/chenhaomingbob/CSC,
+hoping to inspire future research.",cs.CV,['cs.CV']
+Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation,Dongliang Cao · Marvin Eisenberger · Nafie El Amrani · Daniel Cremers · Florian Bernard, ,https://web3.arxiv.org/abs/2402.18920,,2402.18920.pdf,Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation,"Although 3D shape matching and interpolation are highly interrelated, they
+are often studied separately and applied sequentially to relate different 3D
+shapes, thus resulting in sub-optimal performance. In this work we present a
+unified framework to predict both point-wise correspondences and shape
+interpolation between 3D shapes. To this end, we combine the deep functional
+map framework with classical surface deformation models to map shapes in both
+spectral and spatial domains. On the one hand, by incorporating spatial maps,
+our method obtains more accurate and smooth point-wise correspondences compared
+to previous functional map methods for shape matching. On the other hand, by
+introducing spectral maps, our method gets rid of commonly used but
+computationally expensive geodesic distance constraints that are only valid for
+near-isometric shape deformations. Furthermore, we propose a novel test-time
+adaptation scheme to capture both pose-dominant and shape-dominant
+deformations. Using different challenging datasets, we demonstrate that our
+method outperforms previous state-of-the-art methods for both shape matching
+and interpolation, even compared to supervised approaches.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CG']"
+SinSR: Diffusion-Based Image Super-Resolution in a Single Step,Yufei Wang · Wenhan Yang · Xinyuan Chen · Yaohui Wang · Lanqing Guo · Lap-Pui Chau · Ziwei Liu · Yu Qiao · Alex C. Kot · Bihan Wen, ,https://arxiv.org/abs/2311.14760,,2311.14760.pdf,SinSR: Diffusion-Based Image Super-Resolution in a Single Step,"While super-resolution (SR) methods based on diffusion models exhibit
+promising results, their practical application is hindered by the substantial
+number of required inference steps. Recent methods utilize degraded images in
+the initial state, thereby shortening the Markov chain. Nevertheless, these
+solutions either rely on a precise formulation of the degradation process or
+still necessitate a relatively lengthy generation path (e.g., 15 iterations).
+To enhance inference speed, we propose a simple yet effective method for
+achieving single-step SR generation, named SinSR. Specifically, we first derive
+a deterministic sampling process from the most recent state-of-the-art (SOTA)
+method for accelerating diffusion-based SR. This allows the mapping between the
+input random noise and the generated high-resolution image to be obtained in a
+reduced and acceptable number of inference steps during training. We show that
+this deterministic mapping can be distilled into a student model that performs
+SR within only one inference step. Additionally, we propose a novel
+consistency-preserving loss to simultaneously leverage the ground-truth image
+during the distillation process, ensuring that the performance of the student
+model is not solely bound by the feature manifold of the teacher model,
+resulting in further performance improvement. Extensive experiments conducted
+on synthetic and real-world datasets demonstrate that the proposed method can
+achieve comparable or even superior performance compared to both previous SOTA
+methods and the teacher model, in just one sampling step, resulting in a
+remarkable up to x10 speedup for inference. Our code will be released at
+https://github.com/wyf0912/SinSR",cs.CV,['cs.CV']
+SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling,Juhee Lee · Jewon Kang, ,https://arxiv.org/abs/2402.03161,,2402.03161.pdf,Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization,"In light of recent advances in multimodal Large Language Models (LLMs), there
+is increasing attention to scaling them from image-text data to more
+informative real-world videos. Compared to static images, video poses unique
+challenges for effective large-scale pre-training due to the modeling of its
+spatiotemporal dynamics. In this paper, we address such limitations in
+video-language pre-training with an efficient video decomposition that
+represents each video as keyframes and temporal motions. These are then adapted
+to an LLM using well-designed tokenizers that discretize visual and temporal
+information as a few tokens, thus enabling unified generative pre-training of
+videos, images, and text. At inference, the generated tokens from the LLM are
+carefully recovered to the original continuous pixel space to create various
+video content. Our proposed framework is both capable of comprehending and
+generating image and video content, as demonstrated by its competitive
+performance across 13 multimodal benchmarks in image and video understanding
+and generation. Our code and models are available at
+https://video-lavit.github.io.",cs.CV,"['cs.CV', 'cs.CL']"
+Quantifying Task Priority for Multi-Task Optimization,Wooseong Jeong · Kuk-Jin Yoon, ,https://arxiv.org/abs/2403.16162,,2403.16162.pdf,Multi-Task Learning with Multi-Task Optimization,"Multi-task learning solves multiple correlated tasks. However, conflicts may
+exist between them. In such circumstances, a single solution can rarely
+optimize all the tasks, leading to performance trade-offs. To arrive at a set
+of optimized yet well-distributed models that collectively embody different
+trade-offs in one algorithmic pass, this paper proposes to view Pareto
+multi-task learning through the lens of multi-task optimization. Multi-task
+learning is first cast as a multi-objective optimization problem, which is then
+decomposed into a diverse set of unconstrained scalar-valued subproblems. These
+subproblems are solved jointly using a novel multi-task gradient descent
+method, whose uniqueness lies in the iterative transfer of model parameters
+among the subproblems during the course of optimization. A theorem proving
+faster convergence through the inclusion of such transfers is presented. We
+investigate the proposed multi-task learning with multi-task optimization for
+solving various problem settings including image classification, scene
+understanding, and multi-target regression. Comprehensive experiments confirm
+that the proposed method significantly advances the state-of-the-art in
+discovering sets of Pareto-optimized models. Notably, on the large image
+dataset we tested on, namely NYUv2, the hypervolume convergence achieved by our
+method was found to be nearly two times faster than the next-best among the
+state-of-the-art.",cs.AI,['cs.AI']
+Contrasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding,Le Zhang · Rabiul Awal · Aishwarya Agrawal,https://github.com/lezhang7/Enhance-FineGrained,https://arxiv.org/abs/2306.08832,,2306.08832.pdf,Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding,"Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text
+comprehension abilities, facilitating advances in several downstream tasks such
+as zero-shot image classification, image-text retrieval, and text-to-image
+generation. However, the compositional reasoning abilities of existing VLMs
+remains subpar. The root of this limitation lies in the inadequate alignment
+between the images and captions in the pretraining datasets. Additionally, the
+current contrastive learning objective fails to focus on fine-grained grounding
+components like relations, actions, and attributes, resulting in ""bag-of-words""
+representations. We introduce a simple and effective method to improve
+compositional reasoning in VLMs. Our method better leverages available datasets
+by refining and expanding the standard image-text contrastive learning
+framework. Our approach does not require specific annotations and does not
+incur extra parameters. When integrated with CLIP, our technique yields notable
+improvement over state-of-the-art baselines across five vision-language
+compositional benchmarks. We open-source our code at
+https://github.com/lezhang7/Enhance-FineGrained.",cs.CV,['cs.CV']
+CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition,Feng Lu · Xiangyuan Lan · Lijun Zhang · Dongmei Jiang · Yaowei Wang · Chun Yuan, ,https://arxiv.org/abs/2402.19231,,2402.19231.pdf,CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition,"Over the past decade, most methods in visual place recognition (VPR) have
+used neural networks to produce feature representations. These networks
+typically produce a global representation of a place image using only this
+image itself and neglect the cross-image variations (e.g. viewpoint and
+illumination), which limits their robustness in challenging scenes. In this
+paper, we propose a robust global representation method with cross-image
+correlation awareness for VPR, named CricaVPR. Our method uses the attention
+mechanism to correlate multiple images within a batch. These images can be
+taken in the same place with different conditions or viewpoints, or even
+captured from different places. Therefore, our method can utilize the
+cross-image variations as a cue to guide the representation learning, which
+ensures more robust features are produced. To further facilitate the
+robustness, we propose a multi-scale convolution-enhanced adaptation method to
+adapt pre-trained visual foundation models to the VPR task, which introduces
+the multi-scale local information to further enhance the cross-image
+correlation-aware representation. Experimental results show that our method
+outperforms state-of-the-art methods by a large margin with significantly less
+training time. The code is released at https://github.com/Lu-Feng/CricaVPR.",cs.CV,"['cs.CV', 'cs.RO']"
+Dual Prior Unfolding for Snapshot Compressive Imaging,Jiancheng Zhang · Haijin Zeng · Jiezhang Cao · Yongyong Chen · Dengxiu Yu · Yinping Zhao,https://github.com/ZhangJC-2k/DPU,,https://link.springer.com/article/10.1007/s11263-023-01844-4,,,,,nan
+Task2Box: Box Embeddings for Modeling Asymmetric Task Relationships,Rangel Daroya · Aaron Sun · Subhransu Maji,https://github.com/cvl-umass/task2box,https://arxiv.org/abs/2403.17173,,2403.17173.pdf,Task2Box: Box Embeddings for Modeling Asymmetric Task Relationships,"Modeling and visualizing relationships between tasks or datasets is an
+important step towards solving various meta-tasks such as dataset discovery,
+multi-tasking, and transfer learning. However, many relationships, such as
+containment and transferability, are naturally asymmetric and current
+approaches for representation and visualization (e.g., t-SNE) do not readily
+support this. We propose Task2Box, an approach to represent tasks using box
+embeddings -- axis-aligned hyperrectangles in low dimensional spaces -- that
+can capture asymmetric relationships between them through volumetric overlaps.
+We show that Task2Box accurately predicts unseen hierarchical relationships
+between nodes in ImageNet and iNaturalist datasets, as well as transferability
+between tasks in the Taskonomy benchmark. We also show that box embeddings
+estimated from task representations (e.g., CLIP, Task2Vec, or attribute based)
+can be used to predict relationships between unseen tasks more accurately than
+classifiers trained on the same representations, as well as handcrafted
+asymmetric distances (e.g., KL divergence). This suggests that low-dimensional
+box embeddings can effectively capture these task relationships and have the
+added advantage of being interpretable. We use the approach to visualize
+relationships among publicly available image classification datasets on popular
+dataset hosting platform called Hugging Face.",cs.CV,['cs.CV']
+Shadow Generation for Composite Image Using Diffusion Model,Qingyang Liu · Junqi You · Jian-Ting Wang · Xinhao Tao · Bo Zhang · Li Niu, ,https://arxiv.org/abs/2403.15234,,2403.15234.pdf,Shadow Generation for Composite Image Using Diffusion model,"In the realm of image composition, generating realistic shadow for the
+inserted foreground remains a formidable challenge. Previous works have
+developed image-to-image translation models which are trained on paired
+training data. However, they are struggling to generate shadows with accurate
+shapes and intensities, hindered by data scarcity and inherent task complexity.
+In this paper, we resort to foundation model with rich prior knowledge of
+natural shadow images. Specifically, we first adapt ControlNet to our task and
+then propose intensity modulation modules to improve the shadow intensity.
+Moreover, we extend the small-scale DESOBA dataset to DESOBAv2 using a novel
+data acquisition pipeline. Experimental results on both DESOBA and DESOBAv2
+datasets as well as real composite images demonstrate the superior capability
+of our model for shadow generation task. The dataset, code, and model are
+released at https://github.com/bcmi/Object-Shadow-Generation-Dataset-DESOBAv2.",cs.CV,['cs.CV']
+NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild,Weining Ren · Zihan Zhu · Boyang Sun · Jiaqi Chen · Marc Pollefeys · Songyou Peng,https://rwn17.github.io/nerf-on-the-go,https://arxiv.org/abs/2405.18715,,2405.18715.pdf,NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild,"Neural Radiance Fields (NeRFs) have shown remarkable success in synthesizing
+photorealistic views from multi-view images of static scenes, but face
+challenges in dynamic, real-world environments with distractors like moving
+objects, shadows, and lighting changes. Existing methods manage controlled
+environments and low occlusion ratios but fall short in render quality,
+especially under high occlusion scenarios. In this paper, we introduce NeRF
+On-the-go, a simple yet effective approach that enables the robust synthesis of
+novel views in complex, in-the-wild scenes from only casually captured image
+sequences. Delving into uncertainty, our method not only efficiently eliminates
+distractors, even when they are predominant in captures, but also achieves a
+notably faster convergence speed. Through comprehensive experiments on various
+scenes, our method demonstrates a significant improvement over state-of-the-art
+techniques. This advancement opens new avenues for NeRF in diverse and dynamic
+real-world applications.",cs.CV,['cs.CV']
+Improved Baselines with Visual Instruction Tuning,Haotian Liu · Chunyuan Li · Yuheng Li · Yong Jae Lee,https://llava-vl.github.io,https://arxiv.org/abs/2310.03744,,2310.03744.pdf,Improved Baselines with Visual Instruction Tuning,"Large multimodal models (LMM) have recently shown encouraging progress with
+visual instruction tuning. In this note, we show that the fully-connected
+vision-language cross-modal connector in LLaVA is surprisingly powerful and
+data-efficient. With simple modifications to LLaVA, namely, using
+CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA
+data with simple response formatting prompts, we establish stronger baselines
+that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint
+uses merely 1.2M publicly available data, and finishes full training in ~1 day
+on a single 8-A100 node. We hope this can make state-of-the-art LMM research
+more accessible. Code and model will be publicly available.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']"
+ParamISP: Learned Forward and Inverse ISPs using Camera Parameters,Woohyeok Kim · Geonu Kim · Junyong Lee · Seungyong Lee · Seung-Hwan Baek · Sunghyun Cho,https://woo525.github.io/ParamISP/,https://arxiv.org/abs/2312.13313,,2312.13313.pdf,ParamISP: Learned Forward and Inverse ISPs using Camera Parameters,"RAW images are rarely shared mainly due to its excessive data size compared
+to their sRGB counterparts obtained by camera ISPs. Learning the forward and
+inverse processes of camera ISPs has been recently demonstrated, enabling
+physically-meaningful RAW-level image processing on input sRGB images. However,
+existing learning-based ISP methods fail to handle the large variations in the
+ISP processes with respect to camera parameters such as ISO and exposure time,
+and have limitations when used for various applications. In this paper, we
+propose ParamISP, a learning-based method for forward and inverse conversion
+between sRGB and RAW images, that adopts a novel neural-network module to
+utilize camera parameters, which is dubbed as ParamNet. Given the camera
+parameters provided in the EXIF data, ParamNet converts them into a feature
+vector to control the ISP networks. Extensive experiments demonstrate that
+ParamISP achieve superior RAW and sRGB reconstruction results compared to
+previous methods and it can be effectively used for a variety of applications
+such as deblurring dataset synthesis, raw deblurring, HDR reconstruction, and
+camera-to-camera transfer.",eess.IV,"['eess.IV', 'cs.CV']"
+ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts,Mu Cai · Haotian Liu · Siva Mustikovela · Gregory P. Meyer · Yuning Chai · Dennis Park · Yong Jae Lee,https://vip-llava.github.io/,https://arxiv.org/abs/2312.00784,,2312.00784.pdf,ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts,"While existing large vision-language multimodal models focus on whole image
+understanding, there is a prominent gap in achieving region-specific
+comprehension. Current approaches that use textual coordinates or spatial
+encodings often fail to provide a user-friendly interface for visual prompting.
+To address this challenge, we introduce a novel multimodal model capable of
+decoding arbitrary visual prompts. This allows users to intuitively mark images
+and interact with the model using natural cues like a ""red bounding box"" or
+""pointed arrow"". Our simple design directly overlays visual markers onto the
+RGB image, eliminating the need for complex region encodings, yet achieves
+state-of-the-art performance on region-understanding tasks like Visual7W,
+PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present
+ViP-Bench, a comprehensive benchmark to assess the capability of models in
+understanding visual prompts across multiple dimensions, enabling future
+research in this domain. Code, data, and model are publicly available.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']"
+Compact 3D Gaussian Representation for Radiance Field,Joo Chan Lee · Daniel Rho · Xiangyu Sun · Jong Hwan Ko · Eunbyung Park,https://maincold2.github.io/c3dgs/,https://arxiv.org/abs/2311.13681,,2311.13681.pdf,Compact 3D Gaussian Representation for Radiance Field,"Neural Radiance Fields (NeRFs) have demonstrated remarkable potential in
+capturing complex 3D scenes with high fidelity. However, one persistent
+challenge that hinders the widespread adoption of NeRFs is the computational
+bottleneck due to the volumetric rendering. On the other hand, 3D Gaussian
+splatting (3DGS) has recently emerged as an alternative representation that
+leverages a 3D Gaussisan-based representation and adopts the rasterization
+pipeline to render the images rather than volumetric rendering, achieving very
+fast rendering speed and promising image quality. However, a significant
+drawback arises as 3DGS entails a substantial number of 3D Gaussians to
+maintain the high fidelity of the rendered images, which requires a large
+amount of memory and storage. To address this critical issue, we place a
+specific emphasis on two key objectives: reducing the number of Gaussian points
+without sacrificing performance and compressing the Gaussian attributes, such
+as view-dependent color and covariance. To this end, we propose a learnable
+mask strategy that significantly reduces the number of Gaussians while
+preserving high performance. In addition, we propose a compact but effective
+representation of view-dependent color by employing a grid-based neural field
+rather than relying on spherical harmonics. Finally, we learn codebooks to
+compactly represent the geometric attributes of Gaussian by vector
+quantization. With model compression techniques such as quantization and
+entropy coding, we consistently show over 25$\times$ reduced storage and
+enhanced rendering speed, while maintaining the quality of the scene
+representation, compared to 3DGS. Our work provides a comprehensive framework
+for 3D scene representation, achieving high performance, fast training,
+compactness, and real-time rendering. Our project page is available at
+https://maincold2.github.io/c3dgs/.",cs.CV,"['cs.CV', 'cs.GR']"
+Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology,Oren Kraus · Kian Kenyon-Dean · Saber Saberian · Maryam Fallah · Peter McLean · Jess Leung · Vasudev Sharma · Ayla Khan · Jia Balakrishnan · Safiye Celik · Dominique Beaini · Maciej Sypetkowski · Chi Cheng · Kristen Morse · Maureen Makes · Ben Mabey · Berton Earnshaw, ,https://arxiv.org/abs/2404.10242,,2404.10242.pdf,Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology,"Featurizing microscopy images for use in biological research remains a
+significant challenge, especially for large-scale experiments spanning millions
+of images. This work explores the scaling properties of weakly supervised
+classifiers and self-supervised masked autoencoders (MAEs) when training with
+increasingly larger model backbones and microscopy datasets. Our results show
+that ViT-based MAEs outperform weakly supervised classifiers on a variety of
+tasks, achieving as much as a 11.5% relative improvement when recalling known
+biological relationships curated from public databases. Additionally, we
+develop a new channel-agnostic MAE architecture (CA-MAE) that allows for
+inputting images of different numbers and orders of channels at inference time.
+We demonstrate that CA-MAEs effectively generalize by inferring and evaluating
+on a microscopy image dataset (JUMP-CP) generated under different experimental
+conditions with a different channel structure than our pretraining data
+(RPI-93M). Our findings motivate continued research into scaling
+self-supervised learning on microscopy data in order to create powerful
+foundation models of cellular biology that have the potential to catalyze
+advancements in drug discovery and beyond.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis,Yanzuo Lu · Manlin Zhang · Jinhua Ma · Xiaohua Xie · Jianhuang Lai,https://github.com/YanzuoLu/CFLD,https://arxiv.org/abs/2402.18078,,2402.18078.pdf,Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis,"Diffusion model is a promising approach to image generation and has been
+employed for Pose-Guided Person Image Synthesis (PGPIS) with competitive
+performance. While existing methods simply align the person appearance to the
+target pose, they are prone to overfitting due to the lack of a high-level
+semantic understanding on the source person image. In this paper, we propose a
+novel Coarse-to-Fine Latent Diffusion (CFLD) method for PGPIS. In the absence
+of image-caption pairs and textual prompts, we develop a novel training
+paradigm purely based on images to control the generation process of a
+pre-trained text-to-image diffusion model. A perception-refined decoder is
+designed to progressively refine a set of learnable queries and extract
+semantic understanding of person images as a coarse-grained prompt. This allows
+for the decoupling of fine-grained appearance and pose information controls at
+different stages, and thus circumventing the potential overfitting problem. To
+generate more realistic texture details, a hybrid-granularity attention module
+is proposed to encode multi-scale fine-grained appearance features as bias
+terms to augment the coarse-grained prompt. Both quantitative and qualitative
+experimental results on the DeepFashion benchmark demonstrate the superiority
+of our method over the state of the arts for PGPIS. Code is available at
+https://github.com/YanzuoLu/CFLD.",cs.CV,['cs.CV']
+SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training,WU Sitong · Haoru Tan · Zhuotao Tian · Yukang Chen · Xiaojuan Qi · Jiaya Jia, ,https://arxiv.org/abs/2405.10286,,,FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models,"Despite noise and caption quality having been acknowledged as important
+factors impacting vision-language contrastive pre-training, in this paper, we
+show that the full potential of improving the training process by addressing
+such issues is yet to be realized. Specifically, we firstly study and analyze
+two issues affecting training: incorrect assignment of negative pairs, and low
+caption quality and diversity. Then, we devise effective solutions for
+addressing both problems, which essentially require training with multiple true
+positive pairs. Finally, we propose training with sigmoid loss to address such
+a requirement. We show very large gains over the current state-of-the-art for
+both image recognition ($\sim +6\%$ on average over 11 datasets) and image
+retrieval ($\sim +19\%$ on Flickr30k and $\sim +15\%$ on MSCOCO).",cs.CV,"['cs.CV', 'cs.AI']"
+Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps,Octave Mariotti · Oisin Mac Aodha · Hakan Bilen, ,https://arxiv.org/abs/2312.13216,,2312.13216.pdf,Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps,"Recent progress in self-supervised representation learning has resulted in
+models that are capable of extracting image features that are not only
+effective at encoding image level, but also pixel-level, semantics. These
+features have been shown to be effective for dense visual semantic
+correspondence estimation, even outperforming fully-supervised methods.
+Nevertheless, current self-supervised approaches still fail in the presence of
+challenging image characteristics such as symmetries and repeated parts. To
+address these limitations, we propose a new approach for semantic
+correspondence estimation that supplements discriminative self-supervised
+features with 3D understanding via a weak geometric spherical prior. Compared
+to more involved 3D pipelines, our model only requires weak viewpoint
+information, and the simplicity of our spherical representation enables us to
+inject informative geometric priors into the model during training. We propose
+a new evaluation metric that better accounts for repeated part and
+symmetry-induced mistakes. We present results on the challenging SPair-71k
+dataset, where we show that our approach demonstrates is capable of
+distinguishing between symmetric views and repeated parts across many object
+categories, and also demonstrate that we can generalize to unseen classes on
+the AwA dataset.",cs.CV,['cs.CV']
+XFeat: Accelerated Features for Lightweight Image Matching,Guilherme Potje · Felipe Cadar · André Araujo · Renato Martins · Erickson R. Nascimento,https://verlab.dcc.ufmg.br/descriptors/xfeat_cvpr24,https://arxiv.org/abs/2404.19174,,2404.19174.pdf,XFeat: Accelerated Features for Lightweight Image Matching,"We introduce a lightweight and accurate architecture for resource-efficient
+visual correspondence. Our method, dubbed XFeat (Accelerated Features),
+revisits fundamental design choices in convolutional neural networks for
+detecting, extracting, and matching local features. Our new model satisfies a
+critical need for fast and robust algorithms suitable to resource-limited
+devices. In particular, accurate image matching requires sufficiently large
+image resolutions - for this reason, we keep the resolution as large as
+possible while limiting the number of channels in the network. Besides, our
+model is designed to offer the choice of matching at the sparse or semi-dense
+levels, each of which may be more suitable for different downstream
+applications, such as visual navigation and augmented reality. Our model is the
+first to offer semi-dense matching efficiently, leveraging a novel match
+refinement module that relies on coarse local descriptors. XFeat is versatile
+and hardware-independent, surpassing current deep learning-based local features
+in speed (up to 5x faster) with comparable or better accuracy, proven in pose
+estimation and visual localization. We showcase it running in real-time on an
+inexpensive laptop CPU without specialized hardware optimizations. Code and
+weights are available at www.verlab.dcc.ufmg.br/descriptors/xfeat_cvpr24.",cs.CV,['cs.CV']
+Towards Realistic Scene Generation with LiDAR Diffusion Models,Haoxi Ran · Vitor Guizilini · Yue Wang,https://lidar-diffusion.github.io/,https://arxiv.org/abs/2404.00815,,2404.00815.pdf,Towards Realistic Scene Generation with LiDAR Diffusion Models,"Diffusion models (DMs) excel in photo-realistic image synthesis, but their
+adaptation to LiDAR scene generation poses a substantial hurdle. This is
+primarily because DMs operating in the point space struggle to preserve the
+curve-like patterns and 3D geometry of LiDAR scenes, which consumes much of
+their representation power. In this paper, we propose LiDAR Diffusion Models
+(LiDMs) to generate LiDAR-realistic scenes from a latent space tailored to
+capture the realism of LiDAR scenes by incorporating geometric priors into the
+learning pipeline. Our method targets three major desiderata: pattern realism,
+geometry realism, and object realism. Specifically, we introduce curve-wise
+compression to simulate real-world LiDAR patterns, point-wise coordinate
+supervision to learn scene geometry, and patch-wise encoding for a full 3D
+object context. With these three core designs, our method achieves competitive
+performance on unconditional LiDAR generation in 64-beam scenario and state of
+the art on conditional LiDAR generation, while maintaining high efficiency
+compared to point-based DMs (up to 107$\times$ faster). Furthermore, by
+compressing LiDAR scenes into a latent space, we enable the controllability of
+DMs with various conditions such as semantic maps, camera views, and text
+prompts.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']"
+DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models,Muyang Li · Tianle Cai · Jiaxin Cao · Qinsheng Zhang · Han Cai · Junjie Bai · Yangqing Jia · Kai Li · Song Han, ,https://arxiv.org/abs/2402.19481,,2402.19481.pdf,DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models,"Diffusion models have achieved great success in synthesizing high-quality
+images. However, generating high-resolution images with diffusion models is
+still challenging due to the enormous computational costs, resulting in a
+prohibitive latency for interactive applications. In this paper, we propose
+DistriFusion to tackle this problem by leveraging parallelism across multiple
+GPUs. Our method splits the model input into multiple patches and assigns each
+patch to a GPU. However, naively implementing such an algorithm breaks the
+interaction between patches and loses fidelity, while incorporating such an
+interaction will incur tremendous communication overhead. To overcome this
+dilemma, we observe the high similarity between the input from adjacent
+diffusion steps and propose displaced patch parallelism, which takes advantage
+of the sequential nature of the diffusion process by reusing the pre-computed
+feature maps from the previous timestep to provide context for the current
+step. Therefore, our method supports asynchronous communication, which can be
+pipelined by computation. Extensive experiments show that our method can be
+applied to recent Stable Diffusion XL with no quality degradation and achieve
+up to a 6.1$\times$ speedup on eight NVIDIA A100s compared to one. Our code is
+publicly available at https://github.com/mit-han-lab/distrifuser.",cs.CV,['cs.CV']
+Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use,Imad Eddine Toubal · Aditya Avinash · Neil Alldrin · Jan Dlabal · Wenlei Zhou · Enming Luo · Otilia Stretcu · Hao Xiong · Chun-Ta Lu · Howard Zhou · Ranjay Krishna · Ariel Fuxman · Tom Duerig, ,https://arxiv.org/abs/2403.02626,,2403.02626.pdf,Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use,"From content moderation to wildlife conservation, the number of applications
+that require models to recognize nuanced or subjective visual concepts is
+growing. Traditionally, developing classifiers for such concepts requires
+substantial manual effort measured in hours, days, or even months to identify
+and annotate data needed for training. Even with recently proposed Agile
+Modeling techniques, which enable rapid bootstrapping of image classifiers,
+users are still required to spend 30 minutes or more of monotonous, repetitive
+data labeling just to train a single classifier. Drawing on Fiske's Cognitive
+Miser theory, we propose a new framework that alleviates manual effort by
+replacing human labeling with natural language interactions, reducing the total
+effort required to define a concept by an order of magnitude: from labeling
+2,000 images to only 100 plus some natural language interactions. Our framework
+leverages recent advances in foundation models, both large language models and
+vision-language models, to carve out the concept space through conversation and
+by automatically labeling training data points. Most importantly, our framework
+eliminates the need for crowd-sourced annotations. Moreover, our framework
+ultimately produces lightweight classification models that are deployable in
+cost-sensitive scenarios. Across 15 subjective concepts and across 2 public
+image classification datasets, our trained models outperform traditional Agile
+Modeling as well as state-of-the-art zero-shot classification models like
+ALIGN, CLIP, CuPL, and large visual question-answering models like PaLI-X.",cs.CV,"['cs.CV', 'cs.LG']"
+Multi-Task Dense Prediction via Mixture of Low-Rank Experts,Yuqi Yang · Peng-Tao Jiang · Qibin Hou · Hao Zhang · Jinwei Chen · Bo Li, ,https://arxiv.org/abs/2403.17749,,2403.17749.pdf,Multi-Task Dense Prediction via Mixture of Low-Rank Experts,"Previous multi-task dense prediction methods based on the Mixture of Experts
+(MoE) have received great performance but they neglect the importance of
+explicitly modeling the global relations among all tasks. In this paper, we
+present a novel decoder-focused method for multi-task dense prediction, called
+Mixture-of-Low-Rank-Experts (MLoRE). To model the global task relationships,
+MLoRE adds a generic convolution path to the original MoE structure, where each
+task feature can go through this path for explicit parameter sharing.
+Furthermore, to control the parameters and computational cost brought by the
+increase in the number of experts, we take inspiration from LoRA and propose to
+leverage the low-rank format of a vanilla convolution in the expert network.
+Since the low-rank experts have fewer parameters and can be dynamically
+parameterized into the generic convolution, the parameters and computational
+cost do not change much with the increase of experts. Benefiting from this
+design, we increase the number of experts and its reception field to enlarge
+the representation capacity, facilitating multiple dense tasks learning in a
+unified network. Extensive experiments on the PASCAL-Context and NYUD-v2
+benchmarks show that our MLoRE achieves superior performance compared to
+previous state-of-the-art methods on all metrics. Our code is available at
+https://github.com/YuqiYang213/MLoRE.",cs.CV,['cs.CV']
+Traffic Scene Parsing through the TSP6K Dataset,Peng-Tao Jiang · Yuqi Yang · Yang Cao · Qibin Hou · Ming-Ming Cheng · Chunhua Shen, ,https://ar5iv.labs.arxiv.org/html/2303.02835,,2303.02835.pdf,Traffic Scene Parsing through the TSP6K Dataset,"Traffic scene perception in computer vision is a critically important task to
+achieve intelligent cities. To date, most existing datasets focus on autonomous
+driving scenes. We observe that the models trained on those driving datasets
+often yield unsatisfactory results on traffic monitoring scenes. However,
+little effort has been put into improving the traffic monitoring scene
+understanding, mainly due to the lack of specific datasets. To fill this gap,
+we introduce a specialized traffic monitoring dataset, termed TSP6K, containing
+images from the traffic monitoring scenario, with high-quality pixel-level and
+instance-level annotations. The TSP6K dataset captures more crowded traffic
+scenes with several times more traffic participants than the existing driving
+scenes. We perform a detailed analysis of the dataset and comprehensively
+evaluate previous popular scene parsing methods, instance segmentation methods
+and unsupervised domain adaption methods. Furthermore, considering the vast
+difference in instance sizes, we propose a detail refining decoder for scene
+parsing, which recovers the details of different semantic regions in traffic
+scenes owing to the proposed TSP6K dataset. Experiments show its effectiveness
+in parsing the traffic monitoring scenes. Code and dataset are available at
+https://github.com/PengtaoJiang/TSP6K.",cs.CV,['cs.CV']
+"FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation",Chris Rockwell · Nilesh Kulkarni · Linyi Jin · Jeong Joon Park · Justin Johnson · David Fouhey, ,https://arxiv.org/abs/2403.03221,,,"FAR: Flexible, Accurate and Robust 6DoF Relative Camera Pose Estimation","Estimating relative camera poses between images has been a central problem in
+computer vision. Methods that find correspondences and solve for the
+fundamental matrix offer high precision in most cases. Conversely, methods
+predicting pose directly using neural networks are more robust to limited
+overlap and can infer absolute translation scale, but at the expense of reduced
+precision. We show how to combine the best of both methods; our approach yields
+results that are both precise and robust, while also accurately inferring
+translation scales. At the heart of our model lies a Transformer that (1)
+learns to balance between solved and learned pose estimations, and (2) provides
+a prior to guide a solver. A comprehensive analysis supports our design choices
+and demonstrates that our method adapts flexibly to various feature extractors
+and correspondence estimators, showing state-of-the-art performance in 6DoF
+pose estimation on Matterport3D, InteriorNet, StreetLearn, and Map-free
+Relocalization.",cs.CV,['cs.CV']
+Distraction is All You Need: Memory-Efficient Image Immunization against Diffusion-Based Image Editing,Ling Lo · Cheng Yeo · Hong-Han Shuai · Wen-Huang Cheng, ,https://arxiv.org/abs/2402.02583,,,DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing,"Large-scale Text-to-Image (T2I) diffusion models have revolutionized image
+generation over the last few years. Although owning diverse and high-quality
+generation capabilities, translating these abilities to fine-grained image
+editing remains challenging. In this paper, we propose DiffEditor to rectify
+two weaknesses in existing diffusion-based image editing: (1) in complex
+scenarios, editing results often lack editing accuracy and exhibit unexpected
+artifacts; (2) lack of flexibility to harmonize editing operations, e.g.,
+imagine new content. In our solution, we introduce image prompts in
+fine-grained image editing, cooperating with the text prompt to better describe
+the editing content. To increase the flexibility while maintaining content
+consistency, we locally combine stochastic differential equation (SDE) into the
+ordinary differential equation (ODE) sampling. In addition, we incorporate
+regional score-based gradient guidance and a time travel strategy into the
+diffusion sampling, further improving the editing quality. Extensive
+experiments demonstrate that our method can efficiently achieve
+state-of-the-art performance on various fine-grained image editing tasks,
+including editing within a single image (e.g., object moving, resizing, and
+content dragging) and across images (e.g., appearance replacing and object
+pasting). Our source code is released at
+https://github.com/MC-E/DragonDiffusion.",cs.CV,"['cs.CV', 'cs.LG']"
+"Point, Segment and Count: A Generalized Framework for Object Counting",Zhizhong Huang · Mingliang Dai · Yi Zhang · Junping Zhang · Hongming Shan, ,https://arxiv.org/abs/2311.12386,,2311.12386.pdf,"Point, Segment and Count: A Generalized Framework for Object Counting","Class-agnostic object counting aims to count all objects in an image with
+respect to example boxes or class names, \emph{a.k.a} few-shot and zero-shot
+counting. In this paper, we propose a generalized framework for both few-shot
+and zero-shot object counting based on detection. Our framework combines the
+superior advantages of two foundation models without compromising their
+zero-shot capability: (\textbf{i}) SAM to segment all possible objects as mask
+proposals, and (\textbf{ii}) CLIP to classify proposals to obtain accurate
+object counts. However, this strategy meets the obstacles of efficiency
+overhead and the small crowded objects that cannot be localized and
+distinguished. To address these issues, our framework, termed PseCo, follows
+three steps: point, segment, and count. Specifically, we first propose a
+class-agnostic object localization to provide accurate but least point prompts
+for SAM, which consequently not only reduces computation costs but also avoids
+missing small objects. Furthermore, we propose a generalized object
+classification that leverages CLIP image/text embeddings as the classifier,
+following a hierarchical knowledge distillation to obtain discriminative
+classifications among hierarchical mask proposals. Extensive experimental
+results on FSC-147, COCO, and LVIS demonstrate that PseCo achieves
+state-of-the-art performance in both few-shot/zero-shot object
+counting/detection. Code: https://github.com/Hzzone/PseCo",cs.CV,['cs.CV']
+On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm,Peng Sun · Bei Shi · Daiwei Yu · Tao Lin, ,https://arxiv.org/abs/2312.03526,,2312.03526.pdf,On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm,"Contemporary machine learning requires training large neural networks on
+massive datasets and thus faces the challenges of high computational demands.
+Dataset distillation, as a recent emerging strategy, aims to compress
+real-world datasets for efficient training. However, this line of research
+currently struggle with large-scale and high-resolution datasets, hindering its
+practicality and feasibility. To this end, we re-examine the existing dataset
+distillation methods and identify three properties required for large-scale
+real-world applications, namely, realism, diversity, and efficiency. As a
+remedy, we propose RDED, a novel computationally-efficient yet effective data
+distillation paradigm, to enable both diversity and realism of the distilled
+data. Extensive empirical results over various neural architectures and
+datasets demonstrate the advancement of RDED: we can distill the full
+ImageNet-1K to a small dataset comprising 10 images per class within 7 minutes,
+achieving a notable 42% top-1 accuracy with ResNet-18 on a single RTX-4090 GPU
+(while the SOTA only achieves 21% but requires 6 hours).",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surfaces,Linyi Jin · Nilesh Kulkarni · David Fouhey,https://jinlinyi.github.io/3DFIRES/,https://arxiv.org/abs/2403.08768,,2403.08768.pdf,3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surface,"This paper introduces 3DFIRES, a novel system for scene-level 3D
+reconstruction from posed images. Designed to work with as few as one view,
+3DFIRES reconstructs the complete geometry of unseen scenes, including hidden
+surfaces. With multiple view inputs, our method produces full reconstruction
+within all camera frustums. A key feature of our approach is the fusion of
+multi-view information at the feature level, enabling the production of
+coherent and comprehensive 3D reconstruction. We train our system on
+non-watertight scans from large-scale real scene dataset. We show it matches
+the efficacy of single-view reconstruction methods with only one input and
+surpasses existing techniques in both quantitative and qualitative measures for
+sparse-view 3D reconstruction.",cs.CV,['cs.CV']
+AlignMiF: Geometry-Aligned Multimodal Implicit Field for Enhanced LiDAR-Camera Joint Synthesis,Tao Tang · Guangrun Wang · Yixing Lao · Peng Chen · Jie Liu · Liang Lin · Kaicheng Yu · Xiaodan Liang, ,https://arxiv.org/abs/2402.17483,,2402.17483.pdf,AlignMiF: Geometry-Aligned Multimodal Implicit Field for LiDAR-Camera Joint Synthesis,"Neural implicit fields have been a de facto standard in novel view synthesis.
+Recently, there exist some methods exploring fusing multiple modalities within
+a single field, aiming to share implicit features from different modalities to
+enhance reconstruction performance. However, these modalities often exhibit
+misaligned behaviors: optimizing for one modality, such as LiDAR, can adversely
+affect another, like camera performance, and vice versa. In this work, we
+conduct comprehensive analyses on the multimodal implicit field of LiDAR-camera
+joint synthesis, revealing the underlying issue lies in the misalignment of
+different sensors. Furthermore, we introduce AlignMiF, a geometrically aligned
+multimodal implicit field with two proposed modules: Geometry-Aware Alignment
+(GAA) and Shared Geometry Initialization (SGI). These modules effectively align
+the coarse geometry across different modalities, significantly enhancing the
+fusion process between LiDAR and camera data. Through extensive experiments
+across various datasets and scenes, we demonstrate the effectiveness of our
+approach in facilitating better interaction between LiDAR and camera modalities
+within a unified neural field. Specifically, our proposed AlignMiF, achieves
+remarkable improvement over recent implicit fusion methods (+2.01 and +3.11
+image PSNR on the KITTI-360 and Waymo datasets) and consistently surpasses
+single modality performance (13.8% and 14.2% reduction in LiDAR Chamfer
+Distance on the respective datasets).",cs.CV,['cs.CV']
+Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction,Zhenzhong Kuang · Xiaochen Yang · Yingjie Shen · Chao Hu · Jun Yu, ,https://arxiv.org/abs/2309.04228,,2309.04228.pdf,FIVA: Facial Image and Video Anonymization and Anonymization Defense,"In this paper, we present a new approach for facial anonymization in images
+and videos, abbreviated as FIVA. Our proposed method is able to maintain the
+same face anonymization consistently over frames with our suggested
+identity-tracking and guarantees a strong difference from the original face.
+FIVA allows for 0 true positives for a false acceptance rate of 0.001. Our work
+considers the important security issue of reconstruction attacks and
+investigates adversarial noise, uniform noise, and parameter noise to disrupt
+reconstruction attacks. In this regard, we apply different defense and
+protection methods against these privacy threats to demonstrate the scalability
+of FIVA. On top of this, we also show that reconstruction attack models can be
+used for detection of deep fakes. Last but not least, we provide experimental
+results showing how FIVA can even enable face swapping, which is purely trained
+on a single target image.",cs.CV,['cs.CV']
+EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation,Md Mostafijur Rahman · Mustafa Munir · Radu Marculescu,https://github.com/SLDGroup/EMCAD,https://arxiv.org/abs/2405.06880,,2405.06880.pdf,EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation,"An efficient and effective decoding mechanism is crucial in medical image
+segmentation, especially in scenarios with limited computational resources.
+However, these decoding mechanisms usually come with high computational costs.
+To address this concern, we introduce EMCAD, a new efficient multi-scale
+convolutional attention decoder, designed to optimize both performance and
+computational efficiency. EMCAD leverages a unique multi-scale depth-wise
+convolution block, significantly enhancing feature maps through multi-scale
+convolutions. EMCAD also employs channel, spatial, and grouped (large-kernel)
+gated attention mechanisms, which are highly effective at capturing intricate
+spatial relationships while focusing on salient regions. By employing group and
+depth-wise convolution, EMCAD is very efficient and scales well (e.g., only
+1.91M parameters and 0.381G FLOPs are needed when using a standard encoder).
+Our rigorous evaluations across 12 datasets that belong to six medical image
+segmentation tasks reveal that EMCAD achieves state-of-the-art (SOTA)
+performance with 79.4% and 80.3% reduction in #Params and #FLOPs, respectively.
+Moreover, EMCAD's adaptability to different encoders and versatility across
+segmentation tasks further establish EMCAD as a promising tool, advancing the
+field towards more efficient and accurate medical image analysis. Our
+implementation is available at https://github.com/SLDGroup/EMCAD.",eess.IV,"['eess.IV', 'cs.CV']"
+UniDepth: Universal Monocular Metric Depth Estimation,Luigi Piccinelli · Yung-Hsu Yang · Christos Sakaridis · Mattia Segu · Siyuan Li · Luc Van Gool · Fisher Yu,https://github.com/lpiccinelli-eth/unidepth,https://arxiv.org/abs/2403.18913,,2403.18913.pdf,UniDepth: Universal Monocular Metric Depth Estimation,"Accurate monocular metric depth estimation (MMDE) is crucial to solving
+downstream tasks in 3D perception and modeling. However, the remarkable
+accuracy of recent MMDE methods is confined to their training domains. These
+methods fail to generalize to unseen domains even in the presence of moderate
+domain gaps, which hinders their practical applicability. We propose a new
+model, UniDepth, capable of reconstructing metric 3D scenes from solely single
+images across domains. Departing from the existing MMDE methods, UniDepth
+directly predicts metric 3D points from the input image at inference time
+without any additional information, striving for a universal and flexible MMDE
+solution. In particular, UniDepth implements a self-promptable camera module
+predicting dense camera representation to condition depth features. Our model
+exploits a pseudo-spherical output representation, which disentangles camera
+and depth representations. In addition, we propose a geometric invariance loss
+that promotes the invariance of camera-prompted depth features. Thorough
+evaluations on ten datasets in a zero-shot regime consistently demonstrate the
+superior performance of UniDepth, even when compared with methods directly
+trained on the testing domains. Code and models are available at:
+https://github.com/lpiccinelli-eth/unidepth",cs.CV,['cs.CV']
+Learning from Synthetic Human Group Activities,Che-Jui Chang · Danrui Li · Deep Patel · Parth Goel · Seonghyeon Moon · Samuel Sohn · Honglu Zhou · Sejong Yoon · Vladimir Pavlovic · Mubbasir Kapadia,https://cjerry1243.github.io/M3Act/,https://arxiv.org/abs/2306.16772,,2306.16772.pdf,M3Act: Learning from Synthetic Human Group Activities,"The study of complex human interactions and group activities has become a
+focal point in human-centric computer vision. However, progress in related
+tasks is often hindered by the challenges of obtaining large-scale labeled
+datasets from real-world scenarios. To address the limitation, we introduce
+M3Act, a synthetic data generator for multi-view multi-group multi-person human
+atomic actions and group activities. Powered by Unity Engine, M3Act features
+multiple semantic groups, highly diverse and photorealistic images, and a
+comprehensive set of annotations, which facilitates the learning of
+human-centered tasks across single-person, multi-person, and multi-group
+conditions. We demonstrate the advantages of M3Act across three core
+experiments. The results suggest our synthetic dataset can significantly
+improve the performance of several downstream methods and replace real-world
+datasets to reduce cost. Notably, M3Act improves the state-of-the-art MOTRv2 on
+DanceTrack dataset, leading to a hop on the leaderboard from 10th to 2nd place.
+Moreover, M3Act opens new research for controllable 3D group activity
+generation. We define multiple metrics and propose a competitive baseline for
+the novel task. Our code and data are available at our project page:
+http://cjerry1243.github.io/M3Act.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Online Task-Free Continual Generative and Discriminative Learning via Dynamic Cluster Memory,飞 叶 · Adrian Bors, ,,https://ojs.aaai.org/index.php/AAAI/article/view/29582,,,,,nan
+AMU-Tuning: Learning Effective Bias for CLIP-based Few-shot Classification,Yuwei Tang · ZhenYi Lin · Qilong Wang · Pengfei Zhu · Qinghua Hu, ,https://arxiv.org/abs/2404.08958,,2404.08958.pdf,AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning,"Recently, pre-trained vision-language models (e.g., CLIP) have shown great
+potential in few-shot learning and attracted a lot of research interest.
+Although efforts have been made to improve few-shot ability of CLIP, key
+factors on the effectiveness of existing methods have not been well studied,
+limiting further exploration of CLIP's potential in few-shot learning. In this
+paper, we first introduce a unified formulation to analyze CLIP-based few-shot
+learning methods from a perspective of logit bias, which encourages us to learn
+an effective logit bias for further improving performance of CLIP-based
+few-shot learning methods. To this end, we disassemble three key components
+involved in computation of logit bias (i.e., logit features, logit predictor,
+and logit fusion) and empirically analyze the effect on performance of few-shot
+classification. Based on analysis of key components, this paper proposes a
+novel AMU-Tuning method to learn effective logit bias for CLIP-based few-shot
+classification. Specifically, our AMU-Tuning predicts logit bias by exploiting
+the appropriate $\underline{\textbf{A}}$uxiliary features, which are fed into
+an efficient feature-initialized linear classifier with
+$\underline{\textbf{M}}$ulti-branch training. Finally, an
+$\underline{\textbf{U}}$ncertainty-based fusion is developed to incorporate
+logit bias into CLIP for few-shot classification. The experiments are conducted
+on several widely used benchmarks, and the results show AMU-Tuning clearly
+outperforms its counterparts while achieving state-of-the-art performance of
+CLIP-based few-shot learning without bells and whistles.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']"
+CoGS: Controllable Gaussian Splatting,Heng Yu · Joel Julin · Zoltán Á. Milacski · Koichiro Niinuma · László A. Jeni,https://cogs2024.github.io,https://arxiv.org/abs/2312.05664,,2312.05664.pdf,CoGS: Controllable Gaussian Splatting,"Capturing and re-animating the 3D structure of articulated objects present
+significant barriers. On one hand, methods requiring extensively calibrated
+multi-view setups are prohibitively complex and resource-intensive, limiting
+their practical applicability. On the other hand, while single-camera Neural
+Radiance Fields (NeRFs) offer a more streamlined approach, they have excessive
+training and rendering costs. 3D Gaussian Splatting would be a suitable
+alternative but for two reasons. Firstly, existing methods for 3D dynamic
+Gaussians require synchronized multi-view cameras, and secondly, the lack of
+controllability in dynamic scenarios. We present CoGS, a method for
+Controllable Gaussian Splatting, that enables the direct manipulation of scene
+elements, offering real-time control of dynamic scenes without the prerequisite
+of pre-computing control signals. We evaluated CoGS using both synthetic and
+real-world datasets that include dynamic objects that differ in degree of
+difficulty. In our evaluations, CoGS consistently outperformed existing dynamic
+and controllable neural representations in terms of visual fidelity.",cs.CV,['cs.CV']
+Neural Spline Fields for Burst Image Fusion and Layer Separation,Ilya Chugunov · David Shustin · Ruyu Yan · Chenyang Lei · Felix Heide, ,https://arxiv.org/abs/2312.14235,,2312.14235.pdf,Neural Spline Fields for Burst Image Fusion and Layer Separation,"Each photo in an image burst can be considered a sample of a complex 3D
+scene: the product of parallax, diffuse and specular materials, scene motion,
+and illuminant variation. While decomposing all of these effects from a stack
+of misaligned images is a highly ill-conditioned task, the conventional
+align-and-merge burst pipeline takes the other extreme: blending them into a
+single image. In this work, we propose a versatile intermediate representation:
+a two-layer alpha-composited image plus flow model constructed with neural
+spline fields -- networks trained to map input coordinates to spline control
+points. Our method is able to, during test-time optimization, jointly fuse a
+burst image capture into one high-resolution reconstruction and decompose it
+into transmission and obstruction layers. Then, by discarding the obstruction
+layer, we can perform a range of tasks including seeing through occlusions,
+reflection suppression, and shadow removal. Validated on complex synthetic and
+in-the-wild captures we find that, with no post-processing steps or learned
+priors, our generalizable model is able to outperform existing dedicated
+single-image and multi-view obstruction removal approaches.",cs.CV,['cs.CV']
+Object Recognition as Next Token Prediction,Kaiyu Yue · Bor-Chun Chen · Jonas Geiping · Hengduo Li · Tom Goldstein · Ser-Nam Lim,https://github.com/kaiyuyue/nxtp,,https://www.semanticscholar.org/paper/Object-Recognition-as-Next-Token-Prediction-Yue-Chen/529a3164a4ef5c227b6a775f73936866cb51d72f,,,,,nan
+GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models,Taoran Yi · Jiemin Fang · Junjie Wang · Guanjun Wu · Lingxi Xie · Xiaopeng Zhang · Wenyu Liu · Qi Tian · Xinggang Wang,https://taoranyi.com/gaussiandreamer/,https://arxiv.org/abs/2310.08529v3,,2310.08529v3.pdf,GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models,"In recent times, the generation of 3D assets from text prompts has shown
+impressive results. Both 2D and 3D diffusion models can help generate decent 3D
+objects based on prompts. 3D diffusion models have good 3D consistency, but
+their quality and generalization are limited as trainable 3D data is expensive
+and hard to obtain. 2D diffusion models enjoy strong abilities of
+generalization and fine generation, but 3D consistency is hard to guarantee.
+This paper attempts to bridge the power from the two types of diffusion models
+via the recent explicit and efficient 3D Gaussian splatting representation. A
+fast 3D object generation framework, named as GaussianDreamer, is proposed,
+where the 3D diffusion model provides priors for initialization and the 2D
+diffusion model enriches the geometry and appearance. Operations of noisy point
+growing and color perturbation are introduced to enhance the initialized
+Gaussians. Our GaussianDreamer can generate a high-quality 3D instance or 3D
+avatar within 15 minutes on one GPU, much faster than previous methods, while
+the generated instances can be directly rendered in real time. Demos and code
+are available at https://taoranyi.com/gaussiandreamer/.",cs.CV,"['cs.CV', 'cs.GR']"
+APISR: Anime Production Inspired Real-World Anime Super-Resolution,Boyang Wang · Fengyu Yang · Xihang Yu · Chao Zhang · Hanbin Zhao, ,https://arxiv.org/abs/2403.01598,,2403.01598.pdf,APISR: Anime Production Inspired Real-World Anime Super-Resolution,"While real-world anime super-resolution (SR) has gained increasing attention
+in the SR community, existing methods still adopt techniques from the
+photorealistic domain. In this paper, we analyze the anime production workflow
+and rethink how to use characteristics of it for the sake of the real-world
+anime SR. First, we argue that video networks and datasets are not necessary
+for anime SR due to the repetition use of hand-drawing frames. Instead, we
+propose an anime image collection pipeline by choosing the least compressed and
+the most informative frames from the video sources. Based on this pipeline, we
+introduce the Anime Production-oriented Image (API) dataset. In addition, we
+identify two anime-specific challenges of distorted and faint hand-drawn lines
+and unwanted color artifacts. We address the first issue by introducing a
+prediction-oriented compression module in the image degradation model and a
+pseudo-ground truth preparation with enhanced hand-drawn lines. In addition, we
+introduce the balanced twin perceptual loss combining both anime and
+photorealistic high-level features to mitigate unwanted color artifacts and
+increase visual clarity. We evaluate our method through extensive experiments
+on the public benchmark, showing our method outperforms state-of-the-art anime
+dataset-trained approaches.",eess.IV,"['eess.IV', 'cs.AI', 'cs.CV']"
+Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation,Zhiwei Yang · Kexue Fu · Minghong Duan · Linhao Qu · Shuo Wang · Zhijian Song,https://github.com/zwyang6/SeCo,https://arxiv.org/abs/2402.18467,,2402.18467.pdf,Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation,"Weakly supervised semantic segmentation (WSSS) with image-level labels aims
+to achieve segmentation tasks without dense annotations. However, attributed to
+the frequent coupling of co-occurring objects and the limited supervision from
+image-level labels, the challenging co-occurrence problem is widely present and
+leads to false activation of objects in WSSS. In this work, we devise a
+'Separate and Conquer' scheme SeCo to tackle this issue from dimensions of
+image space and feature space. In the image space, we propose to 'separate' the
+co-occurring objects with image decomposition by subdividing images into
+patches. Importantly, we assign each patch a category tag from Class Activation
+Maps (CAMs), which spatially helps remove the co-context bias and guide the
+subsequent representation. In the feature space, we propose to 'conquer' the
+false activation by enhancing semantic representation with multi-granularity
+knowledge contrast. To this end, a dual-teacher-single-student architecture is
+designed and tag-guided contrast is conducted, which guarantee the correctness
+of knowledge and further facilitate the discrepancy among co-contexts. We
+streamline the multi-staged WSSS pipeline end-to-end and tackle this issue
+without external supervision. Extensive experiments are conducted, validating
+the efficiency of our method and the superiority over previous single-staged
+and even multi-staged competitors on PASCAL VOC and MS COCO. Code is available
+at https://github.com/zwyang6/SeCo.git.",cs.CV,['cs.CV']
+Imagine Before Go: Self-Supervised Generative Map for Object Goal Navigation,Sixian Zhang · Xinyao Yu · Xinhang Song · Xiaohan Wang · Shuqiang Jiang, ,,http://vipl.ict.ac.cn/en/news/researchevents/202403/t20240315_207762.html,,,,,nan
+MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis,Dewei Zhou · You Li · Fan Ma · Xiaoting Zhang · Yi Yang, ,https://arxiv.org/abs/2402.05408,,2402.05408.pdf,MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis,"We present a Multi-Instance Generation (MIG) task, simultaneously generating
+multiple instances with diverse controls in one image. Given a set of
+predefined coordinates and their corresponding descriptions, the task is to
+ensure that generated instances are accurately at the designated locations and
+that all instances' attributes adhere to their corresponding description. This
+broadens the scope of current research on Single-instance generation, elevating
+it to a more versatile and practical dimension. Inspired by the idea of divide
+and conquer, we introduce an innovative approach named Multi-Instance
+Generation Controller (MIGC) to address the challenges of the MIG task.
+Initially, we break down the MIG task into several subtasks, each involving the
+shading of a single instance. To ensure precise shading for each instance, we
+introduce an instance enhancement attention mechanism. Lastly, we aggregate all
+the shaded instances to provide the necessary information for accurately
+generating multiple instances in stable diffusion (SD). To evaluate how well
+generation models perform on the MIG task, we provide a COCO-MIG benchmark
+along with an evaluation pipeline. Extensive experiments were conducted on the
+proposed COCO-MIG benchmark, as well as on various commonly used benchmarks.
+The evaluation results illustrate the exceptional control capabilities of our
+model in terms of quantity, position, attribute, and interaction. Code and
+demos will be released at https://migcproject.github.io/.",cs.CV,['cs.CV']
+Transfer CLIP for Generalizable Image Denoising,Jun Cheng · Dong Liang · Shan Tan,https://github.com/alwaysuu/CLIPDenoising,https://arxiv.org/abs/2403.15132,,,Transfer CLIP for Generalizable Image Denoising,"Image denoising is a fundamental task in computer vision. While prevailing
+deep learning-based supervised and self-supervised methods have excelled in
+eliminating in-distribution noise, their susceptibility to out-of-distribution
+(OOD) noise remains a significant challenge. The recent emergence of
+contrastive language-image pre-training (CLIP) model has showcased exceptional
+capabilities in open-world image recognition and segmentation. Yet, the
+potential for leveraging CLIP to enhance the robustness of low-level tasks
+remains largely unexplored. This paper uncovers that certain dense features
+extracted from the frozen ResNet image encoder of CLIP exhibit
+distortion-invariant and content-related properties, which are highly desirable
+for generalizable denoising. Leveraging these properties, we devise an
+asymmetrical encoder-decoder denoising network, which incorporates dense
+features including the noisy image and its multi-scale features from the frozen
+ResNet encoder of CLIP into a learnable image decoder to achieve generalizable
+denoising. The progressive feature augmentation strategy is further proposed to
+mitigate feature overfitting and improve the robustness of the learnable
+decoder. Extensive experiments and comparisons conducted across diverse OOD
+noises, including synthetic noise, real-world sRGB noise, and low-dose CT image
+noise, demonstrate the superior generalization ability of our method.",cs.CV,"['cs.CV', 'eess.IV']"
+Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning,Yiwen Ye · Yutong Xie · Jianpeng Zhang · Ziyang Chen · Qi Wu · Yong Xia, ,https://arxiv.org/abs/2311.17597,,2311.17597.pdf,Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning,"Self-supervised learning is an efficient pre-training method for medical
+image analysis. However, current research is mostly confined to
+specific-modality data pre-training, consuming considerable time and resources
+without achieving universality across different modalities. A straightforward
+solution is combining all modality data for joint self-supervised pre-training,
+which poses practical challenges. Firstly, our experiments reveal conflicts in
+representation learning as the number of modalities increases. Secondly,
+multi-modal data collected in advance cannot cover all real-world scenarios. In
+this paper, we reconsider versatile self-supervised learning from the
+perspective of continual learning and propose MedCoSS, a continuous
+self-supervised learning approach for multi-modal medical data. Unlike joint
+self-supervised learning, MedCoSS assigns different modality data to different
+training stages, forming a multi-stage pre-training process. To balance modal
+conflicts and prevent catastrophic forgetting, we propose a rehearsal-based
+continual learning method. We introduce the k-means sampling strategy to retain
+data from previous modalities and rehearse it when learning new modalities.
+Instead of executing the pretext task on buffer data, a feature distillation
+strategy and an intra-modal mixup strategy are applied to these data for
+knowledge retention. We conduct continuous self-supervised pre-training on a
+large-scale multi-modal unlabeled dataset, including clinical reports, X-rays,
+CT scans, MRI scans, and pathological images. Experimental results demonstrate
+MedCoSS's exceptional generalization ability across nine downstream datasets
+and its significant scalability in integrating new modality data. Code and
+pre-trained weight are available at https://github.com/yeerwen/MedCoSS.",cs.CV,['cs.CV']
+OmniVid: A Generative Framework for Universal Video Understanding,Junke Wang · Dongdong Chen · Chong Luo · Bo He · Lu Yuan · Zuxuan Wu · Yu-Gang Jiang, ,https://arxiv.org/abs/2403.17935,,2403.17935.pdf,OmniVid: A Generative Framework for Universal Video Understanding,"The core of video understanding tasks, such as recognition, captioning, and
+tracking, is to automatically detect objects or actions in a video and analyze
+their temporal evolution. Despite sharing a common goal, different tasks often
+rely on distinct model architectures and annotation formats. In contrast,
+natural language processing benefits from a unified output space, i.e., text
+sequences, which simplifies the training of powerful foundational language
+models, such as GPT-3, with extensive training corpora. Inspired by this, we
+seek to unify the output space of video understanding tasks by using languages
+as labels and additionally introducing time and box tokens. In this way, a
+variety of video tasks could be formulated as video-grounded token generation.
+This enables us to address various types of video tasks, including
+classification (such as action recognition), captioning (covering clip
+captioning, video question answering, and dense video captioning), and
+localization tasks (such as visual object tracking) within a fully shared
+encoder-decoder architecture, following a generative framework. Through
+comprehensive experiments, we demonstrate such a simple and straightforward
+idea is quite effective and can achieve state-of-the-art or competitive results
+on seven video benchmarks, providing a novel perspective for more universal
+video understanding. Code is available at https://github.com/wangjk666/OmniVid.",cs.CV,['cs.CV']
+Learning from One Continuous Video Stream,Joao Carreira · Michael King · Viorica Patraucean · Dilara Gokay · Catalin Ionescu · Yi Yang · Daniel Zoran · Joseph Heyward · Carl Doersch · Yusuf Aytar · Dima Damen · Andrew Zisserman, ,https://arxiv.org/abs/2312.00598,,2312.00598.pdf,Learning from One Continuous Video Stream,"We introduce a framework for online learning from a single continuous video
+stream -- the way people and animals learn, without mini-batches, data
+augmentation or shuffling. This poses great challenges given the high
+correlation between consecutive video frames and there is very little prior
+work on it. Our framework allows us to do a first deep dive into the topic and
+includes a collection of streams and tasks composed from two existing video
+datasets, plus methodology for performance evaluation that considers both
+adaptation and generalization. We employ pixel-to-pixel modelling as a
+practical and flexible way to switch between pre-training and single-stream
+evaluation as well as between arbitrary tasks, without ever requiring changes
+to models and always using the same pixel loss. Equipped with this framework we
+obtained large single-stream learning gains from pre-training with a novel
+family of future prediction tasks, found that momentum hurts, and that the pace
+of weight updates matters. The combination of these insights leads to matching
+the performance of IID learning with batch size 1, when using the same
+architecture and without costly replay buffers.",cs.CV,"['cs.CV', 'cs.AI']"
+Blur2Blur: Blur Conversion for Unsupervised Image Deblurring on Unknown Domains,Bang-Dang Pham · Phong Tran · Anh Tran · Cuong Pham · Rang Nguyen · Minh Hoai,https://zero1778.github.io/blur2blur/,https://arxiv.org/abs/2403.16205,,2403.16205.pdf,Blur2Blur: Blur Conversion for Unsupervised Image Deblurring on Unknown Domains,"This paper presents an innovative framework designed to train an image
+deblurring algorithm tailored to a specific camera device. This algorithm works
+by transforming a blurry input image, which is challenging to deblur, into
+another blurry image that is more amenable to deblurring. The transformation
+process, from one blurry state to another, leverages unpaired data consisting
+of sharp and blurry images captured by the target camera device. Learning this
+blur-to-blur transformation is inherently simpler than direct blur-to-sharp
+conversion, as it primarily involves modifying blur patterns rather than the
+intricate task of reconstructing fine image details. The efficacy of the
+proposed approach has been demonstrated through comprehensive experiments on
+various benchmarks, where it significantly outperforms state-of-the-art methods
+both quantitatively and qualitatively. Our code and data are available at
+https://zero1778.github.io/blur2blur/",cs.CV,['cs.CV']
+Video Harmonization with Triplet Spatio-Temporal Variation Patterns,Zonghui Guo · XinYu Han · Jie Zhang · Shiguang Shan · Haiyong Zheng,https://github.com/zhenglab/VideoTripletTransformer,,http://vipl.ict.ac.cn/en/news/researchevents/202403/t20240315_207762.html,,,,,nan
+D3still: Decoupled Differential Distillation for Asymmetric Image Retrieval,Yi Xie · Yihong Lin · Wenjie Cai · Xuemiao Xu · Huaidong Zhang · Yong Du · Shengfeng He, ,https://arxiv.org/abs/2403.01431,,2403.01431.pdf,Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval,"The task of composed image retrieval (CIR) aims to retrieve images based on
+the query image and the text describing the users' intent. Existing methods
+have made great progress with the advanced large vision-language (VL) model in
+CIR task, however, they generally suffer from two main issues: lack of labeled
+triplets for model training and difficulty of deployment on resource-restricted
+environments when deploying the large vision-language model. To tackle the
+above problems, we propose Image2Sentence based Asymmetric zero-shot composed
+image retrieval (ISA), which takes advantage of the VL model and only relies on
+unlabeled images for composition learning. In the framework, we propose a new
+adaptive token learner that maps an image to a sentence in the word embedding
+space of VL model. The sentence adaptively captures discriminative visual
+information and is further integrated with the text modifier. An asymmetric
+structure is devised for flexible deployment, in which the lightweight model is
+adopted for the query side while the large VL model is deployed on the gallery
+side. The global contrastive distillation and the local alignment
+regularization are adopted for the alignment between the light model and the VL
+model for CIR task. Our experiments demonstrate that the proposed ISA could
+better cope with the real retrieval scenarios and further improve retrieval
+accuracy and efficiency.",cs.CV,['cs.CV']
+DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks,Jiaxin Zhang · Dezhi Peng · Chongyu Liu · Peirong Zhang · Lianwen Jin,https://github.com/ZZZHANG-jx/DocRes,https://arxiv.org/abs/2405.04408,,2405.04408.pdf,DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks,"Document image restoration is a crucial aspect of Document AI systems, as the
+quality of document images significantly influences the overall performance.
+Prevailing methods address distinct restoration tasks independently, leading to
+intricate systems and the incapability to harness the potential synergies of
+multi-task learning. To overcome this challenge, we propose DocRes, a
+generalist model that unifies five document image restoration tasks including
+dewarping, deshadowing, appearance enhancement, deblurring, and binarization.
+To instruct DocRes to perform various restoration tasks, we propose a novel
+visual prompt approach called Dynamic Task-Specific Prompt (DTSPrompt). The
+DTSPrompt for different tasks comprises distinct prior features, which are
+additional characteristics extracted from the input image. Beyond its role as a
+cue for task-specific execution, DTSPrompt can also serve as supplementary
+information to enhance the model's performance. Moreover, DTSPrompt is more
+flexible than prior visual prompt approaches as it can be seamlessly applied
+and adapted to inputs with high and variable resolutions. Experimental results
+demonstrate that DocRes achieves competitive or superior performance compared
+to existing state-of-the-art task-specific models. This underscores the
+potential of DocRes across a broader spectrum of document image restoration
+tasks. The source code is publicly available at
+https://github.com/ZZZHANG-jx/DocRes",cs.CV,['cs.CV']
+Unified Entropy Optimization for Open-Set Test-Time Adaptation,Zhengqing Gao · Xu-Yao Zhang · Cheng-Lin Liu,https://github.com/gaozhengqing/UniEnt,https://arxiv.org/abs/2404.06065,,2404.06065.pdf,Unified Entropy Optimization for Open-Set Test-Time Adaptation,"Test-time adaptation (TTA) aims at adapting a model pre-trained on the
+labeled source domain to the unlabeled target domain. Existing methods usually
+focus on improving TTA performance under covariate shifts, while neglecting
+semantic shifts. In this paper, we delve into a realistic open-set TTA setting
+where the target domain may contain samples from unknown classes. Many
+state-of-the-art closed-set TTA methods perform poorly when applied to open-set
+scenarios, which can be attributed to the inaccurate estimation of data
+distribution and model confidence. To address these issues, we propose a simple
+but effective framework called unified entropy optimization (UniEnt), which is
+capable of simultaneously adapting to covariate-shifted in-distribution (csID)
+data and detecting covariate-shifted out-of-distribution (csOOD) data.
+Specifically, UniEnt first mines pseudo-csID and pseudo-csOOD samples from test
+data, followed by entropy minimization on the pseudo-csID data and entropy
+maximization on the pseudo-csOOD data. Furthermore, we introduce UniEnt+ to
+alleviate the noise caused by hard data partition leveraging sample-level
+confidence. Extensive experiments on CIFAR benchmarks and Tiny-ImageNet-C show
+the superiority of our framework. The code is available at
+https://github.com/gaozhengqing/UniEnt",cs.CV,['cs.CV']
+Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology,Wenhao Tang · Fengtao ZHOU · Sheng Huang · Xiang Zhu · Yi Zhang · Bo Liu, ,https://arxiv.org/abs/2402.17228,,2402.17228.pdf,Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology,"Multiple instance learning (MIL) is the most widely used framework in
+computational pathology, encompassing sub-typing, diagnosis, prognosis, and
+more. However, the existing MIL paradigm typically requires an offline instance
+feature extractor, such as a pre-trained ResNet or a foundation model. This
+approach lacks the capability for feature fine-tuning within the specific
+downstream tasks, limiting its adaptability and performance. To address this
+issue, we propose a Re-embedded Regional Transformer (R$^2$T) for re-embedding
+the instance features online, which captures fine-grained local features and
+establishes connections across different regions. Unlike existing works that
+focus on pre-training powerful feature extractor or designing sophisticated
+instance aggregator, R$^2$T is tailored to re-embed instance features online.
+It serves as a portable module that can seamlessly integrate into mainstream
+MIL models. Extensive experimental results on common computational pathology
+tasks validate that: 1) feature re-embedding improves the performance of MIL
+models based on ResNet-50 features to the level of foundation model features,
+and further enhances the performance of foundation model features; 2) the
+R$^2$T can introduce more significant performance improvements to various MIL
+models; 3) R$^2$T-MIL, as an R$^2$T-enhanced AB-MIL, outperforms other latest
+methods by a large margin.The code is available at:
+https://github.com/DearCaat/RRT-MIL.",cs.CV,['cs.CV']
+Gradient-based Parameter Selection for Efficient Fine-Tuning,Zhi Zhang · Qizhe Zhang · Zijun Gao · Renrui Zhang · Ekaterina Shutova · Shiji Zhou · Shanghang Zhang, ,https://arxiv.org/abs/2312.10136,,2312.10136.pdf,Gradient-based Parameter Selection for Efficient Fine-Tuning,"With the growing size of pre-trained models, full fine-tuning and storing all
+the parameters for various downstream tasks is costly and infeasible. In this
+paper, we propose a new parameter-efficient fine-tuning method, Gradient-based
+Parameter Selection (GPS), demonstrating that only tuning a few selected
+parameters from the pre-trained model while keeping the remainder of the model
+frozen can generate similar or better performance compared with the full model
+fine-tuning method. Different from the existing popular and state-of-the-art
+parameter-efficient fine-tuning approaches, our method does not introduce any
+additional parameters and computational costs during both the training and
+inference stages. Another advantage is the model-agnostic and non-destructive
+property, which eliminates the need for any other design specific to a
+particular model. Compared with the full fine-tuning, GPS achieves 3.33%
+(91.78% vs. 88.45%, FGVC) and 9.61% (73.1% vs. 65.57%, VTAB) improvement of the
+accuracy with tuning only 0.36% parameters of the pre-trained model on average
+over 24 image classification tasks; it also demonstrates a significant
+improvement of 17% and 16.8% in mDice and mIoU, respectively, on medical image
+segmentation task. Moreover, GPS achieves state-of-the-art performance compared
+with existing PEFT methods.",cs.CV,['cs.CV']
+UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model,Shuai Yuan · Lei Luo · Zhuo Hui · Can Pu · Xiaoyu Xiang · Rakesh Ranjan · Denis Demandolx, ,https://arxiv.org/abs/2405.02608,,2405.02608.pdf,UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model,"Traditional unsupervised optical flow methods are vulnerable to occlusions
+and motion boundaries due to lack of object-level information. Therefore, we
+propose UnSAMFlow, an unsupervised flow network that also leverages object
+information from the latest foundation model Segment Anything Model (SAM). We
+first include a self-supervised semantic augmentation module tailored to SAM
+masks. We also analyze the poor gradient landscapes of traditional smoothness
+losses and propose a new smoothness definition based on homography instead. A
+simple yet effective mask feature module has also been added to further
+aggregate features on the object level. With all these adaptations, our method
+produces clear optical flow estimation with sharp boundaries around objects,
+which outperforms state-of-the-art methods on both KITTI and Sintel datasets.
+Our method also generalizes well across domains and runs very efficiently.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']"
+Diversity-aware Channel Pruning for StyleGAN Compression,Jiwoo Chung · Sangeek Hyun · Sang-Heon Shim · Jae-Pil Heo,https://jiwoogit.github.io/DCP-GAN_site/,https://arxiv.org/abs/2403.13548,,2403.13548.pdf,Diversity-aware Channel Pruning for StyleGAN Compression,"StyleGAN has shown remarkable performance in unconditional image generation.
+However, its high computational cost poses a significant challenge for
+practical applications. Although recent efforts have been made to compress
+StyleGAN while preserving its performance, existing compressed models still lag
+behind the original model, particularly in terms of sample diversity. To
+overcome this, we propose a novel channel pruning method that leverages varying
+sensitivities of channels to latent vectors, which is a key factor in sample
+diversity. Specifically, by assessing channel importance based on their
+sensitivities to latent vector perturbations, our method enhances the diversity
+of samples in the compressed model. Since our method solely focuses on the
+channel pruning stage, it has complementary benefits with prior training
+schemes without additional training cost. Extensive experiments demonstrate
+that our method significantly enhances sample diversity across various
+datasets. Moreover, in terms of FID scores, our method not only surpasses
+state-of-the-art by a large margin but also achieves comparable scores with
+only half training iterations.",cs.CV,['cs.CV']
+Discriminative Sample-Guided and Parameter-Efficient Feature Space Adaptation for Cross-Domain Few-Shot Learning,Rashindrie Perera · Saman Halgamuge,https://github.com/rashindrie/DIPA,https://arxiv.org/abs/2403.04492,,2403.04492.pdf,Discriminative Sample-Guided and Parameter-Efficient Feature Space Adaptation for Cross-Domain Few-Shot Learning,"In this paper, we look at cross-domain few-shot classification which presents
+the challenging task of learning new classes in previously unseen domains with
+few labelled examples. Existing methods, though somewhat effective, encounter
+several limitations, which we alleviate through two significant improvements.
+First, we introduce a lightweight parameter-efficient adaptation strategy to
+address overfitting associated with fine-tuning a large number of parameters on
+small datasets. This strategy employs a linear transformation of pre-trained
+features, significantly reducing the trainable parameter count. Second, we
+replace the traditional nearest centroid classifier with a discriminative
+sample-aware loss function, enhancing the model's sensitivity to the inter- and
+intra-class variances within the training set for improved clustering in
+feature space. Empirical evaluations on the Meta-Dataset benchmark showcase
+that our approach not only improves accuracy up to 7.7\% and 5.3\% on
+previously seen and unseen datasets, respectively, but also achieves the above
+performance while being at least $\sim3\times$ more parameter-efficient than
+existing methods, establishing a new state-of-the-art in cross-domain few-shot
+learning. Our code is available at https://github.com/rashindrie/DIPA.",cs.CV,['cs.CV']
+FaceLift: Semi-supervised 3D Facial Landmark Localization,David Ferman · Pablo Garrido · Gaurav Bharaj,https://davidcferman.github.io/FaceLift/,https://arxiv.org/abs/2405.19646,,2405.19646.pdf,FaceLift: Semi-supervised 3D Facial Landmark Localization,"3D facial landmark localization has proven to be of particular use for
+applications, such as face tracking, 3D face modeling, and image-based 3D face
+reconstruction. In the supervised learning case, such methods usually rely on
+3D landmark datasets derived from 3DMM-based registration that often lack
+spatial definition alignment, as compared with that chosen by hand-labeled
+human consensus, e.g., how are eyebrow landmarks defined? This creates a gap
+between landmark datasets generated via high-quality 2D human labels and 3DMMs,
+and it ultimately limits their effectiveness. To address this issue, we
+introduce a novel semi-supervised learning approach that learns 3D landmarks by
+directly lifting (visible) hand-labeled 2D landmarks and ensures better
+definition alignment, without the need for 3D landmark datasets. To lift 2D
+landmarks to 3D, we leverage 3D-aware GANs for better multi-view consistency
+learning and in-the-wild multi-frame videos for robust cross-generalization.
+Empirical experiments demonstrate that our method not only achieves better
+definition alignment between 2D-3D landmarks but also outperforms other
+supervised learning 3D landmark localization methods on both 3DMM labeled and
+photogrammetric ground truth evaluation datasets. Project Page:
+https://davidcferman.github.io/FaceLift",cs.CV,['cs.CV']
+MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models,Sanjoy Chowdhury · Sayan Nag · Joseph K J · Balaji Vasan Srinivasan · Dinesh Manocha, ,https://arxiv.org/abs/2310.13772,,2310.13772.pdf,TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models,"We present TexFusion (Texture Diffusion), a new method to synthesize textures
+for given 3D geometries, using large-scale text-guided image diffusion models.
+In contrast to recent works that leverage 2D text-to-image diffusion models to
+distill 3D objects using a slow and fragile optimization process, TexFusion
+introduces a new 3D-consistent generation technique specifically designed for
+texture synthesis that employs regular diffusion model sampling on different 2D
+rendered views. Specifically, we leverage latent diffusion models, apply the
+diffusion model's denoiser on a set of 2D renders of the 3D object, and
+aggregate the different denoising predictions on a shared latent texture map.
+Final output RGB textures are produced by optimizing an intermediate neural
+color field on the decodings of 2D renders of the latent texture. We thoroughly
+validate TexFusion and show that we can efficiently generate diverse, high
+quality and globally coherent textures. We achieve state-of-the-art text-guided
+texture synthesis performance using only image diffusion models, while avoiding
+the pitfalls of previous distillation-based methods. The text-conditioning
+offers detailed control and we also do not rely on any ground truth 3D textures
+for training. This makes our method versatile and applicable to a broad range
+of geometry and texture types. We hope that TexFusion will advance AI-based
+texturing of 3D assets for applications in virtual reality, game design,
+simulation, and more.",cs.CV,"['cs.CV', 'cs.LG', 'I.3.3']"
+Intensity-Robust Autofocus for Spike Camera,Changqing Su · Zhiyuan Ye · Yongsheng Xiao · You Zhou · Zhen Cheng · Bo Xiong · Zhaofei Yu · Tiejun Huang, ,https://arxiv.org/abs/2405.16790,,2405.16790.pdf,SCSim: A Realistic Spike Cameras Simulator,"Spike cameras, with their exceptional temporal resolution, are
+revolutionizing high-speed visual applications. Large-scale synthetic datasets
+have significantly accelerated the development of these cameras, particularly
+in reconstruction and optical flow. However, current synthetic datasets for
+spike cameras lack sophistication. Addressing this gap, we introduce SCSim, a
+novel and more realistic spike camera simulator with a comprehensive noise
+model. SCSim is adept at autonomously generating driving scenarios and
+synthesizing corresponding spike streams. To enhance the fidelity of these
+streams, we've developed a comprehensive noise model tailored to the unique
+circuitry of spike cameras. Our evaluations demonstrate that SCSim outperforms
+existing simulation methods in generating authentic spike streams. Crucially,
+SCSim simplifies the creation of datasets, thereby greatly advancing
+spike-based visual tasks like reconstruction. Our project refers to
+https://github.com/Acnext/SCSim.",cs.CV,['cs.CV']
+SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using Neural Radiance Fields,"Quentin HERAU · Nathan Piasco · Moussab Bennehar · Luis Guiller,o Roldao Jimenez · Dzmitry Tsishkou · MigniotCyrille · Modélisation Information Systèmes · Cedric Demonceaux", ,https://arxiv.org/abs/2311.15803,,2311.15803.pdf,SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using Neural Radiance Fields,"In rapidly-evolving domains such as autonomous driving, the use of multiple
+sensors with different modalities is crucial to ensure high operational
+precision and stability. To correctly exploit the provided information by each
+sensor in a single common frame, it is essential for these sensors to be
+accurately calibrated. In this paper, we leverage the ability of Neural
+Radiance Fields (NeRF) to represent different sensors modalities in a common
+volumetric representation to achieve robust and accurate spatio-temporal sensor
+calibration. By designing a partitioning approach based on the visible part of
+the scene for each sensor, we formulate the calibration problem using only the
+overlapping areas. This strategy results in a more robust and accurate
+calibration that is less prone to failure. We demonstrate that our approach
+works on outdoor urban scenes by validating it on multiple established driving
+datasets. Results show that our method is able to get better accuracy and
+robustness compared to existing methods.",cs.CV,"['cs.CV', 'cs.RO']"
+Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP Limitations,Lei Fan · Jianxiong Zhou · Xiaoying Xing · Ying Wu, ,https://arxiv.org/abs/2311.17938,,2311.17938.pdf,Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP Limitations,"Active recognition, which allows intelligent agents to explore observations
+for better recognition performance, serves as a prerequisite for various
+embodied AI tasks, such as grasping, navigation and room arrangements. Given
+the evolving environment and the multitude of object classes, it is impractical
+to include all possible classes during the training stage. In this paper, we
+aim at advancing active open-vocabulary recognition, empowering embodied agents
+to actively perceive and classify arbitrary objects. However, directly adopting
+recent open-vocabulary classification models, like Contrastive Language Image
+Pretraining (CLIP), poses its unique challenges. Specifically, we observe that
+CLIP's performance is heavily affected by the viewpoint and occlusions,
+compromising its reliability in unconstrained embodied perception scenarios.
+Further, the sequential nature of observations in agent-environment
+interactions necessitates an effective method for integrating features that
+maintains discriminative strength for open-vocabulary classification. To
+address these issues, we introduce a novel agent for active open-vocabulary
+recognition. The proposed method leverages inter-frame and inter-concept
+similarities to navigate agent movements and to fuse features, without relying
+on class-specific knowledge. Compared to baseline CLIP model with 29.6%
+accuracy on ShapeNet dataset, the proposed agent could achieve 53.3% accuracy
+for open-vocabulary recognition, without any fine-tuning to the equipped CLIP
+model. Additional experiments conducted with the Habitat simulator further
+affirm the efficacy of our method.",cs.CV,['cs.CV']
+2S-UDF: A Novel Two-stage UDF Learning Method for Robust Non-watertight Model Reconstruction from Multi-view Images,Junkai Deng · Fei Hou · Xuhui Chen · Wencheng Wang · Ying He, ,https://arxiv.org/abs/2308.09302,,2308.09302.pdf,Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms,"Robust audio anti-spoofing has been increasingly challenging due to the
+recent advancements on deepfake techniques. While spectrograms have
+demonstrated their capability for anti-spoofing, complementary information
+presented in multi-order spectral patterns have not been well explored, which
+limits their effectiveness for varying spoofing attacks. Therefore, we propose
+a novel deep learning method with a spectral fusion-reconstruction strategy,
+namely S2pecNet, to utilise multi-order spectral patterns for robust audio
+anti-spoofing representations. Specifically, spectral patterns up to
+second-order are fused in a coarse-to-fine manner and two branches are designed
+for the fine-level fusion from the spectral and temporal contexts. A
+reconstruction from the fused representation to the input spectrograms further
+reduces the potential fused information loss. Our method achieved the
+state-of-the-art performance with an EER of 0.77% on a widely used dataset:
+ASVspoof2019 LA Challenge.",cs.SD,"['cs.SD', 'cs.AI', 'cs.MM', 'eess.AS']"
+Sheared Backpropagation for Finetuning Foundation Models,Zhiyuan Yu · Li Shen · Liang Ding · Xinmei Tian · Yixin Chen · Dacheng Tao, ,https://arxiv.org/abs/2402.15017,,2402.15017.pdf,Towards Few-Shot Adaptation of Foundation Models via Multitask Finetuning,"Foundation models have emerged as a powerful tool for many AI problems.
+Despite the tremendous success of foundation models, effective adaptation to
+new tasks, particularly those with limited labels, remains an open question and
+lacks theoretical understanding. An emerging solution with recent success in
+vision and NLP involves finetuning a foundation model on a selection of
+relevant tasks, before its adaptation to a target task with limited labeled
+samples. In this paper, we study the theoretical justification of this
+multitask finetuning approach. Our theoretical analysis reveals that with a
+diverse set of related tasks, this multitask finetuning leads to reduced error
+in the target task, in comparison to directly adapting the same pretrained
+model. We quantify the relationship between finetuning tasks and target tasks
+by diversity and consistency metrics, and further propose a practical task
+selection algorithm. We substantiate our theoretical claims with extensive
+empirical evidence. Further, we present results affirming our task selection
+algorithm adeptly chooses related finetuning tasks, providing advantages to the
+model performance on target tasks. We believe our study shed new light on the
+effective adaptation of foundation models to new tasks that lack abundant
+labels. Our code is available at
+https://github.com/OliverXUZY/Foudation-Model_Multitask.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CL']"
+DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion,Tom Van Wouwe · Seunghwan Lee · Antoine Falisse · Scott Delp · Karen Liu,https://diffusionposer.github.io/,https://arxiv.org/abs/2308.16682,,,DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion,"Motion capture from a limited number of body-worn sensors, such as inertial
+measurement units (IMUs) and pressure insoles, has important applications in
+health, human performance, and entertainment. Recent work has focused on
+accurately reconstructing whole-body motion from a specific sensor
+configuration using six IMUs. While a common goal across applications is to use
+the minimal number of sensors to achieve required accuracy, the optimal
+arrangement of the sensors might differ from application to application. We
+propose a single diffusion model, DiffusionPoser, which reconstructs human
+motion in real-time from an arbitrary combination of sensors, including IMUs
+placed at specified locations, and, pressure insoles. Unlike existing methods,
+our model grants users the flexibility to determine the number and arrangement
+of sensors tailored to the specific activity of interest, without the need for
+retraining. A novel autoregressive inferencing scheme ensures real-time motion
+reconstruction that closely aligns with measured sensor signals. The generative
+nature of DiffusionPoser ensures realistic behavior, even for
+degrees-of-freedom not directly measured. Qualitative results can be found on
+our website: https://diffusionposer.github.io/.",cs.CV,['cs.CV']
+DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning,Haoran Xu · Peixi Peng · Guang Tan · Yuan Li · Xinhai Xu · Yonghong Tian,https://github.com/kyoran/DMR,,https://link.springer.com/article/10.1007/s11704-023-2444-y,,,,,nan
+Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions,Runhao Zeng · Xiaoyong Chen · Jiaming Liang · Huisi Wu · Guang-Zhong Cao · Yong Guo, ,https://arxiv.org/abs/2403.20254,,2403.20254.pdf,Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions,"Temporal action detection (TAD) aims to locate action positions and recognize
+action categories in long-term untrimmed videos. Although many methods have
+achieved promising results, their robustness has not been thoroughly studied.
+In practice, we observe that temporal information in videos can be occasionally
+corrupted, such as missing or blurred frames. Interestingly, existing methods
+often incur a significant performance drop even if only one frame is affected.
+To formally evaluate the robustness, we establish two temporal corruption
+robustness benchmarks, namely THUMOS14-C and ActivityNet-v1.3-C. In this paper,
+we extensively analyze the robustness of seven leading TAD methods and obtain
+some interesting findings: 1) Existing methods are particularly vulnerable to
+temporal corruptions, and end-to-end methods are often more susceptible than
+those with a pre-trained feature extractor; 2) Vulnerability mainly comes from
+localization error rather than classification error; 3) When corruptions occur
+in the middle of an action instance, TAD models tend to yield the largest
+performance drop. Besides building a benchmark, we further develop a simple but
+effective robust training method to defend against temporal corruptions,
+through the FrameDrop augmentation and Temporal-Robust Consistency loss.
+Remarkably, our approach not only improves robustness but also yields promising
+improvements on clean data. We believe that this study will serve as a
+benchmark for future research in robust video analysis. Source code and models
+are available at https://github.com/Alvin-Zeng/temporal-robustness-benchmark.",cs.CV,['cs.CV']
+Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera,Jiye Lee · Hanbyul Joo,https://jiyewise.github.io/projects/MocapEvery,https://arxiv.org/abs/2401.00847,,2401.00847.pdf,Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera,"We present a lightweight and affordable motion capture method based on two
+smartwatches and a head-mounted camera. In contrast to the existing approaches
+that use six or more expert-level IMU devices, our approach is much more
+cost-effective and convenient. Our method can make wearable motion capture
+accessible to everyone everywhere, enabling 3D full-body motion capture in
+diverse environments. As a key idea to overcome the extreme sparsity and
+ambiguities of sensor inputs with different modalities, we integrate 6D head
+poses obtained from the head-mounted cameras for motion estimation. To enable
+capture in expansive indoor and outdoor scenes, we propose an algorithm to
+track and update floor level changes to define head poses, coupled with a
+multi-stage Transformer-based regression module. We also introduce novel
+strategies leveraging visual cues of egocentric images to further enhance the
+motion capture quality while reducing ambiguities. We demonstrate the
+performance of our method on various challenging scenarios, including complex
+outdoor environments and everyday motions including object interactions and
+social interactions among multiple individuals.",cs.CV,"['cs.CV', 'cs.GR']"
+SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design,Seokju Yun · Youngmin Ro,https://github.com/ysj9909/SHViT,https://arxiv.org/abs/2401.16456,,2401.16456.pdf,SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design,"Recently, efficient Vision Transformers have shown great performance with low
+latency on resource-constrained devices. Conventionally, they use 4x4 patch
+embeddings and a 4-stage structure at the macro level, while utilizing
+sophisticated attention with multi-head configuration at the micro level. This
+paper aims to address computational redundancy at all design levels in a
+memory-efficient manner. We discover that using larger-stride patchify stem not
+only reduces memory access costs but also achieves competitive performance by
+leveraging token representations with reduced spatial redundancy from the early
+stages. Furthermore, our preliminary analyses suggest that attention layers in
+the early stages can be substituted with convolutions, and several attention
+heads in the latter stages are computationally redundant. To handle this, we
+introduce a single-head attention module that inherently prevents head
+redundancy and simultaneously boosts accuracy by parallelly combining global
+and local information. Building upon our solutions, we introduce SHViT, a
+Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy
+tradeoff. For example, on ImageNet-1k, our SHViT-S4 is 3.3x, 8.1x, and 2.4x
+faster than MobileViTv2 x1.0 on GPU, CPU, and iPhone12 mobile device,
+respectively, while being 1.3% more accurate. For object detection and instance
+segmentation on MS COCO using Mask-RCNN head, our model achieves performance
+comparable to FastViT-SA12 while exhibiting 3.8x and 2.0x lower backbone
+latency on GPU and mobile device, respectively.",cs.CV,['cs.CV']
+Improved Self-Training for Test-Time Adaptation,Jing Ma, ,https://arxiv.org/abs/2309.14949v1,,2309.14949v1.pdf,Towards Real-World Test-Time Adaptation: Tri-Net Self-Training with Balanced Normalization,"Test-Time Adaptation aims to adapt source domain model to testing data at
+inference stage with success demonstrated in adapting to unseen corruptions.
+However, these attempts may fail under more challenging real-world scenarios.
+Existing works mainly consider real-world test-time adaptation under non-i.i.d.
+data stream and continual domain shift. In this work, we first complement the
+existing real-world TTA protocol with a globally class imbalanced testing set.
+We demonstrate that combining all settings together poses new challenges to
+existing methods. We argue the failure of state-of-the-art methods is first
+caused by indiscriminately adapting normalization layers to imbalanced testing
+data. To remedy this shortcoming, we propose a balanced batchnorm layer to swap
+out the regular batchnorm at inference stage. The new batchnorm layer is
+capable of adapting without biasing towards majority classes. We are further
+inspired by the success of self-training~(ST) in learning from unlabeled data
+and adapt ST for test-time adaptation. However, ST alone is prone to over
+adaption which is responsible for the poor performance under continual domain
+shift. Hence, we propose to improve self-training under continual domain shift
+by regularizing model updates with an anchored loss. The final TTA model,
+termed as TRIBE, is built upon a tri-net architecture with balanced batchnorm
+layers. We evaluate TRIBE on four datasets representing real-world TTA
+settings. TRIBE consistently achieves the state-of-the-art performance across
+multiple evaluation protocols. The code is available at
+\url{https://github.com/Gorilla-Lab-SCUT/TRIBE}.",cs.LG,"['cs.LG', 'cs.CV']"
+APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentation,Weizhao He · Yang Zhang · Wei Zhuo · Linlin Shen · Jiaqi Yang · Songhe Deng · Liang Sun, ,https://arxiv.org/abs/2405.15265,,2405.15265.pdf,Cross-Domain Few-Shot Semantic Segmentation via Doubly Matching Transformation,"Cross-Domain Few-shot Semantic Segmentation (CD-FSS) aims to train
+generalized models that can segment classes from different domains with a few
+labeled images. Previous works have proven the effectiveness of feature
+transformation in addressing CD-FSS. However, they completely rely on support
+images for feature transformation, and repeatedly utilizing a few support
+images for each class may easily lead to overfitting and overlooking
+intra-class appearance differences. In this paper, we propose a Doubly Matching
+Transformation-based Network (DMTNet) to solve the above issue. Instead of
+completely relying on support images, we propose Self-Matching Transformation
+(SMT) to construct query-specific transformation matrices based on query images
+themselves to transform domain-specific query features into domain-agnostic
+ones. Calculating query-specific transformation matrices can prevent
+overfitting, especially for the meta-testing stage where only one or several
+images are used as support images to segment hundreds or thousands of images.
+After obtaining domain-agnostic features, we exploit a Dual Hypercorrelation
+Construction (DHC) module to explore the hypercorrelations between the query
+image with the foreground and background of the support image, based on which
+foreground and background prediction maps are generated and supervised,
+respectively, to enhance the segmentation result. In addition, we propose a
+Test-time Self-Finetuning (TSF) strategy to more accurately self-tune the query
+prediction in unseen domains. Extensive experiments on four popular datasets
+show that DMTNet achieves superior performance over state-of-the-art
+approaches. Code is available at https://github.com/ChenJiayi68/DMTNet.",cs.CV,['cs.CV']
+CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation,Bo-Yuan Sun · Yuqi Yang · Le Zhang · Ming-Ming Cheng · Qibin Hou,https://github.com/BBBBchan/CorrMatch,https://arxiv.org/abs/2306.04300v3,,2306.04300v3.pdf,CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation,"This paper presents a simple but performant semi-supervised semantic
+segmentation approach, called CorrMatch. Previous approaches mostly employ
+complicated training strategies to leverage unlabeled data but overlook the
+role of correlation maps in modeling the relationships between pairs of
+locations. We observe that the correlation maps not only enable clustering
+pixels of the same category easily but also contain good shape information,
+which previous works have omitted. Motivated by these, we aim to improve the
+use efficiency of unlabeled data by designing two novel label propagation
+strategies. First, we propose to conduct pixel propagation by modeling the
+pairwise similarities of pixels to spread the high-confidence pixels and dig
+out more. Then, we perform region propagation to enhance the pseudo labels with
+accurate class-agnostic masks extracted from the correlation maps. CorrMatch
+achieves great performance on popular segmentation benchmarks. Taking the
+DeepLabV3+ with ResNet-101 backbone as our segmentation model, we receive a
+76%+ mIoU score on the Pascal VOC 2012 dataset with only 92 annotated images.
+Code is available at https://github.com/BBBBchan/CorrMatch.",cs.CV,['cs.CV']
+PixelLM: Pixel Reasoning with Large Multimodal Model,Zhongwei Ren · Zhicheng Huang · Yunchao Wei · Yao Zhao · Dongmei Fu · Jiashi Feng · Xiaojie Jin, ,https://arxiv.org/abs/2312.02228,,2312.02228.pdf,PixelLM: Pixel Reasoning with Large Multimodal Model,"While large multimodal models (LMMs) have achieved remarkable progress,
+generating pixel-level masks for image reasoning tasks involving multiple
+open-world targets remains a challenge. To bridge this gap, we introduce
+PixelLM, an effective and efficient LMM for pixel-level reasoning and
+understanding. Central to PixelLM is a novel, lightweight pixel decoder and a
+comprehensive segmentation codebook. The decoder efficiently produces masks
+from the hidden embeddings of the codebook tokens, which encode detailed
+target-relevant information. With this design, PixelLM harmonizes with the
+structure of popular LMMs and avoids the need for additional costly
+segmentation models. Furthermore, we propose a target refinement loss to
+enhance the model's ability to differentiate between multiple targets, leading
+to substantially improved mask quality. To advance research in this area, we
+construct MUSE, a high-quality multi-target reasoning segmentation benchmark.
+PixelLM excels across various pixel-level image reasoning and understanding
+tasks, outperforming well-established methods in multiple benchmarks, including
+MUSE, single- and multi-referring segmentation. Comprehensive ablations confirm
+the efficacy of each proposed component. All code, models, and datasets will be
+publicly available.",cs.CV,['cs.CV']
+EGTR: Extracting Graph from Transformer for Scene Graph Generation,Jinbae Im · JeongYeon Nam · Nokyung Park · Hyungmin Lee · Seunghyun Park,https://github.com/naver-ai/egtr,https://arxiv.org/abs/2404.02072,,2404.02072.pdf,EGTR: Extracting Graph from Transformer for Scene Graph Generation,"Scene Graph Generation (SGG) is a challenging task of detecting objects and
+predicting relationships between objects. After DETR was developed, one-stage
+SGG models based on a one-stage object detector have been actively studied.
+However, complex modeling is used to predict the relationship between objects,
+and the inherent relationship between object queries learned in the multi-head
+self-attention of the object detector has been neglected. We propose a
+lightweight one-stage SGG model that extracts the relation graph from the
+various relationships learned in the multi-head self-attention layers of the
+DETR decoder. By fully utilizing the self-attention by-products, the relation
+graph can be extracted effectively with a shallow relation extraction head.
+Considering the dependency of the relation extraction task on the object
+detection task, we propose a novel relation smoothing technique that adjusts
+the relation label adaptively according to the quality of the detected objects.
+By the relation smoothing, the model is trained according to the continuous
+curriculum that focuses on object detection task at the beginning of training
+and performs multi-task learning as the object detection performance gradually
+improves. Furthermore, we propose a connectivity prediction task that predicts
+whether a relation exists between object pairs as an auxiliary task of the
+relation extraction. We demonstrate the effectiveness and efficiency of our
+method for the Visual Genome and Open Image V6 datasets. Our code is publicly
+available at https://github.com/naver-ai/egtr.",cs.CV,"['cs.CV', 'cs.LG']"
+Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning,Rongjie Li · Yu Wu · Xuming He, ,https://arxiv.org/abs/2404.00909v1,,2404.00909v1.pdf,Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning,"Generative vision-language models (VLMs) have shown impressive performance in
+zero-shot vision-language tasks like image captioning and visual question
+answering. However, improving their zero-shot reasoning typically requires
+second-stage instruction tuning, which relies heavily on human-labeled or large
+language model-generated annotation, incurring high labeling costs. To tackle
+this challenge, we introduce Image-Conditioned Caption Correction (ICCC), a
+novel pre-training task designed to enhance VLMs' zero-shot performance without
+the need for labeled task-aware data. The ICCC task compels VLMs to rectify
+mismatches between visual and language concepts, thereby enhancing instruction
+following and text generation conditioned on visual inputs. Leveraging language
+structure and a lightweight dependency parser, we construct data samples of
+ICCC task from image-text datasets with low labeling and computation costs.
+Experimental results on BLIP-2 and InstructBLIP demonstrate significant
+improvements in zero-shot image-text generation-based VL tasks through ICCC
+instruction tuning.",cs.CV,['cs.CV']
+GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs,Mustafa Munir · William Avery · Md Mostafijur Rahman · Radu Marculescu, ,https://arxiv.org/abs/2405.06849,,2405.06849.pdf,GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs,"Vision graph neural networks (ViG) offer a new avenue for exploration in
+computer vision. A major bottleneck in ViGs is the inefficient k-nearest
+neighbor (KNN) operation used for graph construction. To solve this issue, we
+propose a new method for designing ViGs, Dynamic Axial Graph Construction
+(DAGC), which is more efficient than KNN as it limits the number of considered
+graph connections made within an image. Additionally, we propose a novel
+CNN-GNN architecture, GreedyViG, which uses DAGC. Extensive experiments show
+that GreedyViG beats existing ViG, CNN, and ViT architectures in terms of
+accuracy, GMACs, and parameters on image classification, object detection,
+instance segmentation, and semantic segmentation tasks. Our smallest model,
+GreedyViG-S, achieves 81.1% top-1 accuracy on ImageNet-1K, 2.9% higher than
+Vision GNN and 2.2% higher than Vision HyperGraph Neural Network (ViHGNN), with
+less GMACs and a similar number of parameters. Our largest model, GreedyViG-B
+obtains 83.9% top-1 accuracy, 0.2% higher than Vision GNN, with a 66.6%
+decrease in parameters and a 69% decrease in GMACs. GreedyViG-B also obtains
+the same accuracy as ViHGNN with a 67.3% decrease in parameters and a 71.3%
+decrease in GMACs. Our work shows that hybrid CNN-GNN architectures not only
+provide a new avenue for designing efficient models, but that they can also
+exceed the performance of current state-of-the-art models.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods,Mingqi Jiang · Saeed Khorram · Li Fuxin,https://mingqij.github.io/projects/cdmmtc,,https://www.nature.com/articles/s41598-024-59384-x,,,,,nan
+LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation,Kibum Kim · Kanghoon Yoon · Jaehyeong Jeon · Yeonjun In · Jinyoung Moon · Donghyun Kim · Chanyoung Park, ,https://arxiv.org/abs/2310.10404,,2310.10404.pdf,LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation,"Weakly-Supervised Scene Graph Generation (WSSGG) research has recently
+emerged as an alternative to the fully-supervised approach that heavily relies
+on costly annotations. In this regard, studies on WSSGG have utilized image
+captions to obtain unlocalized triplets while primarily focusing on grounding
+the unlocalized triplets over image regions. However, they have overlooked the
+two issues involved in the triplet formation process from the captions: 1)
+Semantic over-simplification issue arises when extracting triplets from
+captions, where fine-grained predicates in captions are undesirably converted
+into coarse-grained predicates, resulting in a long-tailed predicate
+distribution, and 2) Low-density scene graph issue arises when aligning the
+triplets in the caption with entity/predicate classes of interest, where many
+triplets are discarded and not used in training, leading to insufficient
+supervision. To tackle the two issues, we propose a new approach, i.e., Large
+Language Model for weakly-supervised SGG (LLM4SGG), where we mitigate the two
+issues by leveraging the LLM's in-depth understanding of language and reasoning
+ability during the extraction of triplets from captions and alignment of
+entity/predicate classes with target data. To further engage the LLM in these
+processes, we adopt the idea of Chain-of-Thought and the in-context few-shot
+learning strategy. To validate the effectiveness of LLM4SGG, we conduct
+extensive experiments on Visual Genome and GQA datasets, showing significant
+improvements in both Recall@K and mean Recall@K compared to the
+state-of-the-art WSSGG methods. A further appeal is that LLM4SGG is
+data-efficient, enabling effective model training with a small amount of
+training images.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+DUSt3R: Geometric 3D Vision Made Easy,Shuzhe Wang · Vincent Leroy · Yohann Cabon · Boris Chidlovskii · Jerome Revaud, ,https://arxiv.org/abs/2312.14132v1,,2312.14132v1.pdf,DUSt3R: Geometric 3D Vision Made Easy,"Multi-view stereo reconstruction (MVS) in the wild requires to first estimate
+the camera parameters e.g. intrinsic and extrinsic parameters. These are
+usually tedious and cumbersome to obtain, yet they are mandatory to triangulate
+corresponding pixels in 3D space, which is the core of all best performing MVS
+algorithms. In this work, we take an opposite stance and introduce DUSt3R, a
+radically novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction
+of arbitrary image collections, i.e. operating without prior information about
+camera calibration nor viewpoint poses. We cast the pairwise reconstruction
+problem as a regression of pointmaps, relaxing the hard constraints of usual
+projective camera models. We show that this formulation smoothly unifies the
+monocular and binocular reconstruction cases. In the case where more than two
+images are provided, we further propose a simple yet effective global alignment
+strategy that expresses all pairwise pointmaps in a common reference frame. We
+base our network architecture on standard Transformer encoders and decoders,
+allowing us to leverage powerful pretrained models. Our formulation directly
+provides a 3D model of the scene as well as depth information, but
+interestingly, we can seamlessly recover from it, pixel matches, relative and
+absolute camera. Exhaustive experiments on all these tasks showcase that the
+proposed DUSt3R can unify various 3D vision tasks and set new SoTAs on
+monocular/multi-view depth estimation as well as relative pose estimation. In
+summary, DUSt3R makes many geometric 3D vision tasks easy.",cs.CV,['cs.CV']
+Latent Modulated Function for Computational Optimal Continuous Image Representation,Zongyao He · Zhi Jin,https://github.com/HeZongyao/LMF,https://arxiv.org/abs/2404.16451,,2404.16451.pdf,Latent Modulated Function for Computational Optimal Continuous Image Representation,"The recent work Local Implicit Image Function (LIIF) and subsequent Implicit
+Neural Representation (INR) based works have achieved remarkable success in
+Arbitrary-Scale Super-Resolution (ASSR) by using MLP to decode Low-Resolution
+(LR) features. However, these continuous image representations typically
+implement decoding in High-Resolution (HR) High-Dimensional (HD) space, leading
+to a quadratic increase in computational cost and seriously hindering the
+practical applications of ASSR. To tackle this problem, we propose a novel
+Latent Modulated Function (LMF), which decouples the HR-HD decoding process
+into shared latent decoding in LR-HD space and independent rendering in HR
+Low-Dimensional (LD) space, thereby realizing the first computational optimal
+paradigm of continuous image representation. Specifically, LMF utilizes an HD
+MLP in latent space to generate latent modulations of each LR feature vector.
+This enables a modulated LD MLP in render space to quickly adapt to any input
+feature vector and perform rendering at arbitrary resolution. Furthermore, we
+leverage the positive correlation between modulation intensity and input image
+complexity to design a Controllable Multi-Scale Rendering (CMSR) algorithm,
+offering the flexibility to adjust the decoding efficiency based on the
+rendering precision. Extensive experiments demonstrate that converting existing
+INR-based ASSR methods to LMF can reduce the computational cost by up to 99.9%,
+accelerate inference by up to 57 times, and save up to 76% of parameters, while
+maintaining competitive performance. The code is available at
+https://github.com/HeZongyao/LMF.",cs.CV,"['cs.CV', 'cs.AI']"
+Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding,Zhihao Yuan · Jinke Ren · Chun-Mei Feng · Hengshuang Zhao · Shuguang Cui · Zhen Li,https://curryyuan.github.io/ZSVG3D/,https://arxiv.org/abs/2311.15383,,,Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding,"3D Visual Grounding (3DVG) aims at localizing 3D object based on textual
+descriptions. Conventional supervised methods for 3DVG often necessitate
+extensive annotations and a predefined vocabulary, which can be restrictive. To
+address this issue, we propose a novel visual programming approach for
+zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language
+models (LLMs). Our approach begins with a unique dialog-based method, engaging
+with LLMs to establish a foundational understanding of zero-shot 3DVG. Building
+on this, we design a visual program that consists of three types of modules,
+i.e., view-independent, view-dependent, and functional modules. These modules,
+specifically tailored for 3D scenarios, work collaboratively to perform complex
+reasoning and inference. Furthermore, we develop an innovative language-object
+correlation module to extend the scope of existing 3D object detectors into
+open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot
+approach can outperform some supervised baselines, marking a significant stride
+towards effective 3DVG.",cs.CV,['cs.CV']
+NICE: Neurogenesis Inspired Contextual Encoding for Replay-free Class Incremental Learning,Mustafa B Gurbuz · Jean Moorman · Constantine Dovrolis,https://github.com/BurakGurbuz97/NICE,https://arxiv.org/abs/2310.03898,,2310.03898.pdf,Class-Incremental Learning Using Generative Experience Replay Based on Time-aware Regularization,"Learning new tasks accumulatively without forgetting remains a critical
+challenge in continual learning. Generative experience replay addresses this
+challenge by synthesizing pseudo-data points for past learned tasks and later
+replaying them for concurrent training along with the new tasks' data.
+Generative replay is the best strategy for continual learning under a strict
+class-incremental setting when certain constraints need to be met: (i) constant
+model size, (ii) no pre-training dataset, and (iii) no memory buffer for
+storing past tasks' data. Inspired by the biological nervous system mechanisms,
+we introduce a time-aware regularization method to dynamically fine-tune the
+three training objective terms used for generative replay: supervised learning,
+latent regularization, and data reconstruction. Experimental results on major
+benchmarks indicate that our method pushes the limit of brain-inspired
+continual learners under such strict settings, improves memory retention, and
+increases the average performance over continually arriving tasks.",cs.LG,['cs.LG']
+A Simple Recipe for Language-guided Domain Generalized Segmentation,Mohammad Fahes · TUAN-HUNG VU · Andrei Bursuc · Patrick Pérez · Raoul de Charette,https://astra-vision.github.io/FAMix/,https://arxiv.org/abs/2311.17922,,2311.17922.pdf,A Simple Recipe for Language-guided Domain Generalized Segmentation,"Generalization to new domains not seen during training is one of the
+long-standing challenges in deploying neural networks in real-world
+applications. Existing generalization techniques either necessitate external
+images for augmentation, and/or aim at learning invariant representations by
+imposing various alignment constraints. Large-scale pretraining has recently
+shown promising generalization capabilities, along with the potential of
+binding different modalities. For instance, the advent of vision-language
+models like CLIP has opened the doorway for vision models to exploit the
+textual modality. In this paper, we introduce a simple framework for
+generalizing semantic segmentation networks by employing language as the source
+of randomization. Our recipe comprises three key ingredients: (i) the
+preservation of the intrinsic CLIP robustness through minimal fine-tuning, (ii)
+language-driven local style augmentation, and (iii) randomization by locally
+mixing the source and augmented styles during training. Extensive experiments
+report state-of-the-art results on various generalization benchmarks. Code is
+accessible at https://github.com/astra-vision/FAMix .",cs.CV,['cs.CV']
+Self-Calibrating Vicinal Risk Minimisation for Model Calibration,Jiawei Liu · Changkun Ye · Ruikai Cui · Nick Barnes, ,https://arxiv.org/abs/2307.13539,,2307.13539.pdf,Model Calibration in Dense Classification with Adaptive Label Perturbation,"For safety-related applications, it is crucial to produce trustworthy deep
+neural networks whose prediction is associated with confidence that can
+represent the likelihood of correctness for subsequent decision-making.
+Existing dense binary classification models are prone to being over-confident.
+To improve model calibration, we propose Adaptive Stochastic Label Perturbation
+(ASLP) which learns a unique label perturbation level for each training image.
+ASLP employs our proposed Self-Calibrating Binary Cross Entropy (SC-BCE) loss,
+which unifies label perturbation processes including stochastic approaches
+(like DisturbLabel), and label smoothing, to correct calibration while
+maintaining classification rates. ASLP follows Maximum Entropy Inference of
+classic statistical mechanics to maximise prediction entropy with respect to
+missing information. It performs this while: (1) preserving classification
+accuracy on known data as a conservative solution, or (2) specifically improves
+model calibration degree by minimising the gap between the prediction accuracy
+and expected confidence of the target training label. Extensive results
+demonstrate that ASLP can significantly improve calibration degrees of dense
+binary classification models on both in-distribution and out-of-distribution
+data. The code is available on https://github.com/Carlisle-Liu/ASLP.",cs.CV,"['cs.CV', 'cs.LG']"
+Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation,Junyan Wang · Zhenhong Sun · Stewart Tan · Xuanbai Chen · Weihua Chen · li · Cheng Zhang · Yang Song, ,https://arxiv.org/abs/2403.05239,,2403.05239.pdf,Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation,"Vanilla text-to-image diffusion models struggle with generating accurate
+human images, commonly resulting in imperfect anatomies such as unnatural
+postures or disproportionate limbs.Existing methods address this issue mostly
+by fine-tuning the model with extra images or adding additional controls --
+human-centric priors such as pose or depth maps -- during the image generation
+phase. This paper explores the integration of these human-centric priors
+directly into the model fine-tuning stage, essentially eliminating the need for
+extra conditions at the inference stage. We realize this idea by proposing a
+human-centric alignment loss to strengthen human-related information from the
+textual prompts within the cross-attention maps. To ensure semantic detail
+richness and human structural accuracy during fine-tuning, we introduce
+scale-aware and step-wise constraints within the diffusion process, according
+to an in-depth analysis of the cross-attention layer. Extensive experiments
+show that our method largely improves over state-of-the-art text-to-image
+models to synthesize high-quality human images based on user-written prompts.
+Project page: \url{https://hcplayercvpr2024.github.io}.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability,Jaehui Hwang · Junghyuk Lee · Jong-Seok Lee, ,https://arxiv.org/abs/2312.10634,,2312.10634.pdf,Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability,"With the advancement of generative models, the assessment of generated images
+becomes more and more important. Previous methods measure distances between
+features of reference and generated images from trained vision models. In this
+paper, we conduct an extensive investigation into the relationship between the
+representation space and input space around generated images. We first propose
+two measures related to the presence of unnatural elements within images:
+complexity, which indicates how non-linear the representation space is, and
+vulnerability, which is related to how easily the extracted feature changes by
+adversarial input changes. Based on these, we introduce a new metric to
+evaluating image-generative models called anomaly score (AS). Moreover, we
+propose AS-i (anomaly score for individual images) that can effectively
+evaluate generated images individually. Experimental results demonstrate the
+validity of the proposed approach.",cs.CV,"['cs.CV', 'cs.LG']"
+MuseChat: A Conversational Music Recommendation System for Videos,Zhikang Dong · Bin Chen · Xiulong Liu · Pawel Polak · Peng Zhang, ,https://arxiv.org/abs/2310.06282,,2310.06282.pdf,MuseChat: A Conversational Music Recommendation System for Videos,"Music recommendation for videos attracts growing interest in multi-modal
+research. However, existing systems focus primarily on content compatibility,
+often ignoring the users' preferences. Their inability to interact with users
+for further refinements or to provide explanations leads to a less satisfying
+experience. We address these issues with MuseChat, a first-of-its-kind
+dialogue-based recommendation system that personalizes music suggestions for
+videos. Our system consists of two key functionalities with associated modules:
+recommendation and reasoning. The recommendation module takes a video along
+with optional information including previous suggested music and user's
+preference as inputs and retrieves an appropriate music matching the context.
+The reasoning module, equipped with the power of Large Language Model
+(Vicuna-7B) and extended to multi-modal inputs, is able to provide reasonable
+explanation for the recommended music. To evaluate the effectiveness of
+MuseChat, we build a large-scale dataset, conversational music recommendation
+for videos, that simulates a two-turn interaction between a user and a
+recommender based on accurate music track information. Experiment results show
+that MuseChat achieves significant improvements over existing video-based music
+retrieval methods as well as offers strong interpretability and
+interactability.",cs.LG,"['cs.LG', 'cs.CV', 'cs.IR']"
+Learning Degradation-unaware Representation with Prior-based Latent Transformations for Blind Face Restoration,Lianxin Xie · csbingbing zheng · Wen Xue · Le Jiang · Cheng Liu · Si Wu · Hau San Wong, ,https://arxiv.org/abs/2402.06106,,2402.06106.pdf,CLR-Face: Conditional Latent Refinement for Blind Face Restoration Using Score-Based Diffusion Models,"Recent generative-prior-based methods have shown promising blind face
+restoration performance. They usually project the degraded images to the latent
+space and then decode high-quality faces either by single-stage latent
+optimization or directly from the encoding. Generating fine-grained facial
+details faithful to inputs remains a challenging problem. Most existing methods
+produce either overly smooth outputs or alter the identity as they attempt to
+balance between generation and reconstruction. This may be attributed to the
+typical trade-off between quality and resolution in the latent space. If the
+latent space is highly compressed, the decoded output is more robust to
+degradations but shows worse fidelity. On the other hand, a more flexible
+latent space can capture intricate facial details better, but is extremely
+difficult to optimize for highly degraded faces using existing techniques. To
+address these issues, we introduce a diffusion-based-prior inside a VQGAN
+architecture that focuses on learning the distribution over uncorrupted latent
+embeddings. With such knowledge, we iteratively recover the clean embedding
+conditioning on the degraded counterpart. Furthermore, to ensure the reverse
+diffusion trajectory does not deviate from the underlying identity, we train a
+separate Identity Recovery Network and use its output to constrain the reverse
+diffusion process. Specifically, using a learnable latent mask, we add
+gradients from a face-recognition network to a subset of latent features that
+correlates with the finer identity-related details in the pixel space, leaving
+the other features untouched. Disentanglement between perception and fidelity
+in the latent space allows us to achieve the best of both worlds. We perform
+extensive evaluations on multiple real and synthetic datasets to validate the
+superiority of our approach.",cs.CV,['cs.CV']
+Faces that Speak: Jointly Synthesising Talking Face and Speech from Text,Youngjoon Jang · Jihoon Kim · Junseok Ahn · Doyeop Kwak · Hongsun Yang · Yooncheol Ju · ILHWAN KIM · Byeong-Yeol Kim · Joon Chung,https://mm.kaist.ac.kr/projects/faces-that-speak/,https://arxiv.org/abs/2405.10272,,2405.10272.pdf,Faces that Speak: Jointly Synthesising Talking Face and Speech from Text,"The goal of this work is to simultaneously generate natural talking faces and
+speech outputs from text. We achieve this by integrating Talking Face
+Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We
+address the main challenges of each task: (1) generating a range of head poses
+representative of real-world scenarios, and (2) ensuring voice consistency
+despite variations in facial motion for the same identity. To tackle these
+issues, we introduce a motion sampler based on conditional flow matching, which
+is capable of high-quality motion code generation in an efficient way.
+Moreover, we introduce a novel conditioning method for the TTS system, which
+utilises motion-removed features from the TFG model to yield uniform speech
+outputs. Our extensive experiments demonstrate that our method effectively
+creates natural-looking talking faces and speech that accurately match the
+input text. To our knowledge, this is the first effort to build a multimodal
+synthesis system that can generalise to unseen identities.",cs.CV,"['cs.CV', 'cs.AI', 'cs.SD', 'eess.AS', 'eess.IV']"
+Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance,Yuto Enyo · Ko Nishino, ,https://arxiv.org/abs/2312.04529,,2312.04529.pdf,Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance,"Reflectance bounds the frequency spectrum of illumination in the object
+appearance. In this paper, we introduce the first stochastic inverse rendering
+method, which recovers the attenuated frequency spectrum of an illumination
+jointly with the reflectance of an object of known geometry from a single
+image. Our key idea is to solve this blind inverse problem in the reflectance
+map, an appearance representation invariant to the underlying geometry, by
+learning to reverse the image formation with a novel diffusion model which we
+refer to as the Diffusion Reflectance Map Network (DRMNet). Given an observed
+reflectance map converted and completed from the single input image, DRMNet
+generates a reflectance map corresponding to a perfect mirror sphere while
+jointly estimating the reflectance. The forward process can be understood as
+gradually filtering a natural illumination with lower and lower frequency
+reflectance and additive Gaussian noise. DRMNet learns to invert this process
+with two subnetworks, IllNet and RefNet, which work in concert towards this
+joint estimation. The network is trained on an extensive synthetic dataset and
+is demonstrated to generalize to real images, showing state-of-the-art accuracy
+on established datasets.",cs.CV,['cs.CV']
+PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild,Kun Yuan · Hongbo Liu · Mading Li · Muyi Sun · Ming Sun · Jiachao Gong · Jinhua Hao · Chao Zhou · Yansong Tang, ,https://arxiv.org/abs/2405.17765,,2405.17765.pdf,PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild,"Video quality assessment (VQA) is a challenging problem due to the numerous
+factors that can affect the perceptual quality of a video, \eg, content
+attractiveness, distortion type, motion pattern, and level. However, annotating
+the Mean opinion score (MOS) for videos is expensive and time-consuming, which
+limits the scale of VQA datasets, and poses a significant obstacle for deep
+learning-based methods. In this paper, we propose a VQA method named PTM-VQA,
+which leverages PreTrained Models to transfer knowledge from models pretrained
+on various pre-tasks, enabling benefits for VQA from different aspects.
+  Specifically, we extract features of videos from different pretrained models
+with frozen weights and integrate them to generate representation. Since these
+models possess various fields of knowledge and are often trained with labels
+irrelevant to quality, we propose an Intra-Consistency and Inter-Divisibility
+(ICID) loss to impose constraints on features extracted by multiple pretrained
+models. The intra-consistency constraint ensures that features extracted by
+different pretrained models are in the same unified quality-aware latent space,
+while the inter-divisibility introduces pseudo clusters based on the annotation
+of samples and tries to separate features of samples from different clusters.
+Furthermore, with a constantly growing number of pretrained models, it is
+crucial to determine which models to use and how to use them. To address this
+problem, we propose an efficient scheme to select suitable candidates. Models
+with better clustering performance on VQA datasets are chosen to be our
+candidates. Extensive experiments demonstrate the effectiveness of the proposed
+method.",cs.CV,['cs.CV']
+Plug-and-Play Diffusion Distillation,Yi-Ting Hsiao · Siavash Khodadadeh · Kevin Duarte · Wei-An Lin · Hui Qu · Mingi Kwon · Ratheesh Kalarot,https://5410tiffany.github.io/plug-and-play-diffusion-distillation.github.io/,https://arxiv.org/abs/2403.12015,,2403.12015.pdf,Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation,"Diffusion models are the main driver of progress in image and video
+synthesis, but suffer from slow inference speed. Distillation methods, like the
+recently introduced adversarial diffusion distillation (ADD) aim to shift the
+model from many-shot to single-step inference, albeit at the cost of expensive
+and difficult optimization due to its reliance on a fixed pretrained DINOv2
+discriminator. We introduce Latent Adversarial Diffusion Distillation (LADD), a
+novel distillation approach overcoming the limitations of ADD. In contrast to
+pixel-based ADD, LADD utilizes generative features from pretrained latent
+diffusion models. This approach simplifies training and enhances performance,
+enabling high-resolution multi-aspect ratio image synthesis. We apply LADD to
+Stable Diffusion 3 (8B) to obtain SD3-Turbo, a fast model that matches the
+performance of state-of-the-art text-to-image generators using only four
+unguided sampling steps. Moreover, we systematically investigate its scaling
+behavior and demonstrate LADD's effectiveness in various applications such as
+image editing and inpainting.",cs.CV,['cs.CV']
+Masked and Shuffled Blind Spot Denoising for Real-World Images,Hamadi Chihaoui · Paolo Favaro, ,https://arxiv.org/abs/2404.09389,,2404.09389.pdf,Masked and Shuffled Blind Spot Denoising for Real-World Images,"We introduce a novel approach to single image denoising based on the Blind
+Spot Denoising principle, which we call MAsked and SHuffled Blind Spot
+Denoising (MASH). We focus on the case of correlated noise, which often plagues
+real images. MASH is the result of a careful analysis to determine the
+relationships between the level of blindness (masking) of the input and the
+(unknown) noise correlation. Moreover, we introduce a shuffling technique to
+weaken the local correlation of noise, which in turn yields an additional
+denoising performance improvement. We evaluate MASH via extensive experiments
+on real-world noisy image datasets. We demonstrate on par or better results
+compared to existing self-supervised denoising methods.",cs.CV,"['cs.CV', 'cs.LG']"
+Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation,Li Hu, ,https://arxiv.org/abs/2311.17117,,2311.17117.pdf,Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation,"Character Animation aims to generating character videos from still images
+through driving signals. Currently, diffusion models have become the mainstream
+in visual generation research, owing to their robust generative capabilities.
+However, challenges persist in the realm of image-to-video, especially in
+character animation, where temporally maintaining consistency with detailed
+information from character remains a formidable problem. In this paper, we
+leverage the power of diffusion models and propose a novel framework tailored
+for character animation. To preserve consistency of intricate appearance
+features from reference image, we design ReferenceNet to merge detail features
+via spatial attention. To ensure controllability and continuity, we introduce
+an efficient pose guider to direct character's movements and employ an
+effective temporal modeling approach to ensure smooth inter-frame transitions
+between video frames. By expanding the training data, our approach can animate
+arbitrary characters, yielding superior results in character animation compared
+to other image-to-video methods. Furthermore, we evaluate our method on
+benchmarks for fashion video and human dance synthesis, achieving
+state-of-the-art results.",cs.CV,['cs.CV']
+Bootstrapping SparseFormers from Vision Foundation Models,Ziteng Gao · Zhan Tong · Kevin Qinghong Lin · Joya Chen · Mike Zheng Shou,https://github.com/showlab/sparseformer,https://arxiv.org/abs/2312.01987,,2312.01987.pdf,Bootstrapping SparseFormers from Vision Foundation Models,"The recently proposed SparseFormer architecture provides an alternative
+approach to visual understanding by utilizing a significantly lower number of
+visual tokens via adjusting RoIs, greatly reducing computational costs while
+still achieving promising performance. However, training SparseFormers from
+scratch is still expensive, and scaling up the number of parameters can be
+challenging. In this paper, we propose to bootstrap SparseFormers from
+ViT-based vision foundation models in a simple and efficient way. Since the
+majority of SparseFormer blocks are the standard transformer ones, we can
+inherit weights from large-scale pre-trained vision transformers and freeze
+them as much as possible. Therefore, we only need to train the
+SparseFormer-specific lightweight focusing transformer to adjust token RoIs and
+fine-tune a few early pre-trained blocks to align the final token
+representation. In such a way, we can bootstrap SparseFormer architectures from
+various large-scale pre-trained models (e.g., IN-21K pre-trained AugRegs or
+CLIPs) using a rather smaller amount of training samples (e.g., IN-1K) and
+without labels or captions within just a few hours. As a result, the
+bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9%
+accuracy on IN-1K with only 49 tokens, and the multimodal SparseFormer from
+CLIPs also demonstrates notable zero-shot performance with highly reduced
+computational cost without seeing any caption during the bootstrapping
+procedure. In addition, CLIP-bootstrapped SparseFormers, which align the output
+space with language without seeing a word, can serve as efficient vision
+encoders in multimodal large language models. Code and models are available at
+https://github.com/showlab/sparseformer",cs.CV,['cs.CV']
+Self-Supervised Dual Contouring,Ramana Sundararaman · Roman Klokov · Maks Ovsjanikov, ,https://arxiv.org/abs/2405.18131,,2405.18131.pdf,Self-Supervised Dual Contouring,"Learning-based isosurface extraction methods have recently emerged as a
+robust and efficient alternative to axiomatic techniques. However, the vast
+majority of such approaches rely on supervised training with axiomatically
+computed ground truths, thus potentially inheriting biases and data artifacts
+of the corresponding axiomatic methods. Steering away from such dependencies,
+we propose a self-supervised training scheme for the Neural Dual Contouring
+meshing framework, resulting in our method: Self-Supervised Dual Contouring
+(SDC). Instead of optimizing predicted mesh vertices with supervised training,
+we use two novel self-supervised loss functions that encourage the consistency
+between distances to the generated mesh up to the first order. Meshes
+reconstructed by SDC surpass existing data-driven methods in capturing
+intricate details while being more robust to possible irregularities in the
+input. Furthermore, we use the same self-supervised training objective linking
+inferred mesh and input SDF, to regularize the training process of Deep
+Implicit Networks (DINs). We demonstrate that the resulting DINs produce
+higher-quality implicit functions, ultimately leading to more accurate and
+detail-preserving surfaces compared to prior baselines for different input
+modalities. Finally, we demonstrate that our self-supervised losses improve
+meshing performance in the single-view reconstruction task by enabling joint
+training of predicted SDF and resulting output mesh. We open-source our code at
+https://github.com/Sentient07/SDC",cs.CV,['cs.CV']
+Wired Perspectives: Multi-View Wire Art Embraces Generative AI,Zhiyu Qu · LAN YANG · Honggang Zhang · Tao Xiang · Kaiyue Pang · Yi-Zhe Song,https://dreamwireart.github.io/,https://arxiv.org/abs/2311.15421,,,Wired Perspectives: Multi-View Wire Art Embraces Generative AI,"Creating multi-view wire art (MVWA), a static 3D sculpture with diverse
+interpretations from different viewpoints, is a complex task even for skilled
+artists. In response, we present DreamWire, an AI system enabling everyone to
+craft MVWA easily. Users express their vision through text prompts or
+scribbles, freeing them from intricate 3D wire organisation. Our approach
+synergises 3D B\'ezier curves, Prim's algorithm, and knowledge distillation
+from diffusion models or their variants (e.g., ControlNet). This blend enables
+the system to represent 3D wire art, ensuring spatial continuity and overcoming
+data scarcity. Extensive evaluation and analysis are conducted to shed insight
+on the inner workings of the proposed system, including the trade-off between
+connectivity and visual aesthetics.",cs.CV,"['cs.CV', 'cs.AI']"
+SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation,Yuxuan Zhang · Yiren Song · Jiaming Liu · Rui Wang · Jinpeng Yu · Hao Tang · Huaxia Li · Xu Tang · Yao Hu · Han Pan · Zhongliang Jing,https://ssr-encoder.github.io/,https://arxiv.org/abs/2312.16272,,2312.16272.pdf,SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation,"Recent advancements in subject-driven image generation have led to zero-shot
+generation, yet precise selection and focus on crucial subject representations
+remain challenging. Addressing this, we introduce the SSR-Encoder, a novel
+architecture designed for selectively capturing any subject from single or
+multiple reference images. It responds to various query modalities including
+text and masks, without necessitating test-time fine-tuning. The SSR-Encoder
+combines a Token-to-Patch Aligner that aligns query inputs with image patches
+and a Detail-Preserving Subject Encoder for extracting and preserving fine
+features of the subjects, thereby generating subject embeddings. These
+embeddings, used in conjunction with original text embeddings, condition the
+generation process. Characterized by its model generalizability and efficiency,
+the SSR-Encoder adapts to a range of custom models and control modules.
+Enhanced by the Embedding Consistency Regularization Loss for improved
+training, our extensive experiments demonstrate its effectiveness in versatile
+and high-quality image generation, indicating its broad applicability. Project
+page: https://ssr-encoder.github.io",cs.CV,['cs.CV']
+Dual Pose-invariant Embeddings: Learning Category and Object-specific Discriminative Representations for Recognition and Retrieval,Rohan Sarkar · Avinash Kak, ,https://arxiv.org/abs/2403.00272,,2403.00272.pdf,Dual Pose-invariant Embeddings: Learning Category and Object-specific Discriminative Representations for Recognition and Retrieval,"In the context of pose-invariant object recognition and retrieval, we
+demonstrate that it is possible to achieve significant improvements in
+performance if both the category-based and the object-identity-based embeddings
+are learned simultaneously during training. In hindsight, that sounds intuitive
+because learning about the categories is more fundamental than learning about
+the individual objects that correspond to those categories. However, to the
+best of what we know, no prior work in pose-invariant learning has demonstrated
+this effect. This paper presents an attention-based dual-encoder architecture
+with specially designed loss functions that optimize the inter- and intra-class
+distances simultaneously in two different embedding spaces, one for the
+category embeddings and the other for the object-level embeddings. The loss
+functions we have proposed are pose-invariant ranking losses that are designed
+to minimize the intra-class distances and maximize the inter-class distances in
+the dual representation spaces. We demonstrate the power of our approach with
+three challenging multi-view datasets, ModelNet-40, ObjectPI, and FG3D. With
+our dual approach, for single-view object recognition, we outperform the
+previous best by 20.0% on ModelNet40, 2.0% on ObjectPI, and 46.5% on FG3D. On
+the other hand, for single-view object retrieval, we outperform the previous
+best by 33.7% on ModelNet40, 18.8% on ObjectPI, and 56.9% on FG3D.",cs.CV,"['cs.CV', 'cs.IR', 'cs.LG']"
+Symphonize 3D Semantic Scene Completion with Contextual Instance Queries,Haoyi Jiang · Tianheng Cheng · Naiyu Gao · Haoyang Zhang · Tianwei Lin · Wenyu Liu · Xinggang Wang, ,https://arxiv.org/abs/2306.15670v2,,2306.15670v2.pdf,Symphonize 3D Semantic Scene Completion with Contextual Instance Queries,"`3D Semantic Scene Completion (SSC) has emerged as a nascent and pivotal
+undertaking in autonomous driving, aiming to predict voxel occupancy within
+volumetric scenes. However, prevailing methodologies primarily focus on
+voxel-wise feature aggregation, while neglecting instance semantics and scene
+context. In this paper, we present a novel paradigm termed Symphonies
+(Scene-from-Insts), that delves into the integration of instance queries to
+orchestrate 2D-to-3D reconstruction and 3D scene modeling. Leveraging our
+proposed Serial Instance-Propagated Attentions, Symphonies dynamically encodes
+instance-centric semantics, facilitating intricate interactions between
+image-based and volumetric domains. Simultaneously, Symphonies enables holistic
+scene comprehension by capturing context through the efficient fusion of
+instance queries, alleviating geometric ambiguity such as occlusion and
+perspective errors through contextual scene reasoning. Experimental results
+demonstrate that Symphonies achieves state-of-the-art performance on
+challenging benchmarks SemanticKITTI and SSCBench-KITTI-360, yielding
+remarkable mIoU scores of 15.04 and 18.58, respectively. These results showcase
+the paradigm's promising advancements. The code is available at
+https://github.com/hustvl/Symphonies.",cs.CV,"['cs.CV', 'cs.RO']"
+KeyPoint Relative Position Encoding for Face Recognition,Minchul Kim · Feng Liu · Yiyang Su · Anil Jain · Xiaoming Liu, ,https://arxiv.org/abs/2403.14852,,2403.14852.pdf,KeyPoint Relative Position Encoding for Face Recognition,"In this paper, we address the challenge of making ViT models more robust to
+unseen affine transformations. Such robustness becomes useful in various
+recognition tasks such as face recognition when image alignment failures occur.
+We propose a novel method called KP-RPE, which leverages key points
+(e.g.~facial landmarks) to make ViT more resilient to scale, translation, and
+pose variations. We begin with the observation that Relative Position Encoding
+(RPE) is a good way to bring affine transform generalization to ViTs. RPE,
+however, can only inject the model with prior knowledge that nearby pixels are
+more important than far pixels. Keypoint RPE (KP-RPE) is an extension of this
+principle, where the significance of pixels is not solely dictated by their
+proximity but also by their relative positions to specific keypoints within the
+image. By anchoring the significance of pixels around keypoints, the model can
+more effectively retain spatial relationships, even when those relationships
+are disrupted by affine transformations. We show the merit of KP-RPE in face
+and gait recognition. The experimental results demonstrate the effectiveness in
+improving face recognition performance from low-quality images, particularly
+where alignment is prone to failure. Code and pre-trained models are available.",cs.CV,['cs.CV']
+Feedback-Guided Autonomous Driving,Jimuyang Zhang · Zanming Huang · Arijit Ray · Eshed Ohn-Bar, ,https://arxiv.org/abs/2306.10014,,2306.10014.pdf,Coaching a Teachable Student,"We propose a novel knowledge distillation framework for effectively teaching
+a sensorimotor student agent to drive from the supervision of a privileged
+teacher agent. Current distillation for sensorimotor agents methods tend to
+result in suboptimal learned driving behavior by the student, which we
+hypothesize is due to inherent differences between the input, modeling
+capacity, and optimization processes of the two agents. We develop a novel
+distillation scheme that can address these limitations and close the gap
+between the sensorimotor agent and its privileged teacher. Our key insight is
+to design a student which learns to align their input features with the
+teacher's privileged Bird's Eye View (BEV) space. The student then can benefit
+from direct supervision by the teacher over the internal representation
+learning. To scaffold the difficult sensorimotor learning task, the student
+model is optimized via a student-paced coaching mechanism with various
+auxiliary supervision. We further propose a high-capacity imitation learned
+privileged agent that surpasses prior privileged agents in CARLA and ensures
+the student learns safe driving behavior. Our proposed sensorimotor agent
+results in a robust image-based behavior cloning agent in CARLA, improving over
+current models by over 20.6% in driving score without requiring LiDAR,
+historical observations, ensemble of models, on-policy data aggregation or
+reinforcement learning.",cs.CV,"['cs.CV', 'cs.AI', 'cs.RO']"
+Look-Up Table Compression for Efficient Image Restoration,Yinglong Li · Jiacheng Li · Zhiwei Xiong, ,https://arxiv.org/abs/2307.08544,,2307.08544.pdf,Reconstructed Convolution Module Based Look-Up Tables for Efficient Image Super-Resolution,"Look-up table(LUT)-based methods have shown the great efficacy in single
+image super-resolution (SR) task. However, previous methods ignore the
+essential reason of restricted receptive field (RF) size in LUT, which is
+caused by the interaction of space and channel features in vanilla convolution.
+They can only increase the RF at the cost of linearly increasing LUT size. To
+enlarge RF with contained LUT sizes, we propose a novel Reconstructed
+Convolution(RC) module, which decouples channel-wise and spatial calculation.
+It can be formulated as $n^2$ 1D LUTs to maintain $n\times n$ receptive field,
+which is obviously smaller than $n\times n$D LUT formulated before. The LUT
+generated by our RC module reaches less than 1/10000 storage compared with
+SR-LUT baseline. The proposed Reconstructed Convolution module based LUT
+method, termed as RCLUT, can enlarge the RF size by 9 times than the
+state-of-the-art LUT-based SR method and achieve superior performance on five
+popular benchmark dataset. Moreover, the efficient and robust RC module can be
+used as a plugin to improve other LUT-based SR methods. The code is available
+at https://github.com/liuguandu/RC-LUT.",eess.IV,"['eess.IV', 'cs.CV']"
+WaveMo: Learning Wavefront Modulations to See Through Scattering,Mingyang Xie · Haiyun Guo · Brandon Y. Feng · Lingbo Jin · Ashok Veeraraghavan · Christopher Metzler,https://wavemo-2024.github.io/,https://arxiv.org/abs/2404.07985v1,,2404.07985v1.pdf,WaveMo: Learning Wavefront Modulations to See Through Scattering,"Imaging through scattering media is a fundamental and pervasive challenge in
+fields ranging from medical diagnostics to astronomy. A promising strategy to
+overcome this challenge is wavefront modulation, which induces measurement
+diversity during image acquisition. Despite its importance, designing optimal
+wavefront modulations to image through scattering remains under-explored. This
+paper introduces a novel learning-based framework to address the gap. Our
+approach jointly optimizes wavefront modulations and a computationally
+lightweight feedforward ""proxy"" reconstruction network. This network is trained
+to recover scenes obscured by scattering, using measurements that are modified
+by these modulations. The learned modulations produced by our framework
+generalize effectively to unseen scattering scenarios and exhibit remarkable
+versatility. During deployment, the learned modulations can be decoupled from
+the proxy network to augment other more computationally expensive restoration
+algorithms. Through extensive experiments, we demonstrate our approach
+significantly advances the state of the art in imaging through scattering
+media. Our project webpage is at https://wavemo-2024.github.io/.",cs.CV,"['cs.CV', 'eess.IV']"
+Constrained Layout Generation with Factor Graphs,Mohammed Haroon Dupty · Yanfei Dong · Sicong Leng · Guoji Fu · Yong Liang Goh · Wei Lu · Wee Sun Lee, ,https://arxiv.org/abs/2404.00385,,2404.00385.pdf,Constrained Layout Generation with Factor Graphs,"This paper addresses the challenge of object-centric layout generation under
+spatial constraints, seen in multiple domains including floorplan design
+process. The design process typically involves specifying a set of spatial
+constraints that include object attributes like size and inter-object relations
+such as relative positioning. Existing works, which typically represent objects
+as single nodes, lack the granularity to accurately model complex interactions
+between objects. For instance, often only certain parts of an object, like a
+room's right wall, interact with adjacent objects. To address this gap, we
+introduce a factor graph based approach with four latent variable nodes for
+each room, and a factor node for each constraint. The factor nodes represent
+dependencies among the variables to which they are connected, effectively
+capturing constraints that are potentially of a higher order. We then develop
+message-passing on the bipartite graph, forming a factor graph neural network
+that is trained to produce a floorplan that aligns with the desired
+requirements. Our approach is simple and generates layouts faithful to the user
+requirements, demonstrated by a large improvement in IOU scores over existing
+methods. Additionally, our approach, being inferential and accurate, is
+well-suited to the practical human-in-the-loop design process where
+specifications evolve iteratively, offering a practical and powerful tool for
+AI-guided design.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Prompt Learning via Meta-Regularization,Jinyoung Park · Juyeon Ko · Hyunwoo J. Kim, ,https://arxiv.org/abs/2404.00851,,2404.00851.pdf,Prompt Learning via Meta-Regularization,"Pre-trained vision-language models have shown impressive success on various
+computer vision tasks with their zero-shot generalizability. Recently, prompt
+learning approaches have been explored to efficiently and effectively adapt the
+vision-language models to a variety of downstream tasks. However, most existing
+prompt learning methods suffer from task overfitting since the general
+knowledge of the pre-trained vision language models is forgotten while the
+prompts are finetuned on a small data set from a specific target task. To
+address this issue, we propose a Prompt Meta-Regularization (ProMetaR) to
+improve the generalizability of prompt learning for vision-language models.
+Specifically, ProMetaR meta-learns both the regularizer and the soft prompts to
+harness the task-specific knowledge from the downstream tasks and task-agnostic
+general knowledge from the vision-language models. Further, ProMetaR augments
+the task to generate multiple virtual tasks to alleviate the meta-overfitting.
+In addition, we provide the analysis to comprehend how ProMetaR improves the
+generalizability of prompt tuning in the perspective of the gradient alignment.
+Our extensive experiments demonstrate that our ProMetaR improves the
+generalizability of conventional prompt learning methods under
+base-to-base/base-to-new and domain generalization settings. The code of
+ProMetaR is available at https://github.com/mlvlab/ProMetaR.",cs.CV,['cs.CV']
+Bi-level Learning of Task-Specific Decoders for Joint Registration and One-Shot Medical Image Segmentation,Xin Fan · Xiaolin Wang · Jiaxin Gao · Jia Wang · Zhongxuan Luo · Risheng Liu, ,,https://dl.acm.org/doi/10.1145/3580305.3599452,,,,,nan
+NeLF-Pro: Neural Light Field Probes for Multi-Scale Novel View Synthesis,Zinuo You · Andreas Geiger · Anpei Chen,https://sinoyou.github.io/nelf-pro/,https://arxiv.org/abs/2312.13328,,2312.13328.pdf,NeLF-Pro: Neural Light Field Probes for Multi-Scale Novel View Synthesis,"We present NeLF-Pro, a novel representation to model and reconstruct light
+fields in diverse natural scenes that vary in extent and spatial granularity.
+In contrast to previous fast reconstruction methods that represent the 3D scene
+globally, we model the light field of a scene as a set of local light field
+feature probes, parameterized with position and multi-channel 2D feature maps.
+Our central idea is to bake the scene's light field into spatially varying
+learnable representations and to query point features by weighted blending of
+probes close to the camera - allowing for mipmap representation and rendering.
+We introduce a novel vector-matrix-matrix (VMM) factorization technique that
+effectively represents the light field feature probes as products of core
+factors (i.e., VM) shared among local feature probes, and a basis factor (i.e.,
+M) - efficiently encoding internal relationships and patterns within the scene.
+Experimentally, we demonstrate that NeLF-Pro significantly boosts the
+performance of feature grid-based representations, and achieves fast
+reconstruction with better rendering quality while maintaining compact
+modeling. Project webpage https://sinoyou.github.io/nelf-pro/.",cs.CV,['cs.CV']
+ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers,Narges Norouzi · Svetlana Orlova · Daan de Geus · Gijs Dubbelman,https://www.tue-mps.org/ALGM/,https://arxiv.org/abs/2405.14467,,2405.14467.pdf,Segformer++: Efficient Token-Merging Strategies for High-Resolution Semantic Segmentation,"Utilizing transformer architectures for semantic segmentation of
+high-resolution images is hindered by the attention's quadratic computational
+complexity in the number of tokens. A solution to this challenge involves
+decreasing the number of tokens through token merging, which has exhibited
+remarkable enhancements in inference speed, training efficiency, and memory
+utilization for image classification tasks. In this paper, we explore various
+token merging strategies within the framework of the Segformer architecture and
+perform experiments on multiple semantic segmentation and human pose estimation
+datasets. Notably, without model re-training, we, for example, achieve an
+inference acceleration of 61% on the Cityscapes dataset while maintaining the
+mIoU performance. Consequently, this paper facilitates the deployment of
+transformer-based architectures on resource-constrained devices and in
+real-time applications.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Driving-Video Dehazing with Non-Aligned Regularization for Safety Assistance,Junkai Fan · Jiangwei Weng · Kun Wang · Yijun Yang · Jianjun Qian · Jun Li · Jian Yang,https://fanjunkai1.github.io/projectpage/DVD/index.html,https://arxiv.org/abs/2405.09996,,2405.09996.pdf,Driving-Video Dehazing with Non-Aligned Regularization for Safety Assistance,"Real driving-video dehazing poses a significant challenge due to the inherent
+difficulty in acquiring precisely aligned hazy/clear video pairs for effective
+model training, especially in dynamic driving scenarios with unpredictable
+weather conditions. In this paper, we propose a pioneering approach that
+addresses this challenge through a nonaligned regularization strategy. Our core
+concept involves identifying clear frames that closely match hazy frames,
+serving as references to supervise a video dehazing network. Our approach
+comprises two key components: reference matching and video dehazing. Firstly,
+we introduce a non-aligned reference frame matching module, leveraging an
+adaptive sliding window to match high-quality reference frames from clear
+videos. Video dehazing incorporates flow-guided cosine attention sampler and
+deformable cosine attention fusion modules to enhance spatial multiframe
+alignment and fuse their improved information. To validate our approach, we
+collect a GoProHazy dataset captured effortlessly with GoPro cameras in diverse
+rural and urban road environments. Extensive experiments demonstrate the
+superiority of the proposed method over current state-of-the-art methods in the
+challenging task of real driving-video dehazing. Project page.",cs.CV,['cs.CV']
+Koala: Key frame-conditioned long video-LLM,Reuben Tan · Ximeng Sun · Ping Hu · Jui-Hsien Wang · Hanieh Deilamsalehy · Bryan A. Plummer · Bryan Russell · Kate Saenko, ,https://arxiv.org/abs/2404.04346,,2404.04346.pdf,Koala: Key frame-conditioned long video-LLM,"Long video question answering is a challenging task that involves recognizing
+short-term activities and reasoning about their fine-grained relationships.
+State-of-the-art video Large Language Models (vLLMs) hold promise as a viable
+solution due to their demonstrated emergent capabilities on new tasks. However,
+despite being trained on millions of short seconds-long videos, vLLMs are
+unable to understand minutes-long videos and accurately answer questions about
+them. To address this limitation, we propose a lightweight and self-supervised
+approach, Key frame-conditioned long video-LLM (Koala), that introduces
+learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to
+longer videos. Our approach introduces two new tokenizers that condition on
+visual tokens computed from sparse video key frames for understanding short and
+long video moments. We train our proposed approach on HowTo100M and demonstrate
+its effectiveness on zero-shot long video understanding benchmarks, where it
+outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across
+all tasks. Surprisingly, we also empirically show that our approach not only
+helps a pretrained vLLM to understand long videos but also improves its
+accuracy on short-term action recognition.",cs.CV,['cs.CV']
+Hyperspherical Classification with Dynamic Label-to-Prototype Assignment,Mohammad Saadabadi Saadabadi · Ali Dabouei · Sahar Rahimi Malakshan · Nasser Nasrabadi, ,https://arxiv.org/abs/2403.16937,,2403.16937.pdf,Hyperspherical Classification with Dynamic Label-to-Prototype Assignment,"Aiming to enhance the utilization of metric space by the parametric softmax
+classifier, recent studies suggest replacing it with a non-parametric
+alternative. Although a non-parametric classifier may provide better metric
+space utilization, it introduces the challenge of capturing inter-class
+relationships. A shared characteristic among prior non-parametric classifiers
+is the static assignment of labels to prototypes during the training, ie, each
+prototype consistently represents a class throughout the training course.
+Orthogonal to previous works, we present a simple yet effective method to
+optimize the category assigned to each prototype (label-to-prototype
+assignment) during the training. To this aim, we formalize the problem as a
+two-step optimization objective over network parameters and label-to-prototype
+assignment mapping. We solve this optimization using a sequential combination
+of gradient descent and Bipartide matching. We demonstrate the benefits of the
+proposed approach by conducting experiments on balanced and long-tail
+classification problems using different backbone network architectures. In
+particular, our method outperforms its competitors by 1.22\% accuracy on
+CIFAR-100, and 2.15\% on ImageNet-200 using a metric space dimension half of
+the size of its competitors. Code:
+https://github.com/msed-Ebrahimi/DL2PA_CVPR24",cs.CV,['cs.CV']
+From Activation to Initialization: Scaling Insights for Optimizing Neural Fields,Hemanth Saratchandran · Sameera Ramasinghe · Simon Lucey, ,https://arxiv.org/abs/2403.19205,,2403.19205.pdf,From Activation to Initialization: Scaling Insights for Optimizing Neural Fields,"In the realm of computer vision, Neural Fields have gained prominence as a
+contemporary tool harnessing neural networks for signal representation. Despite
+the remarkable progress in adapting these networks to solve a variety of
+problems, the field still lacks a comprehensive theoretical framework. This
+article aims to address this gap by delving into the intricate interplay
+between initialization and activation, providing a foundational basis for the
+robust optimization of Neural Fields. Our theoretical insights reveal a
+deep-seated connection among network initialization, architectural choices, and
+the optimization process, emphasizing the need for a holistic approach when
+designing cutting-edge Neural Fields.",cs.CV,"['cs.CV', 'cs.LG']"
+Tune-An-Ellipse: CLIP Has Potential to Find What You Want,Jinheng Xie · Songhe Deng · Bing Li · Haozhe Liu · Yawen Huang · Yefeng Zheng · Jürgen Schmidhuber · Bernard Ghanem · Linlin Shen · Mike Zheng Shou, ,,https://cloud.tencent.com/developer/article/2396040,,,,,nan
+Neural Lineage,Runpeng Yu · Xinchao Wang, ,https://arxiv.org/abs/2312.02470v1,,2312.02470v1.pdf,Generator Born from Classifier,"In this paper, we make a bold attempt toward an ambitious task: given a
+pre-trained classifier, we aim to reconstruct an image generator, without
+relying on any data samples. From a black-box perspective, this challenge seems
+intractable, since it inevitably involves identifying the inverse function for
+a classifier, which is, by nature, an information extraction process. As such,
+we resort to leveraging the knowledge encapsulated within the parameters of the
+neural network. Grounded on the theory of Maximum-Margin Bias of gradient
+descent, we propose a novel learning paradigm, in which the generator is
+trained to ensure that the convergence conditions of the network parameters are
+satisfied over the generated distribution of the samples. Empirical validation
+from various image generation tasks substantiates the efficacy of our strategy.",cs.LG,"['cs.LG', 'cs.CV']"
+Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels,Tianming Liang · Chaolei Tan · Beihao Xia · Wei-Shi Zheng · Jian-Fang Hu, ,https://arxiv.org/abs/2403.14430,,2403.14430.pdf,Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels,"This paper focuses on open-ended video question answering, which aims to find
+the correct answers from a large answer set in response to a video-related
+question. This is essentially a multi-label classification task, since a
+question may have multiple answers. However, due to annotation costs, the
+labels in existing benchmarks are always extremely insufficient, typically one
+answer per question. As a result, existing works tend to directly treat all the
+unlabeled answers as negative labels, leading to limited ability for
+generalization. In this work, we introduce a simple yet effective ranking
+distillation framework (RADI) to mitigate this problem without additional
+manual annotation. RADI employs a teacher model trained with incomplete labels
+to generate rankings for potential answers, which contain rich knowledge about
+label priority as well as label-associated visual cues, thereby enriching the
+insufficient labeling information. To avoid overconfidence in the imperfect
+teacher model, we further present two robust and parameter-free ranking
+distillation approaches: a pairwise approach which introduces adaptive soft
+margins to dynamically refine the optimization constraints on various pairwise
+rankings, and a listwise approach which adopts sampling-based partial listwise
+learning to resist the bias in teacher ranking. Extensive experiments on five
+popular benchmarks consistently show that both our pairwise and listwise RADIs
+outperform state-of-the-art methods. Further analysis demonstrates the
+effectiveness of our methods on the insufficient labeling problem.",cs.CV,['cs.CV']
+Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation,Jingyun Wang · Guoliang Kang, ,https://arxiv.org/abs/2403.04547,,2403.04547.pdf,CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?,"We study the effectiveness of data-balancing for mitigating biases in
+contrastive language-image pretraining (CLIP), identifying areas of strength
+and limitation. First, we reaffirm prior conclusions that CLIP models can
+inadvertently absorb societal stereotypes. To counter this, we present a novel
+algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both
+representation and association biases (i.e. in first- and second-order
+statistics) in multimodal data. We use M4 to conduct an in-depth analysis
+taking into account various factors, such as the model, representation, and
+data size. Our study also explores the dynamic nature of how CLIP learns and
+unlearns biases. In particular, we find that fine-tuning is effective in
+countering representation biases, though its impact diminishes for association
+biases. Also, data balancing has a mixed impact on quality: it tends to improve
+classification but can hurt retrieval. Interestingly, data and architectural
+improvements seem to mitigate the negative impact of data balancing on
+performance; e.g. applying M4 to SigLIP-B/16 with data quality filters improves
+COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and
+ImageNet 0-shot classification from 77% to 77.5%! Finally, we conclude with
+recommendations for improving the efficacy of data balancing in multimodal
+systems.",cs.LG,"['cs.LG', 'cs.AI']"
+FINER: Flexible spectral-bias tuning in Implicit NEural Representation by Variable-periodic Activation Functions,Zhen Liu · Hao Zhu · Qi Zhang · Jingde Fu · Weibing Deng · Zhan Ma · Yanwen Guo · Xun Cao, ,https://arxiv.org/abs/2312.02434,,2312.02434.pdf,FINER: Flexible spectral-bias tuning in Implicit NEural Representation by Variable-periodic Activation Functions,"Implicit Neural Representation (INR), which utilizes a neural network to map
+coordinate inputs to corresponding attributes, is causing a revolution in the
+field of signal processing. However, current INR techniques suffer from a
+restricted capability to tune their supported frequency set, resulting in
+imperfect performance when representing complex signals with multiple
+frequencies. We have identified that this frequency-related problem can be
+greatly alleviated by introducing variable-periodic activation functions, for
+which we propose FINER. By initializing the bias of the neural network within
+different ranges, sub-functions with various frequencies in the
+variable-periodic function are selected for activation. Consequently, the
+supported frequency set of FINER can be flexibly tuned, leading to improved
+performance in signal representation. We demonstrate the capabilities of FINER
+in the contexts of 2D image fitting, 3D signed distance field representation,
+and 5D neural radiance fields optimization, and we show that it outperforms
+existing INRs.",cs.CV,['cs.CV']
+InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning,Yan-Shuo Liang · Wu-Jun Li,https://github.com/liangyanshuo/InfLoRA,https://arxiv.org/abs/2404.00228,,2404.00228.pdf,InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning,"Continual learning requires the model to learn multiple tasks sequentially.
+In continual learning, the model should possess the ability to maintain its
+performance on old tasks (stability) and the ability to adapt to new tasks
+continuously (plasticity). Recently, parameter-efficient fine-tuning (PEFT),
+which involves freezing a pre-trained model and injecting a small number of
+learnable parameters to adapt to downstream tasks, has gained increasing
+popularity in continual learning. Although existing continual learning methods
+based on PEFT have demonstrated superior performance compared to those not
+based on PEFT, most of them do not consider how to eliminate the interference
+of the new task on the old tasks, which inhibits the model from making a good
+trade-off between stability and plasticity. In this work, we propose a new PEFT
+method, called interference-free low-rank adaptation (InfLoRA), for continual
+learning. InfLoRA injects a small number of parameters to reparameterize the
+pre-trained weights and shows that fine-tuning these injected parameters is
+equivalent to fine-tuning the pre-trained weights within a subspace.
+Furthermore, InfLoRA designs this subspace to eliminate the interference of the
+new task on the old tasks, making a good trade-off between stability and
+plasticity. Experimental results show that InfLoRA outperforms existing
+state-of-the-art continual learning methods on multiple datasets.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation,Yuanchen Wu · Xichen Ye · KequanYang · Jide Li · Xiaoqiang Li, ,https://arxiv.org/abs/2403.11184,,2403.11184.pdf,DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation,"Recently, One-stage Weakly Supervised Semantic Segmentation (WSSS) with
+image-level labels has gained increasing interest due to simplification over
+its cumbersome multi-stage counterpart. Limited by the inherent ambiguity of
+Class Activation Map (CAM), we observe that one-stage pipelines often encounter
+confirmation bias caused by incorrect CAM pseudo-labels, impairing their final
+segmentation performance. Although recent works discard many unreliable
+pseudo-labels to implicitly alleviate this issue, they fail to exploit
+sufficient supervision for their models. To this end, we propose a dual student
+framework with trustworthy progressive learning (DuPL). Specifically, we
+propose a dual student network with a discrepancy loss to yield diverse CAMs
+for each sub-net. The two sub-nets generate supervision for each other,
+mitigating the confirmation bias caused by learning their own incorrect
+pseudo-labels. In this process, we progressively introduce more trustworthy
+pseudo-labels to be involved in the supervision through dynamic threshold
+adjustment with an adaptive noise filtering strategy. Moreover, we believe that
+every pixel, even discarded from supervision due to its unreliability, is
+important for WSSS. Thus, we develop consistency regularization on these
+discarded regions, providing supervision of every pixel. Experiment results
+demonstrate the superiority of the proposed DuPL over the recent
+state-of-the-art alternatives on PASCAL VOC 2012 and MS COCO datasets. Code is
+available at https://github.com/Wu0409/DuPL.",cs.CV,['cs.CV']
+Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding,Chaolei Tan · Jianhuang Lai · Wei-Shi Zheng · Jian-Fang Hu, ,https://arxiv.org/abs/2403.11463v2,,2403.11463v2.pdf,Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding,"Video Paragraph Grounding (VPG) is an emerging task in video-language
+understanding, which aims at localizing multiple sentences with semantic
+relations and temporal order from an untrimmed video. However, existing VPG
+approaches are heavily reliant on a considerable number of temporal labels that
+are laborious and time-consuming to acquire. In this work, we introduce and
+explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to eliminate the
+need of temporal annotations. Different from previous weakly-supervised
+grounding frameworks based on multiple instance learning or reconstruction
+learning for two-stage candidate ranking, we propose a novel siamese learning
+framework that jointly learns the cross-modal feature alignment and temporal
+coordinate regression without timestamp labels to achieve concise one-stage
+localization for WSVPG. Specifically, we devise a Siamese Grounding TRansformer
+(SiamGTR) consisting of two weight-sharing branches for learning complementary
+supervision. An Augmentation Branch is utilized for directly regressing the
+temporal boundaries of a complete paragraph within a pseudo video, and an
+Inference Branch is designed to capture the order-guided feature correspondence
+for localizing multiple sentences in a normal video. We demonstrate by
+extensive experiments that our paradigm has superior practicability and
+flexibility to achieve efficient weakly-supervised or semi-supervised learning,
+outperforming state-of-the-art methods trained with the same or stronger
+supervision.",cs.CV,['cs.CV']
+Prompt Augmentation for Self-supervised Text-guided Image Manipulation,Rumeysa Bodur · Binod Bhattarai · Tae-Kyun Kim, ,https://arxiv.org/html/2403.10255v1,,2403.10255v1.pdf,Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder,"Super-resolution (SR) and image generation are important tasks in computer
+vision and are widely adopted in real-world applications. Most existing
+methods, however, generate images only at fixed-scale magnification and suffer
+from over-smoothing and artifacts. Additionally, they do not offer enough
+diversity of output images nor image consistency at different scales. Most
+relevant work applied Implicit Neural Representation (INR) to the denoising
+diffusion model to obtain continuous-resolution yet diverse and high-quality SR
+results. Since this model operates in the image space, the larger the
+resolution of image is produced, the more memory and inference time is
+required, and it also does not maintain scale-specific consistency. We propose
+a novel pipeline that can super-resolve an input image or generate from a
+random noise a novel image at arbitrary scales. The method consists of a
+pretrained auto-encoder, a latent diffusion model, and an implicit neural
+decoder, and their learning strategies. The proposed method adopts diffusion
+processes in a latent space, thus efficient, yet aligned with output image
+space decoded by MLPs at arbitrary scales. More specifically, our
+arbitrary-scale decoder is designed by the symmetric decoder w/o up-scaling
+from the pretrained auto-encoder, and Local Implicit Image Function (LIIF) in
+series. The latent diffusion process is learnt by the denoising and the
+alignment losses jointly. Errors in output images are backpropagated via the
+fixed decoder, improving the quality of output images. In the extensive
+experiments using multiple public benchmarks on the two tasks i.e. image
+super-resolution and novel image generation at arbitrary scales, the proposed
+method outperforms relevant methods in metrics of image quality, diversity and
+scale consistency. It is significantly better than the relevant prior-art in
+the inference speed and memory usage.",cs.CV,['cs.CV']
+"Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation",ZHIXIANG WEI · Lin Chen · Xiaoxiao Ma · Huaian Chen · Tianle Liu · Pengyang Ling · Jinjin Zheng · Ben Wang · Yi Jin,https://zxwei.site/rein/,https://arxiv.org/abs/2312.04265,,2312.04265.pdf,"Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation","In this paper, we first assess and harness various Vision Foundation Models
+(VFMs) in the context of Domain Generalized Semantic Segmentation (DGSS).
+Driven by the motivation that Leveraging Stronger pre-trained models and Fewer
+trainable parameters for Superior generalizability, we introduce a robust
+fine-tuning approach, namely Rein, to parameter-efficiently harness VFMs for
+DGSS. Built upon a set of trainable tokens, each linked to distinct instances,
+Rein precisely refines and forwards the feature maps from each layer to the
+next layer within the backbone. This process produces diverse refinements for
+different categories within a single image. With fewer trainable parameters,
+Rein efficiently fine-tunes VFMs for DGSS tasks, surprisingly surpassing full
+parameter fine-tuning. Extensive experiments across various settings
+demonstrate that Rein significantly outperforms state-of-the-art methods.
+Remarkably, with just an extra 1% of trainable parameters within the frozen
+backbone, Rein achieves a mIoU of 78.4% on the Cityscapes, without accessing
+any real urban-scene datasets.Code is available at
+https://github.com/w1oves/Rein.git.",cs.CV,['cs.CV']
+PoseGPT: Chatting about 3D Human Pose,Yao Feng · Jing Lin · Sai Kumar Dwivedi · Yu Sun · Priyanka Patel · Michael J. Black,https://yfeng95.github.io/ChatPose/,https://arxiv.org/abs/2311.18836,,2311.18836.pdf,ChatPose: Chatting about 3D Human Pose,"We introduce ChatPose, a framework employing Large Language Models (LLMs) to
+understand and reason about 3D human poses from images or textual descriptions.
+Our work is motivated by the human ability to intuitively understand postures
+from a single image or a brief description, a process that intertwines image
+interpretation, world knowledge, and an understanding of body language.
+Traditional human pose estimation and generation methods often operate in
+isolation, lacking semantic understanding and reasoning abilities. ChatPose
+addresses these limitations by embedding SMPL poses as distinct signal tokens
+within a multimodal LLM, enabling the direct generation of 3D body poses from
+both textual and visual inputs. Leveraging the powerful capabilities of
+multimodal LLMs, ChatPose unifies classical 3D human pose and generation tasks
+while offering user interactions. Additionally, ChatPose empowers LLMs to apply
+their extensive world knowledge in reasoning about human poses, leading to two
+advanced tasks: speculative pose generation and reasoning about pose
+estimation. These tasks involve reasoning about humans to generate 3D poses
+from subtle text queries, possibly accompanied by images. We establish
+benchmarks for these tasks, moving beyond traditional 3D pose generation and
+estimation methods. Our results show that ChatPose outperforms existing
+multimodal LLMs and task-specific methods on these newly proposed tasks.
+Furthermore, ChatPose's ability to understand and generate 3D human poses based
+on complex reasoning opens new directions in human pose analysis.",cs.CV,['cs.CV']
+SHAP-EDITOR: Instruction-guided Latent 3D Editing in Seconds,Minghao Chen · Junyu Xie · Iro Laina · Andrea Vedaldi, ,,https://huggingface.co/papers/2312.09246,,,,,nan
+Boosting Order-Preserving and Transferability for Neural Architecture Search: a Joint Architecture Refined Search and Fine-tuning Approach,Beichen Zhang · Xiaoxing Wang · Xiaohan Qin · Junchi Yan, ,https://arxiv.org/abs/2403.11380,,2403.11380.pdf,Boosting Order-Preserving and Transferability for Neural Architecture Search: a Joint Architecture Refined Search and Fine-tuning Approach,"Supernet is a core component in many recent Neural Architecture Search (NAS)
+methods. It not only helps embody the search space but also provides a
+(relative) estimation of the final performance of candidate architectures.
+Thus, it is critical that the top architectures ranked by a supernet should be
+consistent with those ranked by true performance, which is known as the
+order-preserving ability. In this work, we analyze the order-preserving ability
+on the whole search space (global) and a sub-space of top architectures
+(local), and empirically show that the local order-preserving for current
+two-stage NAS methods still need to be improved. To rectify this, we propose a
+novel concept of Supernet Shifting, a refined search strategy combining
+architecture searching with supernet fine-tuning. Specifically, apart from
+evaluating, the training loss is also accumulated in searching and the supernet
+is updated every iteration. Since superior architectures are sampled more
+frequently in evolutionary searching, the supernet is encouraged to focus on
+top architectures, thus improving local order-preserving. Besides, a
+pre-trained supernet is often un-reusable for one-shot methods. We show that
+Supernet Shifting can fulfill transferring supernet to a new dataset.
+Specifically, the last classifier layer will be unset and trained through
+evolutionary searching. Comprehensive experiments show that our method has
+better order-preserving ability and can find a dominating architecture.
+Moreover, the pre-trained supernet can be easily transferred into a new dataset
+with no loss of performance.",cs.CV,['cs.CV']
+Towards General Robustness Verification of MaxPool-based Convolutional Neural Networks via Tightening Linear Approximation,Yuan Xiao · Shiqing Ma · Juan Zhai · Chunrong Fang · Jinyuan Jia · Zhenyu Chen,https://github.com/xiaoyuanpigo/maxlin,,https://software.nju.edu.cn/English/News/Selected/20240228/i260151.html,,,,,nan
+Harnessing Meta-Learning for Improving Full-Frame Video Stabilization,Muhammad Kashif Ali · Eun Woo Im · Dongjin Kim · Tae Hyun Kim, ,https://arxiv.org/abs/2403.03662v1,,2403.03662v1.pdf,Harnessing Meta-Learning for Improving Full-Frame Video Stabilization,"Video stabilization is a longstanding computer vision problem, particularly
+pixel-level synthesis solutions for video stabilization which synthesize full
+frames add to the complexity of this task. These techniques aim to stabilize
+videos by synthesizing full frames while enhancing the stability of the
+considered video. This intensifies the complexity of the task due to the
+distinct mix of unique motion profiles and visual content present in each video
+sequence, making robust generalization with fixed parameters difficult. In our
+study, we introduce a novel approach to enhance the performance of pixel-level
+synthesis solutions for video stabilization by adapting these models to
+individual input video sequences. The proposed adaptation exploits low-level
+visual cues accessible during test-time to improve both the stability and
+quality of resulting videos. We highlight the efficacy of our methodology of
+""test-time adaptation"" through simple fine-tuning of one of these models,
+followed by significant stability gain via the integration of meta-learning
+techniques. Notably, significant improvement is achieved with only a single
+adaptation step. The versatility of the proposed algorithm is demonstrated by
+consistently improving the performance of various pixel-level synthesis models
+for video stabilization in real-world scenarios.",cs.CV,['cs.CV']
+SEED-Bench: Benchmarking Multimodal Large Language Models,Bohao Li · Yuying Ge · Yixiao Ge · Guangzhi Wang · Rui Wang · Ruimao Zhang · Ying Shan, ,https://arxiv.org/abs/2307.16125,,2307.16125.pdf,SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension,"Based on powerful Large Language Models (LLMs), recent generative Multimodal
+Large Language Models (MLLMs) have gained prominence as a pivotal research
+area, exhibiting remarkable capability for both comprehension and generation.
+In this work, we address the evaluation of generative comprehension in MLLMs as
+a preliminary step towards a comprehensive assessment of generative models, by
+introducing a benchmark named SEED-Bench. SEED-Bench consists of 19K multiple
+choice questions with accurate human annotations (x 6 larger than existing
+benchmarks), which spans 12 evaluation dimensions including the comprehension
+of both the image and video modality. We develop an advanced pipeline for
+generating multiple-choice questions that target specific evaluation
+dimensions, integrating both automatic filtering and manual verification
+processes. Multiple-choice questions with groundtruth options derived from
+human annotation enables an objective and efficient assessment of model
+performance, eliminating the need for human or GPT intervention during
+evaluation. We further evaluate the performance of 18 models across all 12
+dimensions, covering both the spatial and temporal understanding. By revealing
+the limitations of existing MLLMs through evaluation results, we aim for
+SEED-Bench to provide insights for motivating future research. We will launch
+and consistently maintain a leaderboard to provide a platform for the community
+to assess and investigate model capability.",cs.CL,"['cs.CL', 'cs.CV']"
+Fusing Personal and Environmental Cues for Identification and Segmentation of First-Person Camera Wearers in Third-Person Views,Ziwei Zhao · Yuchen Wang · Chuhua Wang, ,,,,,,,nan
+Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models,Zhang Li · Biao Yang · Qiang Liu · Zhiyin Ma · Shuo Zhang · Jingxu Yang · Yabo Sun · Yuliang Liu · Xiang Bai, ,https://arxiv.org/abs/2311.06607,,2311.06607.pdf,Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models,"Large Multimodal Models (LMMs) have shown promise in vision-language tasks
+but struggle with high-resolution input and detailed scene understanding.
+Addressing these challenges, we introduce Monkey to enhance LMM capabilities.
+Firstly, Monkey processes input images by dividing them into uniform patches,
+each matching the size (e.g., 448x448) used in the original training of the
+well-trained vision encoder. Equipped with individual adapter for each patch,
+Monkey can handle higher resolutions up to 1344x896 pixels, enabling the
+detailed capture of complex visual information. Secondly, it employs a
+multi-level description generation method, enriching the context for
+scene-object associations. This two-part strategy ensures more effective
+learning from generated data: the higher resolution allows for a more detailed
+capture of visuals, which in turn enhances the effectiveness of comprehensive
+descriptions. Extensive ablative results validate the effectiveness of our
+designs. Additionally, experiments on 18 datasets further demonstrate that
+Monkey surpasses existing LMMs in many tasks like Image Captioning and various
+Visual Question Answering formats. Specially, in qualitative tests focused on
+dense text question answering, Monkey has exhibited encouraging results
+compared with GPT4V. Code is available at
+https://github.com/Yuliang-Liu/Monkey.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']"
+CPP-Net: Embracing Multi-Scale Feature Fusion into Deep Unfolding CP-PPA Network for Compressive Sensing,Zhen Guo · Hongping Gan, ,,https://www.mdpi.com/1099-4300/25/12/1579,,,,,nan
+Revisiting Counterfactual Problems in Referring Expression Comprehension,Zhihan Yu · Ruifan Li, ,,https://link.springer.com/chapter/10.1007/978-3-031-41682-8_25,,,,,nan
+AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring,Xintian Mao · Xiwen Gao · Yan Wang,https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur,https://arxiv.org/abs/2402.06117,,2402.06117.pdf,Spatially-Attentive Patch-Hierarchical Network with Adaptive Sampling for Motion Deblurring,"This paper tackles the problem of motion deblurring of dynamic scenes.
+Although end-to-end fully convolutional designs have recently advanced the
+state-of-the-art in non-uniform motion deblurring, their performance-complexity
+trade-off is still sub-optimal. Most existing approaches achieve a large
+receptive field by increasing the number of generic convolution layers and
+kernel size. In this work, we propose a pixel adaptive and feature attentive
+design for handling large blur variations across different spatial locations
+and process each test image adaptively. We design a content-aware global-local
+filtering module that significantly improves performance by considering not
+only global dependencies but also by dynamically exploiting neighboring pixel
+information. We further introduce a pixel-adaptive non-uniform sampling
+strategy that implicitly discovers the difficult-to-restore regions present in
+the image and, in turn, performs fine-grained refinement in a progressive
+manner. Extensive qualitative and quantitative comparisons with prior art on
+deblurring benchmarks demonstrate that our approach performs favorably against
+the state-of-the-art deblurring algorithms.",cs.CV,['cs.CV']
+E-GPS: Explainable Geometry Problem Solving via Top-Down Solver and Bottom-Up Generator,Wenjun Wu · Lingling Zhang · Jun Liu · Xi Tang · Yaxian Wang · Shaowei Wang · QianYing Wang, ,https://arxiv.org/abs/2401.16287,,2401.16287.pdf,GAPS: Geometry-Aware Problem Solver,"Geometry problem solving presents a formidable challenge within the NLP
+community. Existing approaches often rely on models designed for solving math
+word problems, neglecting the unique characteristics of geometry math problems.
+Additionally, the current research predominantly focuses on geometry
+calculation problems, while overlooking other essential aspects like proving.
+In this study, we address these limitations by proposing the Geometry-Aware
+Problem Solver (GAPS) model. GAPS is specifically designed to generate solution
+programs for geometry math problems of various types with the help of its
+unique problem-type classifier. To achieve this, GAPS treats the solution
+program as a composition of operators and operands, segregating their
+generation processes. Furthermore, we introduce the geometry elements
+enhancement method, which enhances the ability of GAPS to recognize geometry
+elements accurately. By leveraging these improvements, GAPS showcases
+remarkable performance in resolving geometry math problems. Our experiments
+conducted on the UniGeo dataset demonstrate the superiority of GAPS over the
+state-of-the-art model, Geoformer. Specifically, GAPS achieves an accuracy
+improvement of more than 5.3% for calculation tasks and an impressive 41.1% for
+proving tasks. Notably, GAPS achieves an impressive accuracy of 97.5% on
+proving problems, representing a significant advancement in solving geometry
+proving tasks.",cs.AI,"['cs.AI', 'cs.CL']"
+IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection,Junbo Yin · Wenguan Wang · Runnan Chen · Wei Li · Ruigang Yang · Pascal Frossard · Jianbing Shen,https://github.com/yinjunbo/IS-Fusion,https://arxiv.org/abs/2403.15241,,2403.15241.pdf,IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection,"Bird's eye view (BEV) representation has emerged as a dominant solution for
+describing 3D space in autonomous driving scenarios. However, objects in the
+BEV representation typically exhibit small sizes, and the associated point
+cloud context is inherently sparse, which leads to great challenges for
+reliable 3D perception. In this paper, we propose IS-Fusion, an innovative
+multimodal fusion framework that jointly captures the Instance- and Scene-level
+contextual information. IS-Fusion essentially differs from existing approaches
+that only focus on the BEV scene-level fusion by explicitly incorporating
+instance-level multimodal information, thus facilitating the instance-centric
+tasks like 3D object detection. It comprises a Hierarchical Scene Fusion (HSF)
+module and an Instance-Guided Fusion (IGF) module. HSF applies Point-to-Grid
+and Grid-to-Region transformers to capture the multimodal scene context at
+different granularities. IGF mines instance candidates, explores their
+relationships, and aggregates the local multimodal context for each instance.
+These instances then serve as guidance to enhance the scene feature and yield
+an instance-aware BEV representation. On the challenging nuScenes benchmark,
+IS-Fusion outperforms all the published multimodal works to date. Code is
+available at: https://github.com/yinjunbo/IS-Fusion.",cs.CV,['cs.CV']
+Open-Vocabulary Semantic Segmentation with Image Embedding Balancing,Xiangheng Shan · Dongyue Wu · Guilin Zhu · Yuanjie Shao · Nong Sang · Changxin Gao, ,https://arxiv.org/abs/2312.04089,,2312.04089.pdf,Open-Vocabulary Segmentation with Semantic-Assisted Calibration,"This paper studies open-vocabulary segmentation (OVS) through calibrating
+in-vocabulary and domain-biased embedding space with generalized contextual
+prior of CLIP. As the core of open-vocabulary understanding, alignment of
+visual content with the semantics of unbounded text has become the bottleneck
+of this field. To address this challenge, recent works propose to utilize CLIP
+as an additional classifier and aggregate model predictions with CLIP
+classification results. Despite their remarkable progress, performance of OVS
+methods in relevant scenarios is still unsatisfactory compared with supervised
+counterparts. We attribute this to the in-vocabulary embedding and
+domain-biased CLIP prediction. To this end, we present a Semantic-assisted
+CAlibration Network (SCAN). In SCAN, we incorporate generalized semantic prior
+of CLIP into proposal embedding to avoid collapsing on known categories.
+Besides, a contextual shift strategy is applied to mitigate the lack of global
+context and unnatural background noise. With above designs, SCAN achieves
+state-of-the-art performance on all popular open-vocabulary segmentation
+benchmarks. Furthermore, we also focus on the problem of existing evaluation
+system that ignores semantic duplication across categories, and propose a new
+metric called Semantic-Guided IoU (SG-IoU).",cs.CV,['cs.CV']
+Doubly Abductive Counterfactual Inference for Text-based Image Editing,Xue Song · Jiequan Cui · Hanwang Zhang · Jingjing Chen · Richang Hong · Yu-Gang Jiang,https://github.com/xuesong39/DAC,https://arxiv.org/abs/2403.02981,,2403.02981.pdf,Doubly Abductive Counterfactual Inference for Text-based Image Editing,"We study text-based image editing (TBIE) of a single image by counterfactual
+inference because it is an elegant formulation to precisely address the
+requirement: the edited image should retain the fidelity of the original one.
+Through the lens of the formulation, we find that the crux of TBIE is that
+existing techniques hardly achieve a good trade-off between editability and
+fidelity, mainly due to the overfitting of the single-image fine-tuning. To
+this end, we propose a Doubly Abductive Counterfactual inference framework
+(DAC). We first parameterize an exogenous variable as a UNet LoRA, whose
+abduction can encode all the image details. Second, we abduct another exogenous
+variable parameterized by a text encoder LoRA, which recovers the lost
+editability caused by the overfitted first abduction. Thanks to the second
+abduction, which exclusively encodes the visual transition from post-edit to
+pre-edit, its inversion -- subtracting the LoRA -- effectively reverts pre-edit
+back to post-edit, thereby accomplishing the edit. Through extensive
+experiments, our DAC achieves a good trade-off between editability and
+fidelity. Thus, we can support a wide spectrum of user editing intents,
+including addition, removal, manipulation, replacement, style transfer, and
+facial change, which are extensively validated in both qualitative and
+quantitative evaluations. Codes are in https://github.com/xuesong39/DAC.",cs.CV,['cs.CV']
+SfmCAD: Unsupervised CAD Reconstruction by Learning Sketch-based Feature Modeling Operations,Pu Li · Jianwei Guo · HUIBIN LI · Bedrich Benes · Dong-Ming Yan, ,https://ar5iv.labs.arxiv.org/html/2303.10613,,2303.10613.pdf,SECAD-Net: Self-Supervised CAD Reconstruction by Learning Sketch-Extrude Operations,"Reverse engineering CAD models from raw geometry is a classic but strenuous
+research problem. Previous learning-based methods rely heavily on labels due to
+the supervised design patterns or reconstruct CAD shapes that are not easily
+editable. In this work, we introduce SECAD-Net, an end-to-end neural network
+aimed at reconstructing compact and easy-to-edit CAD models in a
+self-supervised manner. Drawing inspiration from the modeling language that is
+most commonly used in modern CAD software, we propose to learn 2D sketches and
+3D extrusion parameters from raw shapes, from which a set of extrusion
+cylinders can be generated by extruding each sketch from a 2D plane into a 3D
+body. By incorporating the Boolean operation (i.e., union), these cylinders can
+be combined to closely approximate the target geometry. We advocate the use of
+implicit fields for sketch representation, which allows for creating CAD
+variations by interpolating latent codes in the sketch latent space. Extensive
+experiments on both ABC and Fusion 360 datasets demonstrate the effectiveness
+of our method, and show superiority over state-of-the-art alternatives
+including the closely related method for supervised CAD reconstruction. We
+further apply our approach to CAD editing and single-view CAD reconstruction.
+The code is released at https://github.com/BunnySoCrazy/SECAD-Net.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+Grounding and Enhancing Grid-based Models for Neural Fields,Zelin Zhao · FENGLEI FAN · Wenlong Liao · Junchi Yan,https://sites.google.com/view/cvpr24-2034-submission/home,https://arxiv.org/abs/2403.20002,,2403.20002.pdf,Grounding and Enhancing Grid-based Models for Neural Fields,"Many contemporary studies utilize grid-based models for neural field
+representation, but a systematic analysis of grid-based models is still
+missing, hindering the improvement of those models. Therefore, this paper
+introduces a theoretical framework for grid-based models. This framework points
+out that these models' approximation and generalization behaviors are
+determined by grid tangent kernels (GTK), which are intrinsic properties of
+grid-based models. The proposed framework facilitates a consistent and
+systematic analysis of diverse grid-based models. Furthermore, the introduced
+framework motivates the development of a novel grid-based model named the
+Multiplicative Fourier Adaptive Grid (MulFAGrid). The numerical analysis
+demonstrates that MulFAGrid exhibits a lower generalization bound than its
+predecessors, indicating its robust generalization performance. Empirical
+studies reveal that MulFAGrid achieves state-of-the-art performance in various
+tasks, including 2D image fitting, 3D signed distance field (SDF)
+reconstruction, and novel view synthesis, demonstrating superior representation
+ability. The project website is available at
+https://sites.google.com/view/cvpr24-2034-submission/home.",cs.CV,['cs.CV']
+Language Model Guided Interpretable Video Action Reasoning,Ning Wang · Guangming Zhu · Hongsheng Li · Liang Zhang · Syed Afaq Ali Shah · Mohammed Bennamoun, ,https://arxiv.org/abs/2404.01591,,2404.01591.pdf,Language Model Guided Interpretable Video Action Reasoning,"While neural networks have excelled in video action recognition tasks, their
+black-box nature often obscures the understanding of their decision-making
+processes. Recent approaches used inherently interpretable models to analyze
+video actions in a manner akin to human reasoning. These models, however,
+usually fall short in performance compared to their black-box counterparts. In
+this work, we present a new framework named Language-guided Interpretable
+Action Recognition framework (LaIAR). LaIAR leverages knowledge from language
+models to enhance both the recognition capabilities and the interpretability of
+video models. In essence, we redefine the problem of understanding video model
+decisions as a task of aligning video and language models. Using the logical
+reasoning captured by the language model, we steer the training of the video
+model. This integrated approach not only improves the video model's
+adaptability to different domains but also boosts its overall performance.
+Extensive experiments on two complex video action datasets, Charades & CAD-120,
+validates the improved performance and interpretability of our LaIAR framework.
+The code of LaIAR is available at https://github.com/NingWang2049/LaIAR.",cs.CV,['cs.CV']
+4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling,Sherwin Bahmani · Ivan Skorokhodov · Victor Rong · Gordon Wetzstein · Leonidas Guibas · Peter Wonka · Sergey Tulyakov · Jeong Joon Park · Andrea Tagliasacchi · David B. Lindell,https://sherwinbahmani.github.io/4dfy,https://arxiv.org/abs/2311.17984,,2311.17984.pdf,4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling,"Recent breakthroughs in text-to-4D generation rely on pre-trained
+text-to-image and text-to-video models to generate dynamic 3D scenes. However,
+current text-to-4D methods face a three-way tradeoff between the quality of
+scene appearance, 3D structure, and motion. For example, text-to-image models
+and their 3D-aware variants are trained on internet-scale image datasets and
+can be used to produce scenes with realistic appearance and 3D structure -- but
+no motion. Text-to-video models are trained on relatively smaller video
+datasets and can produce scenes with motion, but poorer appearance and 3D
+structure. While these models have complementary strengths, they also have
+opposing weaknesses, making it difficult to combine them in a way that
+alleviates this three-way tradeoff. Here, we introduce hybrid score
+distillation sampling, an alternating optimization procedure that blends
+supervision signals from multiple pre-trained diffusion models and incorporates
+benefits of each for high-fidelity text-to-4D generation. Using hybrid SDS, we
+demonstrate synthesis of 4D scenes with compelling appearance, 3D structure,
+and motion.",cs.CV,['cs.CV']
+Single-Model and Any-Modality for Video Object Tracking,Zongwei Wu · Jilai Zheng · Xiangxuan Ren · Florin-Alexandru Vasluianu · Chao Ma · Danda Paudel · Luc Van Gool · Radu Timofte,https://github.com/Zongwei97/UnTrack,https://arxiv.org/abs/2311.15851,,2311.15851.pdf,Single-Model and Any-Modality for Video Object Tracking,"In the realm of video object tracking, auxiliary modalities such as depth,
+thermal, or event data have emerged as valuable assets to complement the RGB
+trackers. In practice, most existing RGB trackers learn a single set of
+parameters to use them across datasets and applications. However, a similar
+single-model unification for multi-modality tracking presents several
+challenges. These challenges stem from the inherent heterogeneity of inputs --
+each with modality-specific representations, the scarcity of multi-modal
+datasets, and the absence of all the modalities at all times. In this work, we
+introduce Un-Track, a Unified Tracker of a single set of parameters for any
+modality. To handle any modality, our method learns their common latent space
+through low-rank factorization and reconstruction techniques. More importantly,
+we use only the RGB-X pairs to learn the common latent space. This unique
+shared representation seamlessly binds all modalities together, enabling
+effective unification and accommodating any missing modality, all within a
+single transformer-based architecture. Our Un-Track achieves +8.1 absolute
+F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50)
+GFLOPs with +6.6M (over 93M) parameters, through a simple yet efficient
+prompting strategy. Extensive comparisons on five benchmark datasets with
+different modalities show that Un-Track surpasses both SOTA unified trackers
+and modality-specific counterparts, validating our effectiveness and
+practicality. The source code is publicly available at
+https://github.com/Zongwei97/UnTrack.",cs.CV,['cs.CV']
+Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion,Xunpeng Yi · Han Xu · HAO ZHANG · Linfeng Tang · Jiayi Ma, ,https://arxiv.org/abs/2403.16387,,2403.16387.pdf,Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion,"Image fusion aims to combine information from different source images to
+create a comprehensively representative image. Existing fusion methods are
+typically helpless in dealing with degradations in low-quality source images
+and non-interactive to multiple subjective and objective needs. To solve them,
+we introduce a novel approach that leverages semantic text guidance image
+fusion model for degradation-aware and interactive image fusion task, termed as
+Text-IF. It innovatively extends the classical image fusion to the text guided
+image fusion along with the ability to harmoniously address the degradation and
+interaction issues during fusion. Through the text semantic encoder and
+semantic interaction fusion decoder, Text-IF is accessible to the all-in-one
+infrared and visible image degradation-aware processing and the interactive
+flexible fusion outcomes. In this way, Text-IF achieves not only multi-modal
+image fusion, but also multi-modal information fusion. Extensive experiments
+prove that our proposed text guided image fusion strategy has obvious
+advantages over SOTA methods in the image fusion performance and degradation
+treatment. The code is available at https://github.com/XunpengYi/Text-IF.",cs.CV,['cs.CV']
+TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation,Sai Kumar Dwivedi · Yu Sun · Priyanka Patel · Yao Feng · Michael J. Black,https://tokenhmr.is.tue.mpg.de/,https://arxiv.org/abs/2404.16752,,2404.16752.pdf,TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation,"We address the problem of regressing 3D human pose and shape from a single
+image, with a focus on 3D accuracy. The current best methods leverage large
+datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust
+performance. With such methods, we observe a paradoxical decline in 3D pose
+accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and
+the use of an approximate camera projection model. We quantify the error
+induced by current camera models and show that fitting 2D keypoints and p-GT
+accurately causes incorrect 3D poses. Our analysis defines the invalid
+distances within which minimizing 2D and p-GT losses is detrimental. We use
+this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that
+penalizes gross 2D and p-GT losses but not smaller ones. With such a loss,
+there are many 3D poses that could equally explain the 2D evidence. To reduce
+this ambiguity we need a prior over valid human poses but such priors can
+introduce unwanted bias. To address this, we exploit a tokenized representation
+of human pose and reformulate the problem as token prediction. This restricts
+the estimated poses to the space of valid poses, effectively providing a
+uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that
+our reformulated keypoint loss and tokenization allows us to train on
+in-the-wild data while improving 3D accuracy over the state-of-the-art. Our
+models and code are available for research at https://tokenhmr.is.tue.mpg.de.",cs.CV,['cs.CV']
+Unifying Top-down and Bottom-up Scanpath Prediction using Transformers,Zhibo Yang · Sounak Mondal · Seoyoung Ahn · Ruoyu Xue · Gregory Zelinsky · Minh Hoai · Dimitris Samaras,https://github.com/cvlab-stonybrook/HAT,https://arxiv.org/html/2303.09383v3,,2303.09383v3.pdf,Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers,"Most models of visual attention aim at predicting either top-down or
+bottom-up control, as studied using different visual search and free-viewing
+tasks. In this paper we propose the Human Attention Transformer (HAT), a single
+model that predicts both forms of attention control. HAT uses a novel
+transformer-based architecture and a simplified foveated retina that
+collectively create a spatio-temporal awareness akin to the dynamic visual
+working memory of humans. HAT not only establishes a new state-of-the-art in
+predicting the scanpath of fixations made during target-present and
+target-absent visual search and ``taskless'' free viewing, but also makes human
+gaze behavior interpretable. Unlike previous methods that rely on a coarse grid
+of fixation cells and experience information loss due to fixation
+discretization, HAT features a sequential dense prediction architecture and
+outputs a dense heatmap for each fixation, thus avoiding discretizing
+fixations. HAT sets a new standard in computational attention, which emphasizes
+effectiveness, generality, and interpretability. HAT's demonstrated scope and
+applicability will likely inspire the development of new attention models that
+can better predict human behavior in various attention-demanding scenarios.
+Code is available at https://github.com/cvlab-stonybrook/HAT.",cs.CV,"['cs.CV', 'cs.AI']"
+Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning,Chen Zhao · Shuming Liu · Karttikeya Mangalam · Guocheng Qian · Fatimah Zohra · Abdulmohsen Alghannam · Jitendra Malik · Bernard Ghanem, ,https://arxiv.org/abs/2401.04105,,2401.04105.pdf,Dr$^2$Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning,"Large pretrained models are increasingly crucial in modern computer vision
+tasks. These models are typically used in downstream tasks by end-to-end
+finetuning, which is highly memory-intensive for tasks with high-resolution
+data, e.g., video understanding, small object detection, and point cloud
+analysis. In this paper, we propose Dynamic Reversible Dual-Residual Networks,
+or Dr$^2$Net, a novel family of network architectures that acts as a surrogate
+network to finetune a pretrained model with substantially reduced memory
+consumption. Dr$^2$Net contains two types of residual connections, one
+maintaining the residual structure in the pretrained models, and the other
+making the network reversible. Due to its reversibility, intermediate
+activations, which can be reconstructed from output, are cleared from memory
+during training. We use two coefficients on either type of residual connections
+respectively, and introduce a dynamic training strategy that seamlessly
+transitions the pretrained model to a reversible network with much higher
+numerical precision. We evaluate Dr$^2$Net on various pretrained models and
+various tasks, and show that it can reach comparable performance to
+conventional finetuning but with significantly less memory usage.",cs.CV,"['cs.CV', 'cs.AI']"
+SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction,Pin Tang · Zhongdao Wang · Guoqing Wang · Jilai Zheng · Xiangxuan Ren · Bailan Feng · Chao Ma, ,https://arxiv.org/abs/2404.09502,,2404.09502.pdf,SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction,"Vision-based perception for autonomous driving requires an explicit modeling
+of a 3D space, where 2D latent representations are mapped and subsequent 3D
+operators are applied. However, operating on dense latent spaces introduces a
+cubic time and space complexity, which limits scalability in terms of
+perception range or spatial resolution. Existing approaches compress the dense
+representation using projections like Bird's Eye View (BEV) or Tri-Perspective
+View (TPV). Although efficient, these projections result in information loss,
+especially for tasks like semantic occupancy prediction. To address this, we
+propose SparseOcc, an efficient occupancy network inspired by sparse point
+cloud processing. It utilizes a lossless sparse latent representation with
+three key innovations. Firstly, a 3D sparse diffuser performs latent completion
+using spatially decomposed 3D sparse convolutional kernels. Secondly, a feature
+pyramid and sparse interpolation enhance scales with information from others.
+Finally, the transformer head is redesigned as a sparse variant. SparseOcc
+achieves a remarkable 74.9% reduction on FLOPs over the dense baseline.
+Interestingly, it also improves accuracy, from 12.8% to 14.1% mIOU, which in
+part can be attributed to the sparse representation's ability to avoid
+hallucinations on empty voxels.",cs.CV,['cs.CV']
+4D Gaussian Splatting for Real-Time Dynamic Scene Rendering,Guanjun Wu · Taoran Yi · Jiemin Fang · Lingxi Xie · Xiaopeng Zhang · Wei Wei · Wenyu Liu · Qi Tian · Xinggang Wang,guanjunwu.github.io/4dgs,https://arxiv.org/abs/2310.08528,,2310.08528.pdf,4D Gaussian Splatting for Real-Time Dynamic Scene Rendering,"Representing and rendering dynamic scenes has been an important but
+challenging task. Especially, to accurately model complex motions, high
+efficiency is usually hard to guarantee. To achieve real-time dynamic scene
+rendering while also enjoying high training and storage efficiency, we propose
+4D Gaussian Splatting (4D-GS) as a holistic representation for dynamic scenes
+rather than applying 3D-GS for each individual frame. In 4D-GS, a novel
+explicit representation containing both 3D Gaussians and 4D neural voxels is
+proposed. A decomposed neural voxel encoding algorithm inspired by HexPlane is
+proposed to efficiently build Gaussian features from 4D neural voxels and then
+a lightweight MLP is applied to predict Gaussian deformations at novel
+timestamps. Our 4D-GS method achieves real-time rendering under high
+resolutions, 82 FPS at an 800$\times$800 resolution on an RTX 3090 GPU while
+maintaining comparable or better quality than previous state-of-the-art
+methods. More demos and code are available at
+https://guanjunwu.github.io/4dgs/.",cs.CV,"['cs.CV', 'cs.GR']"
+Open-Set Domain Adaptation for Semantic Segmentation,Seun-An Choe · Ah-Hyung Shin · Keon Hee Park · Jinwoo Choi · Gyeong-Moon Park, ,https://arxiv.org/abs/2405.19899,,2405.19899.pdf,Open-Set Domain Adaptation for Semantic Segmentation,"Unsupervised domain adaptation (UDA) for semantic segmentation aims to
+transfer the pixel-wise knowledge from the labeled source domain to the
+unlabeled target domain. However, current UDA methods typically assume a shared
+label space between source and target, limiting their applicability in
+real-world scenarios where novel categories may emerge in the target domain. In
+this paper, we introduce Open-Set Domain Adaptation for Semantic Segmentation
+(OSDA-SS) for the first time, where the target domain includes unknown classes.
+We identify two major problems in the OSDA-SS scenario as follows: 1) the
+existing UDA methods struggle to predict the exact boundary of the unknown
+classes, and 2) they fail to accurately predict the shape of the unknown
+classes. To address these issues, we propose Boundary and Unknown Shape-Aware
+open-set domain adaptation, coined BUS. Our BUS can accurately discern the
+boundaries between known and unknown classes in a contrastive manner using a
+novel dilation-erosion-based contrastive loss. In addition, we propose
+OpenReMix, a new domain mixing augmentation method that guides our model to
+effectively learn domain and size-invariant features for improving the shape
+detection of the known and unknown classes. Through extensive experiments, we
+demonstrate that our proposed BUS effectively detects unknown classes in the
+challenging OSDA-SS scenario compared to the previous methods by a large
+margin. The code is available at https://github.com/KHU-AGI/BUS.",cs.CV,"['cs.CV', 'cs.AI']"
+Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening,Yule Duan · Xiao Wu · Haoyu Deng · Liang-Jian Deng,https://github.com/Duanyll/CANConv,https://arxiv.org/abs/2404.07543,,2404.07543.pdf,Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening,"Currently, machine learning-based methods for remote sensing pansharpening
+have progressed rapidly. However, existing pansharpening methods often do not
+fully exploit differentiating regional information in non-local spaces, thereby
+limiting the effectiveness of the methods and resulting in redundant learning
+parameters. In this paper, we introduce a so-called content-adaptive non-local
+convolution (CANConv), a novel method tailored for remote sensing image
+pansharpening. Specifically, CANConv employs adaptive convolution, ensuring
+spatial adaptability, and incorporates non-local self-similarity through the
+similarity relationship partition (SRP) and the partition-wise adaptive
+convolution (PWAC) sub-modules. Furthermore, we also propose a corresponding
+network architecture, called CANNet, which mainly utilizes the multi-scale
+self-similarity. Extensive experiments demonstrate the superior performance of
+CANConv, compared with recent promising fusion methods. Besides, we
+substantiate the method's effectiveness through visualization, ablation
+experiments, and comparison with existing methods on multiple test sets. The
+source code is publicly available at https://github.com/duanyll/CANConv.",cs.CV,"['cs.CV', 'eess.IV']"
+GSVA: Generalized Segmentation via Multimodal Large Language Models,Zhuofan Xia · Dongchen Han · Yizeng Han · Xuran Pan · Shiji Song · Gao Huang,https://github.com/LeapLabTHU/GSVA,https://arxiv.org/abs/2312.10103,,2312.10103.pdf,GSVA: Generalized Segmentation via Multimodal Large Language Models,"Generalized Referring Expression Segmentation (GRES) extends the scope of
+classic RES to refer to multiple objects in one expression or identify the
+empty targets absent in the image. GRES poses challenges in modeling the
+complex spatial relationships of the instances in the image and identifying
+non-existing referents. Multimodal Large Language Models (MLLMs) have recently
+shown tremendous progress in these complicated vision-language tasks.
+Connecting Large Language Models (LLMs) and vision models, MLLMs are proficient
+in understanding contexts with visual inputs. Among them, LISA, as a
+representative, adopts a special [SEG] token to prompt a segmentation mask
+decoder, e.g., SAM, to enable MLLMs in the RES task. However, existing
+solutions to GRES remain unsatisfactory since current segmentation MLLMs cannot
+correctly handle the cases where users might reference multiple subjects in a
+singular prompt or provide descriptions incongruent with any image target. In
+this paper, we propose Generalized Segmentation Vision Assistant (GSVA) to
+address this gap. Specifically, GSVA reuses the [SEG] token to prompt the
+segmentation model towards supporting multiple mask references simultaneously
+and innovatively learns to generate a [REJ] token to reject the null targets
+explicitly. Experiments validate GSVA's efficacy in resolving the GRES issue,
+marking a notable enhancement and setting a new record on the GRES benchmark
+gRefCOCO dataset. GSVA also proves effective across various classic referring
+segmentation and comprehension tasks.",cs.CV,['cs.CV']
+S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data,Xuyang Li · Danfeng Hong · Jocelyn Chanussot, ,https://arxiv.org/abs/2311.07113,,2311.07113.pdf,SpectralGPT: Spectral Remote Sensing Foundation Model,"The foundation model has recently garnered significant attention due to its
+potential to revolutionize the field of visual representation learning in a
+self-supervised manner. While most foundation models are tailored to
+effectively process RGB images for various visual tasks, there is a noticeable
+gap in research focused on spectral data, which offers valuable information for
+scene understanding, especially in remote sensing (RS) applications. To fill
+this gap, we created for the first time a universal RS foundation model, named
+SpectralGPT, which is purpose-built to handle spectral RS images using a novel
+3D generative pretrained transformer (GPT). Compared to existing foundation
+models, SpectralGPT 1) accommodates input images with varying sizes,
+resolutions, time series, and regions in a progressive training fashion,
+enabling full utilization of extensive RS big data; 2) leverages 3D token
+generation for spatial-spectral coupling; 3) captures spectrally sequential
+patterns via multi-target reconstruction; 4) trains on one million spectral RS
+images, yielding models with over 600 million parameters. Our evaluation
+highlights significant performance improvements with pretrained SpectralGPT
+models, signifying substantial potential in advancing spectral RS big data
+applications within the field of geoscience across four downstream tasks:
+single/multi-label scene classification, semantic segmentation, and change
+detection.",cs.CV,['cs.CV']
+PointOBB: Learning Oriented Object Detection via Single Point Supervision,Junwei Luo · Xue Yang · Yi Yu · Qingyun Li · Junchi Yan · Yansheng Li, ,https://arxiv.org/abs/2311.14757,,2311.14757.pdf,PointOBB: Learning Oriented Object Detection via Single Point Supervision,"Single point-supervised object detection is gaining attention due to its
+cost-effectiveness. However, existing approaches focus on generating horizontal
+bounding boxes (HBBs) while ignoring oriented bounding boxes (OBBs) commonly
+used for objects in aerial images. This paper proposes PointOBB, the first
+single Point-based OBB generation method, for oriented object detection.
+PointOBB operates through the collaborative utilization of three distinctive
+views: an original view, a resized view, and a rotated/flipped (rot/flp) view.
+Upon the original view, we leverage the resized and rot/flp views to build a
+scale augmentation module and an angle acquisition module, respectively. In the
+former module, a Scale-Sensitive Consistency (SSC) loss is designed to enhance
+the deep network's ability to perceive the object scale. For accurate object
+angle predictions, the latter module incorporates self-supervised learning to
+predict angles, which is associated with a scale-guided Dense-to-Sparse (DS)
+matching strategy for aggregating dense angles corresponding to sparse objects.
+The resized and rot/flp views are switched using a progressive multi-view
+switching strategy during training to achieve coupled optimization of scale and
+angle. Experimental results on the DIOR-R and DOTA-v1.0 datasets demonstrate
+that PointOBB achieves promising performance, and significantly outperforms
+potential point-supervised baselines.",cs.CV,"['cs.CV', 'cs.AI']"
+Long-Tail Class Incremental Learning via Independent Sub-prototype Construction,Xi Wang · Xu Yang · Jie Yin · Kun Wei · Cheng Deng, ,https://ar5iv.labs.arxiv.org/html/2210.00266,,2210.00266.pdf,Long-Tailed Class Incremental Learning,"In class incremental learning (CIL) a model must learn new classes in a
+sequential manner without forgetting old ones. However, conventional CIL
+methods consider a balanced distribution for each new task, which ignores the
+prevalence of long-tailed distributions in the real world. In this work we
+propose two long-tailed CIL scenarios, which we term ordered and shuffled
+LT-CIL. Ordered LT-CIL considers the scenario where we learn from head classes
+collected with more samples than tail classes which have few. Shuffled LT-CIL,
+on the other hand, assumes a completely random long-tailed distribution for
+each task. We systematically evaluate existing methods in both LT-CIL scenarios
+and demonstrate very different behaviors compared to conventional CIL
+scenarios. Additionally, we propose a two-stage learning baseline with a
+learnable weight scaling layer for reducing the bias caused by long-tailed
+distribution in LT-CIL and which in turn also improves the performance of
+conventional CIL due to the limited exemplars. Our results demonstrate the
+superior performance (up to 6.44 points in average incremental accuracy) of our
+approach on CIFAR-100 and ImageNet-Subset. The code is available at
+https://github.com/xialeiliu/Long-Tailed-CIL",cs.CV,['cs.CV']
+FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer,Dongyeong Hwang · Hyunju Kim · Sunwoo Kim · Kijung Shin, ,https://arxiv.org/abs/2403.12821,,2403.12821.pdf,FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer,"The success of a specific neural network architecture is closely tied to the
+dataset and task it tackles; there is no one-size-fits-all solution. Thus,
+considerable efforts have been made to quickly and accurately estimate the
+performances of neural architectures, without full training or evaluation, for
+given tasks and datasets. Neural architecture encoding has played a crucial
+role in the estimation, and graphbased methods, which treat an architecture as
+a graph, have shown prominent performance. For enhanced representation learning
+of neural architectures, we introduce FlowerFormer, a powerful graph
+transformer that incorporates the information flows within a neural
+architecture. FlowerFormer consists of two key components: (a) bidirectional
+asynchronous message passing, inspired by the flows; (b) global attention built
+on flow-based masking. Our extensive experiments demonstrate the superiority of
+FlowerFormer over existing neural encoding methods, and its effectiveness
+extends beyond computer vision models to include graph neural networks and auto
+speech recognition models. Our code is available at
+http://github.com/y0ngjaenius/CVPR2024_FLOWERFormer.",cs.LG,"['cs.LG', 'cs.AI']"
+Convolutional Prompting meets Language Models for Continual Learning,Anurag Roy · Riddhiman Moulick · Vinay Verma · Saptarshi Ghosh · Abir Das,https://cvir.github.io/projects/convprompt.html,https://arxiv.org/abs/2403.20317,,2403.20317.pdf,Convolutional Prompting meets Language Models for Continual Learning,"Continual Learning (CL) enables machine learning models to learn from
+continuously shifting new training data in absence of data from old tasks.
+Recently, pretrained vision transformers combined with prompt tuning have shown
+promise for overcoming catastrophic forgetting in CL. These approaches rely on
+a pool of learnable prompts which can be inefficient in sharing knowledge
+across tasks leading to inferior performance. In addition, the lack of
+fine-grained layer specific prompts does not allow these to fully express the
+strength of the prompts for CL. We address these limitations by proposing
+ConvPrompt, a novel convolutional prompt creation mechanism that maintains
+layer-wise shared embeddings, enabling both layer-specific learning and better
+concept transfer across tasks. The intelligent use of convolution enables us to
+maintain a low parameter overhead without compromising performance. We further
+leverage Large Language Models to generate fine-grained text descriptions of
+each category which are used to get task similarity and dynamically decide the
+number of prompts to be learned. Extensive experiments demonstrate the
+superiority of ConvPrompt and improves SOTA by ~3% with significantly less
+parameter overhead. We also perform strong ablation over various modules to
+disentangle the importance of different components.",cs.CV,['cs.CV']
+As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors,Seungwoo Yoo · Kunho Kim · Vladimir G. Kim · Minhyuk Sung, ,https://arxiv.org/abs/2311.16739,,2311.16739.pdf,As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors,"We present As-Plausible-as-Possible (APAP) mesh deformation technique that
+leverages 2D diffusion priors to preserve the plausibility of a mesh under
+user-controlled deformation. Our framework uses per-face Jacobians to represent
+mesh deformations, where mesh vertex coordinates are computed via a
+differentiable Poisson Solve. The deformed mesh is rendered, and the resulting
+2D image is used in the Score Distillation Sampling (SDS) process, which
+enables extracting meaningful plausibility priors from a pretrained 2D
+diffusion model. To better preserve the identity of the edited mesh, we
+fine-tune our 2D diffusion model with LoRA. Gradients extracted by SDS and a
+user-prescribed handle displacement are then backpropagated to the per-face
+Jacobians, and we use iterative gradient descent to compute the final
+deformation that balances between the user edit and the output plausibility. We
+evaluate our method with 2D and 3D meshes and demonstrate qualitative and
+quantitative improvements when using plausibility priors over
+geometry-preservation or distortion-minimization priors used by previous
+techniques. Our project page is at: https://as-plausible-aspossible.github.io/",cs.CV,"['cs.CV', 'cs.GR']"
+MR-VNet: Media Restoration using Volterra Networks,Siddharth Roheda · Amit Unde · Loay Rashid, ,,https://ieeexplore.ieee.org/document/10251925,,,,,nan
+Low-Latency Neural Stereo Streaming,Qiqi Hou · Farzad Farhadzadeh · Amir Said · Guillaume Sautiere · Hoang Le, ,https://arxiv.org/html/2403.17879v1,,2403.17879v1.pdf,Low-Latency Neural Stereo Streaming,"The rise of new video modalities like virtual reality or autonomous driving
+has increased the demand for efficient multi-view video compression methods,
+both in terms of rate-distortion (R-D) performance and in terms of delay and
+runtime. While most recent stereo video compression approaches have shown
+promising performance, they compress left and right views sequentially, leading
+to poor parallelization and runtime performance. This work presents Low-Latency
+neural codec for Stereo video Streaming (LLSS), a novel parallel stereo video
+coding method designed for fast and efficient low-latency stereo video
+streaming. Instead of using a sequential cross-view motion compensation like
+existing methods, LLSS introduces a bidirectional feature shifting module to
+directly exploit mutual information among views and encode them effectively
+with a joint cross-view prior model for entropy coding. Thanks to this design,
+LLSS processes left and right views in parallel, minimizing latency; all while
+substantially improving R-D performance compared to both existing neural and
+conventional codecs.",cs.CV,"['cs.CV', 'eess.IV']"
+SPECAT: SPatial-spEctral Cumulative-Attention Transformer for High-Resolution Hyperspectral Image Reconstruction,Zhiyang Yao · Shuyang Liu · Xiaoyun Yuan · Lu Fang, ,,https://ieeexplore.ieee.org/document/10463068/,,,,,nan
+MRFP: Learning Generalizable Semantic Segmentation from Sim-2-Real with Multi-Resolution Feature Perturbation,Sumanth Udupa · Prajwal Gurunath · Aniruddh Sikdar · Suresh Sundaram,https://arxiv.org/abs/2311.18331,https://arxiv.org/abs/2311.18331v1,,2311.18331v1.pdf,MRFP: Learning Generalizable Semantic Segmentation from Sim-2-Real with Multi-Resolution Feature Perturbation,"Deep neural networks have shown exemplary performance on semantic scene
+understanding tasks on source domains, but due to the absence of style
+diversity during training, enhancing performance on unseen target domains using
+only single source domain data remains a challenging task. Generation of
+simulated data is a feasible alternative to retrieving large style-diverse
+real-world datasets as it is a cumbersome and budget-intensive process.
+However, the large domain-specific inconsistencies between simulated and
+real-world data pose a significant generalization challenge in semantic
+segmentation. In this work, to alleviate this problem, we propose a novel
+MultiResolution Feature Perturbation (MRFP) technique to randomize
+domain-specific fine-grained features and perturb style of coarse features. Our
+experimental results on various urban-scene segmentation datasets clearly
+indicate that, along with the perturbation of style-information, perturbation
+of fine-feature components is paramount to learn domain invariant robust
+feature maps for semantic segmentation models. MRFP is a simple and
+computationally efficient, transferable module with no additional learnable
+parameters or objective functions, that helps state-of-the-art deep neural
+networks to learn robust domain invariant features for simulation-to-real
+semantic segmentation.",cs.CV,"['cs.CV', 'cs.AI']"
+Theoretically Achieving Continuous Representation of Oriented Bounding Boxes,Zikai Xiao · Guo-Ye Yang · Xue Yang · Tai-Jiang Mu · Junchi Yan · Shi-Min Hu, ,https://arxiv.org/abs/2402.18975v1,,2402.18975v1.pdf,Theoretically Achieving Continuous Representation of Oriented Bounding Boxes,"Considerable efforts have been devoted to Oriented Object Detection (OOD).
+However, one lasting issue regarding the discontinuity in Oriented Bounding Box
+(OBB) representation remains unresolved, which is an inherent bottleneck for
+extant OOD methods. This paper endeavors to completely solve this issue in a
+theoretically guaranteed manner and puts an end to the ad-hoc efforts in this
+direction. Prior studies typically can only address one of the two cases of
+discontinuity: rotation and aspect ratio, and often inadvertently introduce
+decoding discontinuity, e.g. Decoding Incompleteness (DI) and Decoding
+Ambiguity (DA) as discussed in literature. Specifically, we propose a novel
+representation method called Continuous OBB (COBB), which can be readily
+integrated into existing detectors e.g. Faster-RCNN as a plugin. It can
+theoretically ensure continuity in bounding box regression which to our best
+knowledge, has not been achieved in literature for rectangle-based object
+representation. For fairness and transparency of experiments, we have developed
+a modularized benchmark based on the open-source deep learning framework
+Jittor's detection toolbox JDet for OOD evaluation. On the popular DOTA
+dataset, by integrating Faster-RCNN as the same baseline model, our new method
+outperforms the peer method Gliding Vertex by 1.13% mAP50 (relative improvement
+1.54%), and 2.46% mAP75 (relative improvement 5.91%), without any tricks.",cs.CV,"['cs.CV', 'cs.AI']"
+MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors,He Zhang · Shenghao Ren · Haolei Yuan · Jianhui Zhao · Fan Li · Shuangpeng Sun · Zhenghao Liang · Tao Yu · Qiu Shen · Xun Cao,https://metaverse-ai-lab-thu.github.io/MMVP-Dataset/,https://arxiv.org/abs/2403.17610,,2403.17610.pdf,MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors,"Foot contact is an important cue for human motion capture, understanding, and
+generation. Existing datasets tend to annotate dense foot contact using visual
+matching with thresholding or incorporating pressure signals. However, these
+approaches either suffer from low accuracy or are only designed for small-range
+and slow motion. There is still a lack of a vision-pressure multimodal dataset
+with large-range and fast human motion, as well as accurate and dense
+foot-contact annotation. To fill this gap, we propose a Multimodal MoCap
+Dataset with Vision and Pressure sensors, named MMVP. MMVP provides accurate
+and dense plantar pressure signals synchronized with RGBD observations, which
+is especially useful for both plausible shape estimation, robust pose fitting
+without foot drifting, and accurate global translation tracking. To validate
+the dataset, we propose an RGBD-P SMPL fitting method and also a
+monocular-video-based baseline framework, VP-MoCap, for human motion capture.
+Experiments demonstrate that our RGBD-P SMPL Fitting results significantly
+outperform pure visual motion capture. Moreover, VP-MoCap outperforms SOTA
+methods in foot-contact and global translation estimation accuracy. We believe
+the configuration of the dataset and the baseline frameworks will stimulate the
+research in this direction and also provide a good reference for MoCap
+applications in various domains. Project page:
+https://metaverse-ai-lab-thu.github.io/MMVP-Dataset/.",cs.CV,['cs.CV']
+Learning Correlation Structures for Vision Transformers,Manjin Kim · Paul Hongsuck Seo · Cordelia Schmid · Minsu Cho, ,https://arxiv.org/abs/2404.03924,,2404.03924.pdf,Learning Correlation Structures for Vision Transformers,"We introduce a new attention mechanism, dubbed structural self-attention
+(StructSA), that leverages rich correlation patterns naturally emerging in
+key-query interactions of attention. StructSA generates attention maps by
+recognizing space-time structures of key-query correlations via convolution and
+uses them to dynamically aggregate local contexts of value features. This
+effectively leverages rich structural patterns in images and videos such as
+scene layouts, object motion, and inter-object relations. Using StructSA as a
+main building block, we develop the structural vision transformer (StructViT)
+and evaluate its effectiveness on both image and video classification tasks,
+achieving state-of-the-art results on ImageNet-1K, Kinetics-400,
+Something-Something V1 & V2, Diving-48, and FineGym.",cs.CV,['cs.CV']
+Image Restoration by Denoising Diffusion Models With Iteratively Preconditioned Guidance,Tomer Garber · Tom Tirer,https://github.com/tirer-lab/DDPG,https://arxiv.org/abs/2312.16519,,2312.16519.pdf,Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance,"Training deep neural networks has become a common approach for addressing
+image restoration problems. An alternative for training a ""task-specific""
+network for each observation model is to use pretrained deep denoisers for
+imposing only the signal's prior within iterative algorithms, without
+additional training. Recently, a sampling-based variant of this approach has
+become popular with the rise of diffusion/score-based generative models. Using
+denoisers for general purpose restoration requires guiding the iterations to
+ensure agreement of the signal with the observations. In low-noise settings,
+guidance that is based on back-projection (BP) has been shown to be a promising
+strategy (used recently also under the names ""pseudoinverse"" or
+""range/null-space"" guidance). However, the presence of noise in the
+observations hinders the gains from this approach. In this paper, we propose a
+novel guidance technique, based on preconditioning that allows traversing from
+BP-based guidance to least squares based guidance along the restoration scheme.
+The proposed approach is robust to noise while still having much simpler
+implementation than alternative methods (e.g., it does not require SVD or a
+large number of iterations). We use it within both an optimization scheme and a
+sampling-based scheme, and demonstrate its advantages over existing methods for
+image deblurring and super-resolution.",eess.IV,"['eess.IV', 'cs.CV']"
+Resource-Efficient Transformer Pruning for Finetuning of Large Models,Fatih Ilhan · Gong Su · Selim Tekin · Tiansheng Huang · Sihao Hu · Ling Liu, ,https://arxiv.org/abs/2403.14608,,2403.14608.pdf,Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey,"Large models represent a groundbreaking advancement in multiple application
+fields, enabling remarkable achievements across various tasks. However, their
+unprecedented scale comes with significant computational costs. These models,
+often consisting of billions of parameters, require vast amounts of
+computational resources for execution. Especially, the expansive scale and
+computational demands pose considerable challenges when customizing them for
+particular downstream tasks, particularly over the hardware platforms
+constrained by computational capabilities. Parameter Efficient Fine-Tuning
+(PEFT) provides a practical solution by efficiently adapt the large models over
+the various downstream tasks. In particular, PEFT refers to the process of
+adjusting the parameters of a pre-trained large models to adapt it to a
+specific task while minimizing the number of additional parameters introduced
+or computational resources required. This approach is particularly important
+when dealing with large language models with high parameter counts, as
+fine-tuning these models from scratch can be computationally expensive and
+resource-intensive, posing considerable challenges in the supporting system
+platform design. In this survey, we present comprehensive studies of various
+PEFT algorithms, examining their performance and computational overhead.
+Moreover, we provide an overview of applications developed using different PEFT
+algorithms and discuss common techniques employed to mitigate computation costs
+for PEFT. In addition to the algorithmic perspective, we overview various
+real-world system designs to investigate the implementation costs associated
+with different PEFT algorithms. This survey serves as an indispensable resource
+for researchers aiming to understand both the PEFT algorithm and its system
+implementation, offering detailed insights into recent advancements and
+practical applications.",cs.LG,['cs.LG']
+HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation,Zhiying Leng · Tolga Birdal · Xiaohui Liang · Federico Tombari, ,https://arxiv.org/abs/2403.00372,,2403.00372.pdf,HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation,"3D shape generation from text is a fundamental task in 3D representation
+learning. The text-shape pairs exhibit a hierarchical structure, where a
+general text like ``chair"" covers all 3D shapes of the chair, while more
+detailed prompts refer to more specific shapes. Furthermore, both text and 3D
+shapes are inherently hierarchical structures. However, existing Text2Shape
+methods, such as SDFusion, do not exploit that. In this work, we propose
+HyperSDFusion, a dual-branch diffusion model that generates 3D shapes from a
+given text. Since hyperbolic space is suitable for handling hierarchical data,
+we propose to learn the hierarchical representations of text and 3D shapes in
+hyperbolic space. First, we introduce a hyperbolic text-image encoder to learn
+the sequential and multi-modal hierarchical features of text in hyperbolic
+space. In addition, we design a hyperbolic text-graph convolution module to
+learn the hierarchical features of text in hyperbolic space. In order to fully
+utilize these text features, we introduce a dual-branch structure to embed text
+features in 3D feature space. At last, to endow the generated 3D shapes with a
+hierarchical structure, we devise a hyperbolic hierarchical loss. Our method is
+the first to explore the hyperbolic hierarchical representation for
+text-to-shape generation. Experimental results on the existing text-to-shape
+paired dataset, Text2Shape, achieved state-of-the-art results. We release our
+implementation under HyperSDFusion.github.io.",cs.CV,['cs.CV']
+Condition-Aware Neural Network for Controlled Image Generation,Han Cai · Muyang Li · Qinsheng Zhang · Ming-Yu Liu · Song Han, ,https://arxiv.org/abs/2404.01143,,2404.01143.pdf,Condition-Aware Neural Network for Controlled Image Generation,"We present Condition-Aware Neural Network (CAN), a new method for adding
+control to image generative models. In parallel to prior conditional control
+methods, CAN controls the image generation process by dynamically manipulating
+the weight of the neural network. This is achieved by introducing a
+condition-aware weight generation module that generates conditional weight for
+convolution/linear layers based on the input condition. We test CAN on
+class-conditional image generation on ImageNet and text-to-image generation on
+COCO. CAN consistently delivers significant improvements for diffusion
+transformer models, including DiT and UViT. In particular, CAN combined with
+EfficientViT (CaT) achieves 2.78 FID on ImageNet 512x512, surpassing DiT-XL/2
+while requiring 52x fewer MACs per sampling step.",cs.CV,"['cs.CV', 'cs.AI']"
+TULIP: Transformer for Upsampling of LiDAR Point Cloud,Bin Yang · Patrick Pfreundschuh · Roland Siegwart · Marco Hutter · Peyman Moghadam · Vaishakh Patil,https://github.com/ethz-asl/TULIP,https://arxiv.org/abs/2312.06733,,2312.06733.pdf,TULIP: Transformer for Upsampling of LiDAR Point Clouds,"LiDAR Upsampling is a challenging task for the perception systems of robots
+and autonomous vehicles, due to the sparse and irregular structure of
+large-scale scene contexts. Recent works propose to solve this problem by
+converting LiDAR data from 3D Euclidean space into an image super-resolution
+problem in 2D image space. Although their methods can generate high-resolution
+range images with fine-grained details, the resulting 3D point clouds often
+blur out details and predict invalid points. In this paper, we propose TULIP, a
+new method to reconstruct high-resolution LiDAR point clouds from
+low-resolution LiDAR input. We also follow a range image-based approach but
+specifically modify the patch and window geometries of a Swin-Transformer-based
+network to better fit the characteristics of range images. We conducted several
+experiments on three public real-world and simulated datasets. TULIP
+outperforms state-of-the-art methods in all relevant metrics and generates
+robust and more realistic point clouds than prior works.",cs.CV,['cs.CV']
+PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving,Xinshuo Weng · Boris Ivanovic · Yan Wang · Yue Wang · Marco Pavone, ,https://arxiv.org/abs/2311.02077,,2311.02077.pdf,EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision,"We present EmerNeRF, a simple yet powerful approach for learning
+spatial-temporal representations of dynamic driving scenes. Grounded in neural
+fields, EmerNeRF simultaneously captures scene geometry, appearance, motion,
+and semantics via self-bootstrapping. EmerNeRF hinges upon two core components:
+First, it stratifies scenes into static and dynamic fields. This decomposition
+emerges purely from self-supervision, enabling our model to learn from general,
+in-the-wild data sources. Second, EmerNeRF parameterizes an induced flow field
+from the dynamic field and uses this flow field to further aggregate
+multi-frame features, amplifying the rendering precision of dynamic objects.
+Coupling these three fields (static, dynamic, and flow) enables EmerNeRF to
+represent highly-dynamic scenes self-sufficiently, without relying on ground
+truth object annotations or pre-trained models for dynamic object segmentation
+or optical flow estimation. Our method achieves state-of-the-art performance in
+sensor simulation, significantly outperforming previous methods when
+reconstructing static (+2.93 PSNR) and dynamic (+3.70 PSNR) scenes. In
+addition, to bolster EmerNeRF's semantic generalization, we lift 2D visual
+foundation model features into 4D space-time and address a general positional
+bias in modern Transformers, significantly boosting 3D perception performance
+(e.g., 37.50% relative improvement in occupancy prediction accuracy on
+average). Finally, we construct a diverse and challenging 120-sequence dataset
+to benchmark neural fields under extreme and highly-dynamic settings.",cs.CV,['cs.CV']
+Driving Everywhere with Large Language Model Policy Adaptation,Boyi Li · Yue Wang · Jiageng Mao · Boris Ivanovic · Sushant Veer · Karen Leung · Marco Pavone, ,https://arxiv.org/abs/2402.05932,,2402.05932.pdf,Driving Everywhere with Large Language Model Policy Adaptation,"Adapting driving behavior to new environments, customs, and laws is a
+long-standing problem in autonomous driving, precluding the widespread
+deployment of autonomous vehicles (AVs). In this paper, we present LLaDA, a
+simple yet powerful tool that enables human drivers and autonomous vehicles
+alike to drive everywhere by adapting their tasks and motion plans to traffic
+rules in new locations. LLaDA achieves this by leveraging the impressive
+zero-shot generalizability of large language models (LLMs) in interpreting the
+traffic rules in the local driver handbook. Through an extensive user study, we
+show that LLaDA's instructions are useful in disambiguating in-the-wild
+unexpected situations. We also demonstrate LLaDA's ability to adapt AV motion
+planning policies in real-world datasets; LLaDA outperforms baseline planning
+approaches on all our metrics. Please check our website for more details:
+https://boyiliee.github.io/llada.",cs.RO,"['cs.RO', 'cs.AI', 'cs.CL']"
+Semantically-Shifted Incremental Adapter-Tuning is A Continual ViTransformer,Yuwen Tan · Qinhao Zhou · Xiang Xiang · Ke Wang · Yuchuan Wu · Yongbin Li, ,https://arxiv.org/abs/2403.19979,,2403.19979.pdf,Semantically-Shifted Incremental Adapter-Tuning is A Continual ViTransformer,"Class-incremental learning (CIL) aims to enable models to continuously learn
+new classes while overcoming catastrophic forgetting. The introduction of
+pre-trained models has brought new tuning paradigms to CIL. In this paper, we
+revisit different parameter-efficient tuning (PET) methods within the context
+of continual learning. We observe that adapter tuning demonstrates superiority
+over prompt-based methods, even without parameter expansion in each learning
+session. Motivated by this, we propose incrementally tuning the shared adapter
+without imposing parameter update constraints, enhancing the learning capacity
+of the backbone. Additionally, we employ feature sampling from stored
+prototypes to retrain a unified classifier, further improving its performance.
+We estimate the semantic shift of old prototypes without access to past samples
+and update stored prototypes session by session. Our proposed method eliminates
+model expansion and avoids retaining any image samples. It surpasses previous
+pre-trained model-based CIL methods and demonstrates remarkable continual
+learning capabilities. Experimental results on five CIL benchmarks validate the
+effectiveness of our approach, achieving state-of-the-art (SOTA) performance.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+BilevelPruning: Unified Dynamic and Static Channel Pruning for Convolutional Neural Networks,Shangqian Gao · Yanfu Zhang · Feihu Huang · Heng Huang, ,https://arxiv.org/abs/2402.17862v1,,2402.17862v1.pdf,REPrune: Channel Pruning via Kernel Representative Selection,"Channel pruning is widely accepted to accelerate modern convolutional neural
+networks (CNNs). The resulting pruned model benefits from its immediate
+deployment on general-purpose software and hardware resources. However, its
+large pruning granularity, specifically at the unit of a convolution filter,
+often leads to undesirable accuracy drops due to the inflexibility of deciding
+how and where to introduce sparsity to the CNNs. In this paper, we propose
+REPrune, a novel channel pruning technique that emulates kernel pruning, fully
+exploiting the finer but structured granularity. REPrune identifies similar
+kernels within each channel using agglomerative clustering. Then, it selects
+filters that maximize the incorporation of kernel representatives while
+optimizing the maximum cluster coverage problem. By integrating with a
+simultaneous training-pruning paradigm, REPrune promotes efficient, progressive
+pruning throughout training CNNs, avoiding the conventional
+train-prune-finetune sequence. Experimental results highlight that REPrune
+performs better in computer vision tasks than existing methods, effectively
+achieving a balance between acceleration ratio and performance retention.",cs.CV,"['cs.CV', 'cs.AI']"
+DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing,Jia-Wei Liu · Yan-Pei Cao · Jay Zhangjie Wu · Weijia Mao · Yuchao Gu · Rui Zhao · Jussi Keppo · Ying Shan · Mike Zheng Shou, ,https://arxiv.org/abs/2310.10624,,2310.10624.pdf,DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing,"Despite recent progress in diffusion-based video editing, existing methods
+are limited to short-length videos due to the contradiction between long-range
+consistency and frame-wise editing. Prior attempts to address this challenge by
+introducing video-2D representations encounter significant difficulties with
+large-scale motion- and view-change videos, especially in human-centric
+scenarios. To overcome this, we propose to introduce the dynamic Neural
+Radiance Fields (NeRF) as the innovative video representation, where the
+editing can be performed in the 3D spaces and propagated to the entire video
+via the deformation field. To provide consistent and controllable editing, we
+propose the image-based video-NeRF editing pipeline with a set of innovative
+designs, including multi-view multi-pose Score Distillation Sampling (SDS) from
+both the 2D personalized diffusion prior and 3D diffusion prior, reconstruction
+losses, text-guided local parts super-resolution, and style transfer. Extensive
+experiments demonstrate that our method, dubbed as DynVideo-E, significantly
+outperforms SOTA approaches on two challenging datasets by a large margin of
+50% ~ 95% for human preference. Code will be released at
+https://showlab.github.io/DynVideo-E/.",cs.CV,['cs.CV']
+X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model,Lingmin Ran · Xiaodong Cun · Jia-Wei Liu · Rui Zhao · Song Zijie · Xintao Wang · Jussi Keppo · Mike Zheng Shou, ,https://arxiv.org/abs/2312.02238,,2312.02238.pdf,X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model,"We introduce X-Adapter, a universal upgrader to enable the pretrained
+plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the
+upgraded text-to-image diffusion model (e.g., SDXL) without further retraining.
+We achieve this goal by training an additional network to control the frozen
+upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a
+frozen copy of the old model to preserve the connectors of different plugins.
+Additionally, X-Adapter adds trainable mapping layers that bridge the decoders
+from models of different versions for feature remapping. The remapped features
+will be used as guidance for the upgraded model. To enhance the guidance
+ability of X-Adapter, we employ a null-text training strategy for the upgraded
+model. After training, we also introduce a two-stage denoising strategy to
+align the initial latents of X-Adapter and the upgraded model. Thanks to our
+strategies, X-Adapter demonstrates universal compatibility with various plugins
+and also enables plugins of different versions to work together, thereby
+expanding the functionalities of diffusion community. To verify the
+effectiveness of the proposed method, we conduct extensive experiments and the
+results show that X-Adapter may facilitate wider application in the upgraded
+foundational diffusion model.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM']"
+MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model,Zhongcong Xu · Jianfeng Zhang · Jun Hao Liew · Hanshu Yan · Jia-Wei Liu · Chenxu Zhang · Jiashi Feng · Mike Zheng Shou, ,https://arxiv.org/abs/2311.16498,,2311.16498.pdf,MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model,"This paper studies the human image animation task, which aims to generate a
+video of a certain reference identity following a particular motion sequence.
+Existing animation works typically employ the frame-warping technique to
+animate the reference image towards the target motion. Despite achieving
+reasonable results, these approaches face challenges in maintaining temporal
+consistency throughout the animation due to the lack of temporal modeling and
+poor preservation of reference identity. In this work, we introduce
+MagicAnimate, a diffusion-based framework that aims at enhancing temporal
+consistency, preserving reference image faithfully, and improving animation
+fidelity. To achieve this, we first develop a video diffusion model to encode
+temporal information. Second, to maintain the appearance coherence across
+frames, we introduce a novel appearance encoder to retain the intricate details
+of the reference image. Leveraging these two innovations, we further employ a
+simple video fusion technique to encourage smooth transitions for long video
+animation. Empirical results demonstrate the superiority of our method over
+baseline approaches on two benchmarks. Notably, our approach outperforms the
+strongest baseline by over 38% in terms of video fidelity on the challenging
+TikTok dancing dataset. Code and model will be made available.",cs.CV,"['cs.CV', 'cs.GR']"
+VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence,Yuchao Gu · Yipin Zhou · Bichen Wu · Licheng Yu · Jia-Wei Liu · Rui Zhao · Jay Zhangjie Wu · David Junhao Zhang · Mike Zheng Shou · Kevin Tang, ,https://arxiv.org/abs/2312.02087,,2312.02087.pdf,VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence,"Current diffusion-based video editing primarily focuses on
+structure-preserved editing by utilizing various dense correspondences to
+ensure temporal consistency and motion alignment. However, these approaches are
+often ineffective when the target edit involves a shape change. To embark on
+video editing with shape change, we explore customized video subject swapping
+in this work, where we aim to replace the main subject in a source video with a
+target subject having a distinct identity and potentially different shape. In
+contrast to previous methods that rely on dense correspondences, we introduce
+the VideoSwap framework that exploits semantic point correspondences, inspired
+by our observation that only a small number of semantic points are necessary to
+align the subject's motion trajectory and modify its shape. We also introduce
+various user-point interactions (\eg, removing points and dragging points) to
+address various semantic point correspondence. Extensive experiments
+demonstrate state-of-the-art video subject swapping results across a variety of
+real-world videos.",cs.CV,['cs.CV']
+LIVE: Online Large Video-Language Model for Streaming Video,Joya Chen · Zhaoyang Lv · Shiwei Wu · Kevin Qinghong Lin · Chenan Song · Difei Gao · Jia-Wei Liu · Ziteng Gao · Dongxing Mao · Mike Zheng Shou, ,https://arxiv.org/abs/2405.16009,,2405.16009.pdf,Streaming Long Video Understanding with Large Language Models,"This paper presents VideoStreaming, an advanced vision-language large model
+(VLLM) for video understanding, that capably understands arbitrary-length video
+with a constant number of video tokens streamingly encoded and adaptively
+selected. The challenge of video understanding in the vision language area
+mainly lies in the significant computational burden caused by the great number
+of tokens extracted from long videos. Previous works rely on sparse sampling or
+frame compression to reduce tokens. However, such approaches either disregard
+temporal information in a long time span or sacrifice spatial details,
+resulting in flawed compression. To address these limitations, our
+VideoStreaming has two core designs: Memory-Propagated Streaming Encoding and
+Adaptive Memory Selection. The Memory-Propagated Streaming Encoding
+architecture segments long videos into short clips and sequentially encodes
+each clip with a propagated memory. In each iteration, we utilize the encoded
+results of the preceding clip as historical memory, which is integrated with
+the current clip to distill a condensed representation that encapsulates the
+video content up to the current timestamp. After the encoding process, the
+Adaptive Memory Selection strategy selects a constant number of
+question-related memories from all the historical memories and feeds them into
+the LLM to generate informative responses. The question-related selection
+reduces redundancy within the memories, enabling efficient and precise video
+understanding. Meanwhile, the disentangled video extraction and reasoning
+design allows the LLM to answer different questions about a video by directly
+selecting corresponding memories, without the need to encode the whole video
+for each question. Our model achieves superior performance and higher
+efficiency on long video benchmarks, showcasing precise temporal comprehension
+for detailed question answering.",cs.CV,['cs.CV']
+Restoration by Generation with Constrained Priors,Zheng Ding · Xuaner Zhang · Zhuowen Tu · Zhihao Xia,https://gen2res.github.io,https://arxiv.org/abs/2312.17161,,2312.17161.pdf,Restoration by Generation with Constrained Priors,"The inherent generative power of denoising diffusion models makes them
+well-suited for image restoration tasks where the objective is to find the
+optimal high-quality image within the generative space that closely resembles
+the input image. We propose a method to adapt a pretrained diffusion model for
+image restoration by simply adding noise to the input image to be restored and
+then denoise. Our method is based on the observation that the space of a
+generative model needs to be constrained. We impose this constraint by
+finetuning the generative model with a set of anchor images that capture the
+characteristics of the input image. With the constrained space, we can then
+leverage the sampling strategy used for generation to do image restoration. We
+evaluate against previous methods and show superior performances on multiple
+real-world restoration datasets in preserving identity and image quality. We
+also demonstrate an important and practical application on personalized
+restoration, where we use a personal album as the anchor images to constrain
+the generative space. This approach allows us to produce results that
+accurately preserve high-frequency details, which previous works are unable to
+do. Project webpage: https://gen2res.github.io.",cs.CV,['cs.CV']
+3D Multi-frame Fusion for Video Stabilization,Zhan Peng · Xinyi Ye · Weiyue Zhao · TIANQI LIU · Huiqiang Sun · Baopu Li · Zhiguo Cao, ,https://arxiv.org/abs/2404.12887,,2404.12887.pdf,3D Multi-frame Fusion for Video Stabilization,"In this paper, we present RStab, a novel framework for video stabilization
+that integrates 3D multi-frame fusion through volume rendering. Departing from
+conventional methods, we introduce a 3D multi-frame perspective to generate
+stabilized images, addressing the challenge of full-frame generation while
+preserving structure. The core of our approach lies in Stabilized Rendering
+(SR), a volume rendering module, which extends beyond the image fusion by
+incorporating feature fusion. The core of our RStab framework lies in
+Stabilized Rendering (SR), a volume rendering module, fusing multi-frame
+information in 3D space. Specifically, SR involves warping features and colors
+from multiple frames by projection, fusing them into descriptors to render the
+stabilized image. However, the precision of warped information depends on the
+projection accuracy, a factor significantly influenced by dynamic regions. In
+response, we introduce the Adaptive Ray Range (ARR) module to integrate depth
+priors, adaptively defining the sampling range for the projection process.
+Additionally, we propose Color Correction (CC) assisting geometric constraints
+with optical flow for accurate color aggregation. Thanks to the three modules,
+our RStab demonstrates superior performance compared with previous stabilizers
+in the field of view (FOV), image quality, and video stability across various
+datasets.",cs.CV,"['cs.CV', 'eess.IV']"
+3D Facial Expressions through Analysis-by-Neural-Synthesis,George Retsinas · Panagiotis Filntisis · Radek Danecek · Victoria Abrevaya · Anastasios Roussos · Timo Bolkart · Petros Maragos,https://georgeretsi.github.io/smirk/,https://arxiv.org/abs/2404.04104,,2404.04104.pdf,3D Facial Expressions through Analysis-by-Neural-Synthesis,"While existing methods for 3D face reconstruction from in-the-wild images
+excel at recovering the overall face shape, they commonly miss subtle, extreme,
+asymmetric, or rarely observed expressions. We improve upon these methods with
+SMIRK (Spatial Modeling for Image-based Reconstruction of Kinesics), which
+faithfully reconstructs expressive 3D faces from images. We identify two key
+limitations in existing methods: shortcomings in their self-supervised training
+formulation, and a lack of expression diversity in the training images. For
+training, most methods employ differentiable rendering to compare a predicted
+face mesh with the input image, along with a plethora of additional loss
+functions. This differentiable rendering loss not only has to provide
+supervision to optimize for 3D face geometry, camera, albedo, and lighting,
+which is an ill-posed optimization problem, but the domain gap between
+rendering and input image further hinders the learning process. Instead, SMIRK
+replaces the differentiable rendering with a neural rendering module that,
+given the rendered predicted mesh geometry, and sparsely sampled pixels of the
+input image, generates a face image. As the neural rendering gets color
+information from sampled image pixels, supervising with neural rendering-based
+reconstruction loss can focus solely on the geometry. Further, it enables us to
+generate images of the input identity with varying expressions while training.
+These are then utilized as input to the reconstruction model and used as
+supervision with ground truth geometry. This effectively augments the training
+data and enhances the generalization for diverse expressions. Our qualitative,
+quantitative and particularly our perceptual evaluations demonstrate that SMIRK
+achieves the new state-of-the art performance on accurate expression
+reconstruction. Project webpage: https://georgeretsi.github.io/smirk/.",cs.CV,['cs.CV']
+SVGDreamer: Text Guided SVG Generation with Diffusion Model,XiMing Xing · Chuang Wang · Haitao Zhou · Jing Zhang · Dong Xu · Qian Yu,https://github.com/ximinng/SVGDreamer,https://arxiv.org/abs/2312.16476,,2312.16476.pdf,SVGDreamer: Text Guided SVG Generation with Diffusion Model,"Recently, text-guided scalable vector graphics (SVGs) synthesis has shown
+promise in domains such as iconography and sketch. However, existing
+text-to-SVG generation methods lack editability and struggle with visual
+quality and result diversity. To address these limitations, we propose a novel
+text-guided vector graphics synthesis method called SVGDreamer. SVGDreamer
+incorporates a semantic-driven image vectorization (SIVE) process that enables
+the decomposition of synthesis into foreground objects and background, thereby
+enhancing editability. Specifically, the SIVE process introduces
+attention-based primitive control and an attention-mask loss function for
+effective control and manipulation of individual elements. Additionally, we
+propose a Vectorized Particle-based Score Distillation (VPSD) approach to
+address issues of shape over-smoothing, color over-saturation, limited
+diversity, and slow convergence of the existing text-to-SVG generation methods
+by modeling SVGs as distributions of control points and colors. Furthermore,
+VPSD leverages a reward model to re-weight vector particles, which improves
+aesthetic appeal and accelerates convergence. Extensive experiments are
+conducted to validate the effectiveness of SVGDreamer, demonstrating its
+superiority over baseline methods in terms of editability, visual quality, and
+diversity. Project page:
+\href{https://ximinng.github.io/SVGDreamer-project/}{https://ximinng.github.io/SVGDreamer-project/}",cs.CV,"['cs.CV', 'cs.AI']"
+ExMap: Leveraging Explainability Heatmaps for Unsupervised Group Robustness to Spurious Correlations,Rwiddhi Chakraborty · Adrian de Sena Sletten · Michael C. Kampffmeyer, ,https://arxiv.org/abs/2403.13870,,2403.13870.pdf,ExMap: Leveraging Explainability Heatmaps for Unsupervised Group Robustness to Spurious Correlations,"Group robustness strategies aim to mitigate learned biases in deep learning
+models that arise from spurious correlations present in their training
+datasets. However, most existing methods rely on the access to the label
+distribution of the groups, which is time-consuming and expensive to obtain. As
+a result, unsupervised group robustness strategies are sought. Based on the
+insight that a trained model's classification strategies can be inferred
+accurately based on explainability heatmaps, we introduce ExMap, an
+unsupervised two stage mechanism designed to enhance group robustness in
+traditional classifiers. ExMap utilizes a clustering module to infer
+pseudo-labels based on a model's explainability heatmaps, which are then used
+during training in lieu of actual labels. Our empirical studies validate the
+efficacy of ExMap - We demonstrate that it bridges the performance gap with its
+supervised counterparts and outperforms existing partially supervised and
+unsupervised methods. Additionally, ExMap can be seamlessly integrated with
+existing group robustness learning strategies. Finally, we demonstrate its
+potential in tackling the emerging issue of multiple shortcut
+mitigation\footnote{Code available at \url{https://github.com/rwchakra/exmap}}.",cs.CV,"['cs.CV', 'cs.LG']"
+Learning Triangular Distribution in Visual World,Ping Chen · Xingpeng Zhang · Chengtao Zhou · dichao Fan · Peng Tu · Le Zhang · Yanlin Qian, ,https://arxiv.org/abs/2311.18605,,2311.18605.pdf,Learning Triangular Distribution in Visual World,"Convolution neural network is successful in pervasive vision tasks, including
+label distribution learning, which usually takes the form of learning an
+injection from the non-linear visual features to the well-defined labels.
+However, how the discrepancy between features is mapped to the label
+discrepancy is ambient, and its correctness is not guaranteed.To address these
+problems, we study the mathematical connection between feature and its label,
+presenting a general and simple framework for label distribution learning. We
+propose a so-called Triangular Distribution Transform (TDT) to build an
+injective function between feature and label, guaranteeing that any symmetric
+feature discrepancy linearly reflects the difference between labels. The
+proposed TDT can be used as a plug-in in mainstream backbone networks to
+address different label distribution learning tasks. Experiments on Facial Age
+Recognition, Illumination Chromaticity Estimation, and Aesthetics assessment
+show that TDT achieves on-par or better results than the prior arts.",cs.CV,['cs.CV']
+ToonerGAN: Reinforcing GANs for Obfuscating Automated Facial Indexing,Kartik Thakral · Shashikant Prasad · Stuti Aswani · Mayank Vatsa · Richa Singh, ,,https://github.com/Kartik-3004/facexformer,,,,,nan
+Boosting Adversarial Training via Fisher-Rao Norm-based Regularization,Xiangyu Yin · Wenjie Ruan, ,https://arxiv.org/abs/2403.17520,,2403.17520.pdf,Boosting Adversarial Training via Fisher-Rao Norm-based Regularization,"Adversarial training is extensively utilized to improve the adversarial
+robustness of deep neural networks. Yet, mitigating the degradation of standard
+generalization performance in adversarial-trained models remains an open
+problem. This paper attempts to resolve this issue through the lens of model
+complexity. First, We leverage the Fisher-Rao norm, a geometrically invariant
+metric for model complexity, to establish the non-trivial bounds of the
+Cross-Entropy Loss-based Rademacher complexity for a ReLU-activated Multi-Layer
+Perceptron. Then we generalize a complexity-related variable, which is
+sensitive to the changes in model width and the trade-off factors in
+adversarial training. Moreover, intensive empirical evidence validates that
+this variable highly correlates with the generalization gap of Cross-Entropy
+loss between adversarial-trained and standard-trained models, especially during
+the initial and final phases of the training process. Building upon this
+observation, we propose a novel regularization framework, called Logit-Oriented
+Adversarial Training (LOAT), which can mitigate the trade-off between
+robustness and accuracy while imposing only a negligible increase in
+computational overhead. Our extensive experiments demonstrate that the proposed
+regularization strategy can boost the performance of the prevalent adversarial
+training algorithms, including PGD-AT, TRADES, TRADES (LSE), MART, and DM-AT,
+across various network architectures. Our code will be available at
+https://github.com/TrustAI/LOAT.",cs.LG,"['cs.LG', 'cs.CV']"
+CORES: Convolutional Response-based Score for Out-of-distribution Detection,Keke Tang · Chao Hou · Weilong Peng · Runnan Chen · Peican Zhu · Wenping Wang · Zhihong Tian, ,https://arxiv.org/abs/2405.01662,,2405.01662.pdf,Out-of-distribution detection based on subspace projection of high-dimensional features output by the last convolutional layer,"Out-of-distribution (OOD) detection, crucial for reliable pattern
+classification, discerns whether a sample originates outside the training
+distribution. This paper concentrates on the high-dimensional features output
+by the final convolutional layer, which contain rich image features. Our key
+idea is to project these high-dimensional features into two specific feature
+subspaces, leveraging the dimensionality reduction capacity of the network's
+linear layers, trained with Predefined Evenly-Distribution Class Centroids
+(PEDCC)-Loss. This involves calculating the cosines of three projection angles
+and the norm values of features, thereby identifying distinctive information
+for in-distribution (ID) and OOD data, which assists in OOD detection. Building
+upon this, we have modified the batch normalization (BN) and ReLU layer
+preceding the fully connected layer, diminishing their impact on the output
+feature distributions and thereby widening the distribution gap between ID and
+OOD data features. Our method requires only the training of the classification
+network model, eschewing any need for input pre-processing or specific OOD data
+pre-tuning. Extensive experiments on several benchmark datasets demonstrates
+that our approach delivers state-of-the-art performance. Our code is available
+at https://github.com/Hewell0/ProjOOD.",cs.CV,['cs.CV']
+Higher-order Relational Reasoning for Pedestrian Trajectory Prediction,Sungjune Kim · Hyung-gun Chi · Hyerin Lim · Karthik Ramani · Jinkyu Kim · Sangpil Kim, ,https://arxiv.org/abs/2403.08032,,2403.08032.pdf,LG-Traj: LLM Guided Pedestrian Trajectory Prediction,"Accurate pedestrian trajectory prediction is crucial for various
+applications, and it requires a deep understanding of pedestrian motion
+patterns in dynamic environments. However, existing pedestrian trajectory
+prediction methods still need more exploration to fully leverage these motion
+patterns. This paper investigates the possibilities of using Large Language
+Models (LLMs) to improve pedestrian trajectory prediction tasks by inducing
+motion cues. We introduce LG-Traj, a novel approach incorporating LLMs to
+generate motion cues present in pedestrian past/observed trajectories. Our
+approach also incorporates motion cues present in pedestrian future
+trajectories by clustering future trajectories of training data using a mixture
+of Gaussians. These motion cues, along with pedestrian coordinates, facilitate
+a better understanding of the underlying representation. Furthermore, we
+utilize singular value decomposition to augment the observed trajectories,
+incorporating them into the model learning process to further enhance
+representation learning. Our method employs a transformer-based architecture
+comprising a motion encoder to model motion patterns and a social decoder to
+capture social interactions among pedestrians. We demonstrate the effectiveness
+of our approach on popular pedestrian trajectory prediction benchmarks, namely
+ETH-UCY and SDD, and present various ablation experiments to validate our
+approach.",cs.CV,"['cs.CV', 'cs.AI']"
+LED: A Large-scale Real-world Paired Dataset for Event Camera Denoising,Yuxing Duan, ,https://arxiv.org/abs/2405.19718,,2405.19718.pdf,LED: A Large-scale Real-world Paired Dataset for Event Camera Denoising,"Event camera has significant advantages in capturing dynamic scene
+information while being prone to noise interference, particularly in
+challenging conditions like low threshold and low illumination. However, most
+existing research focuses on gentle situations, hindering event camera
+applications in realistic complex scenarios. To tackle this limitation and
+advance the field, we construct a new paired real-world event denoising dataset
+(LED), including 3K sequences with 18K seconds of high-resolution (1200*680)
+event streams and showing three notable distinctions compared to others:
+diverse noise levels and scenes, larger-scale with high-resolution, and
+high-quality GT. Specifically, it contains stepped parameters and varying
+illumination with diverse scenarios. Moreover, based on the property of noise
+events inconsistency and signal events consistency, we propose a novel
+effective denoising framework(DED) using homogeneous dual events to generate
+the GT with better separating noise from the raw. Furthermore, we design a
+bio-inspired baseline leveraging Leaky-Integrate-and-Fire (LIF) neurons with
+dynamic thresholds to realize accurate denoising. The experimental results
+demonstrate that the remarkable performance of the proposed approach on
+different datasets.The dataset and code are at https://github.com/Yee-Sing/led.",cs.CV,['cs.CV']
+Point2CAD: Reverse Engineering CAD Models from 3D Point Clouds,Yujia Liu · Anton Obukhov · Jan D. Wegner · Konrad Schindler, ,https://arxiv.org/abs/2312.04962,,2312.04962.pdf,Point2CAD: Reverse Engineering CAD Models from 3D Point Clouds,"Computer-Aided Design (CAD) model reconstruction from point clouds is an
+important problem at the intersection of computer vision, graphics, and machine
+learning; it saves the designer significant time when iterating on in-the-wild
+objects. Recent advancements in this direction achieve relatively reliable
+semantic segmentation but still struggle to produce an adequate topology of the
+CAD model. In this work, we analyze the current state of the art for that
+ill-posed task and identify shortcomings of existing methods. We propose a
+hybrid analytic-neural reconstruction scheme that bridges the gap between
+segmented point clouds and structured CAD models and can be readily combined
+with different segmentation backbones. Moreover, to power the surface fitting
+stage, we propose a novel implicit neural representation of freeform surfaces,
+driving up the performance of our overall CAD reconstruction scheme. We
+extensively evaluate our method on the popular ABC benchmark of CAD models and
+set a new state-of-the-art for that dataset. Project page:
+https://www.obukhov.ai/point2cad}{https://www.obukhov.ai/point2cad.",cs.CV,['cs.CV']
+Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation,Hoang Chuong Nguyen · Tianyu Wang · Jose M. Alvarez · Miaomiao Liu, ,https://arxiv.org/abs/2404.14908,,2404.14908.pdf,Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation,"This paper focuses on self-supervised monocular depth estimation in dynamic
+scenes trained on monocular videos. Existing methods jointly estimate
+pixel-wise depth and motion, relying mainly on an image reconstruction loss.
+Dynamic regions1 remain a critical challenge for these methods due to the
+inherent ambiguity in depth and motion estimation, resulting in inaccurate
+depth estimation. This paper proposes a self-supervised training framework
+exploiting pseudo depth labels for dynamic regions from training data. The key
+contribution of our framework is to decouple depth estimation for static and
+dynamic regions of images in the training data. We start with an unsupervised
+depth estimation approach, which provides reliable depth estimates for static
+regions and motion cues for dynamic regions and allows us to extract moving
+object information at the instance level. In the next stage, we use an object
+network to estimate the depth of those moving objects assuming rigid motions.
+Then, we propose a new scale alignment module to address the scale ambiguity
+between estimated depths for static and dynamic regions. We can then use the
+depth labels generated to train an end-to-end depth estimation network and
+improve its performance. Extensive experiments on the Cityscapes and KITTI
+datasets show that our self-training strategy consistently outperforms existing
+self/unsupervised depth estimation methods.",cs.CV,['cs.CV']
+Collaborative Learning of Anomalies with Privacy (CLAP) for Unsupervised Video Anomaly Detection: A New Baseline,Anas Al-lahham · Muhammad Zaigham Zaheer · Nurbek Tastan · Karthik Nandakumar,https://anasemad11.github.io/CLAP/,https://arxiv.org/abs/2404.00847,,2404.00847.pdf,Collaborative Learning of Anomalies with Privacy (CLAP) for Unsupervised Video Anomaly Detection: A New Baseline,"Unsupervised (US) video anomaly detection (VAD) in surveillance applications
+is gaining more popularity recently due to its practical real-world
+applications. As surveillance videos are privacy sensitive and the availability
+of large-scale video data may enable better US-VAD systems, collaborative
+learning can be highly rewarding in this setting. However, due to the extremely
+challenging nature of the US-VAD task, where learning is carried out without
+any annotations, privacy-preserving collaborative learning of US-VAD systems
+has not been studied yet. In this paper, we propose a new baseline for anomaly
+detection capable of localizing anomalous events in complex surveillance videos
+in a fully unsupervised fashion without any labels on a privacy-preserving
+participant-based distributed training configuration. Additionally, we propose
+three new evaluation protocols to benchmark anomaly detection approaches on
+various scenarios of collaborations and data availability. Based on these
+protocols, we modify existing VAD datasets to extensively evaluate our approach
+as well as existing US SOTA methods on two large-scale datasets including
+UCF-Crime and XD-Violence. All proposed evaluation protocols, dataset splits,
+and codes are available here: https://github.com/AnasEmad11/CLAP",cs.CV,['cs.CV']
+Unlocking Pretrained Image Backbones for Semantic Image Synthesis,Tariq Berrada · Jakob Verbeek · camille couprie · Karteek Alahari, ,https://arxiv.org/abs/2312.13314,,2312.13314.pdf,Unlocking Pre-trained Image Backbones for Semantic Image Synthesis,"Semantic image synthesis, i.e., generating images from user-provided semantic
+label maps, is an important conditional image generation task as it allows to
+control both the content as well as the spatial layout of generated images.
+Although diffusion models have pushed the state of the art in generative image
+modeling, the iterative nature of their inference process makes them
+computationally demanding. Other approaches such as GANs are more efficient as
+they only need a single feed-forward pass for generation, but the image quality
+tends to suffer on large and diverse datasets. In this work, we propose a new
+class of GAN discriminators for semantic image synthesis that generates highly
+realistic images by exploiting feature backbone networks pre-trained for tasks
+such as image classification. We also introduce a new generator architecture
+with better context modeling and using cross-attention to inject noise into
+latent variables, leading to more diverse generated images. Our model, which we
+dub DP-SIMS, achieves state-of-the-art results in terms of image quality and
+consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes,
+surpassing recent diffusion models while requiring two orders of magnitude less
+compute for inference.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision,Yi Yu · Xue Yang · Qingyun Li · Feipeng Da · Jifeng Dai · Yu Qiao · Junchi Yan, ,https://arxiv.org/abs/2311.14758,,2311.14758.pdf,Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision,"With the rapidly increasing demand for oriented object detection (OOD),
+recent research involving weakly-supervised detectors for learning rotated box
+(RBox) from the horizontal box (HBox) has attracted more and more attention. In
+this paper, we explore a more challenging yet label-efficient setting, namely
+single point-supervised OOD, and present our approach called Point2RBox.
+Specifically, we propose to leverage two principles: 1) Synthetic pattern
+knowledge combination: By sampling around each labeled point on the image, we
+spread the object feature to synthetic visual patterns with known boxes to
+provide the knowledge for box regression. 2) Transform self-supervision: With a
+transformed input image (e.g. scaled/rotated), the output RBoxes are trained to
+follow the same transformation so that the network can perceive the relative
+size/rotation between objects. The detector is further enhanced by a few
+devised techniques to cope with peripheral issues, e.g. the anchor/layer
+assignment as the size of the object is not available in our point supervision
+setting. To our best knowledge, Point2RBox is the first end-to-end solution for
+point-supervised OOD. In particular, our method uses a lightweight paradigm,
+yet it achieves a competitive performance among point-supervised alternatives,
+41.05%/27.62%/80.01% on DOTA/DIOR/HRSC datasets.",cs.CV,"['cs.CV', 'cs.AI']"
+A Versatile Framework for Continual Test-Time Domain Adaptation: Balancing Discriminability and Generalizability,Xu Yang · Xuan chen · Moqi Li · Kun Wei · Cheng Deng, ,https://arxiv.org/abs/2405.14602,,2405.14602.pdf,Controllable Continual Test-Time Adaptation,"Continual Test-Time Adaptation (CTTA) is an emerging and challenging task
+where a model trained in a source domain must adapt to continuously changing
+conditions during testing, without access to the original source data. CTTA is
+prone to error accumulation due to uncontrollable domain shifts, leading to
+blurred decision boundaries between categories. Existing CTTA methods primarily
+focus on suppressing domain shifts, which proves inadequate during the
+unsupervised test phase. In contrast, we introduce a novel approach that guides
+rather than suppresses these shifts. Specifically, we propose
+$\textbf{C}$ontrollable $\textbf{Co}$ntinual $\textbf{T}$est-$\textbf{T}$ime
+$\textbf{A}$daptation (C-CoTTA), which explicitly prevents any single category
+from encroaching on others, thereby mitigating the mutual influence between
+categories caused by uncontrollable shifts. Moreover, our method reduces the
+sensitivity of model to domain transformations, thereby minimizing the
+magnitude of category shifts. Extensive quantitative experiments demonstrate
+the effectiveness of our method, while qualitative analyses, such as t-SNE
+plots, confirm the theoretical validity of our approach.",cs.LG,['cs.LG']
+Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning,xin zhang · Jiawei Du · Weiying Xie · Yunsong Li · Joey Tianyi Zhou, ,https://arxiv.org/abs/2311.13613,,2311.13613.pdf,Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning,"Dataset pruning aims to construct a coreset capable of achieving performance
+comparable to the original, full dataset. Most existing dataset pruning methods
+rely on snapshot-based criteria to identify representative samples, often
+resulting in poor generalization across various pruning and cross-architecture
+scenarios. Recent studies have addressed this issue by expanding the scope of
+training dynamics considered, including factors such as forgetting event and
+probability change, typically using an averaging approach. However, these works
+struggle to integrate a broader range of training dynamics without overlooking
+well-generalized samples, which may not be sufficiently highlighted in an
+averaging manner. In this study, we propose a novel dataset pruning method
+termed as Temporal Dual-Depth Scoring (TDDS), to tackle this problem. TDDS
+utilizes a dual-depth strategy to achieve a balance between incorporating
+extensive training dynamics and identifying representative samples for dataset
+pruning. In the first depth, we estimate the series of each sample's individual
+contributions spanning the training progress, ensuring comprehensive
+integration of training dynamics. In the second depth, we focus on the
+variability of the sample-wise contributions identified in the first depth to
+highlight well-generalized samples. Extensive experiments conducted on CIFAR
+and ImageNet datasets verify the superiority of TDDS over previous SOTA
+methods. Specifically on CIFAR-100, our method achieves 54.51% accuracy with
+only 10% training data, surpassing random selection by 7.83% and other
+comparison methods by at least 12.69%.",cs.CV,"['cs.CV', 'cs.LG']"
+FlowDiffuser: Advancing Optical Flow Estimation with Diffusion Models,Ao Luo · XIN LI · Fan Yang · Jiangyu Liu · Haoqiang Fan · Shuaicheng Liu, ,https://arxiv.org/html/2312.01746v1,,2312.01746v1.pdf,Open-DDVM: A Reproduction and Extension of Diffusion Model for Optical Flow Estimation,"Recently, Google proposes DDVM which for the first time demonstrates that a
+general diffusion model for image-to-image translation task works impressively
+well on optical flow estimation task without any specific designs like RAFT.
+However, DDVM is still a closed-source model with the expensive and private
+Palette-style pretraining. In this technical report, we present the first
+open-source DDVM by reproducing it. We study several design choices and find
+those important ones. By training on 40k public data with 4 GPUs, our
+reproduction achieves comparable performance to the closed-source DDVM. The
+code and model have been released in
+https://github.com/DQiaole/FlowDiffusion_pytorch.",cs.CV,['cs.CV']
+Probabilistic Sampling of Balanced K-Means using Adiabatic Quantum Computing,Jan-Nico Zaech · Martin Danelljan · Tolga Birdal · Luc Van Gool, ,https://arxiv.org/abs/2310.12153,,2310.12153.pdf,Probabilistic Sampling of Balanced K-Means using Adiabatic Quantum Computing,"Adiabatic quantum computing (AQC) is a promising approach for discrete and
+often NP-hard optimization problems. Current AQCs allow to implement problems
+of research interest, which has sparked the development of quantum
+representations for many computer vision tasks. Despite requiring multiple
+measurements from the noisy AQC, current approaches only utilize the best
+measurement, discarding information contained in the remaining ones. In this
+work, we explore the potential of using this information for probabilistic
+balanced k-means clustering. Instead of discarding non-optimal solutions, we
+propose to use them to compute calibrated posterior probabilities with little
+additional compute cost. This allows us to identify ambiguous solutions and
+data points, which we demonstrate on a D-Wave AQC on synthetic tasks and real
+visual data.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CV']"
+OMG-Seg: Is One Model Good Enough For All Segmentation?,Xiangtai Li · Haobo Yuan · Wei Li · Henghui Ding · Size Wu · Wenwei Zhang · Yining Li · Kai Chen · Chen Change Loy, ,https://arxiv.org/abs/2401.10229,,2401.10229.pdf,OMG-Seg: Is One Model Good Enough For All Segmentation?,"In this work, we address various segmentation tasks, each traditionally
+tackled by distinct or partially unified models. We propose OMG-Seg, One Model
+that is Good enough to efficiently and effectively handle all the segmentation
+tasks, including image semantic, instance, and panoptic segmentation, as well
+as their video counterparts, open vocabulary settings, prompt-driven,
+interactive segmentation like SAM, and video object segmentation. To our
+knowledge, this is the first model to handle all these tasks in one model and
+achieve satisfactory performance. We show that OMG-Seg, a transformer-based
+encoder-decoder architecture with task-specific queries and outputs, can
+support over ten distinct segmentation tasks and yet significantly reduce
+computational and parameter overhead across various tasks and datasets. We
+rigorously evaluate the inter-task influences and correlations during
+co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg.",cs.CV,['cs.CV']
+Towards Fairness-Aware Adversarial Learning,Yanghao Zhang · Tianle Zhang · Ronghui Mu · Xiaowei Huang · Wenjie Ruan, ,https://arxiv.org/abs/2402.17729,,2402.17729.pdf,Towards Fairness-Aware Adversarial Learning,"Although adversarial training (AT) has proven effective in enhancing the
+model's robustness, the recently revealed issue of fairness in robustness has
+not been well addressed, i.e. the robust accuracy varies significantly among
+different categories. In this paper, instead of uniformly evaluating the
+model's average class performance, we delve into the issue of robust fairness,
+by considering the worst-case distribution across various classes. We propose a
+novel learning paradigm, named Fairness-Aware Adversarial Learning (FAAL). As a
+generalization of conventional AT, we re-define the problem of adversarial
+training as a min-max-max framework, to ensure both robustness and fairness of
+the trained model. Specifically, by taking advantage of distributional robust
+optimization, our method aims to find the worst distribution among different
+categories, and the solution is guaranteed to obtain the upper bound
+performance with high probability. In particular, FAAL can fine-tune an unfair
+robust model to be fair within only two epochs, without compromising the
+overall clean and robust accuracies. Extensive experiments on various image
+datasets validate the superior performance and efficiency of the proposed FAAL
+compared to other state-of-the-art methods.",cs.CV,['cs.CV']
+Inter-X: Towards Versatile Human-Human Interaction Analysis,Liang Xu · Xintao Lv · Yichao Yan · Xin Jin · Wu Shuwen · Congsheng Xu · Yifan Liu · Yizhou Zhou · Fengyun Rao · Xingdong Sheng · Yunhui LIU · Wenjun Zeng · Xiaokang Yang, ,https://arxiv.org/abs/2312.16051,,2312.16051.pdf,Inter-X: Towards Versatile Human-Human Interaction Analysis,"The analysis of the ubiquitous human-human interactions is pivotal for
+understanding humans as social beings. Existing human-human interaction
+datasets typically suffer from inaccurate body motions, lack of hand gestures
+and fine-grained textual descriptions. To better perceive and generate
+human-human interactions, we propose Inter-X, a currently largest human-human
+interaction dataset with accurate body movements and diverse interaction
+patterns, together with detailed hand gestures. The dataset includes ~11K
+interaction sequences and more than 8.1M frames. We also equip Inter-X with
+versatile annotations of more than 34K fine-grained human part-level textual
+descriptions, semantic interaction categories, interaction order, and the
+relationship and personality of the subjects. Based on the elaborate
+annotations, we propose a unified benchmark composed of 4 categories of
+downstream tasks from both the perceptual and generative directions. Extensive
+experiments and comprehensive analysis show that Inter-X serves as a testbed
+for promoting the development of versatile human-human interaction analysis.
+Our dataset and benchmark will be publicly available for research purposes.",cs.CV,['cs.CV']
+ReGenNet: Towards Human Action-Reaction Synthesis,Liang Xu · Yizhou Zhou · Yichao Yan · Xin Jin · Wenhan Zhu · Fengyun Rao · Xiaokang Yang · Wenjun Zeng, ,https://arxiv.org/abs/2403.11882,,2403.11882.pdf,ReGenNet: Towards Human Action-Reaction Synthesis,"Humans constantly interact with their surrounding environments. Current
+human-centric generative models mainly focus on synthesizing humans plausibly
+interacting with static scenes and objects, while the dynamic human
+action-reaction synthesis for ubiquitous causal human-human interactions is
+less explored. Human-human interactions can be regarded as asymmetric with
+actors and reactors in atomic interaction periods. In this paper, we
+comprehensively analyze the asymmetric, dynamic, synchronous, and detailed
+nature of human-human interactions and propose the first multi-setting human
+action-reaction synthesis benchmark to generate human reactions conditioned on
+given human actions. To begin with, we propose to annotate the actor-reactor
+order of the interaction sequences for the NTU120, InterHuman, and Chi3D
+datasets. Based on them, a diffusion-based generative model with a Transformer
+decoder architecture called ReGenNet together with an explicit distance-based
+interaction loss is proposed to predict human reactions in an online manner,
+where the future states of actors are unavailable to reactors. Quantitative and
+qualitative results show that our method can generate instant and plausible
+human reactions compared to the baselines, and can generalize to unseen actor
+motions and viewpoint changes.",cs.CV,"['cs.CV', 'cs.AI']"
+Universal Novelty Detection through Adaptive Contrastive Learning,Hossein Mirzaei · Mojtaba Nafez · Mohammad Jafari · Mohammad Soltani · Mohammad Azizmalayeri · Jafar Habibi · Mohammad Sabokrou · Mohammad Rohban, ,,https://oist.mlds.jp/2024/02/27/two-papers-have-been-accepted-by-cvpr-2024/,,,,,nan
+Cross-dimension Affinity Distillation for 3D EM Neuron Segmentation,Xiaoyu Liu · Miaomiao Cai · Yinda Chen · Yueyi Zhang · Te Shi · Ruobing Zhang · Xuejin Chen · Zhiwei Xiong, ,https://arxiv.org/html/2401.03043v1,,2401.03043v1.pdf,Learning Multimodal Volumetric Features for Large-Scale Neuron Tracing,"The current neuron reconstruction pipeline for electron microscopy (EM) data
+usually includes automatic image segmentation followed by extensive human
+expert proofreading. In this work, we aim to reduce human workload by
+predicting connectivity between over-segmented neuron pieces, taking both
+microscopy image and 3D morphology features into account, similar to human
+proofreading workflow. To this end, we first construct a dataset, named
+FlyTracing, that contains millions of pairwise connections of segments
+expanding the whole fly brain, which is three orders of magnitude larger than
+existing datasets for neuron segment connection. To learn sophisticated
+biological imaging features from the connectivity annotations, we propose a
+novel connectivity-aware contrastive learning method to generate dense
+volumetric EM image embedding. The learned embeddings can be easily
+incorporated with any point or voxel-based morphological representations for
+automatic neuron tracing. Extensive comparisons of different combination
+schemes of image and morphological representation in identifying split errors
+across the whole fly brain demonstrate the superiority of the proposed
+approach, especially for the locations that contain severe imaging artifacts,
+such as section missing and misalignment. The dataset and code are available at
+https://github.com/Levishery/Flywire-Neuron-Tracing.",cs.CV,['cs.CV']
+PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation,Ardian Umam · Cheng-Kun Yang · Min-Hung Chen · Jen-Hui Chuang · Yen-Yu Lin,https://ardianumam.github.io/partdistill/,https://arxiv.org/abs/2312.04016,,2312.04016.pdf,PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation,"This paper proposes a cross-modal distillation framework, PartDistill, which
+transfers 2D knowledge from vision-language models (VLMs) to facilitate 3D
+shape part segmentation. PartDistill addresses three major challenges in this
+task: the lack of 3D segmentation in invisible or undetected regions in the 2D
+projections, inconsistent 2D predictions by VLMs, and the lack of knowledge
+accumulation across different 3D shapes. PartDistill consists of a teacher
+network that uses a VLM to make 2D predictions and a student network that
+learns from the 2D predictions while extracting geometrical features from
+multiple 3D shapes to carry out 3D part segmentation. A bi-directional
+distillation, including forward and backward distillations, is carried out
+within the framework, where the former forward distills the 2D predictions to
+the student network, and the latter improves the quality of the 2D predictions,
+which subsequently enhances the final 3D segmentation. Moreover, PartDistill
+can exploit generative models that facilitate effortless 3D shape creation for
+generating knowledge sources to be distilled. Through extensive experiments,
+PartDistill boosts the existing methods with substantial margins on widely used
+ShapeNetPart and PartNetE datasets, by more than 15% and 12% higher mIoU
+scores, respectively. The code for this work is available at
+https://github.com/ardianumam/PartDistill.",cs.CV,['cs.CV']
+Diffusion-FOF: Single-view Clothed Human Reconstruction via Diffusion-based Fourier Occupancy Field,Yuanzhen Li · Fei LUO · Chunxia Xiao,https://youtu.be/jm1CsLV_5XU,https://arxiv.org/abs/2311.15855,,,SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion,"A long-standing goal of 3D human reconstruction is to create lifelike and
+fully detailed 3D humans from single-view images. The main challenge lies in
+inferring unknown body shapes, appearances, and clothing details in areas not
+visible in the images. To address this, we propose SiTH, a novel pipeline that
+uniquely integrates an image-conditioned diffusion model into a 3D mesh
+reconstruction workflow. At the core of our method lies the decomposition of
+the challenging single-view reconstruction problem into generative
+hallucination and reconstruction subproblems. For the former, we employ a
+powerful generative diffusion model to hallucinate unseen back-view appearance
+based on the input images. For the latter, we leverage skinned body meshes as
+guidance to recover full-body texture meshes from the input and back-view
+images. SiTH requires as few as 500 3D human scans for training while
+maintaining its generality and robustness to diverse images. Extensive
+evaluations on two 3D human benchmarks, including our newly created one,
+highlighted our method's superior accuracy and perceptual quality in 3D
+textured human reconstruction. Our code and evaluation benchmark are available
+at https://ait.ethz.ch/sith",cs.CV,['cs.CV']
+Distilling Semantic Priors from SAM to Efficient Image Restoration Models,Quan Zhang · Xiaoyu Liu · Wei Li · Hanting Chen · Junchao Liu · Jie Hu · Zhiwei Xiong · Chun Yuan · Yunhe Wang, ,https://arxiv.org/abs/2403.16368,,2403.16368.pdf,Distilling Semantic Priors from SAM to Efficient Image Restoration Models,"In image restoration (IR), leveraging semantic priors from segmentation
+models has been a common approach to improve performance. The recent segment
+anything model (SAM) has emerged as a powerful tool for extracting advanced
+semantic priors to enhance IR tasks. However, the computational cost of SAM is
+prohibitive for IR, compared to existing smaller IR models. The incorporation
+of SAM for extracting semantic priors considerably hampers the model inference
+efficiency. To address this issue, we propose a general framework to distill
+SAM's semantic knowledge to boost exiting IR models without interfering with
+their inference process. Specifically, our proposed framework consists of the
+semantic priors fusion (SPF) scheme and the semantic priors distillation (SPD)
+scheme. SPF fuses two kinds of information between the restored image predicted
+by the original IR model and the semantic mask predicted by SAM for the refined
+restored image. SPD leverages a self-distillation manner to distill the fused
+semantic priors to boost the performance of original IR models. Additionally,
+we design a semantic-guided relation (SGR) module for SPD, which ensures
+semantic feature representation space consistency to fully distill the priors.
+We demonstrate the effectiveness of our framework across multiple IR models and
+tasks, including deraining, deblurring, and denoising.",cs.CV,['cs.CV']
+Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features,Niladri Shekhar Dutt · Sanjeev Muralikrishnan · Niloy J. Mitra,https://diff3f.github.io/,https://arxiv.org/abs/2311.17024,,2311.17024.pdf,Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features,"We present Diff3F as a simple, robust, and class-agnostic feature descriptor
+that can be computed for untextured input shapes (meshes or point clouds). Our
+method distills diffusion features from image foundational models onto input
+shapes. Specifically, we use the input shapes to produce depth and normal maps
+as guidance for conditional image synthesis. In the process, we produce
+(diffusion) features in 2D that we subsequently lift and aggregate on the
+original surface. Our key observation is that even if the conditional image
+generations obtained from multi-view rendering of the input shapes are
+inconsistent, the associated image features are robust and, hence, can be
+directly aggregated across views. This produces semantic features on the input
+shapes, without requiring additional data or training. We perform extensive
+experiments on multiple benchmarks (SHREC'19, SHREC'20, FAUST, and TOSCA) and
+demonstrate that our features, being semantic instead of geometric, produce
+reliable correspondence across both isometric and non-isometrically related
+shape families. Code is available via the project page at
+https://diff3f.github.io/",cs.CV,"['cs.CV', 'cs.GR']"
+TutteNet: Injective 3D Deformations by Composition of 2D Mesh Deformations,Bo Sun · Thibault Groueix · Chen Song · Qixing Huang · Noam Aigerman, ,https://arxiv.org/abs/2307.09892,,2307.09892.pdf,3Deformer: A Common Framework for Image-Guided Mesh Deformation,"We propose 3Deformer, a general-purpose framework for interactive 3D shape
+editing. Given a source 3D mesh with semantic materials, and a user-specified
+semantic image, 3Deformer can accurately edit the source mesh following the
+shape guidance of the semantic image, while preserving the source topology as
+rigid as possible. Recent studies of 3D shape editing mostly focus on learning
+neural networks to predict 3D shapes, which requires high-cost 3D training
+datasets and is limited to handling objects involved in the datasets. Unlike
+these studies, our 3Deformer is a non-training and common framework, which only
+requires supervision of readily-available semantic images, and is compatible
+with editing various objects unlimited by datasets. In 3Deformer, the source
+mesh is deformed utilizing the differentiable renderer technique, according to
+the correspondences between semantic images and mesh materials. However,
+guiding complex 3D shapes with a simple 2D image incurs extra challenges, that
+is, the deform accuracy, surface smoothness, geometric rigidity, and global
+synchronization of the edited mesh should be guaranteed. To address these
+challenges, we propose a hierarchical optimization architecture to balance the
+global and local shape features, and propose further various strategies and
+losses to improve properties of accuracy, smoothness, rigidity, and so on.
+Extensive experiments show that our 3Deformer is able to produce impressive
+results and reaches the state-of-the-art level.",cs.CV,['cs.CV']
+MAGICK: A Large-scale Captioned Dataset from Matting Generated Images using Chroma Keying,Ryan Burgert · Brian Price · Jason Kuen · Yijun Li · Michael Ryoo,https://ryanndagreat.github.io/MAGICK,https://arxiv.org/abs/2307.10350,,2307.10350.pdf,Improving Multimodal Datasets with Image Captioning,"Massive web datasets play a key role in the success of large vision-language
+models like CLIP and Flamingo. However, the raw web data is noisy, and existing
+filtering methods to reduce noise often come at the expense of data diversity.
+Our work focuses on caption quality as one major source of noise, and studies
+how generated captions can increase the utility of web-scraped datapoints with
+nondescript text. Through exploring different mixing strategies for raw and
+generated captions, we outperform the best filtering method proposed by the
+DataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a
+candidate pool of 128M image-text pairs. Our best approach is also 2x better at
+Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions an
+effective source of text supervision. In experimenting with different image
+captioning models, we also demonstrate that the performance of a model on
+standard image captioning benchmarks (e.g., NoCaps CIDEr) is not a reliable
+indicator of the utility of the captions it generates for multimodal training.
+Finally, our experiments with using generated captions at DataComp's large
+scale (1.28B image-text pairs) offer insights into the limitations of synthetic
+text, as well as the importance of image curation with increasing training data
+quantity. The synthetic captions used in our experiments are now available on
+HuggingFace.",cs.LG,"['cs.LG', 'cs.CV']"
+Generative Latent Coding for Ultra-Low Bitrate Image Compression,Zhaoyang Jia · Jiahao Li · Bin Li · Houqiang Li · Yan Lu, ,https://arxiv.org/abs/2403.03736,,2403.03736.pdf,Unifying Generation and Compression: Ultra-low bitrate Image Coding Via Multi-stage Transformer,"Recent progress in generative compression technology has significantly
+improved the perceptual quality of compressed data. However, these advancements
+primarily focus on producing high-frequency details, often overlooking the
+ability of generative models to capture the prior distribution of image
+content, thus impeding further bitrate reduction in extreme compression
+scenarios (<0.05 bpp). Motivated by the capabilities of predictive language
+models for lossless compression, this paper introduces a novel Unified Image
+Generation-Compression (UIGC) paradigm, merging the processes of generation and
+compression. A key feature of the UIGC framework is the adoption of
+vector-quantized (VQ) image models for tokenization, alongside a multi-stage
+transformer designed to exploit spatial contextual information for modeling the
+prior distribution. As such, the dual-purpose framework effectively utilizes
+the learned prior for entropy estimation and assists in the regeneration of
+lost tokens. Extensive experiments demonstrate the superiority of the proposed
+UIGC framework over existing codecs in perceptual quality and human perception,
+particularly in ultra-low bitrate scenarios (<=0.03 bpp), pioneering a new
+direction in generative compression.",cs.CV,"['cs.CV', 'cs.LG', 'eess.IV']"
+Makeup Prior Models for 3D Facial Makeup Estimation and Applications,Xingchao Yang · Takafumi Taketomi · Yuki Endo · Yoshihiro Kanamori,https://yangxingchao.github.io/makeup-priors-page/,https://arxiv.org/abs/2403.17761,,2403.17761.pdf,Makeup Prior Models for 3D Facial Makeup Estimation and Applications,"In this work, we introduce two types of makeup prior models to extend
+existing 3D face prior models: PCA-based and StyleGAN2-based priors. The
+PCA-based prior model is a linear model that is easy to construct and is
+computationally efficient. However, it retains only low-frequency information.
+Conversely, the StyleGAN2-based model can represent high-frequency information
+with relatively higher computational cost than the PCA-based model. Although
+there is a trade-off between the two models, both are applicable to 3D facial
+makeup estimation and related applications. By leveraging makeup prior models
+and designing a makeup consistency module, we effectively address the
+challenges that previous methods faced in robustly estimating makeup,
+particularly in the context of handling self-occluded faces. In experiments, we
+demonstrate that our approach reduces computational costs by several orders of
+magnitude, achieving speeds up to 180 times faster. In addition, by improving
+the accuracy of the estimated makeup, we confirm that our methods are highly
+advantageous for various 3D facial makeup applications such as 3D makeup face
+reconstruction, user-friendly makeup editing, makeup transfer, and
+interpolation.",cs.CV,"['cs.CV', 'cs.GR']"
+Asymmetric Masked Distillation for Pre-Training Small Foundation Models,Zhiyu Zhao · Bingkun Huang · Sen Xing · Gangshan Wu · Yu Qiao · Limin Wang, ,https://arxiv.org/abs/2311.03149,,,Asymmetric Masked Distillation for Pre-Training Small Foundation Models,"Self-supervised foundation models have shown great potential in computer
+vision thanks to the pre-training paradigm of masked autoencoding. Scale is a
+primary factor influencing the performance of these foundation models. However,
+these large foundation models often result in high computational cost. This
+paper focuses on pre-training relatively small vision transformer models that
+could be efficiently adapted to downstream tasks. Specifically, taking
+inspiration from knowledge distillation in model compression, we propose a new
+asymmetric masked distillation (AMD) framework for pre-training relatively
+small models with autoencoding. The core of AMD is to devise an asymmetric
+masking strategy, where the teacher model is enabled to see more context
+information with a lower masking ratio, while the student model is still
+equipped with a high masking ratio. We design customized multi-layer feature
+alignment between the teacher encoder and student encoder to regularize the
+pre-training of student MAE. To demonstrate the effectiveness and versatility
+of AMD, we apply it to both ImageMAE and VideoMAE for pre-training relatively
+small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the
+ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B
+model on the Something-in-Something V2 dataset, a 3.7% improvement over the
+original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to
+downstream tasks and obtain consistent performance improvement over the
+original masked autoencoding. The code and models are available at
+https://github.com/MCG-NJU/AMD.",cs.CV,['cs.CV']
+Uncertainty-Aware Source-Free Adaptive Image Super-Resolution with Wavelet Augmentation Transformer,Yuang Ai · Xiaoqiang Zhou · Huaibo Huang · Lei Zhang · Ran He, ,https://arxiv.org/abs/2404.11273,,2404.11273.pdf,Training Transformer Models by Wavelet Losses Improves Quantitative and Visual Performance in Single Image Super-Resolution,"Transformer-based models have achieved remarkable results in low-level vision
+tasks including image super-resolution (SR). However, early Transformer-based
+approaches that rely on self-attention within non-overlapping windows encounter
+challenges in acquiring global information. To activate more input pixels
+globally, hybrid attention models have been proposed. Moreover, training by
+solely minimizing pixel-wise RGB losses, such as L1, have been found inadequate
+for capturing essential high-frequency details. This paper presents two
+contributions: i) We introduce convolutional non-local sparse attention (NLSA)
+blocks to extend the hybrid transformer architecture in order to further
+enhance its receptive field. ii) We employ wavelet losses to train Transformer
+models to improve quantitative and subjective performance. While wavelet losses
+have been explored previously, showing their power in training
+Transformer-based SR models is novel. Our experimental results demonstrate that
+the proposed model provides state-of-the-art PSNR results as well as superior
+visual performance across various benchmark datasets.",eess.IV,"['eess.IV', 'cs.CV']"
+AHIVE: Anatomy-aware Hierarchical Vision Encoding for Interactive Radiology Report Retrieval,Sixing Yan · William K. Cheung · Ivor Tsang · Wan Hang Keith Chiu · Tong Terence · Ka Chun Cheung · Simon See, ,,https://www.a-star.edu.sg/cfar/research/publications,,,,,nan
+Unveiling the Unknown: Unleashing the Power of Unknown to Known in Open-Set Source-Free Domain Adaptation,Fuli Wan · Han Zhao · Xu Yang · Cheng Deng, ,https://arxiv.org/abs/2312.03767,,2312.03767.pdf,Unknown Sample Discovery for Source Free Open Set Domain Adaptation,"Open Set Domain Adaptation (OSDA) aims to adapt a model trained on a source
+domain to a target domain that undergoes distribution shift and contains
+samples from novel classes outside the source domain. Source-free OSDA
+(SF-OSDA) techniques eliminate the need to access source domain samples, but
+current SF-OSDA methods utilize only the known classes in the target domain for
+adaptation, and require access to the entire target domain even during
+inference after adaptation, to make the distinction between known and unknown
+samples. In this paper, we introduce Unknown Sample Discovery (USD) as an
+SF-OSDA method that utilizes a temporally ensembled teacher model to conduct
+known-unknown target sample separation and adapts the student model to the
+target domain over all classes using co-training and temporal consistency
+between the teacher and the student. USD promotes Jensen-Shannon distance (JSD)
+as an effective measure for known-unknown sample separation. Our
+teacher-student framework significantly reduces error accumulation resulting
+from imperfect known-unknown sample separation, while curriculum guidance helps
+to reliably learn the distinction between target known and target unknown
+subspaces. USD appends the target model with an unknown class node, thus
+readily classifying a target sample into any of the known or unknown classes in
+subsequent post-adaptation inference stages. Empirical results show that USD is
+superior to existing SF-OSDA methods and is competitive with current OSDA
+models that utilize both source and target domains during adaptation.",cs.CV,"['cs.CV', 'cs.AI']"
+Classes Are Not Equal: An Empirical Study on Image Recognition Fairness,Jiequan Cui · Beier Zhu · Xin Wen · Xiaojuan Qi · Bei Yu · Hanwang Zhang, ,https://arxiv.org/abs/2402.18133,,2402.18133.pdf,Classes Are Not Equal: An Empirical Study on Image Recognition Fairness,"In this paper, we present an empirical study on image recognition fairness,
+i.e., extreme class accuracy disparity on balanced data like ImageNet. We
+experimentally demonstrate that classes are not equal and the fairness issue is
+prevalent for image classification models across various datasets, network
+architectures, and model capacities. Moreover, several intriguing properties of
+fairness are identified. First, the unfairness lies in problematic
+representation rather than classifier bias. Second, with the proposed concept
+of Model Prediction Bias, we investigate the origins of problematic
+representation during optimization. Our findings reveal that models tend to
+exhibit greater prediction biases for classes that are more challenging to
+recognize. It means that more other classes will be confused with harder
+classes. Then the False Positives (FPs) will dominate the learning in
+optimization, thus leading to their poor accuracy. Further, we conclude that
+data augmentation and representation learning algorithms improve overall
+performance by promoting fairness to some degree in image classification. The
+Code is available at
+https://github.com/dvlab-research/Parametric-Contrastive-Learning.",cs.LG,"['cs.LG', 'cs.CV']"
+The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding,Lorenzo Bianchi · Fabio Carrara · Nicola Messina · Claudio Gennaro · Fabrizio Falchi,https://lorebianchi98.github.io/FG-OVD/,https://arxiv.org/abs/2311.17518v2,,2311.17518v2.pdf,The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding,"Recent advancements in large vision-language models enabled visual object
+detection in open-vocabulary scenarios, where object classes are defined in
+free-text formats during inference. In this paper, we aim to probe the
+state-of-the-art methods for open-vocabulary object detection to determine to
+what extent they understand fine-grained properties of objects and their parts.
+To this end, we introduce an evaluation protocol based on dynamic vocabulary
+generation to test whether models detect, discern, and assign the correct
+fine-grained description to objects in the presence of hard-negative classes.
+We contribute with a benchmark suite of increasing difficulty and probing
+different properties like color, pattern, and material. We further enhance our
+investigation by evaluating several state-of-the-art open-vocabulary object
+detectors using the proposed protocol and find that most existing solutions,
+which shine in standard open-vocabulary benchmarks, struggle to accurately
+capture and distinguish finer object details. We conclude the paper by
+highlighting the limitations of current methodologies and exploring promising
+research directions to overcome the discovered drawbacks. Data and code are
+available at https://lorebianchi98.github.io/FG-OVD/.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Style Aligned Image Generation via Shared Attention,Amir Hertz · Andrey Voynov · Shlomi Fruchter · Daniel Cohen-Or,https://style-aligned-gen.github.io/,https://arxiv.org/abs/2312.02133v1,,2312.02133v1.pdf,Style Aligned Image Generation via Shared Attention,"Large-scale Text-to-Image (T2I) models have rapidly gained prominence across
+creative fields, generating visually compelling outputs from textual prompts.
+However, controlling these models to ensure consistent style remains
+challenging, with existing methods necessitating fine-tuning and manual
+intervention to disentangle content and style. In this paper, we introduce
+StyleAligned, a novel technique designed to establish style alignment among a
+series of generated images. By employing minimal `attention sharing' during the
+diffusion process, our method maintains style consistency across images within
+T2I models. This approach allows for the creation of style-consistent images
+using a reference style through a straightforward inversion operation. Our
+method's evaluation across diverse styles and text prompts demonstrates
+high-quality synthesis and fidelity, underscoring its efficacy in achieving
+consistent style across various inputs.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+"Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration",Yuang Ai · Huaibo Huang · Xiaoqiang Zhou · Jiexiang Wang · Ran He, ,https://arxiv.org/abs/2312.02918v2,,2312.02918v2.pdf,"Multimodal Prompt Perceiver: Empower Adaptiveness, Generalizability and Fidelity for All-in-One Image Restoration","Despite substantial progress, all-in-one image restoration (IR) grapples with
+persistent challenges in handling intricate real-world degradations. This paper
+introduces MPerceiver: a novel multimodal prompt learning approach that
+harnesses Stable Diffusion (SD) priors to enhance adaptiveness,
+generalizability and fidelity for all-in-one image restoration. Specifically,
+we develop a dual-branch module to master two types of SD prompts: textual for
+holistic representation and visual for multiscale detail representation. Both
+prompts are dynamically adjusted by degradation predictions from the CLIP image
+encoder, enabling adaptive responses to diverse unknown degradations. Moreover,
+a plug-in detail refinement module improves restoration fidelity via direct
+encoder-to-decoder information transformation. To assess our method, MPerceiver
+is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art
+task-specific methods across most tasks. Post multitask pre-training,
+MPerceiver attains a generalized representation in low-level vision, exhibiting
+remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive
+experiments on 16 IR tasks underscore the superiority of MPerceiver in terms of
+adaptiveness, generalizability and fidelity.",cs.CV,['cs.CV']
+FMA-Net: Flow Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring,Geunhyuk Youk · Jihyong Oh · Munchurl Kim,https://kaist-viclab.github.io/fmanet-site,https://arxiv.org/abs/2401.03707,,2401.03707.pdf,FMA-Net: Flow-Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring,"We present a joint learning scheme of video super-resolution and deblurring,
+called VSRDB, to restore clean high-resolution (HR) videos from blurry
+low-resolution (LR) ones. This joint restoration problem has drawn much less
+attention compared to single restoration problems. In this paper, we propose a
+novel flow-guided dynamic filtering (FGDF) and iterative feature refinement
+with multi-attention (FRMA), which constitutes our VSRDB framework, denoted as
+FMA-Net. Specifically, our proposed FGDF enables precise estimation of both
+spatio-temporally-variant degradation and restoration kernels that are aware of
+motion trajectories through sophisticated motion representation learning.
+Compared to conventional dynamic filtering, the FGDF enables the FMA-Net to
+effectively handle large motions into the VSRDB. Additionally, the stacked FRMA
+blocks trained with our novel temporal anchor (TA) loss, which temporally
+anchors and sharpens features, refine features in a course-to-fine manner
+through iterative updates. Extensive experiments demonstrate the superiority of
+the proposed FMA-Net over state-of-the-art methods in terms of both
+quantitative and qualitative quality. Codes and pre-trained models are
+available at: https://kaist-viclab.github.io/fmanet-site",cs.CV,['cs.CV']
+Device-Wise Federated Network Pruning,Shangqian Gao · Junyi Li · Zeyu Zhang · Yanfu Zhang · Weidong Cai · Heng Huang, ,,https://lijunyi95.github.io/publications/,,,,,nan
+Differentiable Display Photometric Stereo,Seokjun Choi · Seungwoo Yoon · Giljoo Nam · Seungyong Lee · Seung-Hwan Baek, ,https://arxiv.org/abs/2306.13325,,2306.13325.pdf,Differentiable Display Photometric Stereo,"Photometric stereo leverages variations in illumination conditions to
+reconstruct surface normals. Display photometric stereo, which employs a
+conventional monitor as an illumination source, has the potential to overcome
+limitations often encountered in bulky and difficult-to-use conventional
+setups. In this paper, we present differentiable display photometric stereo
+(DDPS), addressing an often overlooked challenge in display photometric stereo:
+the design of display patterns. Departing from using heuristic display
+patterns, DDPS learns the display patterns that yield accurate normal
+reconstruction for a target system in an end-to-end manner. To this end, we
+propose a differentiable framework that couples basis-illumination image
+formation with analytic photometric-stereo reconstruction. The differentiable
+framework facilitates the effective learning of display patterns via
+auto-differentiation. Also, for training supervision, we propose to use 3D
+printing for creating a real-world training dataset, enabling accurate
+reconstruction on the target real-world setup. Finally, we exploit that
+conventional LCD monitors emit polarized light, which allows for the optical
+separation of diffuse and specular reflections when combined with a
+polarization camera, leading to accurate normal reconstruction. Extensive
+evaluation of DDPS shows improved normal-reconstruction accuracy compared to
+heuristic patterns and demonstrates compelling properties such as robustness to
+pattern initialization, calibration errors, and simplifications in image
+formation and reconstruction.",cs.CV,['cs.CV']
+"Slice3D: Multi-Slice, Occlusion-Revealing, Single View 3D Reconstruction",Yizhi Wang · Wallace Lira · Wenqi Wang · Ali Mahdavi Amiri · Hao Zhang,https://yizhiwang96.github.io/Slice3D/,https://arxiv.org/abs/2312.02221,,2312.02221.pdf,"Slice3D: Multi-Slice, Occlusion-Revealing, Single View 3D Reconstruction","We introduce multi-slice reasoning, a new notion for single-view 3D
+reconstruction which challenges the current and prevailing belief that
+multi-view synthesis is the most natural conduit between single-view and 3D.
+Our key observation is that object slicing is more advantageous than altering
+views to reveal occluded structures. Specifically, slicing is more
+occlusion-revealing since it can peel through any occluders without
+obstruction. In the limit, i.e., with infinitely many slices, it is guaranteed
+to unveil all hidden object parts. We realize our idea by developing Slice3D, a
+novel method for single-view 3D reconstruction which first predicts multi-slice
+images from a single RGB image and then integrates the slices into a 3D model
+using a coordinate-based transformer network for signed distance prediction.
+The slice images can be regressed or generated, both through a U-Net based
+network. For the former, we inject a learnable slice indicator code to
+designate each decoded image into a spatial slice location, while the slice
+generator is a denoising diffusion model operating on the entirety of slice
+images stacked on the input channels. We conduct extensive evaluation against
+state-of-the-art alternatives to demonstrate superiority of our method,
+especially in recovering complex and severely occluded shape structures, amid
+ambiguities. All Slice3D results were produced by networks trained on a single
+Nvidia A40 GPU, with an inference time less than 20 seconds.",cs.CV,"['cs.CV', 'cs.GR']"
+Cyclic Learning for Binaural Audio Generation and Localization,Zhaojian Li · Bin Zhao · Yuan Yuan, ,https://arxiv.org/abs/2311.07630,,2311.07630.pdf,Cross-modal Generative Model for Visual-Guided Binaural Stereo Generation,"Binaural stereo audio is recorded by imitating the way the human ear receives
+sound, which provides people with an immersive listening experience. Existing
+approaches leverage autoencoders and directly exploit visual spatial
+information to synthesize binaural stereo, resulting in a limited
+representation of visual guidance. For the first time, we propose a visually
+guided generative adversarial approach for generating binaural stereo audio
+from mono audio. Specifically, we develop a Stereo Audio Generation Model
+(SAGM), which utilizes shared spatio-temporal visual information to guide the
+generator and the discriminator to work separately. The shared visual
+information is updated alternately in the generative adversarial stage,
+allowing the generator and discriminator to deliver their respective guided
+knowledge while visually sharing. The proposed method learns bidirectional
+complementary visual information, which facilitates the expression of visual
+guidance in generation. In addition, spatial perception is a crucial attribute
+of binaural stereo audio, and thus the evaluation of stereo spatial perception
+is essential. However, previous metrics failed to measure the spatial
+perception of audio. To this end, a metric to measure the spatial perception of
+audio is proposed for the first time. The proposed metric is capable of
+measuring the magnitude and direction of spatial perception in the temporal
+dimension. Further, considering its function, it is feasible to utilize it
+instead of demanding user studies to some extent. The proposed method achieves
+state-of-the-art performance on 2 datasets and 5 evaluation metrics.
+Qualitative experiments and user studies demonstrate that the method generates
+space-realistic stereo audio.",cs.SD,"['cs.SD', 'cs.CV', 'cs.LG', 'eess.AS']"
+OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition,Tongjia Chen · Hongshan Yu · Zhengeng Yang · Zechuan Li · Wei Sun · Chen Chen,https://tomchen-ctj.github.io/OST/,https://arxiv.org/abs/2312.00096,,2312.00096.pdf,OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition,"Due to the resource-intensive nature of training vision-language models on
+expansive video data, a majority of studies have centered on adapting
+pre-trained image-language models to the video domain. Dominant pipelines
+propose to tackle the visual discrepancies with additional temporal learners
+while overlooking the substantial discrepancy for web-scaled descriptive
+narratives and concise action category names, leading to less distinct semantic
+space and potential performance limitations. In this work, we prioritize the
+refinement of text knowledge to facilitate generalizable video recognition. To
+address the limitations of the less distinct semantic space of category names,
+we prompt a large language model (LLM) to augment action class names into
+Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving
+as a knowledge base for general recognition. Moreover, to assign the best
+descriptors with different video instances, we propose Optimal Descriptor
+Solver, forming the video recognition problem as solving the optimal matching
+flow across frame-level representations and descriptors. Comprehensive
+evaluations in zero-shot, few-shot, and fully supervised video recognition
+highlight the effectiveness of our approach. Our best model achieves a
+state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.",cs.CV,['cs.CV']
+Visual Objectification in Films: Towards a New AI Task for Video Interpretation,Julie Tores · Lucile Sassatelli · Hui-Yin Wu · Clement Bergman · Léa Andolfi · Victor Ecrement · Frederic Precioso · Thierry Devars · Magali GUARESI · Virginie Julliard · Sarah Lécossais, ,https://arxiv.org/abs/2401.13296,,2401.13296.pdf,Visual Objectification in Films: Towards a New AI Task for Video Interpretation,"In film gender studies, the concept of 'male gaze' refers to the way the
+characters are portrayed on-screen as objects of desire rather than subjects.
+In this article, we introduce a novel video-interpretation task, to detect
+character objectification in films. The purpose is to reveal and quantify the
+usage of complex temporal patterns operated in cinema to produce the cognitive
+perception of objectification. We introduce the ObyGaze12 dataset, made of 1914
+movie clips densely annotated by experts for objectification concepts
+identified in film studies and psychology. We evaluate recent vision models,
+show the feasibility of the task and where the challenges remain with concept
+bottleneck models. Our new dataset and code are made available to the
+community.",cs.CV,['cs.CV']
+Bilateral Event Mining and Complementary for Event Stream Super-Resolution,Zhilin Huang · Quanmin Liang · Yijie Yu · Chujun Qin · Xiawu Zheng · Kai Huang · Zikun Zhou · Wenming Yang, ,https://arxiv.org/abs/2405.10037v1,,2405.10037v1.pdf,Bilateral Event Mining and Complementary for Event Stream Super-Resolution,"Event Stream Super-Resolution (ESR) aims to address the challenge of
+insufficient spatial resolution in event streams, which holds great
+significance for the application of event cameras in complex scenarios.
+Previous works for ESR often process positive and negative events in a mixed
+paradigm. This paradigm limits their ability to effectively model the unique
+characteristics of each event and mutually refine each other by considering
+their correlations. In this paper, we propose a bilateral event mining and
+complementary network (BMCNet) to fully leverage the potential of each event
+and capture the shared information to complement each other simultaneously.
+Specifically, we resort to a two-stream network to accomplish comprehensive
+mining of each type of events individually. To facilitate the exchange of
+information between two streams, we propose a bilateral information exchange
+(BIE) module. This module is layer-wisely embedded between two streams,
+enabling the effective propagation of hierarchical global information while
+alleviating the impact of invalid information brought by inherent
+characteristics of events. The experimental results demonstrate that our
+approach outperforms the previous state-of-the-art methods in ESR, achieving
+performance improvements of over 11\% on both real and synthetic datasets.
+Moreover, our method significantly enhances the performance of event-based
+downstream tasks such as object recognition and video reconstruction. Our code
+is available at https://github.com/Lqm26/BMCNet-ESR.",cs.CV,['cs.CV']
+Instance-Aware Group Quantization for Vision Transformers,Jaehyeon Moon · Dohyung Kim · Jun Yong Cheon · Bumsub Ham,https://cvlab.yonsei.ac.kr/projects/IGQ-ViT/,https://arxiv.org/abs/2404.00928,,2404.00928.pdf,Instance-Aware Group Quantization for Vision Transformers,"Post-training quantization (PTQ) is an efficient model compression technique
+that quantizes a pretrained full-precision model using only a small calibration
+set of unlabeled samples without retraining. PTQ methods for convolutional
+neural networks (CNNs) provide quantization results comparable to
+full-precision counterparts. Directly applying them to vision transformers
+(ViTs), however, incurs severe performance degradation, mainly due to the
+differences in architectures between CNNs and ViTs. In particular, the
+distribution of activations for each channel vary drastically according to
+input instances, making PTQ methods for CNNs inappropriate for ViTs. To address
+this, we introduce instance-aware group quantization for ViTs (IGQ-ViT). To
+this end, we propose to split the channels of activation maps into multiple
+groups dynamically for each input instance, such that activations within each
+group share similar statistical properties. We also extend our scheme to
+quantize softmax attentions across tokens. In addition, the number of groups
+for each layer is adjusted to minimize the discrepancies between predictions
+from quantized and full-precision models, under a bit-operation (BOP)
+constraint. We show extensive experimental results on image classification,
+object detection, and instance segmentation, with various transformer
+architectures, demonstrating the effectiveness of our approach.",cs.CV,"['cs.CV', 'cs.LG']"
+MLP Can Be A Good Transformer Learner,Sihao Lin · Pumeng Lyu · Dongrui Liu · Tao Tang · Xiaodan Liang · Andy Song · Xiaojun Chang, ,https://arxiv.org/abs/2404.05657,,2404.05657.pdf,MLP Can Be A Good Transformer Learner,"Self-attention mechanism is the key of the Transformer but often criticized
+for its computation demands. Previous token pruning works motivate their
+methods from the view of computation redundancy but still need to load the full
+network and require same memory costs. This paper introduces a novel strategy
+that simplifies vision transformers and reduces computational load through the
+selective removal of non-essential attention layers, guided by entropy
+considerations. We identify that regarding the attention layer in bottom
+blocks, their subsequent MLP layers, i.e. two feed-forward layers, can elicit
+the same entropy quantity. Meanwhile, the accompanied MLPs are under-exploited
+since they exhibit smaller feature entropy compared to those MLPs in the top
+blocks. Therefore, we propose to integrate the uninformative attention layers
+into their subsequent counterparts by degenerating them into identical mapping,
+yielding only MLP in certain transformer blocks. Experimental results on
+ImageNet-1k show that the proposed method can remove 40% attention layer of
+DeiT-B, improving throughput and memory bound without performance compromise.
+Code is available at https://github.com/sihaoevery/lambda_vit.",cs.CV,['cs.CV']
+BiPer: Binary Neural Networks using a Periodic Function,Edwin Vargas · Claudia Correa · Carlos Hinojosa · Henry Arguello, ,https://arxiv.org/abs/2404.01278,,2404.01278.pdf,BiPer: Binary Neural Networks using a Periodic Function,"Quantized neural networks employ reduced precision representations for both
+weights and activations. This quantization process significantly reduces the
+memory requirements and computational complexity of the network. Binary Neural
+Networks (BNNs) are the extreme quantization case, representing values with
+just one bit. Since the sign function is typically used to map real values to
+binary values, smooth approximations are introduced to mimic the gradients
+during error backpropagation. Thus, the mismatch between the forward and
+backward models corrupts the direction of the gradient, causing training
+inconsistency problems and performance degradation. In contrast to current BNN
+approaches, we propose to employ a binary periodic (BiPer) function during
+binarization. Specifically, we use a square wave for the forward pass to obtain
+the binary values and employ the trigonometric sine function with the same
+period of the square wave as a differentiable surrogate during the backward
+pass. We demonstrate that this approach can control the quantization error by
+using the frequency of the periodic function and improves network performance.
+Extensive experiments validate the effectiveness of BiPer in benchmark datasets
+and network architectures, with improvements of up to 1% and 0.69% with respect
+to state-of-the-art methods in the classification task over CIFAR-10 and
+ImageNet, respectively. Our code is publicly available at
+https://github.com/edmav4/BiPer.",cs.CV,['cs.CV']
+Task-aligned Part-aware Panoptic Segmentation through Joint Object-Part Representations,Daan de Geus · Gijs Dubbelman,https://www.tue-mps.org/tapps/,https://arxiv.org/abs/2311.18618,,2311.18618.pdf,JPPF: Multi-task Fusion for Consistent Panoptic-Part Segmentation,"Part-aware panoptic segmentation is a problem of computer vision that aims to
+provide a semantic understanding of the scene at multiple levels of
+granularity. More precisely, semantic areas, object instances, and semantic
+parts are predicted simultaneously. In this paper, we present our Joint
+Panoptic Part Fusion (JPPF) that combines the three individual segmentations
+effectively to obtain a panoptic-part segmentation. Two aspects are of utmost
+importance for this: First, a unified model for the three problems is desired
+that allows for mutually improved and consistent representation learning.
+Second, balancing the combination so that it gives equal importance to all
+individual results during fusion. Our proposed JPPF is parameter-free and
+dynamically balances its input. The method is evaluated and compared on the
+Cityscapes Panoptic Parts (CPP) and Pascal Panoptic Parts (PPP) datasets in
+terms of PartPQ and Part-Whole Quality (PWQ). In extensive experiments, we
+verify the importance of our fair fusion, highlight its most significant impact
+for areas that can be further segmented into parts, and demonstrate the
+generalization capabilities of our design without fine-tuning on 5 additional
+datasets.",cs.CV,['cs.CV']
+CaKDP: Category-aware Knowledge Distillation and Pruning Framework for Lightweight 3D Object Detection,Haonan Zhang · Longjun Liu · Yuqi Huang · YangZhao · Xinyu Lei · Bihan Wen, ,,https://github.com/zhnxjtu/CaKDP,,,,,nan
+Bilateral Propagation Network for Depth Completion,Jie Tang · Fei-Peng Tian · Boshi An · Jian Li · Ping Tan, ,https://arxiv.org/abs/2403.11270,,2403.11270.pdf,Bilateral Propagation Network for Depth Completion,"Depth completion aims to derive a dense depth map from sparse depth
+measurements with a synchronized color image. Current state-of-the-art (SOTA)
+methods are predominantly propagation-based, which work as an iterative
+refinement on the initial estimated dense depth. However, the initial depth
+estimations mostly result from direct applications of convolutional layers on
+the sparse depth map. In this paper, we present a Bilateral Propagation Network
+(BP-Net), that propagates depth at the earliest stage to avoid directly
+convolving on sparse data. Specifically, our approach propagates the target
+depth from nearby depth measurements via a non-linear model, whose coefficients
+are generated through a multi-layer perceptron conditioned on both
+\emph{radiometric difference} and \emph{spatial distance}. By integrating
+bilateral propagation with multi-modal fusion and depth refinement in a
+multi-scale framework, our BP-Net demonstrates outstanding performance on both
+indoor and outdoor scenes. It achieves SOTA on the NYUv2 dataset and ranks 1st
+on the KITTI depth completion benchmark at the time of submission. Experimental
+results not only show the effectiveness of bilateral propagation but also
+emphasize the significance of early-stage propagation in contrast to the
+refinement stage. Our code and trained models will be available on the project
+page.",cs.CV,['cs.CV']
+SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection,Peng Qi · Zehong Yan · Wynne Hsu · Mong Li Lee,https://pengqi.site/Sniffer/,https://arxiv.org/abs/2403.03170,,2403.03170.pdf,SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection,"Misinformation is a prevalent societal issue due to its potential high risks.
+Out-of-context (OOC) misinformation, where authentic images are repurposed with
+false text, is one of the easiest and most effective ways to mislead audiences.
+Current methods focus on assessing image-text consistency but lack convincing
+explanations for their judgments, which is essential for debunking
+misinformation. While Multimodal Large Language Models (MLLMs) have rich
+knowledge and innate capability for visual reasoning and explanation
+generation, they still lack sophistication in understanding and discovering the
+subtle crossmodal differences. In this paper, we introduce SNIFFER, a novel
+multimodal large language model specifically engineered for OOC misinformation
+detection and explanation. SNIFFER employs two-stage instruction tuning on
+InstructBLIP. The first stage refines the model's concept alignment of generic
+objects with news-domain entities and the second stage leverages language-only
+GPT-4 generated OOC-specific instruction data to fine-tune the model's
+discriminatory powers. Enhanced by external tools and retrieval, SNIFFER not
+only detects inconsistencies between text and image but also utilizes external
+knowledge for contextual verification. Our experiments show that SNIFFER
+surpasses the original MLLM by over 40% and outperforms state-of-the-art
+methods in detection accuracy. SNIFFER also provides accurate and persuasive
+explanations as validated by quantitative and human evaluations.",cs.MM,"['cs.MM', 'cs.AI', 'cs.CL', 'cs.CV', 'cs.CY']"
+Semantic-aware SAM for Point-Prompted Instance Segmentation,Zhaoyang Wei · Pengfei Chen · Xuehui Yu · Guorong Li · Jianbin Jiao · Zhenjun Han, ,https://arxiv.org/abs/2312.15895,,2312.15895.pdf,Semantic-aware SAM for Point-Prompted Instance Segmentation,"Single-point annotation in visual tasks, with the goal of minimizing
+labelling costs, is becoming increasingly prominent in research. Recently,
+visual foundation models, such as Segment Anything (SAM), have gained
+widespread usage due to their robust zero-shot capabilities and exceptional
+annotation performance. However, SAM's class-agnostic output and high
+confidence in local segmentation introduce 'semantic ambiguity', posing a
+challenge for precise category-specific segmentation. In this paper, we
+introduce a cost-effective category-specific segmenter using SAM. To tackle
+this challenge, we have devised a Semantic-Aware Instance Segmentation Network
+(SAPNet) that integrates Multiple Instance Learning (MIL) with matching
+capability and SAM with point prompts. SAPNet strategically selects the most
+representative mask proposals generated by SAM to supervise segmentation, with
+a specific focus on object category information. Moreover, we introduce the
+Point Distance Guidance and Box Mining Strategy to mitigate inherent
+challenges: 'group' and 'local' issues in weakly supervised segmentation. These
+strategies serve to further enhance the overall segmentation performance. The
+experimental results on Pascal VOC and COCO demonstrate the promising
+performance of our proposed SAPNet, emphasizing its semantic matching
+capabilities and its potential to advance point-prompted instance segmentation.
+The code will be made publicly available.",cs.CV,['cs.CV']
+Loopy-SLAM: Dense Neural SLAM with Loop Closures,Lorenzo Liso · Erik Sandström · Vladimir Yugay · Luc Van Gool · Martin R. Oswald, ,https://arxiv.org/abs/2402.09944,,2402.09944.pdf,Loopy-SLAM: Dense Neural SLAM with Loop Closures,"Neural RGBD SLAM techniques have shown promise in dense Simultaneous
+Localization And Mapping (SLAM), yet face challenges such as error accumulation
+during camera tracking resulting in distorted maps. In response, we introduce
+Loopy-SLAM that globally optimizes poses and the dense 3D model. We use
+frame-to-model tracking using a data-driven point-based submap generation
+method and trigger loop closures online by performing global place recognition.
+Robust pose graph optimization is used to rigidly align the local submaps. As
+our representation is point based, map corrections can be performed efficiently
+without the need to store the entire history of input frames used for mapping
+as typically required by methods employing a grid based mapping structure.
+Evaluation on the synthetic Replica and real-world TUM-RGBD and ScanNet
+datasets demonstrate competitive or superior performance in tracking, mapping,
+and rendering accuracy when compared to existing dense neural RGBD SLAM
+methods. Project page: notchla.github.io/Loopy-SLAM.",cs.CV,['cs.CV']
+Aligning Logits Generatively for Principled Black-Box Knowledge Distillation,Jing Ma · Xiang Xiang · Ke Wang · Yuchuan Wu · Yongbin Li, ,https://arxiv.org/abs/2403.01427,,,Logit Standardization in Knowledge Distillation,"Knowledge distillation involves transferring soft labels from a teacher to a
+student using a shared temperature-based softmax function. However, the
+assumption of a shared temperature between teacher and student implies a
+mandatory exact match between their logits in terms of logit range and
+variance. This side-effect limits the performance of student, considering the
+capacity discrepancy between them and the finding that the innate logit
+relations of teacher are sufficient for student to learn. To address this
+issue, we propose setting the temperature as the weighted standard deviation of
+logit and performing a plug-and-play Z-score pre-process of logit
+standardization before applying softmax and Kullback-Leibler divergence. Our
+pre-process enables student to focus on essential logit relations from teacher
+rather than requiring a magnitude match, and can improve the performance of
+existing logit-based distillation methods. We also show a typical case where
+the conventional setting of sharing temperature between teacher and student
+cannot reliably yield the authentic distillation evaluation; nonetheless, this
+challenge is successfully alleviated by our Z-score. We extensively evaluate
+our method for various student and teacher models on CIFAR-100 and ImageNet,
+showing its significant superiority. The vanilla knowledge distillation powered
+by our pre-process can achieve favorable performance against state-of-the-art
+methods, and other distillation variants can obtain considerable gain with the
+assistance of our pre-process.",cs.CV,['cs.CV']
+Grid Diffusion Models for Text-to-Video Generation,Taegyeong Lee · Soyeong Kwon · Taehwan Kim,https://taegyeong-lee.github.io/text2video,https://arxiv.org/abs/2404.00234v1,,2404.00234v1.pdf,Grid Diffusion Models for Text-to-Video Generation,"Recent advances in the diffusion models have significantly improved
+text-to-image generation. However, generating videos from text is a more
+challenging task than generating images from text, due to the much larger
+dataset and higher computational cost required. Most existing video generation
+methods use either a 3D U-Net architecture that considers the temporal
+dimension or autoregressive generation. These methods require large datasets
+and are limited in terms of computational costs compared to text-to-image
+generation. To tackle these challenges, we propose a simple but effective novel
+grid diffusion for text-to-video generation without temporal dimension in
+architecture and a large text-video paired dataset. We can generate a
+high-quality video using a fixed amount of GPU memory regardless of the number
+of frames by representing the video as a grid image. Additionally, since our
+method reduces the dimensions of the video to the dimensions of the image,
+various image-based methods can be applied to videos, such as text-guided video
+manipulation from image manipulation. Our proposed method outperforms the
+existing methods in both quantitative and qualitative evaluations,
+demonstrating the suitability of our model for real-world video generation.",cs.CV,['cs.CV']
+Wonder3D: Single Image to 3D using Cross-Domain Diffusion,Xiaoxiao Long · Yuan-Chen Guo · Cheng Lin · Yuan Liu · Zhiyang Dou · Lingjie Liu · Yuexin Ma · Song-Hai Zhang · Marc Habermann · Christian Theobalt · Wenping Wang, ,https://arxiv.org/abs/2310.15008,,2310.15008.pdf,Wonder3D: Single Image to 3D using Cross-Domain Diffusion,"In this work, we introduce Wonder3D, a novel method for efficiently
+generating high-fidelity textured meshes from single-view images.Recent methods
+based on Score Distillation Sampling (SDS) have shown the potential to recover
+3D geometry from 2D diffusion priors, but they typically suffer from
+time-consuming per-shape optimization and inconsistent geometry. In contrast,
+certain works directly produce 3D information via fast network inferences, but
+their results are often of low quality and lack geometric details. To
+holistically improve the quality, consistency, and efficiency of image-to-3D
+tasks, we propose a cross-domain diffusion model that generates multi-view
+normal maps and the corresponding color images. To ensure consistency, we
+employ a multi-view cross-domain attention mechanism that facilitates
+information exchange across views and modalities. Lastly, we introduce a
+geometry-aware normal fusion algorithm that extracts high-quality surfaces from
+the multi-view 2D representations. Our extensive evaluations demonstrate that
+our method achieves high-quality reconstruction results, robust generalization,
+and reasonably good efficiency compared to prior works.",cs.CV,['cs.CV']
+Towards High-fidelity Artistic Image Vectorization via Texture-Encapsulated Shape Parameterization,Ye Chen · Bingbing Ni · Jinfan Liu · Xiaoyang Huang · Xuanhong Chen, ,https://arxiv.org/abs/2308.13628,,2308.13628.pdf,HiFiHR: Enhancing 3D Hand Reconstruction from a Single Image via High-Fidelity Texture,"We present HiFiHR, a high-fidelity hand reconstruction approach that utilizes
+render-and-compare in the learning-based framework from a single image, capable
+of generating visually plausible and accurate 3D hand meshes while recovering
+realistic textures. Our method achieves superior texture reconstruction by
+employing a parametric hand model with predefined texture assets, and by
+establishing a texture reconstruction consistency between the rendered and
+input images during training. Moreover, based on pretraining the network on an
+annotated dataset, we apply varying degrees of supervision using our pipeline,
+i.e., self-supervision, weak supervision, and full supervision, and discuss the
+various levels of contributions of the learned high-fidelity textures in
+enhancing hand pose and shape estimation. Experimental results on public
+benchmarks including FreiHAND and HO-3D demonstrate that our method outperforms
+the state-of-the-art hand reconstruction methods in texture reconstruction
+quality while maintaining comparable accuracy in pose and shape estimation. Our
+code is available at https://github.com/viridityzhu/HiFiHR.",cs.CV,"['cs.CV', 'cs.AI']"
+Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks,Yuhao Liu · Zhanghan Ke · Fang Liu · Nanxuan Zhao · Rynson W.H. Lau, ,https://arxiv.org/abs/2403.00644,,2403.00644.pdf,Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks,"Diffusion models trained on large-scale datasets have achieved remarkable
+progress in image synthesis. However, due to the randomness in the diffusion
+process, they often struggle with handling diverse low-level tasks that require
+details preservation. To overcome this limitation, we present a new Diff-Plugin
+framework to enable a single pre-trained diffusion model to generate
+high-fidelity results across a variety of low-level tasks. Specifically, we
+first propose a lightweight Task-Plugin module with a dual branch design to
+provide task-specific priors, guiding the diffusion process in preserving image
+content. We then propose a Plugin-Selector that can automatically select
+different Task-Plugins based on the text instruction, allowing users to edit
+images by indicating multiple low-level tasks with natural language. We conduct
+extensive experiments on 8 low-level vision tasks. The results demonstrate the
+superiority of Diff-Plugin over existing methods, particularly in real-world
+scenarios. Our ablations further validate that Diff-Plugin is stable,
+schedulable, and supports robust training across different dataset sizes.",cs.CV,['cs.CV']
+Hunting Attributes:  Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation,Feilong Tang · Zhongxing Xu · Zhaojun QU · Wei Feng · xingjian jiang · Zongyuan Ge, ,https://arxiv.org/abs/2403.07630,,2403.07630.pdf,Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation,"Recent weakly supervised semantic segmentation (WSSS) methods strive to
+incorporate contextual knowledge to improve the completeness of class
+activation maps (CAM). In this work, we argue that the knowledge bias between
+instances and contexts affects the capability of the prototype to sufficiently
+understand instance semantics. Inspired by prototype learning theory, we
+propose leveraging prototype awareness to capture diverse and fine-grained
+feature attributes of instances. The hypothesis is that contextual prototypes
+might erroneously activate similar and frequently co-occurring object
+categories due to this knowledge bias. Therefore, we propose to enhance the
+prototype representation ability by mitigating the bias to better capture
+spatial coverage in semantic object regions. With this goal, we present a
+Context Prototype-Aware Learning (CPAL) strategy, which leverages semantic
+context to enrich instance comprehension. The core of this method is to
+accurately capture intra-class variations in object features through
+context-aware prototypes, facilitating the adaptation to the semantic
+attributes of various instances. We design feature distribution alignment to
+optimize prototype awareness, aligning instance feature distributions with
+dense features. In addition, a unified training framework is proposed to
+combine label-guided classification supervision and prototypes-guided
+self-supervision. Experimental results on PASCAL VOC 2012 and MS COCO 2014 show
+that CPAL significantly improves off-the-shelf methods and achieves
+state-of-the-art performance. The project is available at
+https://github.com/Barrett-python/CPAL.",cs.CV,"['cs.CV', 'cs.AI']"
+Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation,Luca Barsellotti · Roberto Amoroso · Marcella Cornia · Lorenzo Baraldi · Rita Cucchiara,https://aimagelab.github.io/freeda/,https://arxiv.org/abs/2404.06542,,2404.06542.pdf,Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation,"Open-vocabulary semantic segmentation aims at segmenting arbitrary categories
+expressed in textual form. Previous works have trained over large amounts of
+image-caption pairs to enforce pixel-level multimodal alignments. However,
+captions provide global information about the semantics of a given image but
+lack direct localization of individual concepts. Further, training on
+large-scale datasets inevitably brings significant computational costs. In this
+paper, we propose FreeDA, a training-free diffusion-augmented method for
+open-vocabulary semantic segmentation, which leverages the ability of diffusion
+models to visually localize generated concepts and local-global similarities to
+match class-agnostic regions with semantic classes. Our approach involves an
+offline stage in which textual-visual reference embeddings are collected,
+starting from a large set of captions and leveraging visual and semantic
+contexts. At test time, these are queried to support the visual matching
+process, which is carried out by jointly considering class-agnostic regions and
+global semantic similarities. Extensive analyses demonstrate that FreeDA
+achieves state-of-the-art performance on five datasets, surpassing previous
+methods by more than 7.0 average points in terms of mIoU and without requiring
+any training.",cs.CV,['cs.CV']
+TULIP: Multi-camera 3D Precision Assessment of Parkinson's Disease,Kyungdo Kim · Sihan Lyu · Sneha Mantri · Timothy DUNN, ,,https://www.nature.com/articles/s41746-023-00905-9,,,,,nan
+ControlRoom3D: Room Generation using Semantic Controls,Jonas Schult · Sam Tsai · Lukas Höllein · Bichen Wu · Jialiang Wang · Chih-Yao Ma · Kunpeng Li · Xiaofang Wang · Felix Wimbauer · Zijian He · Peizhao Zhang · Bastian Leibe · Peter Vajda · Ji Hou,https://jonasschult.github.io/ControlRoom3D/,https://arxiv.org/abs/2312.05208,,2312.05208.pdf,ControlRoom3D: Room Generation using Semantic Proxy Rooms,"Manually creating 3D environments for AR/VR applications is a complex process
+requiring expert knowledge in 3D modeling software. Pioneering works facilitate
+this process by generating room meshes conditioned on textual style
+descriptions. Yet, many of these automatically generated 3D meshes do not
+adhere to typical room layouts, compromising their plausibility, e.g., by
+placing several beds in one bedroom. To address these challenges, we present
+ControlRoom3D, a novel method to generate high-quality room meshes. Central to
+our approach is a user-defined 3D semantic proxy room that outlines a rough
+room layout based on semantic bounding boxes and a textual description of the
+overall room style. Our key insight is that when rendered to 2D, this 3D
+representation provides valuable geometric and semantic information to control
+powerful 2D models to generate 3D consistent textures and geometry that aligns
+well with the proxy room. Backed up by an extensive study including
+quantitative metrics and qualitative user evaluations, our method generates
+diverse and globally plausible 3D room meshes, thus empowering users to design
+3D rooms effortlessly without specialized knowledge.",cs.CV,['cs.CV']
+Passive Snapshot Coded Aperture Dual-Pixel RGB-D Imaging,Bhargav Ghanekar · Salman Siddique Khan · Pranav Sharma · Shreyas Singh · Vivek Boominathan · Kaushik Mitra · Ashok Veeraraghavan,https://shadowfax11.github.io/cads/,https://arxiv.org/abs/2402.18102,,2402.18102.pdf,Passive Snapshot Coded Aperture Dual-Pixel RGB-D Imaging,"Passive, compact, single-shot 3D sensing is useful in many application areas
+such as microscopy, medical imaging, surgical navigation, and autonomous
+driving where form factor, time, and power constraints can exist. Obtaining
+RGB-D scene information over a short imaging distance, in an ultra-compact form
+factor, and in a passive, snapshot manner is challenging. Dual-pixel (DP)
+sensors are a potential solution to achieve the same. DP sensors collect light
+rays from two different halves of the lens in two interleaved pixel arrays,
+thus capturing two slightly different views of the scene, like a stereo camera
+system. However, imaging with a DP sensor implies that the defocus blur size is
+directly proportional to the disparity seen between the views. This creates a
+trade-off between disparity estimation vs. deblurring accuracy. To improve this
+trade-off effect, we propose CADS (Coded Aperture Dual-Pixel Sensing), in which
+we use a coded aperture in the imaging lens along with a DP sensor. In our
+approach, we jointly learn an optimal coded pattern and the reconstruction
+algorithm in an end-to-end optimization setting. Our resulting CADS imaging
+system demonstrates improvement of >1.5dB PSNR in all-in-focus (AIF) estimates
+and 5-6% in depth estimation quality over naive DP sensing for a wide range of
+aperture settings. Furthermore, we build the proposed CADS prototypes for DSLR
+photography settings and in an endoscope and a dermoscope form factor. Our
+novel coded dual-pixel sensing approach demonstrates accurate RGB-D
+reconstruction results in simulations and real-world experiments in a passive,
+snapshot, and compact manner.",eess.IV,"['eess.IV', 'cs.CV']"
+Real-time 3D-aware Portrait Video Relighting,Ziqi Cai · Kaiwen Jiang · Shu-Yu Chen · Yu-Kun Lai · Hongbo Fu · Boxin Shi · Lin Gao,http://geometrylearning.com/VideoRelighting/,https://arxiv.org/html/2402.14000v1,,2402.14000v1.pdf,Real-time 3D-aware Portrait Editing from a Single Image,"This work presents 3DPE, a practical tool that can efficiently edit a face
+image following given prompts, like reference images or text descriptions, in
+the 3D-aware manner. To this end, a lightweight module is distilled from a 3D
+portrait generator and a text-to-image model, which provide prior knowledge of
+face geometry and open-vocabulary editing capability, respectively. Such a
+design brings two compelling advantages over existing approaches. First, our
+system achieves real-time editing with a feedforward network (i.e., ~0.04s per
+image), over 100x faster than the second competitor. Second, thanks to the
+powerful priors, our module could focus on the learning of editing-related
+variations, such that it manages to handle various types of editing
+simultaneously in the training phase and further supports fast adaptation to
+user-specified novel types of editing during inference (e.g., with ~5min
+fine-tuning per case). The code, the model, and the interface will be made
+publicly available to facilitate future research.",cs.CV,['cs.CV']
+DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes,Xiaoyu Zhou · Zhiwei Lin · Xiaojun Shan · Yongtao Wang · Deqing Sun · Ming-Hsuan Yang, ,https://arxiv.org/abs/2312.07920,,2312.07920.pdf,DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes,"We present DrivingGaussian, an efficient and effective framework for
+surrounding dynamic autonomous driving scenes. For complex scenes with moving
+objects, we first sequentially and progressively model the static background of
+the entire scene with incremental static 3D Gaussians. We then leverage a
+composite dynamic Gaussian graph to handle multiple moving objects,
+individually reconstructing each object and restoring their accurate positions
+and occlusion relationships within the scene. We further use a LiDAR prior for
+Gaussian Splatting to reconstruct scenes with greater details and maintain
+panoramic consistency. DrivingGaussian outperforms existing methods in dynamic
+driving scene reconstruction and enables photorealistic surround-view synthesis
+with high-fidelity and multi-camera consistency. Our project page is at:
+https://github.com/VDIGPKU/DrivingGaussian.",cs.CV,['cs.CV']
+A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition,Yusheng Dai · HangChen · Jun Du · Ruoyu Wang · shihao chen · Haotian Wang · Chin-Hui Lee,https://github.com/dalision/ModalBiasAVSR,https://arxiv.org/abs/2403.04245,,2403.04245.pdf,A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition,"Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to
+be sensitive to missing video frames, performing even worse than
+single-modality models. While applying the dropout technique to the video
+modality enhances robustness to missing frames, it simultaneously results in a
+performance loss when dealing with complete data input. In this paper, we
+investigate this contrasting phenomenon from the perspective of modality bias
+and reveal that an excessive modality bias on the audio caused by dropout is
+the underlying reason. Moreover, we present the Modality Bias Hypothesis (MBH)
+to systematically describe the relationship between modality bias and
+robustness against missing modality in multimodal systems. Building on these
+findings, we propose a novel Multimodal Distribution Approximation with
+Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio
+modality and to maintain performance and robustness simultaneously. Finally, to
+address an entirely missing modality, we adopt adapters to dynamically switch
+decision strategies. The effectiveness of our proposed approach is evaluated
+and validated through a series of comprehensive experiments using the MISP2021
+and MISP2022 datasets. Our code is available at
+https://github.com/dalision/ModalBiasAVSR",cs.SD,"['cs.SD', 'cs.CV', 'cs.LG', 'cs.MM', 'eess.AS']"
+GPLD3D: Latent Diffusion of 3D Shape Generative Models by Enforcing Geometric and Physical Priors,Yuan Dong · Qi Zuo · Xiaodong Gu · Weihao Yuan · zhengyi zhao · Zilong Dong · Liefeng Bo · Qixing Huang, ,https://arxiv.org/abs/2401.17603,,2401.17603.pdf,Topology-Aware Latent Diffusion for 3D Shape Generation,"We introduce a new generative model that combines latent diffusion with
+persistent homology to create 3D shapes with high diversity, with a special
+emphasis on their topological characteristics. Our method involves representing
+3D shapes as implicit fields, then employing persistent homology to extract
+topological features, including Betti numbers and persistence diagrams. The
+shape generation process consists of two steps. Initially, we employ a
+transformer-based autoencoding module to embed the implicit representation of
+each 3D shape into a set of latent vectors. Subsequently, we navigate through
+the learned latent space via a diffusion model. By strategically incorporating
+topological features into the diffusion process, our generative module is able
+to produce a richer variety of 3D shapes with different topological structures.
+Furthermore, our framework is flexible, supporting generation tasks constrained
+by a variety of inputs, including sparse and partial point clouds, as well as
+sketches. By modifying the persistence diagrams, we can alter the topology of
+the shapes generated from these input modalities.",cs.CV,"['cs.CV', 'I.3.5; I.2.10']"
+Shallow-Deep Collaborative Learning for Unsupervised Visible-Infrared Person Re-Identification,Bin Yang · Jun Chen · Mang Ye, ,,https://dl.acm.org/doi/10.1145/3581783.3612077,,,,,nan
+Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAM,"Tongyan Hua · Addison, Lin Wang",https://vlis2022.github.io/nerf-slam-benchmark/,https://arxiv.org/abs/2403.19473,,2403.19473.pdf,Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAM,"Implicit neural representation (INR), in combination with geometric
+rendering, has recently been employed in real-time dense RGB-D SLAM. Despite
+active research endeavors being made, there lacks a unified protocol for fair
+evaluation, impeding the evolution of this area. In this work, we establish, to
+our knowledge, the first open-source benchmark framework to evaluate the
+performance of a wide spectrum of commonly used INRs and rendering functions
+for mapping and localization. The goal of our benchmark is to 1) gain an
+intuition of how different INRs and rendering functions impact mapping and
+localization and 2) establish a unified evaluation protocol w.r.t. the design
+choices that may impact the mapping and localization. With the framework, we
+conduct a large suite of experiments, offering various insights in choosing the
+INRs and geometric rendering functions: for example, the dense feature grid
+outperforms other INRs (e.g. tri-plane and hash grid), even when geometric and
+color features are jointly encoded for memory efficiency. To extend the
+findings into the practical scenario, a hybrid encoding strategy is proposed to
+bring the best of the accuracy and completion from the grid-based and
+decomposition-based INRs. We further propose explicit hybrid encoding for
+high-fidelity dense grid mapping to comply with the RGB-D SLAM system that puts
+the premise on robustness and computation efficiency.",cs.CV,['cs.CV']
+Elite360D: Towards Efficient 360 Depth Estimation via Semantic- and Distance-Aware Bi-Projection Fusion,"Hao Ai · Addison, Lin Wang", ,http://export.arxiv.org/abs/2403.16376,,2403.16376.pdf,Elite360D: Towards Efficient 360 Depth Estimation via Semantic- and Distance-Aware Bi-Projection Fusion,"360 depth estimation has recently received great attention for 3D
+reconstruction owing to its omnidirectional field of view (FoV). Recent
+approaches are predominantly focused on cross-projection fusion with
+geometry-based re-projection: they fuse 360 images with equirectangular
+projection (ERP) and another projection type, e.g., cubemap projection to
+estimate depth with the ERP format. However, these methods suffer from 1)
+limited local receptive fields, making it hardly possible to capture large FoV
+scenes, and 2) prohibitive computational cost, caused by the complex
+cross-projection fusion module design. In this paper, we propose Elite360D, a
+novel framework that inputs the ERP image and icosahedron projection (ICOSAP)
+point set, which is undistorted and spatially continuous. Elite360D is superior
+in its capacity in learning a representation from a local-with-global
+perspective. With a flexible ERP image encoder, it includes an ICOSAP point
+encoder, and a Bi-projection Bi-attention Fusion (B2F) module (totally ~1M
+parameters). Specifically, the ERP image encoder can take various perspective
+image-trained backbones (e.g., ResNet, Transformer) to extract local features.
+The point encoder extracts the global features from the ICOSAP. Then, the B2F
+module captures the semantic- and distance-aware dependencies between each
+pixel of the ERP feature and the entire ICOSAP feature set. Without specific
+backbone design and obvious computational cost increase, Elite360D outperforms
+the prior arts on several benchmark datasets.",cs.CV,['cs.CV']
+EventDance: Unsupervised Cross-modal Source-free Adaptation for Event-based Object Recognition,"Xu Zheng · Addison, Lin Wang", ,https://arxiv.org/abs/2403.14082,,2403.14082.pdf,EventDance: Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition,"In this paper, we make the first attempt at achieving the cross-modal (i.e.,
+image-to-events) adaptation for event-based object recognition without
+accessing any labeled source image data owning to privacy and commercial
+issues. Tackling this novel problem is non-trivial due to the novelty of event
+cameras and the distinct modality gap between images and events. In particular,
+as only the source model is available, a hurdle is how to extract the knowledge
+from the source model by only using the unlabeled target event data while
+achieving knowledge transfer. To this end, we propose a novel framework, dubbed
+EventDance for this unsupervised source-free cross-modal adaptation problem.
+Importantly, inspired by event-to-video reconstruction methods, we propose a
+reconstruction-based modality bridging (RMB) module, which reconstructs
+intensity frames from events in a self-supervised manner. This makes it
+possible to build up the surrogate images to extract the knowledge (i.e.,
+labels) from the source model. We then propose a multi-representation knowledge
+adaptation (MKA) module that transfers the knowledge to target models learning
+events with multiple representation types for fully exploring the
+spatiotemporal information of events. The two modules connecting the source and
+target models are mutually updated so as to achieve the best performance.
+Experiments on three benchmark datasets with two adaption settings show that
+EventDance is on par with prior methods utilizing the source data.",cs.CV,['cs.CV']
+GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-aware Panoramic Semantic Segmentation,"WEIMING ZHANG · Yexin Liu · Xu Zheng · Addison, Lin Wang", ,https://arxiv.org/abs/2403.16370,,2403.16370.pdf,GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-aware Panoramic Semantic Segmentation,"This paper tackles a novel yet challenging problem: how to transfer knowledge
+from the emerging Segment Anything Model (SAM) -- which reveals impressive
+zero-shot instance segmentation capacity -- to learn a compact panoramic
+semantic segmentation model, i.e., student, without requiring any labeled data.
+This poses considerable challenges due to SAM's inability to provide semantic
+labels and the large capacity gap between SAM and the student. To this end, we
+propose a novel framework, called GoodSAM, that introduces a teacher assistant
+(TA) to provide semantic information, integrated with SAM to generate ensemble
+logits to achieve knowledge transfer. Specifically, we propose a
+Distortion-Aware Rectification (DAR) module that first addresses the distortion
+problem of panoramic images by imposing prediction-level consistency and
+boundary enhancement. This subtly enhances TA's prediction capacity on
+panoramic images. DAR then incorporates a cross-task complementary fusion block
+to adaptively merge the predictions of SAM and TA to obtain more reliable
+ensemble logits. Moreover, we introduce a Multi-level Knowledge Adaptation
+(MKA) module to efficiently transfer the multi-level feature knowledge from TA
+and ensemble logits to learn a compact student model. Extensive experiments on
+two benchmarks show that our GoodSAM achieves a remarkable +3.75\% mIoU
+improvement over the state-of-the-art (SOTA) domain adaptation methods. Also,
+our most lightweight model achieves comparable performance to the SOTA methods
+with only 3.7M parameters.",cs.CV,['cs.CV']
+ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More,"Jiazhou Zhou · Xu Zheng · Yuanhuiyi Lyu · Addison, Lin Wang", ,https://arxiv.org/abs/2403.12534,,2403.12534.pdf,ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More,"Event cameras have recently been shown beneficial for practical vision tasks,
+such as action recognition, thanks to their high temporal resolution, power
+efficiency, and reduced privacy concerns. However, current research is hindered
+by 1) the difficulty in processing events because of their prolonged duration
+and dynamic actions with complex and ambiguous semantics and 2) the redundant
+action depiction of the event frame representation with fixed stacks. We find
+language naturally conveys abundant semantic information, rendering it
+stunningly superior in reducing semantic uncertainty. In light of this, we
+propose ExACT, a novel approach that, for the first time, tackles event-based
+action recognition from a cross-modal conceptualizing perspective. Our ExACT
+brings two technical contributions. Firstly, we propose an adaptive
+fine-grained event (AFE) representation to adaptively filter out the repeated
+events for the stationary objects while preserving dynamic ones. This subtly
+enhances the performance of ExACT without extra computational cost. Then, we
+propose a conceptual reasoning-based uncertainty estimation module, which
+simulates the recognition process to enrich the semantic representation. In
+particular, conceptual reasoning builds the temporal relation based on the
+action semantics, and uncertainty estimation tackles the semantic uncertainty
+of actions based on the distributional representation. Experiments show that
+our ExACT achieves superior recognition accuracy of 94.83%(+2.23%),
+90.10%(+37.47%) and 67.24% on PAF, HARDVS and our SeAct datasets respectively.",cs.CV,['cs.CV']
+"Semantics, Distortion, and Style Matter: Towards Source-free UDA for Panoramic Segmentation","Xu Zheng · Pengyuan Zhou · ATHANASIOS · Addison, Lin Wang",https://vlislab22.github.io/360SFUDA/,https://arxiv.org/abs/2403.12505,,2403.12505.pdf,"Semantics, Distortion, and Style Matter: Towards Source-free UDA for Panoramic Segmentation","This paper addresses an interesting yet challenging problem -- source-free
+unsupervised domain adaptation (SFUDA) for pinhole-to-panoramic semantic
+segmentation -- given only a pinhole image-trained model (i.e., source) and
+unlabeled panoramic images (i.e., target). Tackling this problem is nontrivial
+due to the semantic mismatches, style discrepancies, and inevitable distortion
+of panoramic images. To this end, we propose a novel method that utilizes
+Tangent Projection (TP) as it has less distortion and meanwhile slits the
+equirectangular projection (ERP) with a fixed FoV to mimic the pinhole images.
+Both projections are shown effective in extracting knowledge from the source
+model. However, the distinct projection discrepancies between source and target
+domains impede the direct knowledge transfer; thus, we propose a panoramic
+prototype adaptation module (PPAM) to integrate panoramic prototypes from the
+extracted knowledge for adaptation. We then impose the loss constraints on both
+predictions and prototypes and propose a cross-dual attention module (CDAM) at
+the feature level to better align the spatial and channel characteristics
+across the domains and projections. Both knowledge extraction and transfer
+processes are synchronously updated to reach the best performance. Extensive
+experiments on the synthetic and real-world benchmarks, including outdoor and
+indoor scenarios, demonstrate that our method achieves significantly better
+performance than prior SFUDA methods for pinhole-to-panoramic adaptation.",cs.CV,['cs.CV']
+UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All,"Yuanhuiyi Lyu · Xu Zheng · Jiazhou Zhou · Addison, Lin Wang", ,https://arxiv.org/abs/2405.16108,,2405.16108.pdf,OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All,"Research on multi-modal learning dominantly aligns the modalities in a
+unified space at training, and only a single one is taken for prediction at
+inference. However, for a real machine, e.g., a robot, sensors could be added
+or removed at any time. Thus, it is crucial to enable the machine to tackle the
+mismatch and unequal-scale problems of modality combinations between training
+and inference. In this paper, we tackle these problems from a new perspective:
+""Modalities Help Modalities"". Intuitively, we present OmniBind, a novel
+two-stage learning framework that can achieve any modality combinations and
+interaction. It involves teaching data-constrained, a.k.a, student, modalities
+to be aligned with the well-trained data-abundant, a.k.a, teacher, modalities.
+This subtly enables the adaptive fusion of any modalities to build a unified
+representation space for any combinations. Specifically, we propose Cross-modal
+Alignment Distillation (CAD) to address the unequal-scale problem between
+student and teacher modalities and effectively align student modalities into
+the teacher modalities' representation space in stage one. We then propose an
+Adaptive Fusion (AF) module to fuse any modality combinations and learn a
+unified representation space in stage two. To address the mismatch problem, we
+aggregate existing datasets and combine samples from different modalities by
+the same semantics. This way, we build the first dataset for training and
+evaluation that consists of teacher (image, text) and student (touch, thermal,
+event, point cloud, audio) modalities and enables omni-bind for any of them.
+Extensive experiments on the recognition task show performance gains over prior
+arts by an average of 4.05 % on the arbitrary modality combination setting. It
+also achieves state-of-the-art performance for a single modality, e.g., touch,
+with a 4.34 % gain.",cs.CV,['cs.CV']
+C3Net: Compound Conditioned ControlNet for Multimodal Content Generation,Juntao Zhang · Yuehuai LIU · Yu-Wing Tai · Chi-Keung Tang, ,https://arxiv.org/abs/2311.17951,,2311.17951.pdf,C3Net: Compound Conditioned ControlNet for Multimodal Content Generation,"We present Compound Conditioned ControlNet, C3Net, a novel generative neural
+architecture taking conditions from multiple modalities and synthesizing
+multimodal contents simultaneously (e.g., image, text, audio). C3Net adapts the
+ControlNet architecture to jointly train and make inferences on a
+production-ready diffusion model and its trainable copies. Specifically, C3Net
+first aligns the conditions from multi-modalities to the same semantic latent
+space using modality-specific encoders based on contrastive training. Then, it
+generates multimodal outputs based on the aligned latent space, whose semantic
+information is combined using a ControlNet-like architecture called Control
+C3-UNet. Correspondingly, with this system design, our model offers an improved
+solution for joint-modality generation through learning and explaining
+multimodal conditions instead of simply taking linear interpolations on the
+latent space. Meanwhile, as we align conditions to a unified latent space,
+C3Net only requires one trainable Control C3-UNet to work on multimodal
+semantic information. Furthermore, our model employs unimodal pretraining on
+the condition alignment stage, outperforming the non-pretrained alignment even
+on relatively scarce training data and thus demonstrating high-quality compound
+condition generation. We contribute the first high-quality tri-modal validation
+set to validate quantitatively that C3Net outperforms or is on par with first
+and contemporary state-of-the-art multimodal generation. Our codes and
+tri-modal dataset will be released.",cs.LG,['cs.LG']
+Towards Robust Event-guided Low-Light Image Enhancement: A Large-Scale Real-World Event-Image Dataset and Novel Approach,"Guoqiang Liang · Kanghao Chen · Hangyu Li · Yunfan Lu · Addison, Lin Wang",https://vlislab22.github.io/eg-lowlight/.,https://arxiv.org/abs/2404.00834v1,,2404.00834v1.pdf,Towards Robust Event-guided Low-Light Image Enhancement: A Large-Scale Real-World Event-Image Dataset and Novel Approach,"Event camera has recently received much attention for low-light image
+enhancement (LIE) thanks to their distinct advantages, such as high dynamic
+range. However, current research is prohibitively restricted by the lack of
+large-scale, real-world, and spatial-temporally aligned event-image datasets.
+To this end, we propose a real-world (indoor and outdoor) dataset comprising
+over 30K pairs of images and events under both low and normal illumination
+conditions. To achieve this, we utilize a robotic arm that traces a consistent
+non-linear trajectory to curate the dataset with spatial alignment precision
+under 0.03mm. We then introduce a matching alignment strategy, rendering 90% of
+our dataset with errors less than 0.01s. Based on the dataset, we propose a
+novel event-guided LIE approach, called EvLight, towards robust performance in
+real-world low-light scenes. Specifically, we first design the multi-scale
+holistic fusion branch to extract holistic structural and textural information
+from both events and images. To ensure robustness against variations in the
+regional illumination and noise, we then introduce a Signal-to-Noise-Ratio
+(SNR)-guided regional feature selection to selectively fuse features of images
+from regions with high SNR and enhance those with low SNR by extracting
+regional structure information from events. Extensive experiments on our
+dataset and the synthetic SDSD dataset demonstrate our EvLight significantly
+surpasses the frame-based methods. Code and datasets are available at
+https://vlislab22.github.io/eg-lowlight/.",cs.CV,['cs.CV']
+Flatten Long-Range Loss Landscapes for Cross-Domain Few-Shot Learning,Yixiong Zou · Yicong Liu · Yiman Hu · Yuhua Li · Ruixuan Li, ,https://arxiv.org/abs/2403.00567,,2403.00567.pdf,Flatten Long-Range Loss Landscapes for Cross-Domain Few-Shot Learning,"Cross-domain few-shot learning (CDFSL) aims to acquire knowledge from limited
+training data in the target domain by leveraging prior knowledge transferred
+from source domains with abundant training samples. CDFSL faces challenges in
+transferring knowledge across dissimilar domains and fine-tuning models with
+limited training data. To address these challenges, we initially extend the
+analysis of loss landscapes from the parameter space to the representation
+space, which allows us to simultaneously interpret the transferring and
+fine-tuning difficulties of CDFSL models. We observe that sharp minima in the
+loss landscapes of the representation space result in representations that are
+hard to transfer and fine-tune. Moreover, existing flatness-based methods have
+limited generalization ability due to their short-range flatness. To enhance
+the transferability and facilitate fine-tuning, we introduce a simple yet
+effective approach to achieve long-range flattening of the minima in the loss
+landscape. This approach considers representations that are differently
+normalized as minima in the loss landscape and flattens the high-loss region in
+the middle by randomly sampling interpolated representations. We implement this
+method as a new normalization layer that replaces the original one in both CNNs
+and ViTs. This layer is simple and lightweight, introducing only a minimal
+number of additional parameters. Experimental results on 8 datasets demonstrate
+that our approach outperforms state-of-the-art methods in terms of average
+accuracy. Moreover, our method achieves performance improvements of up to 9\%
+compared to the current best approaches on individual datasets. Our code will
+be released.",cs.CV,"['cs.CV', 'cs.AI']"
+PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization,Xu Peng · Junwei Zhu · Boyuan Jiang · Ying Tai · Donghao Luo · Jiangning Zhang · Wei Lin · Taisong Jin · Chengjie Wang · Rongrong Ji, ,https://arxiv.org/abs/2312.06354,,2312.06354.pdf,PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization,"Recent advancements in personalized image generation using diffusion models
+have been noteworthy. However, existing methods suffer from inefficiencies due
+to the requirement for subject-specific fine-tuning. This computationally
+intensive process hinders efficient deployment, limiting practical usability.
+Moreover, these methods often grapple with identity distortion and limited
+expression diversity. In light of these challenges, we propose PortraitBooth,
+an innovative approach designed for high efficiency, robust identity
+preservation, and expression-editable text-to-image generation, without the
+need for fine-tuning. PortraitBooth leverages subject embeddings from a face
+recognition model for personalized image generation without fine-tuning. It
+eliminates computational overhead and mitigates identity distortion. The
+introduced dynamic identity preservation strategy further ensures close
+resemblance to the original image identity. Moreover, PortraitBooth
+incorporates emotion-aware cross-attention control for diverse facial
+expressions in generated images, supporting text-driven expression editing. Its
+scalability enables efficient and high-quality image creation, including
+multi-subject generation. Extensive results demonstrate superior performance
+over other state-of-the-art methods in both single and multiple image
+generation scenarios.",cs.CV,['cs.CV']
+Discriminability-Driven Channel Selection for Out-of-Distribution Detection,Yue Yuan · Rundong He · Yicong Dong · Zhongyi Han · Yilong Yin, ,,https://www.semanticscholar.org/paper/Exploring-Channel-Aware-Typical-Features-for-He-Yuan/755390c365c4a39445f73ed09fe673f2b823876d,,,,,nan
+Unmixing Diffusion for Self-Supervised Hyperspectral Image Denoising,Haijin Zeng · Jiezhang Cao · Yongyong Chen · Kai Zhang · Hiep Luong · Wilfried Philips, ,https://arxiv.org/abs/2311.11417,,,DiffSCI: Zero-Shot Snapshot Compressive Imaging via Iterative Spectral Diffusion Model,"This paper endeavors to advance the precision of snapshot compressive imaging
+(SCI) reconstruction for multispectral image (MSI). To achieve this, we
+integrate the advantageous attributes of established SCI techniques and an
+image generative model, propose a novel structured zero-shot diffusion model,
+dubbed DiffSCI. DiffSCI leverages the structural insights from the deep prior
+and optimization-based methodologies, complemented by the generative
+capabilities offered by the contemporary denoising diffusion model.
+Specifically, firstly, we employ a pre-trained diffusion model, which has been
+trained on a substantial corpus of RGB images, as the generative denoiser
+within the Plug-and-Play framework for the first time. This integration allows
+for the successful completion of SCI reconstruction, especially in the case
+that current methods struggle to address effectively. Secondly, we
+systematically account for spectral band correlations and introduce a robust
+methodology to mitigate wavelength mismatch, thus enabling seamless adaptation
+of the RGB diffusion model to MSIs. Thirdly, an accelerated algorithm is
+implemented to expedite the resolution of the data subproblem. This
+augmentation not only accelerates the convergence rate but also elevates the
+quality of the reconstruction process. We present extensive testing to show
+that DiffSCI exhibits discernible performance enhancements over prevailing
+self-supervised and zero-shot approaches, surpassing even supervised
+transformer counterparts across both simulated and real datasets. Our code will
+be available.",cs.CV,['cs.CV']
+Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection,Huan Liu · Zichang Tan · Chuangchuang Tan · Yunchao Wei · Jingdong Wang · Yao Zhao,https://github.com/Michel-liu/FatFormer,https://arxiv.org/abs/2312.16649,,2312.16649.pdf,Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection,"In this paper, we study the problem of generalizable synthetic image
+detection, aiming to detect forgery images from diverse generative methods,
+e.g., GANs and diffusion models. Cutting-edge solutions start to explore the
+benefits of pre-trained models, and mainly follow the fixed paradigm of solely
+training an attached classifier, e.g., combining frozen CLIP-ViT with a
+learnable linear layer in UniFD. However, our analysis shows that such a fixed
+paradigm is prone to yield detectors with insufficient learning regarding
+forgery representations. We attribute the key challenge to the lack of forgery
+adaptation, and present a novel forgery-aware adaptive transformer approach,
+namely FatFormer. Based on the pre-trained vision-language spaces of CLIP,
+FatFormer introduces two core designs for the adaption to build generalized
+forgery representations. First, motivated by the fact that both image and
+frequency analysis are essential for synthetic image detection, we develop a
+forgery-aware adapter to adapt image features to discern and integrate local
+forgery traces within image and frequency domains. Second, we find that
+considering the contrastive objectives between adapted image features and text
+prompt embeddings, a previously overlooked aspect, results in a nontrivial
+generalization improvement. Accordingly, we introduce language-guided alignment
+to supervise the forgery adaptation with image and text prompts in FatFormer.
+Experiments show that, by coupling these two designs, our approach tuned on
+4-class ProGAN data attains a remarkable detection performance, achieving an
+average of 98% accuracy to unseen GANs, and surprisingly generalizes to unseen
+diffusion models with 95% accuracy.",cs.CV,['cs.CV']
+"What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions",Brian Chen · Nina Shvetsova · Andrew Rouditchenko · Daniel Kondermann · Samuel Thomas · Shih-Fu Chang · Rogerio Feris · James Glass · Hilde Kuehne, ,,https://openreview.net/forum?id=eEtfBIjzWi,,,,,nan
+CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned Normalization,Yao Ni · Piotr Koniusz, ,https://arxiv.org/abs/2404.00521,,2404.00521.pdf,CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned Normalization,"Generative Adversarial Networks (GANs) significantly advanced image
+generation but their performance heavily depends on abundant training data. In
+scenarios with limited data, GANs often struggle with discriminator overfitting
+and unstable training. Batch Normalization (BN), despite being known for
+enhancing generalization and training stability, has rarely been used in the
+discriminator of Data-Efficient GANs. Our work addresses this gap by
+identifying a critical flaw in BN: the tendency for gradient explosion during
+the centering and scaling steps. To tackle this issue, we present CHAIN
+(lipsCHitz continuity constrAIned Normalization), which replaces the
+conventional centering step with zero-mean regularization and integrates a
+Lipschitz continuity constraint in the scaling step. CHAIN further enhances GAN
+training by adaptively interpolating the normalized and unnormalized features,
+effectively avoiding discriminator overfitting. Our theoretical analyses firmly
+establishes CHAIN's effectiveness in reducing gradients in latent features and
+weights, improving stability and generalization in GAN training. Empirical
+evidence supports our theory. CHAIN achieves state-of-the-art results in
+data-limited scenarios on CIFAR-10/100, ImageNet, five low-shot and seven
+high-resolution few-shot image datasets. Code:
+https://github.com/MaxwellYaoNi/CHAIN",cs.LG,"['cs.LG', 'cs.CV']"
+Improving Plasticity in Online Continual Learning via Collaborative Learning,Maorong Wang · Nicolas Michel · Ling Xiao · Toshihiko Yamasaki, ,https://arxiv.org/abs/2312.00600,,2312.00600.pdf,Improving Plasticity in Online Continual Learning via Collaborative Learning,"Online Continual Learning (CL) solves the problem of learning the
+ever-emerging new classification tasks from a continuous data stream. Unlike
+its offline counterpart, in online CL, the training data can only be seen once.
+Most existing online CL research regards catastrophic forgetting (i.e., model
+stability) as almost the only challenge. In this paper, we argue that the
+model's capability to acquire new knowledge (i.e., model plasticity) is another
+challenge in online CL. While replay-based strategies have been shown to be
+effective in alleviating catastrophic forgetting, there is a notable gap in
+research attention toward improving model plasticity. To this end, we propose
+Collaborative Continual Learning (CCL), a collaborative learning based strategy
+to improve the model's capability in acquiring new concepts. Additionally, we
+introduce Distillation Chain (DC), a collaborative learning scheme to boost the
+training of the models. We adapt CCL-DC to existing representative online CL
+works. Extensive experiments demonstrate that even if the learners are
+well-trained with state-of-the-art online CL methods, our strategy can still
+improve model plasticity dramatically, and thereby improve the overall
+performance by a large margin. The source code of our work is available at
+https://github.com/maorong-wang/CCL-DC.",cs.LG,['cs.LG']
+Bi-SSC: Geometric-Semantic Bidirectional Fusion for Camera-based 3D Semantic Scene Completion,Yujie Xue · Ruihui Li · F anWu · Zhuo Tang · Kenli Li · Duan Mingxing, ,https://arxiv.org/abs/2312.05752,,2312.05752.pdf,Camera-based 3D Semantic Scene Completion with Sparse Guidance Network,"Semantic scene completion (SSC) aims to predict the semantic occupancy of
+each voxel in the entire 3D scene from limited observations, which is an
+emerging and critical task for autonomous driving. Recently, many studies have
+turned to camera-based SSC solutions due to the richer visual cues and
+cost-effectiveness of cameras. However, existing methods usually rely on
+sophisticated and heavy 3D models to directly process the lifted 3D features
+that are not discriminative enough for clear segmentation boundaries. In this
+paper, we adopt the dense-sparse-dense design and propose an end-to-end
+camera-based SSC framework, termed SGN, to diffuse semantics from the semantic-
+and occupancy-aware seed voxels to the whole scene based on geometry prior and
+occupancy information. By designing hybrid guidance (sparse semantic and
+geometry guidance) and effective voxel aggregation for spatial occupancy and
+geometry priors, we enhance the feature separation between different categories
+and expedite the convergence of semantic diffusion. Extensive experimental
+results on the SemanticKITTI dataset demonstrate the superiority of our SGN
+over existing state-of-the-art methods.",cs.CV,['cs.CV']
+ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object,Chenshuang Zhang · Fei Pan · Junmo Kim · In So Kweon · Chengzhi Mao,https://github.com/chenshuang-zhang/imagenet_d,https://arxiv.org/abs/2403.18775,,2403.18775.pdf,ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object,"We establish rigorous benchmarks for visual perception robustness. Synthetic
+images such as ImageNet-C, ImageNet-9, and Stylized ImageNet provide specific
+type of evaluation over synthetic corruptions, backgrounds, and textures, yet
+those robustness benchmarks are restricted in specified variations and have low
+synthetic quality. In this work, we introduce generative model as a data source
+for synthesizing hard images that benchmark deep models' robustness. Leveraging
+diffusion models, we are able to generate images with more diversified
+backgrounds, textures, and materials than any prior work, where we term this
+benchmark as ImageNet-D. Experimental results show that ImageNet-D results in a
+significant accuracy drop to a range of vision models, from the standard ResNet
+visual classifier to the latest foundation models like CLIP and MiniGPT-4,
+significantly reducing their accuracy by up to 60\%. Our work suggests that
+diffusion models can be an effective source to test vision models. The code and
+dataset are available at https://github.com/chenshuang-zhang/imagenet_d.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Object Pose Estimation via the Aggregation of Diffusion Features,Tianfu Wang · Guosheng Hu · Hongguang Wang,https://github.com/Tianfu18/diff-feats-pose,https://arxiv.org/abs/2403.18791,,2403.18791.pdf,Object Pose Estimation via the Aggregation of Diffusion Features,"Estimating the pose of objects from images is a crucial task of 3D scene
+understanding, and recent approaches have shown promising results on very large
+benchmarks. However, these methods experience a significant performance drop
+when dealing with unseen objects. We believe that it results from the limited
+generalizability of image features. To address this problem, we have an
+in-depth analysis on the features of diffusion models, e.g. Stable Diffusion,
+which hold substantial potential for modeling unseen objects. Based on this
+analysis, we then innovatively introduce these diffusion features for object
+pose estimation. To achieve this, we propose three distinct architectures that
+can effectively capture and aggregate diffusion features of different
+granularity, greatly improving the generalizability of object pose estimation.
+Our approach outperforms the state-of-the-art methods by a considerable margin
+on three popular benchmark datasets, LM, O-LM, and T-LESS. In particular, our
+method achieves higher accuracy than the previous best arts on unseen objects:
+98.2% vs. 93.5% on Unseen LM, 85.9% vs. 76.3% on Unseen O-LM, showing the
+strong generalizability of our method. Our code is released at
+https://github.com/Tianfu18/diff-feats-pose.",cs.CV,['cs.CV']
+Efficient Meshflow and Optical Flow Estimation from Event Cameras,Xinglong Luo · Ao Luo · Zhengning Wang · Chunyu Lin · Bing Zeng · Shuaicheng Liu,https://github.com/boomluo02/EEMFlow,https://arxiv.org/abs/2307.05033,,2307.05033.pdf,Towards Anytime Optical Flow Estimation with Event Cameras,"Optical flow estimation is a fundamental task in the field of autonomous
+driving. Event cameras are capable of responding to log-brightness changes in
+microseconds. Its characteristic of producing responses only to the changing
+region is particularly suitable for optical flow estimation. In contrast to the
+super low-latency response speed of event cameras, existing datasets collected
+via event cameras, however, only provide limited frame rate optical flow ground
+truth, (e.g., at 10Hz), greatly restricting the potential of event-driven
+optical flow. To address this challenge, we put forward a high-frame-rate,
+low-latency event representation Unified Voxel Grid, sequentially fed into the
+network bin by bin. We then propose EVA-Flow, an EVent-based Anytime Flow
+estimation network to produce high-frame-rate event optical flow with only
+low-frame-rate optical flow ground truth for supervision. The key component of
+our EVA-Flow is the stacked Spatiotemporal Motion Refinement (SMR) module,
+which predicts temporally dense optical flow and enhances the accuracy via
+spatial-temporal motion refinement. The time-dense feature warping utilized in
+the SMR module provides implicit supervision for the intermediate optical flow.
+Additionally, we introduce the Rectified Flow Warp Loss (RFWL) for the
+unsupervised evaluation of intermediate optical flow in the absence of ground
+truth. This is, to the best of our knowledge, the first work focusing on
+anytime optical flow estimation via event cameras. A comprehensive variety of
+experiments on MVSEC, DESC, and our EVA-FlowSet demonstrates that EVA-Flow
+achieves competitive performance, super-low-latency (5ms), fastest inference
+(9.2ms), time-dense motion estimation (200Hz), and strong generalization. Our
+code will be available at https://github.com/Yaozhuwa/EVA-Flow.",cs.CV,"['cs.CV', 'cs.RO', 'eess.IV']"
+MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes via Diffusion Prior,Honghua Chen · Chen Change Loy · Xingang Pan, ,https://arxiv.org/abs/2405.02859,,2405.02859.pdf,MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes via Diffusion Prior,"Despite the emergence of successful NeRF inpainting methods built upon
+explicit RGB and depth 2D inpainting supervisions, these methods are inherently
+constrained by the capabilities of their underlying 2D inpainters. This is due
+to two key reasons: (i) independently inpainting constituent images results in
+view-inconsistent imagery, and (ii) 2D inpainters struggle to ensure
+high-quality geometry completion and alignment with inpainted RGB images.
+  To overcome these limitations, we propose a novel approach called MVIP-NeRF
+that harnesses the potential of diffusion priors for NeRF inpainting,
+addressing both appearance and geometry aspects. MVIP-NeRF performs joint
+inpainting across multiple views to reach a consistent solution, which is
+achieved via an iterative optimization process based on Score Distillation
+Sampling (SDS). Apart from recovering the rendered RGB images, we also extract
+normal maps as a geometric representation and define a normal SDS loss that
+motivates accurate geometry inpainting and alignment with the appearance.
+Additionally, we formulate a multi-view SDS score function to distill
+generative priors simultaneously from different view images, ensuring
+consistent visual completion when dealing with large view variations. Our
+experimental results show better appearance and geometry recovery than previous
+NeRF inpainting methods.",cs.CV,['cs.CV']
+Functional Diffusion,Biao Zhang · Peter Wonka, ,https://arxiv.org/abs/2311.15435,,2311.15435.pdf,Functional Diffusion,"We propose a new class of generative diffusion models, called functional
+diffusion. In contrast to previous work, functional diffusion works on samples
+that are represented by functions with a continuous domain. Functional
+diffusion can be seen as an extension of classical diffusion models to an
+infinite-dimensional domain. Functional diffusion is very versatile as images,
+videos, audio, 3D shapes, deformations, \etc, can be handled by the same
+framework with minimal changes. In addition, functional diffusion is especially
+suited for irregular data or data defined in non-standard domains. In our work,
+we derive the necessary foundations for functional diffusion and propose a
+first implementation based on the transformer architecture. We show generative
+results on complicated signed distance functions and deformation functions
+defined on 3D surfaces.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping,Hyeongjun Kwon · Jinhyun Jang · Jin Kim · Kwonyoung Kim · Kwanghoon Sohn, ,https://arxiv.org/abs/2404.00974v1,,2404.00974v1.pdf,Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping,"Visual scenes are naturally organized in a hierarchy, where a coarse semantic
+is recursively comprised of several fine details. Exploring such a visual
+hierarchy is crucial to recognize the complex relations of visual elements,
+leading to a comprehensive scene understanding. In this paper, we propose a
+Visual Hierarchy Mapper (Hi-Mapper), a novel approach for enhancing the
+structured understanding of the pre-trained Deep Neural Networks (DNNs).
+Hi-Mapper investigates the hierarchical organization of the visual scene by 1)
+pre-defining a hierarchy tree through the encapsulation of probability
+densities; and 2) learning the hierarchical relations in hyperbolic space with
+a novel hierarchical contrastive loss. The pre-defined hierarchy tree
+recursively interacts with the visual features of the pre-trained DNNs through
+hierarchy decomposition and encoding procedures, thereby effectively
+identifying the visual hierarchy and enhancing the recognition of an entire
+scene. Extensive experiments demonstrate that Hi-Mapper significantly enhances
+the representation capability of DNNs, leading to an improved performance on
+various tasks, including image classification and dense prediction tasks.",cs.CV,['cs.CV']
+Neural Underwater Scene Representation,Yunkai Tang · Chengxuan Zhu · Renjie Wan · Chao Xu · Boxin Shi, ,,https://freebutuselesssoul.github.io/publications/cvpr2024a,,,,,nan
+ViewFusion: Towards Multi-View Consistency via Interpolated Denoising,Xianghui Yang · Gil Avraham · Yan Zuo · Sameera Ramasinghe · Loris Bazzani · Anton van den Hengel, ,https://arxiv.org/abs/2402.18842,,2402.18842.pdf,ViewFusion: Towards Multi-View Consistency via Interpolated Denoising,"Novel-view synthesis through diffusion models has demonstrated remarkable
+potential for generating diverse and high-quality images. Yet, the independent
+process of image generation in these prevailing methods leads to challenges in
+maintaining multiple-view consistency. To address this, we introduce
+ViewFusion, a novel, training-free algorithm that can be seamlessly integrated
+into existing pre-trained diffusion models. Our approach adopts an
+auto-regressive method that implicitly leverages previously generated views as
+context for the next view generation, ensuring robust multi-view consistency
+during the novel-view generation process. Through a diffusion process that
+fuses known-view information via interpolated denoising, our framework
+successfully extends single-view conditioned models to work in multiple-view
+conditional settings without any additional fine-tuning. Extensive experimental
+results demonstrate the effectiveness of ViewFusion in generating consistent
+and detailed novel views.",cs.CV,['cs.CV']
+Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation,Ruicong Liu · Takehiko Ohkawa · Mingfang Zhang · Yoichi Sato, ,https://arxiv.org/abs/2403.04381,,2403.04381.pdf,Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation,"The pursuit of accurate 3D hand pose estimation stands as a keystone for
+understanding human activity in the realm of egocentric vision. The majority of
+existing estimation methods still rely on single-view images as input, leading
+to potential limitations, e.g., limited field-of-view and ambiguity in depth.
+To address these problems, adding another camera to better capture the shape of
+hands is a practical direction. However, existing multi-view hand pose
+estimation methods suffer from two main drawbacks: 1) Requiring multi-view
+annotations for training, which are expensive. 2) During testing, the model
+becomes inapplicable if camera parameters/layout are not the same as those used
+in training. In this paper, we propose a novel Single-to-Dual-view adaptation
+(S2DHand) solution that adapts a pre-trained single-view estimator to dual
+views. Compared with existing multi-view training methods, 1) our adaptation
+process is unsupervised, eliminating the need for multi-view annotation. 2)
+Moreover, our method can handle arbitrary dual-view pairs with unknown camera
+parameters, making the model applicable to diverse camera settings.
+Specifically, S2DHand is built on certain stereo constraints, including
+pair-wise cross-view consensus and invariance of transformation between both
+views. These two stereo constraints are used in a complementary manner to
+generate pseudo-labels, allowing reliable adaptation. Evaluation results reveal
+that S2DHand achieves significant improvements on arbitrary camera pairs under
+both in-dataset and cross-dataset settings, and outperforms existing adaptation
+methods with leading performance. Project page:
+https://github.com/MickeyLLG/S2DHand.",cs.CV,['cs.CV']
+Backpropagation-free Network for 3D Test-time Adaptation,YANSHUO WANG · Ali Cheraghian · Zeeshan Hayder · JIE HONG · Sameera Ramasinghe · Shafin Rahman · David Ahmedt-Aristizabal · Xuesong Li · Lars Petersson · Mehrtash Harandi, ,https://arxiv.org/abs/2403.18442,,2403.18442.pdf,Backpropagation-free Network for 3D Test-time Adaptation,"Real-world systems often encounter new data over time, which leads to
+experiencing target domain shifts. Existing Test-Time Adaptation (TTA) methods
+tend to apply computationally heavy and memory-intensive backpropagation-based
+approaches to handle this. Here, we propose a novel method that uses a
+backpropagation-free approach for TTA for the specific case of 3D data. Our
+model uses a two-stream architecture to maintain knowledge about the source
+domain as well as complementary target-domain-specific information. The
+backpropagation-free property of our model helps address the well-known
+forgetting problem and mitigates the error accumulation issue. The proposed
+method also eliminates the need for the usually noisy process of
+pseudo-labeling and reliance on costly self-supervised training. Moreover, our
+method leverages subspace learning, effectively reducing the distribution
+variance between the two domains. Furthermore, the source-domain-specific and
+the target-domain-specific streams are aligned using a novel entropy-based
+adaptive fusion strategy. Extensive experiments on popular benchmarks
+demonstrate the effectiveness of our method. The code will be available at
+\url{https://github.com/abie-e/BFTT3D}.",cs.CV,['cs.CV']
+Improving Single Domain-Generalized Object Detection: A Focus on Diversification and Alignment,Muhammad Sohail Danish · Muhammad Haris Khan · Muhammad Akhtar Munir · M. Sarfraz · Mohsen Ali, ,https://arxiv.org/abs/2405.14497,,2405.14497.pdf,Improving Single Domain-Generalized Object Detection: A Focus on Diversification and Alignment,"In this work, we tackle the problem of domain generalization for object
+detection, specifically focusing on the scenario where only a single source
+domain is available. We propose an effective approach that involves two key
+steps: diversifying the source domain and aligning detections based on class
+prediction confidence and localization. Firstly, we demonstrate that by
+carefully selecting a set of augmentations, a base detector can outperform
+existing methods for single domain generalization by a good margin. This
+highlights the importance of domain diversification in improving the
+performance of object detectors. Secondly, we introduce a method to align
+detections from multiple views, considering both classification and
+localization outputs. This alignment procedure leads to better generalized and
+well-calibrated object detector models, which are crucial for accurate
+decision-making in safety-critical applications. Our approach is
+detector-agnostic and can be seamlessly applied to both single-stage and
+two-stage detectors. To validate the effectiveness of our proposed methods, we
+conduct extensive experiments and ablations on challenging domain-shift
+scenarios. The results consistently demonstrate the superiority of our approach
+compared to existing methods. Our code and models are available at:
+https://github.com/msohaildanish/DivAlign",cs.CV,['cs.CV']
+Universal Segmentation at Arbitrary Granularity with Language Instruction,Yong Liu · Cairong Zhang · Yitong Wang · Jiahao Wang · Yujiu Yang · Yansong Tang, ,https://arxiv.org/abs/2312.01623,,2312.01623.pdf,Universal Segmentation at Arbitrary Granularity with Language Instruction,"This paper aims to achieve universal segmentation of arbitrary semantic
+level. Despite significant progress in recent years, specialist segmentation
+approaches are limited to specific tasks and data distribution. Retraining a
+new model for adaptation to new scenarios or settings takes expensive
+computation and time cost, which raises the demand for versatile and universal
+segmentation model that can cater to various granularity. Although some
+attempts have been made for unifying different segmentation tasks or
+generalization to various scenarios, limitations in the definition of paradigms
+and input-output spaces make it difficult for them to achieve accurate
+understanding of content at arbitrary granularity. To this end, we present
+UniLSeg, a universal segmentation model that can perform segmentation at any
+semantic level with the guidance of language instructions. For training
+UniLSeg, we reorganize a group of tasks from original diverse distributions
+into a unified data format, where images with texts describing segmentation
+targets as input and corresponding masks are output. Combined with a automatic
+annotation engine for utilizing numerous unlabeled data, UniLSeg achieves
+excellent performance on various tasks and settings, surpassing both specialist
+and unified segmentation models.",cs.CV,['cs.CV']
+ScanFormer: Referring Expression Comprehension by Iteratively Scanning,Wei Su · Peihan Miao · Huanzhang Dou · Xi Li, ,http://export.arxiv.org/abs/2306.04451,,2306.04451.pdf,Referring Expression Comprehension Using Language Adaptive Inference,"Different from universal object detection, referring expression comprehension
+(REC) aims to locate specific objects referred to by natural language
+expressions. The expression provides high-level concepts of relevant visual and
+contextual patterns, which vary significantly with different expressions and
+account for only a few of those encoded in the REC model. This leads us to a
+question: do we really need the entire network with a fixed structure for
+various referring expressions? Ideally, given an expression, only
+expression-relevant components of the REC model are required. These components
+should be small in number as each expression only contains very few visual and
+contextual clues. This paper explores the adaptation between expressions and
+REC models for dynamic inference. Concretely, we propose a neat yet efficient
+framework named Language Adaptive Dynamic Subnets (LADS), which can extract
+language-adaptive subnets from the REC model conditioned on the referring
+expressions. By using the compact subnet, the inference can be more economical
+and efficient. Extensive experiments on RefCOCO, RefCOCO+, RefCOCOg, and
+Referit show that the proposed method achieves faster inference speed and
+higher accuracy against state-of-the-art approaches.",cs.CV,['cs.CV']
+SPIDeRS: Structured Polarization for Invisible Depth and Reflectance Sensing,Tomoki Ichikawa · Shohei Nobuhara · Ko Nishino,https://vision.ist.i.kyoto-u.ac.jp/research/spiders/,https://arxiv.org/abs/2312.04553,,2312.04553.pdf,SPIDeRS: Structured Polarization for Invisible Depth and Reflectance Sensing,"Can we capture shape and reflectance in stealth? Such capability would be
+valuable for many application domains in vision, xR, robotics, and HCI. We
+introduce structured polarization for invisible depth and reflectance sensing
+(SPIDeRS), the first depth and reflectance sensing method using patterns of
+polarized light. The key idea is to modulate the angle of linear polarization
+(AoLP) of projected light at each pixel. The use of polarization makes it
+invisible and lets us recover not only depth but also directly surface normals
+and even reflectance. We implement SPIDeRS with a liquid crystal spatial light
+modulator (SLM) and a polarimetric camera. We derive a novel method for
+robustly extracting the projected structured polarization pattern from the
+polarimetric object appearance. We evaluate the effectiveness of SPIDeRS by
+applying it to a number of real-world objects. The results show that our method
+successfully reconstructs object shapes of various materials and is robust to
+diffuse reflection and ambient light. We also demonstrate relighting using
+recovered surface normals and reflectance. We believe SPIDeRS opens a new
+avenue of polarization use in visual sensing.",cs.CV,"['cs.CV', 'eess.IV']"
+MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning,Chaoyi Zhang · Kevin Lin · Zhengyuan Yang · Jianfeng Wang · Linjie Li · Chung-Ching Lin · Zicheng Liu · Lijuan Wang, ,https://arxiv.org/abs/2311.17435,,2311.17435.pdf,MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning,"We present MM-Narrator, a novel system leveraging GPT-4 with multimodal
+in-context learning for the generation of audio descriptions (AD). Unlike
+previous methods that primarily focused on downstream fine-tuning with short
+video clips, MM-Narrator excels in generating precise audio descriptions for
+videos of extensive lengths, even beyond hours, in an autoregressive manner.
+This capability is made possible by the proposed memory-augmented generation
+process, which effectively utilizes both the short-term textual context and
+long-term visual memory through an efficient register-and-recall mechanism.
+These contextual memories compile pertinent past information, including
+storylines and character identities, ensuring an accurate tracking and
+depicting of story-coherent and character-centric audio descriptions.
+Maintaining the training-free design of MM-Narrator, we further propose a
+complexity-based demonstration selection strategy to largely enhance its
+multi-step reasoning capability via few-shot multimodal in-context learning
+(MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator
+consistently outperforms both the existing fine-tuning-based approaches and
+LLM-based approaches in most scenarios, as measured by standard evaluation
+metrics. Additionally, we introduce the first segment-based evaluator for
+recurrent text generation. Empowered by GPT-4, this evaluator comprehensively
+reasons and marks AD generation performance in various extendable dimensions.",cs.CV,"['cs.CV', 'cs.AI']"
+DisCo: Disentangled Control for Realistic Human Dance Generation,Tan Wang · Linjie Li · Kevin Lin · Yuanhao Zhai · Chung-Ching Lin · Zhengyuan Yang · Hanwang Zhang · Zicheng Liu · Lijuan Wang, ,https://arxiv.org/abs/2307.00040,,2307.00040.pdf,DisCo: Disentangled Control for Realistic Human Dance Generation,"Generative AI has made significant strides in computer vision, particularly
+in text-driven image/video synthesis (T2I/T2V). Despite the notable
+advancements, it remains challenging in human-centric content synthesis such as
+realistic dance generation. Current methodologies, primarily tailored for human
+motion transfer, encounter difficulties when confronted with real-world dance
+scenarios (e.g., social media dance), which require to generalize across a wide
+spectrum of poses and intricate human details. In this paper, we depart from
+the traditional paradigm of human motion transfer and emphasize two additional
+critical attributes for the synthesis of human dance content in social media
+contexts: (i) Generalizability: the model should be able to generalize beyond
+generic human viewpoints as well as unseen human subjects, backgrounds, and
+poses; (ii) Compositionality: it should allow for the seamless composition of
+seen/unseen subjects, backgrounds, and poses from different sources. To address
+these challenges, we introduce DISCO, which includes a novel model architecture
+with disentangled control to improve the compositionality of dance synthesis,
+and an effective human attribute pre-training for better generalizability to
+unseen humans. Extensive qualitative and quantitative results demonstrate that
+DisCc can generate high-quality human dance images and videos with diverse
+appearances and flexible motions. Code is available at
+https://disco-dance.github.io/.",cs.CV,"['cs.CV', 'cs.AI']"
+DeMatch: Deep Decomposition of Motion Field for Two-View Correspondence Learning,Shihua Zhang · Zizhuo Li · Yuan Gao · Jiayi Ma, ,,https://ojs.aaai.org/index.php/AAAI/article/view/25456,,,,,nan
+EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning,Hongxia Xie · Chu-Jun Peng · Yu-Wen Tseng · Hung-Jen Chen · Chan-Feng Hsu · Hong-Han Shuai · Wen-Huang Cheng, ,https://arxiv.org/abs/2404.16670,,2404.16670.pdf,EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning,"Visual Instruction Tuning represents a novel learning paradigm involving the
+fine-tuning of pre-trained language models using task-specific instructions.
+This paradigm shows promising zero-shot results in various natural language
+processing tasks but is still unexplored in vision emotion understanding. In
+this work, we focus on enhancing the model's proficiency in understanding and
+adhering to instructions related to emotional contexts. Initially, we identify
+key visual clues critical to visual emotion recognition. Subsequently, we
+introduce a novel GPT-assisted pipeline for generating emotion visual
+instruction data, effectively addressing the scarcity of annotated instruction
+data in this domain. Expanding on the groundwork established by InstructBLIP,
+our proposed EmoVIT architecture incorporates emotion-specific instruction
+data, leveraging the powerful capabilities of Large Language Models to enhance
+performance. Through extensive experiments, our model showcases its proficiency
+in emotion classification, adeptness in affective reasoning, and competence in
+comprehending humor. The comparative analysis provides a robust benchmark for
+Emotion Visual Instruction Tuning in the era of LLMs, providing valuable
+insights and opening avenues for future exploration in this domain. Our code is
+available at \url{https://github.com/aimmemotion/EmoVIT}.",cs.CV,"['cs.CV', 'cs.AI']"
+PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion,Ying-Tian Liu · Yuan-Chen Guo · Guan Luo · Heyi Sun · Wei Yin · Song-Hai Zhang, ,https://arxiv.org/abs/2312.09069,,2312.09069.pdf,PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion,"Diffusion models trained on large-scale text-image datasets have demonstrated
+a strong capability of controllable high-quality image generation from
+arbitrary text prompts. However, the generation quality and generalization
+ability of 3D diffusion models is hindered by the scarcity of high-quality and
+large-scale 3D datasets. In this paper, we present PI3D, a framework that fully
+leverages the pre-trained text-to-image diffusion models' ability to generate
+high-quality 3D shapes from text prompts in minutes. The core idea is to
+connect the 2D and 3D domains by representing a 3D shape as a set of Pseudo RGB
+Images. We fine-tune an existing text-to-image diffusion model to produce such
+pseudo-images using a small number of text-3D pairs. Surprisingly, we find that
+it can already generate meaningful and consistent 3D shapes given complex text
+descriptions. We further take the generated shapes as the starting point for a
+lightweight iterative refinement using score distillation sampling to achieve
+high-quality generation under a low budget. PI3D generates a single 3D shape
+from text in only 3 minutes and the quality is validated to outperform existing
+3D generative models by a large margin.",cs.CV,['cs.CV']
+VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction,Jiaqi Lin · Zhihao Li · Xiao Tang · Jianzhuang Liu · Shiyong Liu · Jiayue Liu · Yangdi Lu · Xiaofei Wu · Songcen Xu · Youliang Yan · Wenming Yang, ,https://arxiv.org/abs/2402.17427,,2402.17427.pdf,VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction,"Existing NeRF-based methods for large scene reconstruction often have
+limitations in visual quality and rendering speed. While the recent 3D Gaussian
+Splatting works well on small-scale and object-centric scenes, scaling it up to
+large scenes poses challenges due to limited video memory, long optimization
+time, and noticeable appearance variations. To address these challenges, we
+present VastGaussian, the first method for high-quality reconstruction and
+real-time rendering on large scenes based on 3D Gaussian Splatting. We propose
+a progressive partitioning strategy to divide a large scene into multiple
+cells, where the training cameras and point cloud are properly distributed with
+an airspace-aware visibility criterion. These cells are merged into a complete
+scene after parallel optimization. We also introduce decoupled appearance
+modeling into the optimization process to reduce appearance variations in the
+rendered images. Our approach outperforms existing NeRF-based methods and
+achieves state-of-the-art results on multiple large scene datasets, enabling
+fast optimization and high-fidelity real-time rendering.",cs.CV,['cs.CV']
+Open-Vocabulary Segmentation with Semantic-Assisted Calibration,Yong Liu · Sule Bai · Guanbin Li · Yitong Wang · Yansong Tang, ,https://arxiv.org/abs/2312.04089,,,Open-Vocabulary Segmentation with Semantic-Assisted Calibration,"This paper studies open-vocabulary segmentation (OVS) through calibrating
+in-vocabulary and domain-biased embedding space with generalized contextual
+prior of CLIP. As the core of open-vocabulary understanding, alignment of
+visual content with the semantics of unbounded text has become the bottleneck
+of this field. To address this challenge, recent works propose to utilize CLIP
+as an additional classifier and aggregate model predictions with CLIP
+classification results. Despite their remarkable progress, performance of OVS
+methods in relevant scenarios is still unsatisfactory compared with supervised
+counterparts. We attribute this to the in-vocabulary embedding and
+domain-biased CLIP prediction. To this end, we present a Semantic-assisted
+CAlibration Network (SCAN). In SCAN, we incorporate generalized semantic prior
+of CLIP into proposal embedding to avoid collapsing on known categories.
+Besides, a contextual shift strategy is applied to mitigate the lack of global
+context and unnatural background noise. With above designs, SCAN achieves
+state-of-the-art performance on all popular open-vocabulary segmentation
+benchmarks. Furthermore, we also focus on the problem of existing evaluation
+system that ignores semantic duplication across categories, and propose a new
+metric called Semantic-Guided IoU (SG-IoU).",cs.CV,['cs.CV']
+GPT4Point: A Unified Framework for Point-Language Understanding and Generation,Zhangyang Qi · Ye Fang · Zeyi Sun · Xiaoyang Wu · Tong Wu · Jiaqi Wang · Dahua Lin · Hengshuang Zhao, ,https://arxiv.org/abs/2312.02980,,2312.02980.pdf,GPT4Point: A Unified Framework for Point-Language Understanding and Generation,"Multimodal Large Language Models (MLLMs) have excelled in 2D image-text
+comprehension and image generation, but their understanding of the 3D world is
+notably deficient, limiting progress in 3D language understanding and
+generation. To solve this problem, we introduce GPT4Point, an innovative
+groundbreaking point-language multimodal model designed specifically for
+unified 3D object understanding and generation within the MLLM framework.
+GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text
+reference tasks such as point-cloud captioning and Q&A. Additionally, GPT4Point
+is equipped with advanced capabilities for controllable 3D generation, it can
+get high-quality results through a low-quality point-text feature maintaining
+the geometric shapes and colors. To support the expansive needs of 3D
+object-text pairs, we develop Pyramid-XL, a point-language dataset annotation
+engine. It constructs a large-scale database over 1M objects of varied text
+granularity levels from the Objaverse-XL dataset, essential for training
+GPT4Point. A comprehensive benchmark has been proposed to evaluate 3D
+point-language understanding capabilities. In extensive evaluations, GPT4Point
+has demonstrated superior performance in understanding and generation.",cs.CV,['cs.CV']
+FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation,Pengchong Qiao · Lei Shang · Chang Liu · Baigui Sun · Xiangyang Ji · Jie Chen, ,,https://paperswithcode.com/paper/facechain-sude-building-derived-class-to,,,,,nan
+TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding,Yun Liu · Haolin Yang · Xu Si · Ling Liu · Zipeng Li · Yuxiang Zhang · Yebin Liu · Li Yi, ,https://arxiv.org/abs/2401.08399,,2401.08399.pdf,TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding,"Humans commonly work with multiple objects in daily life and can intuitively
+transfer manipulation skills to novel objects by understanding object
+functional regularities. However, existing technical approaches for analyzing
+and synthesizing hand-object manipulation are mostly limited to handling a
+single hand and object due to the lack of data support. To address this, we
+construct TACO, an extensive bimanual hand-object-interaction dataset spanning
+a large variety of tool-action-object compositions for daily human activities.
+TACO contains 2.5K motion sequences paired with third-person and egocentric
+views, precise hand-object 3D meshes, and action labels. To rapidly expand the
+data scale, we present a fully automatic data acquisition pipeline combining
+multi-view sensing with an optical motion capture system. With the vast
+research fields provided by TACO, we benchmark three generalizable
+hand-object-interaction tasks: compositional action recognition, generalizable
+hand-object motion forecasting, and cooperative grasp synthesis. Extensive
+experiments reveal new insights, challenges, and opportunities for advancing
+the studies of generalizable hand-object motion analysis and synthesis. Our
+data and code are available at https://taco2024.github.io.",cs.CV,['cs.CV']
+Alpha-CLIP: A CLIP Model Focusing on Wherever You Want,Zeyi Sun · Ye Fang · Tong Wu · Pan Zhang · Yuhang Zang · Shu Kong · Yuanjun Xiong · Dahua Lin · Jiaqi Wang,https://aleafy.github.io/alpha-clip/,https://arxiv.org/abs/2312.03818,,2312.03818.pdf,Alpha-CLIP: A CLIP Model Focusing on Wherever You Want,"Contrastive Language-Image Pre-training (CLIP) plays an essential role in
+extracting valuable content information from images across diverse tasks. It
+aligns textual and visual modalities to comprehend the entire image, including
+all the details, even those irrelevant to specific tasks. However, for a finer
+understanding and controlled editing of images, it becomes crucial to focus on
+specific regions of interest, which can be indicated as points, masks, or boxes
+by humans or perception models. To fulfill the requirements, we introduce
+Alpha-CLIP, an enhanced version of CLIP with an auxiliary alpha channel to
+suggest attentive regions and fine-tuned with constructed millions of RGBA
+region-text pairs. Alpha-CLIP not only preserves the visual recognition ability
+of CLIP but also enables precise control over the emphasis of image contents.
+It demonstrates effectiveness in various tasks, including but not limited to
+open-world recognition, multimodal large language models, and conditional 2D /
+3D generation. It has a strong potential to serve as a versatile tool for
+image-related tasks.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']"
+VCoder: Versatile Vision Encoders for Multimodal Large Language Models,Jitesh Jain · Jianwei Yang · Humphrey Shi,https://praeclarumjj3.github.io/vcoder/,https://arxiv.org/abs/2312.14233,,2312.14233.pdf,VCoder: Versatile Vision Encoders for Multimodal Large Language Models,"Humans possess the remarkable skill of Visual Perception, the ability to see
+and understand the seen, helping them make sense of the visual world and, in
+turn, reason. Multimodal Large Language Models (MLLM) have recently achieved
+impressive performance on vision-language tasks ranging from visual
+question-answering and image captioning to visual reasoning and image
+generation. However, when prompted to identify or count (perceive) the entities
+in a given image, existing MLLM systems fail. Working towards developing an
+accurate MLLM system for perception and reasoning, we propose using Versatile
+vision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the
+VCoder with perception modalities such as segmentation or depth maps, improving
+the MLLM's perception abilities. Secondly, we leverage the images from COCO and
+outputs from off-the-shelf vision perception models to create our COCO
+Segmentation Text (COST) dataset for training and evaluating MLLMs on the
+object perception task. Thirdly, we introduce metrics to assess the object
+perception abilities in MLLMs on our COST dataset. Lastly, we provide extensive
+experimental evidence proving the VCoder's improved object-level perception
+skills over existing Multimodal LLMs, including GPT-4V. We open-source our
+dataset, code, and models to promote research. We open-source our code at
+https://github.com/SHI-Labs/VCoder",cs.CV,['cs.CV']
+Emotional Speech-Driven 3D Body Animation via Disentangled Latent Diffusion,Kiran Chhatre · Radek Danecek · Nikos Athanasiou · Giorgio Becherini · Christopher Peters · Michael J. Black · Timo Bolkart,https://amuse.is.tue.mpg.de/,https://arxiv.org/abs/2312.04466,,2312.04466.pdf,Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion,"Existing methods for synthesizing 3D human gestures from speech have shown
+promising results, but they do not explicitly model the impact of emotions on
+the generated gestures. Instead, these methods directly output animations from
+speech without control over the expressed emotion. To address this limitation,
+we present AMUSE, an emotional speech-driven body animation model based on
+latent diffusion. Our observation is that content (i.e., gestures related to
+speech rhythm and word utterances), emotion, and personal style are separable.
+To account for this, AMUSE maps the driving audio to three disentangled latent
+vectors: one for content, one for emotion, and one for personal style. A latent
+diffusion model, trained to generate gesture motion sequences, is then
+conditioned on these latent vectors. Once trained, AMUSE synthesizes 3D human
+gestures directly from speech with control over the expressed emotions and
+style by combining the content from the driving speech with the emotion and
+style of another speech sequence. Randomly sampling the noise of the diffusion
+model further generates variations of the gesture with the same emotional
+expressivity. Qualitative, quantitative, and perceptual evaluations demonstrate
+that AMUSE outputs realistic gesture sequences. Compared to the state of the
+art, the generated gestures are better synchronized with the speech content,
+and better represent the emotion expressed by the input speech. Our code is
+available at amuse.is.tue.mpg.de.",cs.CV,['cs.CV']
+Accept the Modality Gap: An Exploration in the Hyperbolic Space,Sameera Ramasinghe · Violetta Shevchenko · Gil Avraham · Thalaiyasingam Ajanthan, ,,https://openreview.net/forum?id=KiespDPaRH,,,,,nan
+Transferable Structural Sparse Adversarial Attack Via Exact Group Sparsity Training,Di Ming · Peng Ren · Yunlong Wang · Xin Feng,https://github.com/MisterRpeng/EGS-TSSA,,https://midasdming.github.io/news/announcement_17/,,,,,nan
+DreamComposer: Controllable 3D Object Generation via Multi-View Conditions,Yunhan Yang · Yukun Huang · Xiaoyang Wu · Yuan-Chen Guo · Song-Hai Zhang · Hengshuang Zhao · Tong He · Xihui Liu, ,https://arxiv.org/abs/2312.03611,,2312.03611.pdf,DreamComposer: Controllable 3D Object Generation via Multi-View Conditions,"Utilizing pre-trained 2D large-scale generative models, recent works are
+capable of generating high-quality novel views from a single in-the-wild image.
+However, due to the lack of information from multiple views, these works
+encounter difficulties in generating controllable novel views. In this paper,
+we present DreamComposer, a flexible and scalable framework that can enhance
+existing view-aware diffusion models by injecting multi-view conditions.
+Specifically, DreamComposer first uses a view-aware 3D lifting module to obtain
+3D representations of an object from multiple views. Then, it renders the
+latent features of the target view from 3D representations with the multi-view
+feature fusion module. Finally the target view features extracted from
+multi-view inputs are injected into a pre-trained diffusion model. Experiments
+show that DreamComposer is compatible with state-of-the-art diffusion models
+for zero-shot novel view synthesis, further enhancing them to generate
+high-fidelity novel view images with multi-view conditions, ready for
+controllable 3D object reconstruction and various other applications.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Pose Adapted Shape Learning for Large-Pose Face Reenactment,Gee-Sern Hsu · Jie-Ying Zhang · Yu-Hsiang Huang · Wei-Jie Hong, ,,https://ieeexplore.ieee.org/abstract/document/10219601,,,,,nan
+LayoutFormer: Hierarchical Text Detection Towards Scene Text Understanding,Min Liang · Jia-Wei Ma · Xiaobin Zhu · Jingyan Qin · Xu-Cheng Yin, ,https://ar5iv.labs.arxiv.org/html/2207.12955,,2207.12955.pdf,Contextual Text Block Detection towards Scene Text Understanding,"Most existing scene text detectors focus on detecting characters or words
+that only capture partial text messages due to missing contextual information.
+For a better understanding of text in scenes, it is more desired to detect
+contextual text blocks (CTBs) which consist of one or multiple integral text
+units (e.g., characters, words, or phrases) in natural reading order and
+transmit certain complete text messages. This paper presents contextual text
+detection, a new setup that detects CTBs for better understanding of texts in
+scenes. We formulate the new setup by a dual detection task which first detects
+integral text units and then groups them into a CTB. To this end, we design a
+novel scene text clustering technique that treats integral text units as tokens
+and groups them (belonging to the same CTB) into an ordered token sequence. In
+addition, we create two datasets SCUT-CTW-Context and ReCTS-Context to
+facilitate future research, where each CTB is well annotated by an ordered
+sequence of integral text units. Further, we introduce three metrics that
+measure contextual text detection in local accuracy, continuity, and global
+accuracy. Extensive experiments show that our method accurately detects CTBs
+which effectively facilitates downstream tasks such as text classification and
+translation. The project is available at
+https://sg-vilab.github.io/publication/xue2022contextual/.",cs.CV,['cs.CV']
+PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos,Qi Zhao · M. Salman Asif · Zhan Ma, ,https://arxiv.org/abs/2404.08921,,2404.08921.pdf,PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos,"The primary focus of Neural Representation for Videos (NeRV) is to
+effectively model its spatiotemporal consistency. However, current NeRV systems
+often face a significant issue of spatial inconsistency, leading to decreased
+perceptual quality. To address this issue, we introduce the Pyramidal Neural
+Representation for Videos (PNeRV), which is built on a multi-scale information
+connection and comprises a lightweight rescaling operator, Kronecker
+Fully-connected layer (KFc), and a Benign Selective Memory (BSM) mechanism. The
+KFc, inspired by the tensor decomposition of the vanilla Fully-connected layer,
+facilitates low-cost rescaling and global correlation modeling. BSM merges
+high-level features with granular ones adaptively. Furthermore, we provide an
+analysis based on the Universal Approximation Theory of the NeRV system and
+validate the effectiveness of the proposed PNeRV.We conducted comprehensive
+experiments to demonstrate that PNeRV surpasses the performance of contemporary
+NeRV models, achieving the best results in video regression on UVG and DAVIS
+under various metrics (PSNR, SSIM, LPIPS, and FVD). Compared to vanilla NeRV,
+PNeRV achieves a +4.49 dB gain in PSNR and a 231% increase in FVD on UVG, along
+with a +3.28 dB PSNR and 634% FVD increase on DAVIS.",cs.CV,['cs.CV']
+Bézier Everywhere All at Once: Learning Drivable Lanes as Bézier Graphs,Hugh Blayney · Hanlin Tian · Hamish Scott · Nils Goldbeck · Chess Stetson · Panagiotis Angeloudis, ,,https://screenrant.com/everything-everywhere-all-at-once-real-meaning-explained/,,,,,nan
+Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer,Jiwoo Chung · Sangeek Hyun · Jae-Pil Heo,https://jiwoogit.github.io/StyleID_site/,https://arxiv.org/abs/2312.09008,,2312.09008.pdf,Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer,"Despite the impressive generative capabilities of diffusion models, existing
+diffusion model-based style transfer methods require inference-stage
+optimization (e.g. fine-tuning or textual inversion of style) which is
+time-consuming, or fails to leverage the generative ability of large-scale
+diffusion models. To address these issues, we introduce a novel artistic style
+transfer method based on a pre-trained large-scale diffusion model without any
+optimization. Specifically, we manipulate the features of self-attention layers
+as the way the cross-attention mechanism works; in the generation process,
+substituting the key and value of content with those of style image. This
+approach provides several desirable characteristics for style transfer
+including 1) preservation of content by transferring similar styles into
+similar image patches and 2) transfer of style based on similarity of local
+texture (e.g. edge) between content and style images. Furthermore, we introduce
+query preservation and attention temperature scaling to mitigate the issue of
+disruption of original content, and initial latent Adaptive Instance
+Normalization (AdaIN) to deal with the disharmonious color (failure to transfer
+the colors of style). Our experimental results demonstrate that our proposed
+method surpasses state-of-the-art methods in both conventional and
+diffusion-based style transfer baselines.",cs.CV,['cs.CV']
+Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes,Diandian Guo · Deng-Ping Fan · Tongyu Lu · Christos Sakaridis · Luc Van Gool,https://github.com/RascalGdd/VPSeg,https://arxiv.org/abs/2401.15261,,2401.15261.pdf,Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes,"The estimation of implicit cross-frame correspondences and the high
+computational cost have long been major challenges in video semantic
+segmentation (VSS) for driving scenes. Prior works utilize keyframes, feature
+propagation, or cross-frame attention to address these issues. By contrast, we
+are the first to harness vanishing point (VP) priors for more effective
+segmentation. Intuitively, objects near VPs (i.e., away from the vehicle) are
+less discernible. Moreover, they tend to move radially away from the VP over
+time in the usual case of a forward-facing camera, a straight road, and linear
+forward motion of the vehicle. Our novel, efficient network for VSS, named
+VPSeg, incorporates two modules that utilize exactly this pair of static and
+dynamic VP priors: sparse-to-dense feature mining (DenseVP) and VP-guided
+motion fusion (MotionVP). MotionVP employs VP-guided motion estimation to
+establish explicit correspondences across frames and help attend to the most
+relevant features from neighboring frames, while DenseVP enhances weak dynamic
+features in distant regions around VPs. These modules operate within a
+context-detail framework, which separates contextual features from
+high-resolution local features at different input resolutions to reduce
+computational costs. Contextual and local features are integrated through
+contextualized motion attention (CMA) for the final prediction. Extensive
+experiments on two popular driving segmentation benchmarks, Cityscapes and
+ACDC, demonstrate that VPSeg outperforms previous SOTA methods, with only
+modest computational overhead.",cs.CV,['cs.CV']
+TransNeXt: Robust Foveal Visual Perception for Vision Transformers,Dai Shi, ,https://arxiv.org/abs/2311.17132,,2311.17132.pdf,TransNeXt: Robust Foveal Visual Perception for Vision Transformers,"Due to the depth degradation effect in residual connections, many efficient
+Vision Transformers models that rely on stacking layers for information
+exchange often fail to form sufficient information mixing, leading to unnatural
+visual perception. To address this issue, in this paper, we propose Aggregated
+Attention, a biomimetic design-based token mixer that simulates biological
+foveal vision and continuous eye movement while enabling each token on the
+feature map to have a global perception. Furthermore, we incorporate learnable
+tokens that interact with conventional queries and keys, which further
+diversifies the generation of affinity matrices beyond merely relying on the
+similarity between queries and keys. Our approach does not rely on stacking for
+information exchange, thus effectively avoiding depth degradation and achieving
+natural visual perception. Additionally, we propose Convolutional GLU, a
+channel mixer that bridges the gap between GLU and SE mechanism, which empowers
+each token to have channel attention based on its nearest neighbor image
+features, enhancing local modeling capability and model robustness. We combine
+aggregated attention and convolutional GLU to create a new visual backbone
+called TransNeXt. Extensive experiments demonstrate that our TransNeXt achieves
+state-of-the-art performance across multiple model sizes. At a resolution of
+$224^2$, TransNeXt-Tiny attains an ImageNet accuracy of 84.0%, surpassing
+ConvNeXt-B with 69% fewer parameters. Our TransNeXt-Base achieves an ImageNet
+accuracy of 86.2% and an ImageNet-A accuracy of 61.6% at a resolution of
+$384^2$, a COCO object detection mAP of 57.1, and an ADE20K semantic
+segmentation mIoU of 54.7.",cs.CV,"['cs.CV', 'cs.AI']"
+Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects,Yijia Weng · Bowen Wen · Jonathan Tremblay · Valts Blukis · Dieter Fox · Leonidas Guibas · Stan Birchfield,https://nvlabs.github.io/DigitalTwinArt/,https://arxiv.org/abs/2404.01440,,2404.01440.pdf,Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects,"We address the problem of building digital twins of unknown articulated
+objects from two RGBD scans of the object at different articulation states. We
+decompose the problem into two stages, each addressing distinct aspects. Our
+method first reconstructs object-level shape at each state, then recovers the
+underlying articulation model including part segmentation and joint
+articulations that associate the two states. By explicitly modeling point-level
+correspondences and exploiting cues from images, 3D reconstructions, and
+kinematics, our method yields more accurate and stable results compared to
+prior work. It also handles more than one movable part and does not rely on any
+object shape or structure priors. Project page:
+https://github.com/NVlabs/DigitalTwinArt",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.RO']"
+MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI,Xiang Yue · Yuansheng Ni · Kai Zhang · Tianyu Zheng · Ruoqi Liu · Ge Zhang · Samuel Stevens · Dongfu Jiang · Weiming Ren · Yuxuan Sun · Cong Wei · Botao Yu · Ruibin Yuan · Renliang Sun · Ming Yin · Boyuan Zheng · Zhenzhu Yang · Yibo Liu · Wenhao Huang · Huan Sun · Yu Su · Wenhu Chen,https://mmmu-benchmark.github.io/,https://arxiv.org/abs/2311.16502,,2311.16502.pdf,MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI,"We introduce MMMU: a new benchmark designed to evaluate multimodal models on
+massive multi-discipline tasks demanding college-level subject knowledge and
+deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal
+questions from college exams, quizzes, and textbooks, covering six core
+disciplines: Art & Design, Business, Science, Health & Medicine, Humanities &
+Social Science, and Tech & Engineering. These questions span 30 subjects and
+183 subfields, comprising 30 highly heterogeneous image types, such as charts,
+diagrams, maps, tables, music sheets, and chemical structures. Unlike existing
+benchmarks, MMMU focuses on advanced perception and reasoning with
+domain-specific knowledge, challenging models to perform tasks akin to those
+faced by experts. The evaluation of 14 open-source LMMs as well as the
+proprietary GPT-4V(ision) and Gemini highlights the substantial challenges
+posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve
+accuracies of 56% and 59% respectively, indicating significant room for
+improvement. We believe MMMU will stimulate the community to build
+next-generation multimodal foundation models towards expert artificial general
+intelligence.",cs.CL,"['cs.CL', 'cs.AI', 'cs.CV']"
+Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models,Takami Sato · Justin Yue · Nanze Chen · Ningfei Wang · Alfred Chen, ,https://arxiv.org/abs/2308.15692,,2308.15692.pdf,Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models,"Denoising probabilistic diffusion models have shown breakthrough performance
+to generate more photo-realistic images or human-level illustrations than the
+prior models such as GANs. This high image-generation capability has stimulated
+the creation of many downstream applications in various areas. However, we find
+that this technology is actually a double-edged sword: We identify a new type
+of attack, called the Natural Denoising Diffusion (NDD) attack based on the
+finding that state-of-the-art deep neural network (DNN) models still hold their
+prediction even if we intentionally remove their robust features, which are
+essential to the human visual system (HVS), through text prompts. The NDD
+attack shows a significantly high capability to generate low-cost,
+model-agnostic, and transferable adversarial attacks by exploiting the natural
+attack capability in diffusion models. To systematically evaluate the risk of
+the NDD attack, we perform a large-scale empirical study with our newly created
+dataset, the Natural Denoising Diffusion Attack (NDDA) dataset. We evaluate the
+natural attack capability by answering 6 research questions. Through a user
+study, we find that it can achieve an 88% detection rate while being stealthy
+to 93% of human subjects; we also find that the non-robust features embedded by
+diffusion models contribute to the natural attack capability. To confirm the
+model-agnostic and transferable attack capability, we perform the NDD attack
+against the Tesla Model 3 and find that 73% of the physically printed attacks
+can be detected as stop signs. Our hope is that the study and dataset can help
+our community be aware of the risks in diffusion models and facilitate further
+research toward robust DNN models.",cs.CV,"['cs.CV', 'cs.CR']"
+SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting,Hoon Kim · Minje Jang · Wonjun Yoon · Jisoo Lee · Donghyun Na · Sanghyun Woo, ,https://arxiv.org/abs/2402.18848,,2402.18848.pdf,SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting,"We introduce a co-designed approach for human portrait relighting that
+combines a physics-guided architecture with a pre-training framework. Drawing
+on the Cook-Torrance reflectance model, we have meticulously configured the
+architecture design to precisely simulate light-surface interactions.
+Furthermore, to overcome the limitation of scarce high-quality lightstage data,
+we have developed a self-supervised pre-training strategy. This novel
+combination of accurate physical modeling and expanded training dataset
+establishes a new benchmark in relighting realism.",cs.CV,['cs.CV']
+Context-Aware Integration of Language and Visual References for Natural Language Tracking,Yanyan Shao · Shuting He · Qi Ye · Yuchao Feng · Wenhan Luo · Jiming Chen,https://github.com/twotwo2/QueryNLT,https://arxiv.org/abs/2403.19975,,2403.19975.pdf,Context-Aware Integration of Language and Visual References for Natural Language Tracking,"Tracking by natural language specification (TNL) aims to consistently
+localize a target in a video sequence given a linguistic description in the
+initial frame. Existing methodologies perform language-based and template-based
+matching for target reasoning separately and merge the matching results from
+two sources, which suffer from tracking drift when language and visual
+templates miss-align with the dynamic target state and ambiguity in the later
+merging stage. To tackle the issues, we propose a joint multi-modal tracking
+framework with 1) a prompt modulation module to leverage the complementarity
+between temporal visual templates and language expressions, enabling precise
+and context-aware appearance and linguistic cues, and 2) a unified target
+decoding module to integrate the multi-modal reference cues and executes the
+integrated queries on the search image to predict the target location in an
+end-to-end manner directly. This design ensures spatio-temporal consistency by
+leveraging historical visual information and introduces an integrated solution,
+generating predictions in a single step. Extensive experiments conducted on
+TNL2K, OTB-Lang, LaSOT, and RefCOCOg validate the efficacy of our proposed
+approach. The results demonstrate competitive performance against
+state-of-the-art methods for both tracking and grounding.",cs.CV,['cs.CV']
+Correlation-Decoupled Knowledge Distillation for Multimodal Sentiment Analysis with Incomplete Modalities,Mingcheng Li · Dingkang Yang · Xiao Zhao · Shuaibing Wang · Yan Wang · Kun Yang · Mingyang Sun · Dongliang Kou · Qian · Lihua Zhang, ,https://arxiv.org/abs/2404.16456,,2404.16456.pdf,Correlation-Decoupled Knowledge Distillation for Multimodal Sentiment Analysis with Incomplete Modalities,"Multimodal sentiment analysis (MSA) aims to understand human sentiment
+through multimodal data. Most MSA efforts are based on the assumption of
+modality completeness. However, in real-world applications, some practical
+factors cause uncertain modality missingness, which drastically degrades the
+model's performance. To this end, we propose a Correlation-decoupled Knowledge
+Distillation (CorrKD) framework for the MSA task under uncertain missing
+modalities. Specifically, we present a sample-level contrastive distillation
+mechanism that transfers comprehensive knowledge containing cross-sample
+correlations to reconstruct missing semantics. Moreover, a category-guided
+prototype distillation mechanism is introduced to capture cross-category
+correlations using category prototypes to align feature distributions and
+generate favorable joint representations. Eventually, we design a
+response-disentangled consistency distillation strategy to optimize the
+sentiment decision boundaries of the student network through response
+disentanglement and mutual information maximization. Comprehensive experiments
+on three datasets indicate that our framework can achieve favorable
+improvements compared with several baselines.",cs.CV,['cs.CV']
+Pose-Transformed Equivariant Network for  3D Point Trajectory Prediction,Ruixuan Yu · Jian Sun, ,https://arxiv.org/abs/2308.06564,,2308.06564.pdf,EquiDiff: A Conditional Equivariant Diffusion Model For Trajectory Prediction,"Accurate trajectory prediction is crucial for the safe and efficient
+operation of autonomous vehicles. The growing popularity of deep learning has
+led to the development of numerous methods for trajectory prediction. While
+deterministic deep learning models have been widely used, deep generative
+models have gained popularity as they learn data distributions from training
+data and account for trajectory uncertainties. In this study, we propose
+EquiDiff, a deep generative model for predicting future vehicle trajectories.
+EquiDiff is based on the conditional diffusion model, which generates future
+trajectories by incorporating historical information and random Gaussian noise.
+The backbone model of EquiDiff is an SO(2)-equivariant transformer that fully
+utilizes the geometric properties of location coordinates. In addition, we
+employ Recurrent Neural Networks and Graph Attention Networks to extract social
+interactions from historical trajectories. To evaluate the performance of
+EquiDiff, we conduct extensive experiments on the NGSIM dataset. Our results
+demonstrate that EquiDiff outperforms other baseline models in short-term
+prediction, but has slightly higher errors for long-term prediction.
+Furthermore, we conduct an ablation study to investigate the contribution of
+each component of EquiDiff to the prediction accuracy. Additionally, we present
+a visualization of the generation process of our diffusion model, providing
+insights into the uncertainty of the prediction.",cs.LG,"['cs.LG', 'cs.RO']"
+SC-Tune: Unleashing Self-Consistent Referential Comprehension  in Large Vision Language Models,Tongtian Yue · Jie Cheng · Longteng Guo · Xingyuan Dai · Zijia Zhao · Xingjian He · Gang Xiong · Yisheng Lv · Jing Liu, ,https://arxiv.org/abs/2403.13263,,2403.13263.pdf,SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models,"Recent trends in Large Vision Language Models (LVLMs) research have been
+increasingly focusing on advancing beyond general image understanding towards
+more nuanced, object-level referential comprehension. In this paper, we present
+and delve into the self-consistency capability of LVLMs, a crucial aspect that
+reflects the models' ability to both generate informative captions for specific
+objects and subsequently utilize these captions to accurately re-identify the
+objects in a closed-loop process. This capability significantly mirrors the
+precision and reliability of fine-grained visual-language understanding. Our
+findings reveal that the self-consistency level of existing LVLMs falls short
+of expectations, posing limitations on their practical applicability and
+potential. To address this gap, we introduce a novel fine-tuning paradigm named
+Self-Consistency Tuning (SC-Tune). It features the synergistic learning of a
+cyclic describer-locator system. This paradigm is not only data-efficient but
+also exhibits generalizability across multiple LVLMs. Through extensive
+experiments, we demonstrate that SC-Tune significantly elevates performance
+across a spectrum of object-level vision-language benchmarks and maintains
+competitive or improved performance on image-level vision-language benchmarks.
+Both our model and code will be publicly available at
+https://github.com/ivattyue/SC-Tune.",cs.CV,['cs.CV']
+Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models,Huimin Huang · Yawen Huang · Lanfen Lin · Ruofeng Tong · Yen-Wei Chen · Hao Zheng · Yuexiang Li · Yefeng Zheng, ,https://arxiv.org/abs/2405.14136,,,Efficient Multitask Dense Predictor via Binarization,"Multi-task learning for dense prediction has emerged as a pivotal area in
+computer vision, enabling simultaneous processing of diverse yet interrelated
+pixel-wise prediction tasks. However, the substantial computational demands of
+state-of-the-art (SoTA) models often limit their widespread deployment. This
+paper addresses this challenge by introducing network binarization to compress
+resource-intensive multi-task dense predictors. Specifically, our goal is to
+significantly accelerate multi-task dense prediction models via Binary Neural
+Networks (BNNs) while maintaining and even improving model performance at the
+same time. To reach this goal, we propose a Binary Multi-task Dense Predictor,
+Bi-MTDP, and several variants of Bi-MTDP, in which a multi-task dense predictor
+is constructed via specified binarized modules. Our systematical analysis of
+this predictor reveals that performance drop from binarization is primarily
+caused by severe information degradation. To address this issue, we introduce a
+deep information bottleneck layer that enforces representations for downstream
+tasks satisfying Gaussian distribution in forward propagation. Moreover, we
+introduce a knowledge distillation mechanism to correct the direction of
+information flow in backward propagation. Intriguingly, one variant of Bi-MTDP
+outperforms full-precision (FP) multi-task dense prediction SoTAs, ARTC
+(CNN-based) and InvPT (ViT-Based). This result indicates that Bi-MTDP is not
+merely a naive trade-off between performance and efficiency, but is rather a
+benefit of the redundant information flow thanks to the multi-task
+architecture. Code is available at https://github.com/42Shawn/BiMTDP.",cs.CV,['cs.CV']
+Clustering Propagation for Universal Medical Image Segmentation,Yuhang Ding · Liulei Li · Wenguan Wang · Yi Yang, ,https://arxiv.org/abs/2403.16646,,2403.16646.pdf,Clustering Propagation for Universal Medical Image Segmentation,"Prominent solutions for medical image segmentation are typically tailored for
+automatic or interactive setups, posing challenges in facilitating progress
+achieved in one task to another.$_{\!}$ This$_{\!}$ also$_{\!}$
+necessitates$_{\!}$ separate$_{\!}$ models for each task, duplicating both
+training time and parameters.$_{\!}$ To$_{\!}$ address$_{\!}$ above$_{\!}$
+issues,$_{\!}$ we$_{\!}$ introduce$_{\!}$ S2VNet,$_{\!}$ a$_{\!}$
+universal$_{\!}$ framework$_{\!}$ that$_{\!}$ leverages$_{\!}$
+Slice-to-Volume$_{\!}$ propagation$_{\!}$ to$_{\!}$ unify automatic/interactive
+segmentation within a single model and one training session. Inspired by
+clustering-based segmentation techniques, S2VNet makes full use of the
+slice-wise structure of volumetric data by initializing cluster centers from
+the cluster$_{\!}$ results$_{\!}$ of$_{\!}$ previous$_{\!}$ slice.$_{\!}$ This
+enables knowledge acquired from prior slices to assist in the segmentation of
+the current slice, further efficiently bridging the communication between
+remote slices using mere 2D networks. Moreover, such a framework readily
+accommodates interactive segmentation with no architectural change, simply by
+initializing centroids from user inputs. S2VNet distinguishes itself by swift
+inference speeds and reduced memory consumption compared to prevailing 3D
+solutions. It can also handle multi-class interactions with each of them
+serving to initialize different centroids. Experiments on three benchmarks
+demonstrate S2VNet surpasses task-specified solutions on both
+automatic/interactive setups.",cs.CV,['cs.CV']
+Tri-Modal Motion Retrieval by Learning a Joint Embedding Space,Kangning Yin · Shihao Zou · Yuxuan Ge · Zheng Tian, ,https://arxiv.org/abs/2403.00691,,2403.00691.pdf,Tri-Modal Motion Retrieval by Learning a Joint Embedding Space,"Information retrieval is an ever-evolving and crucial research domain. The
+substantial demand for high-quality human motion data especially in online
+acquirement has led to a surge in human motion research works. Prior works have
+mainly concentrated on dual-modality learning, such as text and motion tasks,
+but three-modality learning has been rarely explored. Intuitively, an extra
+introduced modality can enrich a model's application scenario, and more
+importantly, an adequate choice of the extra modality can also act as an
+intermediary and enhance the alignment between the other two disparate
+modalities. In this work, we introduce LAVIMO (LAnguage-VIdeo-MOtion
+alignment), a novel framework for three-modality learning integrating
+human-centric videos as an additional modality, thereby effectively bridging
+the gap between text and motion. Moreover, our approach leverages a specially
+designed attention mechanism to foster enhanced alignment and synergistic
+effects among text, video, and motion modalities. Empirically, our results on
+the HumanML3D and KIT-ML datasets show that LAVIMO achieves state-of-the-art
+performance in various motion-related cross-modal retrieval tasks, including
+text-to-motion, motion-to-text, video-to-motion and motion-to-video.",cs.CV,"['cs.CV', 'cs.AI']"
+Rethinking Human Motion Prediction with Symplectic Integral,Haipeng Chen · Kedi L yu · Zhenguang Liu · Yifang Yin · Xun Yang · Yingda Lyu, ,https://arxiv.org/abs/2312.06184,,2312.06184.pdf,Recent Advances in Deterministic Human Motion Prediction: A Review,"In recent years, with the continuous advancement of deep learning and the
+emergence of large-scale human motion datasets, human motion prediction
+technology has gradually gained prominence in various fields such as
+human-computer interaction, autonomous driving, sports analysis, and personnel
+tracking. This article introduces common model architectures in this domain
+along with their respective advantages and disadvantages. It also
+systematically summarizes recent research innovations, focusing on in-depth
+discussions of relevant papers in these areas, thereby highlighting
+forward-looking insights into the field's development. Furthermore, this paper
+provides a comprehensive overview of existing methods, commonly used datasets,
+and evaluation metrics in this field. Finally, it discusses some of the current
+limitations in the field and proposes potential future research directions to
+address these challenges and promote further advancements in human motion
+prediction.",cs.CV,['cs.CV']
+UniPAD: A Universal Pre-training Paradigm for Autonomous Driving,Honghui Yang · Sha Zhang · Di Huang · Xiaoyang Wu · Haoyi Zhu · Tong He · SHIXIANG TANG · Hengshuang Zhao · Qibo Qiu · Binbin Lin · Xiaofei He · Wanli Ouyang,https://github.com/Nightmare-n/UniPAD,https://arxiv.org/abs/2310.08370,,2310.08370.pdf,UniPAD: A Universal Pre-training Paradigm for Autonomous Driving,"In the context of autonomous driving, the significance of effective feature
+learning is widely acknowledged. While conventional 3D self-supervised
+pre-training methods have shown widespread success, most methods follow the
+ideas originally designed for 2D images. In this paper, we present UniPAD, a
+novel self-supervised learning paradigm applying 3D volumetric differentiable
+rendering. UniPAD implicitly encodes 3D space, facilitating the reconstruction
+of continuous 3D shape structures and the intricate appearance characteristics
+of their 2D projections. The flexibility of our method enables seamless
+integration into both 2D and 3D frameworks, enabling a more holistic
+comprehension of the scenes. We manifest the feasibility and effectiveness of
+UniPAD by conducting extensive experiments on various downstream 3D tasks. Our
+method significantly improves lidar-, camera-, and lidar-camera-based baseline
+by 9.1, 7.7, and 6.9 NDS, respectively. Notably, our pre-training pipeline
+achieves 73.2 NDS for 3D object detection and 79.4 mIoU for 3D semantic
+segmentation on the nuScenes validation set, achieving state-of-the-art results
+in comparison with previous methods. The code will be available at
+https://github.com/Nightmare-n/UniPAD.",cs.CV,['cs.CV']
+Dual-Enhanced Coreset Selection with Class-wise Collaboration for Online Blurry Class Incremental Learning,Yutian Luo · Shiqi Zhao · Haoran Wu · Zhiwu Lu, ,https://arxiv.org/abs/2308.09303,,2308.09303.pdf,Online Class Incremental Learning on Stochastic Blurry Task Boundary via Mask and Visual Prompt Tuning,"Continual learning aims to learn a model from a continuous stream of data,
+but it mainly assumes a fixed number of data and tasks with clear task
+boundaries. However, in real-world scenarios, the number of input data and
+tasks is constantly changing in a statistical way, not a static way. Although
+recently introduced incremental learning scenarios having blurry task
+boundaries somewhat address the above issues, they still do not fully reflect
+the statistical properties of real-world situations because of the fixed ratio
+of disjoint and blurry samples. In this paper, we propose a new Stochastic
+incremental Blurry task boundary scenario, called Si-Blurry, which reflects the
+stochastic properties of the real-world. We find that there are two major
+challenges in the Si-Blurry scenario: (1) inter- and intra-task forgettings and
+(2) class imbalance problem. To alleviate them, we introduce Mask and Visual
+Prompt tuning (MVP). In MVP, to address the inter- and intra-task forgetting
+issues, we propose a novel instance-wise logit masking and contrastive visual
+prompt tuning loss. Both of them help our model discern the classes to be
+learned in the current batch. It results in consolidating the previous
+knowledge. In addition, to alleviate the class imbalance problem, we introduce
+a new gradient similarity-based focal loss and adaptive feature scaling to ease
+overfitting to the major classes and underfitting to the minor classes.
+Extensive experiments show that our proposed MVP significantly outperforms the
+existing state-of-the-art methods in our challenging Si-Blurry scenario.",cs.CV,"['cs.CV', 'cs.LG']"
+Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection,Zhiwei Yang · Jing Liu · Peng Wu, ,https://arxiv.org/abs/2404.08531,,2404.08531.pdf,Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection,"Weakly supervised video anomaly detection (WSVAD) is a challenging task.
+Generating fine-grained pseudo-labels based on weak-label and then
+self-training a classifier is currently a promising solution. However, since
+the existing methods use only RGB visual modality and the utilization of
+category text information is neglected, thus limiting the generation of more
+accurate pseudo-labels and affecting the performance of self-training. Inspired
+by the manual labeling process based on the event description, in this paper,
+we propose a novel pseudo-label generation and self-training framework based on
+Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer
+the rich language-visual knowledge of the contrastive language-image
+pre-training (CLIP) model for aligning the video event description text and
+corresponding video frames to generate pseudo-labels. Specifically, We first
+fine-tune the CLIP for domain adaptation by designing two ranking losses and a
+distributional inconsistency loss. Further, we propose a learnable text prompt
+mechanism with the assist of a normality visual prompt to further improve the
+matching accuracy of video event description text and video frames. Then, we
+design a pseudo-label generation module based on the normality guidance to
+infer reliable frame-level pseudo-labels. Finally, we introduce a temporal
+context self-adaptive learning module to learn the temporal dependencies of
+different video events more flexibly and accurately. Extensive experiments show
+that our method achieves state-of-the-art performance on two benchmark
+datasets, UCF-Crime and XD-Viole",cs.CV,['cs.CV']
+Partial-to-Partial Shape Matching with Geometric Consistency,Viktoria Ehm · Maolin Gao · Paul Roetzer · Marvin Eisenberger · Daniel Cremers · Florian Bernard,https://vikiehm.github.io/publications/gcppsm/,https://arxiv.org/abs/2404.12209,,2404.12209.pdf,Partial-to-Partial Shape Matching with Geometric Consistency,"Finding correspondences between 3D shapes is an important and long-standing
+problem in computer vision, graphics and beyond. A prominent challenge are
+partial-to-partial shape matching settings, which occur when the shapes to
+match are only observed incompletely (e.g. from 3D scanning). Although
+partial-to-partial matching is a highly relevant setting in practice, it is
+rarely explored. Our work bridges the gap between existing (rather artificial)
+3D full shape matching and partial-to-partial real-world settings by exploiting
+geometric consistency as a strong constraint. We demonstrate that it is indeed
+possible to solve this challenging problem in a variety of settings. For the
+first time, we achieve geometric consistency for partial-to-partial matching,
+which is realized by a novel integer non-linear program formalism building on
+triangle product spaces, along with a new pruning algorithm based on linear
+integer programming. Further, we generate a new inter-class dataset for
+partial-to-partial shape-matching. We show that our method outperforms current
+SOTA methods on both an established intra-class dataset and our novel
+inter-class dataset.",cs.CV,['cs.CV']
+Beyond Image Super-Resolution for Image Recognition with Task-Driven Perceptual Loss,Jaeha Kim · Junghun Oh · Kyoung Mu Lee, ,https://arxiv.org/abs/2404.01692,,2404.01692.pdf,Beyond Image Super-Resolution for Image Recognition with Task-Driven Perceptual Loss,"In real-world scenarios, image recognition tasks, such as semantic
+segmentation and object detection, often pose greater challenges due to the
+lack of information available within low-resolution (LR) content. Image
+super-resolution (SR) is one of the promising solutions for addressing the
+challenges. However, due to the ill-posed property of SR, it is challenging for
+typical SR methods to restore task-relevant high-frequency contents, which may
+dilute the advantage of utilizing the SR method. Therefore, in this paper, we
+propose Super-Resolution for Image Recognition (SR4IR) that effectively guides
+the generation of SR images beneficial to achieving satisfactory image
+recognition performance when processing LR images. The critical component of
+our SR4IR is the task-driven perceptual (TDP) loss that enables the SR network
+to acquire task-specific knowledge from a network tailored for a specific task.
+Moreover, we propose a cross-quality patch mix and an alternate training
+framework that significantly enhances the efficacy of the TDP loss by
+addressing potential problems when employing the TDP loss. Through extensive
+experiments, we demonstrate that our SR4IR achieves outstanding task
+performance by generating SR images useful for a specific image recognition
+task, including semantic segmentation, object detection, and image
+classification. The implementation code is available at
+https://github.com/JaehaKim97/SR4IR.",cs.CV,['cs.CV']
+FaceCom: Towards High-fidelity 3D Facial Shape Completion via Optimization and Inpainting Guidance,Yinglong Li · Hongyu Wu · Wang · Qingzhao Qin · yijiao zhao · Yong Wang · Aimin Hao, ,https://arxiv.org/abs/2308.16758,,2308.16758.pdf,Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images,"Generating 3D faces from textual descriptions has a multitude of
+applications, such as gaming, movie, and robotics. Recent progresses have
+demonstrated the success of unconditional 3D face generation and text-to-3D
+shape generation. However, due to the limited text-3D face data pairs,
+text-driven 3D face generation remains an open problem. In this paper, we
+propose a text-guided 3D faces generation method, refer as TG-3DFace, for
+generating realistic 3D faces using text guidance. Specifically, we adopt an
+unconditional 3D face generation framework and equip it with text conditions,
+which learns the text-guided 3D face generation with only text-2D face data. On
+top of that, we propose two text-to-face cross-modal alignment techniques,
+including the global contrastive learning and the fine-grained alignment
+module, to facilitate high semantic consistency between generated 3D faces and
+input texts. Besides, we present directional classifier guidance during the
+inference process, which encourages creativity for out-of-domain generations.
+Compared to the existing methods, TG-3DFace creates more realistic and
+aesthetically pleasing 3D faces, boosting 9% multi-view consistency (MVIC) over
+Latent3D. The rendered face images generated by TG-3DFace achieve higher FID
+and CLIP score than text-to-2D face/image generation models, demonstrating our
+superiority in generating realistic and semantic-consistent textures.",cs.CV,['cs.CV']
+4K4D: Real-Time 4D View Synthesis at 4K Resolution,Zhen Xu · Sida Peng · Haotong Lin · Guangzhao He · Jiaming Sun · Yujun Shen · Hujun Bao · Xiaowei Zhou,https://zju3dv.github.io/4k4d,https://arxiv.org/abs/2310.11448,,2310.11448.pdf,4K4D: Real-Time 4D View Synthesis at 4K Resolution,"This paper targets high-fidelity and real-time view synthesis of dynamic 3D
+scenes at 4K resolution. Recently, some methods on dynamic view synthesis have
+shown impressive rendering quality. However, their speed is still limited when
+rendering high-resolution images. To overcome this problem, we propose 4K4D, a
+4D point cloud representation that supports hardware rasterization and enables
+unprecedented rendering speed. Our representation is built on a 4D feature grid
+so that the points are naturally regularized and can be robustly optimized. In
+addition, we design a novel hybrid appearance model that significantly boosts
+the rendering quality while preserving efficiency. Moreover, we develop a
+differentiable depth peeling algorithm to effectively learn the proposed model
+from RGB videos. Experiments show that our representation can be rendered at
+over 400 FPS on the DNA-Rendering dataset at 1080p resolution and 80 FPS on the
+ENeRF-Outdoor dataset at 4K resolution using an RTX 4090 GPU, which is 30x
+faster than previous methods and achieves the state-of-the-art rendering
+quality. Our project page is available at https://zju3dv.github.io/4k4d/.",cs.CV,['cs.CV']
+VILA: On Pre-training for Visual Language Models,Ji Lin · Danny Yin · Wei Ping · Pavlo Molchanov · Mohammad Shoeybi · Song Han,https://github.com/NVlabs/VILA,https://arxiv.org/abs/2312.07533,,,VILA: On Pre-training for Visual Language Models,"Visual language models (VLMs) rapidly progressed with the recent success of
+large language models. There have been growing efforts on visual instruction
+tuning to extend the LLM with visual inputs, but lacks an in-depth study of the
+visual language pre-training process, where the model learns to perform joint
+modeling on both modalities. In this work, we examine the design options for
+VLM pre-training by augmenting LLM towards VLM through step-by-step
+controllable comparisons. We introduce three main findings: (1) freezing LLMs
+during pre-training can achieve decent zero-shot performance, but lack
+in-context learning capability, which requires unfreezing the LLM; (2)
+interleaved pre-training data is beneficial whereas image-text pairs alone are
+not optimal; (3) re-blending text-only instruction data to image-text data
+during instruction fine-tuning not only remedies the degradation of text-only
+tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe
+we build VILA, a Visual Language model family that consistently outperforms the
+state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells
+and whistles. Multi-modal pre-training also helps unveil appealing properties
+of VILA, including multi-image reasoning, enhanced in-context learning, and
+better world knowledge.",cs.CV,['cs.CV']
+GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting,Chi Yan · Delin Qu · Dong Wang · Dan Xu · Zhigang Wang · Bin Zhao · Xuelong Li,https://gs-slam.github.io/,https://arxiv.org/abs/2311.11700,,2311.11700.pdf,GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting,"In this paper, we introduce \textbf{GS-SLAM} that first utilizes 3D Gaussian
+representation in the Simultaneous Localization and Mapping (SLAM) system. It
+facilitates a better balance between efficiency and accuracy. Compared to
+recent SLAM methods employing neural implicit representations, our method
+utilizes a real-time differentiable splatting rendering pipeline that offers
+significant speedup to map optimization and RGB-D rendering. Specifically, we
+propose an adaptive expansion strategy that adds new or deletes noisy 3D
+Gaussians in order to efficiently reconstruct new observed scene geometry and
+improve the mapping of previously observed areas. This strategy is essential to
+extend 3D Gaussian representation to reconstruct the whole scene rather than
+synthesize a static object in existing methods. Moreover, in the pose tracking
+process, an effective coarse-to-fine technique is designed to select reliable
+3D Gaussian representations to optimize camera pose, resulting in runtime
+reduction and robust estimation. Our method achieves competitive performance
+compared with existing state-of-the-art real-time methods on the Replica,
+TUM-RGBD datasets. Project page: https://gs-slam.github.io/.",cs.CV,['cs.CV']
+Generating Content for HDR Deghosting from Frequency View,Tao Hu · Qingsen Yan · Yuankai Qi · Yanning Zhang, ,https://arxiv.org/abs/2404.00849,,2404.00849.pdf,Generating Content for HDR Deghosting from Frequency View,"Recovering ghost-free High Dynamic Range (HDR) images from multiple Low
+Dynamic Range (LDR) images becomes challenging when the LDR images exhibit
+saturation and significant motion. Recent Diffusion Models (DMs) have been
+introduced in HDR imaging field, demonstrating promising performance,
+particularly in achieving visually perceptible results compared to previous
+DNN-based methods. However, DMs require extensive iterations with large models
+to estimate entire images, resulting in inefficiency that hinders their
+practical application. To address this challenge, we propose the Low-Frequency
+aware Diffusion (LF-Diff) model for ghost-free HDR imaging. The key idea of
+LF-Diff is implementing the DMs in a highly compacted latent space and
+integrating it into a regression-based model to enhance the details of
+reconstructed images. Specifically, as low-frequency information is closely
+related to human visual perception we propose to utilize DMs to create compact
+low-frequency priors for the reconstruction process. In addition, to take full
+advantage of the above low-frequency priors, the Dynamic HDR Reconstruction
+Network (DHRNet) is carried out in a regression-based manner to obtain final
+HDR images. Extensive experiments conducted on synthetic and real-world
+benchmark datasets demonstrate that our LF-Diff performs favorably against
+several state-of-the-art methods and is 10$\times$ faster than previous
+DM-based methods.",cs.CV,['cs.CV']
+Neural Sign Actors: A diffusion model for 3D sign language production from text,Vasileios Baltatzis · Rolandos Alexandros Potamias · Evangelos Ververas · Guanxiong Sun · Jiankang Deng · Stefanos Zafeiriou, ,https://arxiv.org/abs/2312.02702,,2312.02702.pdf,Neural Sign Actors: A diffusion model for 3D sign language production from text,"Sign Languages (SL) serve as the primary mode of communication for the Deaf
+and Hard of Hearing communities. Deep learning methods for SL recognition and
+translation have achieved promising results. However, Sign Language Production
+(SLP) poses a challenge as the generated motions must be realistic and have
+precise semantic meaning. Most SLP methods rely on 2D data, which hinders their
+realism. In this work, a diffusion-based SLP model is trained on a curated
+large-scale dataset of 4D signing avatars and their corresponding text
+transcripts. The proposed method can generate dynamic sequences of 3D avatars
+from an unconstrained domain of discourse using a diffusion process formed on a
+novel and anatomically informed graph neural network defined on the SMPL-X body
+skeleton. Through quantitative and qualitative experiments, we show that the
+proposed method considerably outperforms previous methods of SLP. This work
+makes an important step towards realistic neural sign avatars, bridging the
+communication gap between Deaf and hearing communities.",cs.CV,['cs.CV']
+Steerers: A framework for rotation equivariant keypoint descriptors,Georg Bökman · Johan Edstedt · Michael Felsberg · Fredrik Kahl, ,https://arxiv.org/abs/2312.02152,,2312.02152.pdf,Steerers: A framework for rotation equivariant keypoint descriptors,"Image keypoint descriptions that are discriminative and matchable over large
+changes in viewpoint are vital for 3D reconstruction. However, descriptions
+output by learned descriptors are typically not robust to camera rotation.
+While they can be made more robust by, e.g., data augmentation, this degrades
+performance on upright images. Another approach is test-time augmentation,
+which incurs a significant increase in runtime. Instead, we learn a linear
+transform in description space that encodes rotations of the input image. We
+call this linear transform a steerer since it allows us to transform the
+descriptions as if the image was rotated. From representation theory, we know
+all possible steerers for the rotation group. Steerers can be optimized (A)
+given a fixed descriptor, (B) jointly with a descriptor or (C) we can optimize
+a descriptor given a fixed steerer. We perform experiments in these three
+settings and obtain state-of-the-art results on the rotation invariant image
+matching benchmarks AIMS and Roto-360. We publish code and model weights at
+https://github.com/georg-bn/rotation-steerers.",cs.CV,['cs.CV']
+LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning,Siyuan Cheng · Guanhong Tao · Yingqi Liu · Guangyu Shen · Shengwei An · Shiwei Feng · Xiangzhe Xu · Kaiyuan Zhang · Shiqing Ma · Xiangyu Zhang,https://github.com/Megum1/LOTUS,https://arxiv.org/abs/2403.17188,,2403.17188.pdf,LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning,"Backdoor attack poses a significant security threat to Deep Learning
+applications. Existing attacks are often not evasive to established backdoor
+detection techniques. This susceptibility primarily stems from the fact that
+these attacks typically leverage a universal trigger pattern or transformation
+function, such that the trigger can cause misclassification for any input. In
+response to this, recent papers have introduced attacks using sample-specific
+invisible triggers crafted through special transformation functions. While
+these approaches manage to evade detection to some extent, they reveal
+vulnerability to existing backdoor mitigation techniques. To address and
+enhance both evasiveness and resilience, we introduce a novel backdoor attack
+LOTUS. Specifically, it leverages a secret function to separate samples in the
+victim class into a set of partitions and applies unique triggers to different
+partitions. Furthermore, LOTUS incorporates an effective trigger focusing
+mechanism, ensuring only the trigger corresponding to the partition can induce
+the backdoor behavior. Extensive experimental results show that LOTUS can
+achieve high attack success rate across 4 datasets and 7 model structures, and
+effectively evading 13 backdoor detection and mitigation techniques. The code
+is available at https://github.com/Megum1/LOTUS.",cs.CV,"['cs.CV', 'cs.CR']"
+Language-only Training of Zero-shot Composed Image Retrieval,Geonmo Gu · Sanghyuk Chun · Wonjae Kim · Yoohoon Kang · Sangdoo Yun,https://github.com/navervision/lincir,https://arxiv.org/abs/2312.01998,,2312.01998.pdf,Language-only Efficient Training of Zero-shot Composed Image Retrieval,"Composed image retrieval (CIR) task takes a composed query of image and text,
+aiming to search relative images for both conditions. Conventional CIR
+approaches need a training dataset composed of triplets of query image, query
+text, and target image, which is very expensive to collect. Several recent
+works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue
+without using pre-collected triplets. However, the existing ZS-CIR methods show
+limited backbone scalability and generalizability due to the lack of diversity
+of the input texts during training. We propose a novel CIR framework, only
+using language for its training. Our LinCIR (Language-only training for CIR)
+can be trained only with text datasets by a novel self-supervision named
+self-masking projection (SMP). We project the text latent embedding to the
+token embedding space and construct a new text by replacing the keyword tokens
+of the original text. Then, we let the new and original texts have the same
+latent embedding vector. With this simple strategy, LinCIR is surprisingly
+efficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in
+48 minutes and shows the best ZS-CIR performances on four different CIR
+benchmarks, CIRCO, GeneCIS, FashionIQ, and CIRR, even outperforming supervised
+method on FashionIQ. Code is available at https://github.com/navervision/lincir",cs.CV,"['cs.CV', 'cs.IR']"
+"""Previously on ..."" From Recaps to Story Summarization",Aditya Kumar Singh · Dhruv Srivastava · Makarand Tapaswi, ,https://arxiv.org/abs/2405.11487,,2405.11487.pdf,"""Previously on ..."" From Recaps to Story Summarization","We introduce multimodal story summarization by leveraging TV episode recaps -
+short video sequences interweaving key story moments from previous episodes to
+bring viewers up to speed. We propose PlotSnap, a dataset featuring two crime
+thriller TV shows with rich recaps and long episodes of 40 minutes. Story
+summarization labels are unlocked by matching recap shots to corresponding
+sub-stories in the episode. We propose a hierarchical model TaleSumm that
+processes entire episodes by creating compact shot and dialog representations,
+and predicts importance scores for each video shot and dialog utterance by
+enabling interactions between local story groups. Unlike traditional
+summarization, our method extracts multiple plot points from long videos. We
+present a thorough evaluation on story summarization, including promising
+cross-series generalization. TaleSumm also shows good results on classic video
+summarization benchmarks.",cs.CV,['cs.CV']
+Learning Equi-angular Representations for Online Continual Learning,Minhyuk Seo · Hyunseo Koh · Wonje Jeung · Minjae Lee · San Kim · Hankook Lee · Sungjun Cho · Sungik Choi · Hyunwoo Kim · Jonghyun Choi, ,https://arxiv.org/abs/2404.01628,,2404.01628.pdf,Learning Equi-angular Representations for Online Continual Learning,"Online continual learning suffers from an underfitted solution due to
+insufficient training for prompt model update (e.g., single-epoch training). To
+address the challenge, we propose an efficient online continual learning method
+using the neural collapse phenomenon. In particular, we induce neural collapse
+to form a simplex equiangular tight frame (ETF) structure in the representation
+space so that the continuously learned model with a single epoch can better fit
+to the streamed data by proposing preparatory data training and residual
+correction in the representation space. With an extensive set of empirical
+validations using CIFAR-10/100, TinyImageNet, ImageNet-200, and ImageNet-1K, we
+show that our proposed method outperforms state-of-the-art methods by a
+noticeable margin in various online continual learning scenarios such as
+disjoint and Gaussian scheduled continuous (i.e., boundary-free) data setups.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Holodeck: Language Guided Generation of 3D Embodied AI Environments,Yue Yang · Fan-Yun Sun · Luca Weihs · Eli VanderBilt · Alvaro Herrasti · Winson Han · Jiajun Wu · Nick Haber · Ranjay Krishna · Lingjie Liu · Chris Callison-Burch · Mark Yatskar · Aniruddha Kembhavi · Christopher Clark,https://yueyang1996.github.io/holodeck/,https://arxiv.org/abs/2312.09067,,2312.09067.pdf,Holodeck: Language Guided Generation of 3D Embodied AI Environments,"3D simulated environments play a critical role in Embodied AI, but their
+creation requires expertise and extensive manual effort, restricting their
+diversity and scope. To mitigate this limitation, we present Holodeck, a system
+that generates 3D environments to match a user-supplied prompt fully
+automatedly. Holodeck can generate diverse scenes, e.g., arcades, spas, and
+museums, adjust the designs for styles, and can capture the semantics of
+complex queries such as ""apartment for a researcher with a cat"" and ""office of
+a professor who is a fan of Star Wars"". Holodeck leverages a large language
+model (i.e., GPT-4) for common sense knowledge about what the scene might look
+like and uses a large collection of 3D assets from Objaverse to populate the
+scene with diverse objects. To address the challenge of positioning objects
+correctly, we prompt GPT-4 to generate spatial relational constraints between
+objects and then optimize the layout to satisfy those constraints. Our
+large-scale human evaluation shows that annotators prefer Holodeck over
+manually designed procedural baselines in residential scenes and that Holodeck
+can produce high-quality outputs for diverse scene types. We also demonstrate
+an exciting application of Holodeck in Embodied AI, training agents to navigate
+in novel scenes like music rooms and daycares without human-constructed data,
+which is a significant step forward in developing general-purpose embodied
+agents.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.RO']"
+NeRF-HuGS: Improved Neural Radiance Fields in Non-static Scenes Using Heuristics-Guided Segmentation,Jiahao Chen · Yipeng Qin · Lingjie Liu · Jiangbo Lu · Guanbin Li, ,https://arxiv.org/abs/2403.17537,,2403.17537.pdf,NeRF-HuGS: Improved Neural Radiance Fields in Non-static Scenes Using Heuristics-Guided Segmentation,"Neural Radiance Field (NeRF) has been widely recognized for its excellence in
+novel view synthesis and 3D scene reconstruction. However, their effectiveness
+is inherently tied to the assumption of static scenes, rendering them
+susceptible to undesirable artifacts when confronted with transient distractors
+such as moving objects or shadows. In this work, we propose a novel paradigm,
+namely ""Heuristics-Guided Segmentation"" (HuGS), which significantly enhances
+the separation of static scenes from transient distractors by harmoniously
+combining the strengths of hand-crafted heuristics and state-of-the-art
+segmentation models, thus significantly transcending the limitations of
+previous solutions. Furthermore, we delve into the meticulous design of
+heuristics, introducing a seamless fusion of Structure-from-Motion (SfM)-based
+heuristics and color residual heuristics, catering to a diverse range of
+texture profiles. Extensive experiments demonstrate the superiority and
+robustness of our method in mitigating transient distractors for NeRFs trained
+in non-static scenes. Project page: https://cnhaox.github.io/NeRF-HuGS/.",cs.CV,['cs.CV']
+FLHetBench: Benchmarking Device and State Heterogeneity in Federated Learning,Junyuan Zhang · Shuang Zeng · Miao Zhang · Runxi Wang · Feifei Wang · Yuyin Zhou · Paul Pu Liang · Liangqiong Qu,https://carkham.github.io/FL_Het_Bench/,https://arxiv.org/abs/2306.05172,,2306.05172.pdf,FLEdge: Benchmarking Federated Machine Learning Applications in Edge Computing Systems,"Federated Machine Learning (FL) has received considerable attention in recent
+years. FL benchmarks are predominantly explored in either simulated systems or
+data center environments, neglecting the setups of real-world systems, which
+are often closely linked to edge computing. We close this research gap by
+introducing FLEdge, a benchmark targeting FL workloads in edge computing
+systems. We systematically study hardware heterogeneity, energy efficiency
+during training, and the effect of various differential privacy levels on
+training in FL systems. To make this benchmark applicable to real-world
+scenarios, we evaluate the impact of client dropouts on state-of-the-art FL
+strategies with failure rates as high as 50%. FLEdge provides new insights,
+such as that training state-of-the-art FL workloads on older GPU-accelerated
+embedded devices is up to 3x more energy efficient than on modern server-grade
+GPUs.",cs.LG,"['cs.LG', 'cs.DC', 'I.2.11; C.2.4; C.4; D.2.8']"
+Hyper-MD: Mesh Denoising with Customized Parameters Aware of Noise Intensity and Geometric Characteristics,Xingtao Wang · Hongliang Wei · Xiaopeng Fan · Debin Zhao, ,https://arxiv.org/abs/2405.06536,,2405.06536.pdf,Mesh Denoising Transformer,"Mesh denoising, aimed at removing noise from input meshes while preserving
+their feature structures, is a practical yet challenging task. Despite the
+remarkable progress in learning-based mesh denoising methodologies in recent
+years, their network designs often encounter two principal drawbacks: a
+dependence on single-modal geometric representations, which fall short in
+capturing the multifaceted attributes of meshes, and a lack of effective global
+feature aggregation, hindering their ability to fully understand the mesh's
+comprehensive structure. To tackle these issues, we propose SurfaceFormer, a
+pioneering Transformer-based mesh denoising framework. Our first contribution
+is the development of a new representation known as Local Surface Descriptor,
+which is crafted by establishing polar systems on each mesh face, followed by
+sampling points from adjacent surfaces using geodesics. The normals of these
+points are organized into 2D patches, mimicking images to capture local
+geometric intricacies, whereas the poles and vertex coordinates are
+consolidated into a point cloud to embody spatial information. This advancement
+surmounts the hurdles posed by the irregular and non-Euclidean characteristics
+of mesh data, facilitating a smooth integration with Transformer architecture.
+Next, we propose a dual-stream structure consisting of a Geometric Encoder
+branch and a Spatial Encoder branch, which jointly encode local geometry
+details and spatial information to fully explore multimodal information for
+mesh denoising. A subsequent Denoising Transformer module receives the
+multimodal information and achieves efficient global feature aggregation
+through self-attention operators. Our experimental evaluations demonstrate that
+this novel approach outperforms existing state-of-the-art methods in both
+objective and subjective assessments, marking a significant leap forward in
+mesh denoising.",cs.CV,['cs.CV']
+Boosting Diffusion Models with Moving Average Sampling in Frequency Domain,Yurui Qian · Qi Cai · Yingwei Pan · Yehao Li · Ting Yao · Qibin Sun · Tao Mei, ,https://arxiv.org/abs/2403.17870,,2403.17870.pdf,Boosting Diffusion Models with Moving Average Sampling in Frequency Domain,"Diffusion models have recently brought a powerful revolution in image
+generation. Despite showing impressive generative capabilities, most of these
+models rely on the current sample to denoise the next one, possibly resulting
+in denoising instability. In this paper, we reinterpret the iterative denoising
+process as model optimization and leverage a moving average mechanism to
+ensemble all the prior samples. Instead of simply applying moving average to
+the denoised samples at different timesteps, we first map the denoised samples
+to data space and then perform moving average to avoid distribution shift
+across timesteps. In view that diffusion models evolve the recovery from
+low-frequency components to high-frequency details, we further decompose the
+samples into different frequency components and execute moving average
+separately on each component. We name the complete approach ""Moving Average
+Sampling in Frequency domain (MASF)"". MASF could be seamlessly integrated into
+mainstream pre-trained diffusion models and sampling schedules. Extensive
+experiments on both unconditional and conditional diffusion models demonstrate
+that our MASF leads to superior performances compared to the baselines, with
+almost negligible additional complexity cost.",cs.CV,"['cs.CV', 'cs.MM']"
+Task-Aware Encoder Control for Deep Video Compression,Xingtong Ge · Jixiang Luo · XINJIE ZHANG · Tongda Xu · Guo Lu · Dailan He · Jing Geng · Yan Wang · Jun Zhang · Hongwei Qin, ,https://arxiv.org/abs/2404.04848,,2404.04848.pdf,Task-Aware Encoder Control for Deep Video Compression,"Prior research on deep video compression (DVC) for machine tasks typically
+necessitates training a unique codec for each specific task, mandating a
+dedicated decoder per task. In contrast, traditional video codecs employ a
+flexible encoder controller, enabling the adaptation of a single codec to
+different tasks through mechanisms like mode prediction. Drawing inspiration
+from this, we introduce an innovative encoder controller for deep video
+compression for machines. This controller features a mode prediction and a
+Group of Pictures (GoP) selection module. Our approach centralizes control at
+the encoding stage, allowing for adaptable encoder adjustments across different
+tasks, such as detection and tracking, while maintaining compatibility with a
+standard pre-trained DVC decoder. Empirical evidence demonstrates that our
+method is applicable across multiple tasks with various existing pre-trained
+DVCs. Moreover, extensive experiments demonstrate that our method outperforms
+previous DVC by about 25% bitrate for different tasks, with only one
+pre-trained decoder.",eess.IV,"['eess.IV', 'cs.AI', 'cs.CV']"
+NEAT: Distilling 3D Wireframes from Neural Attraction Fields,Nan Xue · Bin Tan · Yuxi Xiao · Liang Dong · Gui-Song Xia · Tianfu Wu · Yujun Shen,https://github.com/cherubicXN/neat,https://arxiv.org/abs/2307.10206,,2307.10206.pdf,NEAT: Distilling 3D Wireframes from Neural Attraction Fields,"This paper studies the problem of structured 3D reconstruction using
+wireframes that consist of line segments and junctions, focusing on the
+computation of structured boundary geometries of scenes. Instead of leveraging
+matching-based solutions from 2D wireframes (or line segments) for 3D wireframe
+reconstruction as done in prior arts, we present NEAT, a rendering-distilling
+formulation using neural fields to represent 3D line segments with 2D
+observations, and bipartite matching for perceiving and distilling of a sparse
+set of 3D global junctions. The proposed {NEAT} enjoys the joint optimization
+of the neural fields and the global junctions from scratch, using
+view-dependent 2D observations without precomputed cross-view feature matching.
+Comprehensive experiments on the DTU and BlendedMVS datasets demonstrate our
+NEAT's superiority over state-of-the-art alternatives for 3D wireframe
+reconstruction. Moreover, the distilled 3D global junctions by NEAT, are a
+better initialization than SfM points, for the recently-emerged 3D Gaussian
+Splatting for high-fidelity novel view synthesis using about 20 times fewer
+initial 3D points. Project page: \url{https://xuenan.net/neat}.",cs.CV,"['cs.CV', 'cs.GR']"
+Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval,Jiamian Wang · Guohao Sun · Pichao Wang · Dongfang Liu · Sohail Dianat · MAJID RABBANI · Raghuveer Rao · ZHIQIANG TAO, ,https://arxiv.org/abs/2403.17998,,2403.17998.pdf,Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval,"The increasing prevalence of video clips has sparked growing interest in
+text-video retrieval. Recent advances focus on establishing a joint embedding
+space for text and video, relying on consistent embedding representations to
+compute similarity. However, the text content in existing datasets is generally
+short and concise, making it hard to fully describe the redundant semantics of
+a video. Correspondingly, a single text embedding may be less expressive to
+capture the video embedding and empower the retrieval. In this study, we
+propose a new stochastic text modeling method T-MASS, i.e., text is modeled as
+a stochastic embedding, to enrich text embedding with a flexible and resilient
+semantic range, yielding a text mass. To be specific, we introduce a
+similarity-aware radius module to adapt the scale of the text mass upon the
+given text-video pairs. Plus, we design and develop a support text
+regularization to further control the text mass during the training. The
+inference pipeline is also tailored to fully exploit the text mass for accurate
+retrieval. Empirical evidence suggests that T-MASS not only effectively
+attracts relevant text-video pairs while distancing irrelevant ones, but also
+enables the determination of precise text embeddings for relevant pairs. Our
+experimental results show a substantial improvement of T-MASS over baseline (3%
+to 6.3% by R@1). Also, T-MASS achieves state-of-the-art performance on five
+benchmark datasets, including MSRVTT, LSMDC, DiDeMo, VATEX, and Charades.",cs.CV,['cs.CV']
+Optimizing Diffusion Noise Can Serve As Universal Motion Priors,Korrawe Karunratanakul · Konpat Preechakul · Emre Aksan · Thabo Beeler · Supasorn Suwajanakorn · Siyu Tang,https://korrawe.github.io/dno-project/,https://arxiv.org/abs/2312.11994v1,,2312.11994v1.pdf,Optimizing Diffusion Noise Can Serve As Universal Motion Priors,"We propose Diffusion Noise Optimization (DNO), a new method that effectively
+leverages existing motion diffusion models as motion priors for a wide range of
+motion-related tasks. Instead of training a task-specific diffusion model for
+each new task, DNO operates by optimizing the diffusion latent noise of an
+existing pre-trained text-to-motion model. Given the corresponding latent noise
+of a human motion, it propagates the gradient from the target criteria defined
+on the motion space through the whole denoising process to update the diffusion
+latent noise. As a result, DNO supports any use cases where criteria can be
+defined as a function of motion. In particular, we show that, for motion
+editing and control, DNO outperforms existing methods in both achieving the
+objective and preserving the motion content. DNO accommodates a diverse range
+of editing modes, including changing trajectory, pose, joint locations, or
+avoiding newly added obstacles. In addition, DNO is effective in motion
+denoising and completion, producing smooth and realistic motion from noisy and
+partial inputs. DNO achieves these results at inference time without the need
+for model retraining, offering great versatility for any defined reward or loss
+function on the motion representation.",cs.CV,['cs.CV']
+Each Test Image Deserves A Specific Prompt: Continual Test-Time Adaptation for 2D Medical Image Segmentation,Ziyang Chen · Yongsheng Pan · Yiwen Ye · Mengkang Lu · Yong Xia,https://github.com/Chen-Ziyang/VPTTA,https://arxiv.org/abs/2311.18363,,2311.18363.pdf,Each Test Image Deserves A Specific Prompt: Continual Test-Time Adaptation for 2D Medical Image Segmentation,"Distribution shift widely exists in medical images acquired from different
+medical centres and poses a significant obstacle to deploying the pre-trained
+semantic segmentation model in real-world applications. Test-time adaptation
+has proven its effectiveness in tackling the cross-domain distribution shift
+during inference. However, most existing methods achieve adaptation by updating
+the pre-trained models, rendering them susceptible to error accumulation and
+catastrophic forgetting when encountering a series of distribution shifts
+(i.e., under the continual test-time adaptation setup). To overcome these
+challenges caused by updating the models, in this paper, we freeze the
+pre-trained model and propose the Visual Prompt-based Test-Time Adaptation
+(VPTTA) method to train a specific prompt for each test image to align the
+statistics in the batch normalization layers. Specifically, we present the
+low-frequency prompt, which is lightweight with only a few parameters and can
+be effectively trained in a single iteration. To enhance prompt initialization,
+we equip VPTTA with a memory bank to benefit the current prompt from previous
+ones. Additionally, we design a warm-up mechanism, which mixes source and
+target statistics to construct warm-up statistics, thereby facilitating the
+training process. Extensive experiments demonstrate the superiority of our
+VPTTA over other state-of-the-art methods on two medical image segmentation
+benchmark tasks. The code and weights of pre-trained source models are
+available at https://github.com/Chen-Ziyang/VPTTA.",cs.CV,['cs.CV']
+A Stealthy Wrongdoer: Feature-Oriented Reconstruction Attack against Split Learning,Xiaoyang Xu · Mengda Yang · Wenzhe Yi · Ziang Li · Juan Wang · Hongxin Hu · Yong ZHUANG · Yaxin Liu, ,https://arxiv.org/abs/2405.04115,,2405.04115.pdf,A Stealthy Wrongdoer: Feature-Oriented Reconstruction Attack against Split Learning,"Split Learning (SL) is a distributed learning framework renowned for its
+privacy-preserving features and minimal computational requirements. Previous
+research consistently highlights the potential privacy breaches in SL systems
+by server adversaries reconstructing training data. However, these studies
+often rely on strong assumptions or compromise system utility to enhance attack
+performance. This paper introduces a new semi-honest Data Reconstruction Attack
+on SL, named Feature-Oriented Reconstruction Attack (FORA). In contrast to
+prior works, FORA relies on limited prior knowledge, specifically that the
+server utilizes auxiliary samples from the public without knowing any client's
+private information. This allows FORA to conduct the attack stealthily and
+achieve robust performance. The key vulnerability exploited by FORA is the
+revelation of the model representation preference in the smashed data output by
+victim client. FORA constructs a substitute client through feature-level
+transfer learning, aiming to closely mimic the victim client's representation
+preference. Leveraging this substitute client, the server trains the attack
+model to effectively reconstruct private data. Extensive experiments showcase
+FORA's superior performance compared to state-of-the-art methods. Furthermore,
+the paper systematically evaluates the proposed method's applicability across
+diverse settings and advanced defense strategies.",cs.CR,['cs.CR']
+Physics-guided Shape-from-Template: Monocular Video Perception through Neural Surrogate Models,David Stotko · Nils Wandel · Reinhard Klein,https://cg.cs.uni-bonn.de/publication/stotko2024-Physics-guided-SfT,https://arxiv.org/abs/2311.12796,,2311.12796.pdf,Physics-guided Shape-from-Template: Monocular Video Perception through Neural Surrogate Models,"3D reconstruction of dynamic scenes is a long-standing problem in computer
+graphics and increasingly difficult the less information is available.
+Shape-from-Template (SfT) methods aim to reconstruct a template-based geometry
+from RGB images or video sequences, often leveraging just a single monocular
+camera without depth information, such as regular smartphone recordings.
+Unfortunately, existing reconstruction methods are either unphysical and noisy
+or slow in optimization. To solve this problem, we propose a novel SfT
+reconstruction algorithm for cloth using a pre-trained neural surrogate model
+that is fast to evaluate, stable, and produces smooth reconstructions due to a
+regularizing physics simulation. Differentiable rendering of the simulated mesh
+enables pixel-wise comparisons between the reconstruction and a target video
+sequence that can be used for a gradient-based optimization procedure to
+extract not only shape information but also physical parameters such as
+stretching, shearing, or bending stiffness of the cloth. This allows to retain
+a precise, stable, and smooth reconstructed geometry while reducing the runtime
+by a factor of 400-500 compared to $\phi$-SfT, a state-of-the-art physics-based
+SfT approach.",cs.CV,"['cs.CV', 'cs.LG']"
+Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition,Anqi Zhu · Qiuhong Ke · Mingming Gong · James Bailey, ,https://arxiv.org/abs/2404.07487,,2404.07487.pdf,Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition,"Skeleton-based zero-shot action recognition aims to recognize unknown human
+actions based on the learned priors of the known skeleton-based actions and a
+semantic descriptor space shared by both known and unknown categories. However,
+previous works focus on establishing the bridges between the known skeleton
+representation space and semantic descriptions space at the coarse-grained
+level for recognizing unknown action categories, ignoring the fine-grained
+alignment of these two spaces, resulting in suboptimal performance in
+distinguishing high-similarity action categories. To address these challenges,
+we propose a novel method via Side information and dual-prompts learning for
+skeleton-based zero-shot action recognition (STAR) at the fine-grained level.
+Specifically, 1) we decompose the skeleton into several parts based on its
+topology structure and introduce the side information concerning multi-part
+descriptions of human body movements for alignment between the skeleton and the
+semantic space at the fine-grained level; 2) we design the visual-attribute and
+semantic-part prompts to improve the intra-class compactness within the
+skeleton space and inter-class separability within the semantic space,
+respectively, to distinguish the high-similarity actions. Extensive experiments
+show that our method achieves state-of-the-art performance in ZSL and GZSL
+settings on NTU RGB+D, NTU RGB+D 120, and PKU-MMD datasets.",cs.CV,['cs.CV']
+MICap: A Unified Model for Identity-aware Movie Descriptions,Haran Raajesh · Naveen Reddy Desanur · Zeeshan Khan · Makarand Tapaswi, ,https://arxiv.org/abs/2405.11483,,2405.11483.pdf,MICap: A Unified Model for Identity-aware Movie Descriptions,"Characters are an important aspect of any storyline and identifying and
+including them in descriptions is necessary for story understanding. While
+previous work has largely ignored identity and generated captions with someone
+(anonymized names), recent work formulates id-aware captioning as a
+fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is
+to predict person id labels. However, to predict captions with ids, a two-stage
+approach is required: first predict captions with someone, then fill in
+identities. In this work, we present a new single stage approach that can
+seamlessly switch between id-aware caption generation or FITB when given a
+caption with blanks. Our model, Movie-Identity Captioner (MICap), uses a shared
+auto-regressive decoder that benefits from training with FITB and full-caption
+generation objectives, while the encoder can benefit from or disregard captions
+with blanks as input. Another challenge with id-aware captioning is the lack of
+a metric to capture subtle differences between person ids. To this end, we
+introduce iSPICE, a caption evaluation metric that focuses on identity tuples
+created through intermediate scene graphs. We evaluate MICap on Large-Scale
+Movie Description Challenge (LSMDC), where we show a 4.2% improvement in FITB
+accuracy, and a 1-2% bump in classic captioning metrics.",cs.CV,['cs.CV']
+DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection,Lewei Yao · Renjie Pi · Jianhua Han · Xiaodan Liang · Hang Xu · Wei Zhang · Zhenguo Li · Dan Xu, ,https://arxiv.org/abs/2404.09216,,,DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection,"Existing open-vocabulary object detectors typically require a predefined set
+of categories from users, significantly confining their application scenarios.
+In this paper, we introduce DetCLIPv3, a high-performing detector that excels
+not only at both open-vocabulary object detection, but also generating
+hierarchical labels for detected objects. DetCLIPv3 is characterized by three
+core designs: 1. Versatile model architecture: we derive a robust open-set
+detection framework which is further empowered with generation ability via the
+integration of a caption head. 2. High information density data: we develop an
+auto-annotation pipeline leveraging visual large language model to refine
+captions for large-scale image-text pairs, providing rich, multi-granular
+object labels to enhance the training. 3. Efficient training strategy: we
+employ a pre-training stage with low-resolution inputs that enables the object
+captioner to efficiently learn a broad spectrum of visual concepts from
+extensive image-text paired data. This is followed by a fine-tuning stage that
+leverages a small number of high-resolution samples to further enhance
+detection performance. With these effective designs, DetCLIPv3 demonstrates
+superior open-vocabulary detection performance, \eg, our Swin-T backbone model
+achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark,
+outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP,
+respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense
+captioning task on VG dataset, showcasing its strong generative capability.",cs.CV,['cs.CV']
+Label Propagation for Zero-shot Classification with Vision-Language Models,Vladan Stojnić · Yannis Kalantidis · Giorgos Tolias,https://github.com/vladan-stojnic/ZLaP,https://arxiv.org/abs/2404.04072,,2404.04072.pdf,Label Propagation for Zero-shot Classification with Vision-Language Models,"Vision-Language Models (VLMs) have demonstrated impressive performance on
+zero-shot classification, i.e. classification when provided merely with a list
+of class names. In this paper, we tackle the case of zero-shot classification
+in the presence of unlabeled data. We leverage the graph structure of the
+unlabeled data and introduce ZLaP, a method based on label propagation (LP)
+that utilizes geodesic distances for classification. We tailor LP to graphs
+containing both text and image features and further propose an efficient method
+for performing inductive inference based on a dual solution and a
+sparsification step. We perform extensive experiments to evaluate the
+effectiveness of our method on 14 common datasets and show that ZLaP
+outperforms the latest related works. Code:
+https://github.com/vladan-stojnic/ZLaP",cs.CV,"['cs.CV', 'cs.LG']"
+KVQ: Kwai Video Quality Assessment for Short-form Videos,Yiting Lu · Xin Li · Yajing Pei · Kun Yuan · Qizhi Xie · Yunpeng Qu · Ming Sun · Chao Zhou · Zhibo Chen,https://github.com/lixinustc/KVQ-Challenge-CVPR-NTIRE2024,https://arxiv.org/abs/2402.07220,,2402.07220.pdf,KVQ: Kwai Video Quality Assessment for Short-form Videos,"Short-form UGC video platforms, like Kwai and TikTok, have been an emerging
+and irreplaceable mainstream media form, thriving on user-friendly engagement,
+and kaleidoscope creation, etc. However, the advancing content-generation
+modes, e.g., special effects, and sophisticated processing workflows, e.g.,
+de-artifacts, have introduced significant challenges to recent UGC video
+quality assessment: (i) the ambiguous contents hinder the identification of
+quality-determined regions. (ii) the diverse and complicated hybrid distortions
+are hard to distinguish. To tackle the above challenges and assist in the
+development of short-form videos, we establish the first large-scale
+Kaleidoscope short Video database for Quality assessment, termed KVQ, which
+comprises 600 user-uploaded short videos and 3600 processed videos through the
+diverse practical processing workflows, including pre-processing, transcoding,
+and enhancement. Among them, the absolute quality score of each video and
+partial ranking score among indistinguishable samples are provided by a team of
+professional researchers specializing in image processing. Based on this
+database, we propose the first short-form video quality evaluator, i.e., KSVQE,
+which enables the quality evaluator to identify the quality-determined
+semantics with the content understanding of large vision language models (i.e.,
+CLIP) and distinguish the distortions with the distortion understanding module.
+Experimental results have shown the effectiveness of KSVQE on our KVQ database
+and popular VQA databases.",eess.IV,"['eess.IV', 'cs.CV']"
+StreamingFlow: Streaming Occupancy Forecasting with Asynchronous Multi-modal Data Streams via Neural Ordinary Differential Equation,Yining Shi · Kun JIANG · Ke Wang · Jiusi Li · Yunlong Wang · Mengmeng Yang · Diange Yang, ,,https://github.com/keithAND2020/awesome-Occupancy-research,,,,,nan
+HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion,Jingbo Zhang · Xiaoyu Li · Qi Zhang · Yan-Pei Cao · Ying Shan · Jing Liao, ,https://arxiv.org/abs/2311.16961v1,,2311.16961v1.pdf,HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion,"Generating a 3D human model from a single reference image is challenging
+because it requires inferring textures and geometries in invisible views while
+maintaining consistency with the reference image. Previous methods utilizing 3D
+generative models are limited by the availability of 3D training data.
+Optimization-based methods that lift text-to-image diffusion models to 3D
+generation often fail to preserve the texture details of the reference image,
+resulting in inconsistent appearances in different views. In this paper, we
+propose HumanRef, a 3D human generation framework from a single-view input. To
+ensure the generated 3D model is photorealistic and consistent with the input
+image, HumanRef introduces a novel method called reference-guided score
+distillation sampling (Ref-SDS), which effectively incorporates image guidance
+into the generation process. Furthermore, we introduce region-aware attention
+to Ref-SDS, ensuring accurate correspondence between different body regions.
+Experimental results demonstrate that HumanRef outperforms state-of-the-art
+methods in generating 3D clothed humans with fine geometry, photorealistic
+textures, and view-consistent appearances.",cs.CV,['cs.CV']
+SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery,Xin Guo · Jiangwei Lao · Bo Dang · Yingying Zhang · Lei Yu · Lixiang Ru · Liheng Zhong · Ziyuan Huang · Kang Wu · Dingxiang Hu · HUIMEI HE · Jian Wang · Jingdong Chen · Ming Yang · Yongjun Zhang · Yansheng Li, ,https://arxiv.org/abs/2312.10115,,2312.10115.pdf,SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery,"Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense
+potential towards a generic model for Earth Observation. Nevertheless, these
+works primarily focus on a single modality without temporal and geo-context
+modeling, hampering their capabilities for diverse tasks. In this study, we
+present SkySense, a generic billion-scale model, pre-trained on a curated
+multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal
+sequences. SkySense incorporates a factorized multi-modal spatiotemporal
+encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR)
+data as input. This encoder is pre-trained by our proposed Multi-Granularity
+Contrastive Learning to learn representations across different modal and
+spatial granularities. To further enhance the RSI representations by the
+geo-context clue, we introduce Geo-Context Prototype Learning to learn
+region-aware prototypes upon RSI's multi-modal spatiotemporal features. To our
+best knowledge, SkySense is the largest Multi-Modal RSFM to date, whose modules
+can be flexibly combined or used individually to accommodate various tasks. It
+demonstrates remarkable generalization capabilities on a thorough evaluation
+encompassing 16 datasets over 7 tasks, from single- to multi-modal, static to
+temporal, and classification to localization. SkySense surpasses 18 recent
+RSFMs in all test scenarios. Specifically, it outperforms the latest models
+such as GFM, SatLas and Scale-MAE by a large margin, i.e., 2.76%, 3.67% and
+3.61% on average respectively. We will release the pre-trained weights to
+facilitate future research and Earth Observation applications.",cs.CV,['cs.CV']
+BioCLIP: A Vision Foundation Model for the Tree of Life,Samuel Stevens · Jiaman Wu · Matthew Thompson · Elizabeth Campolongo · Chan Hee Song · David Carlyn · Li Dong · Wasila Dahdul · Charles Stewart · Tanya Berger-Wolf · Wei-Lun Chao · Yu Su, ,https://arxiv.org/abs/2311.18803,,2311.18803.pdf,BioCLIP: A Vision Foundation Model for the Tree of Life,"Images of the natural world, collected by a variety of cameras, from drones
+to individual phones, are increasingly abundant sources of biological
+information. There is an explosion of computational methods and tools,
+particularly computer vision, for extracting biologically relevant information
+from images for science and conservation. Yet most of these are bespoke
+approaches designed for a specific task and are not easily adaptable or
+extendable to new questions, contexts, and datasets. A vision model for general
+organismal biology questions on images is of timely need. To approach this, we
+curate and release TreeOfLife-10M, the largest and most diverse ML-ready
+dataset of biology images. We then develop BioCLIP, a foundation model for the
+tree of life, leveraging the unique properties of biology captured by
+TreeOfLife-10M, namely the abundance and variety of images of plants, animals,
+and fungi, together with the availability of rich structured biological
+knowledge. We rigorously benchmark our approach on diverse fine-grained biology
+classification tasks and find that BioCLIP consistently and substantially
+outperforms existing baselines (by 16% to 17% absolute). Intrinsic evaluation
+reveals that BioCLIP has learned a hierarchical representation conforming to
+the tree of life, shedding light on its strong generalizability.
+https://imageomics.github.io/bioclip has models, data and code.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']"
+MAFA: Managing False Negatives for Vision-Language Pre-training,Jaeseok Byun · Dohoon Kim · Taesup Moon, ,https://arxiv.org/abs/2312.06112,,2312.06112.pdf,Converting and Smoothing False Negatives for Vision-Language Pre-training,"We consider the critical issue of false negatives in Vision-Language
+Pre-training (VLP), a challenge that arises from the inherent many-to-many
+correspondence of image-text pairs in large-scale web-crawled datasets. The
+presence of false negatives can impede achieving optimal performance and even
+lead to learning failures. To address this challenge, we propose a method
+called COSMO (COnverting and SMOoothing false negatives) that manages the false
+negative issues, especially powerful in hard negative sampling. Building upon
+the recently developed GRouped mIni-baTch sampling (GRIT) strategy, our
+approach consists of two pivotal components: 1) an efficient connection mining
+process that identifies and converts false negatives into positives, and 2)
+label smoothing for the image-text contrastive loss (ITC). Our comprehensive
+experiments verify the effectiveness of COSMO across multiple downstream tasks,
+emphasizing the crucial role of addressing false negatives in VLP, potentially
+even surpassing the importance of addressing false positives. In addition, the
+compatibility of COSMO with the recent BLIP-family model is also demonstrated.",cs.CV,"['cs.CV', 'cs.AI']"
+General Object Foundation Model for Images and Videos at Scale,Junfeng Wu · Yi Jiang · Qihao Liu · Zehuan Yuan · Xiang Bai · Song Bai,https://glee-vision.github.io/,https://arxiv.org/abs/2312.09158,,2312.09158.pdf,General Object Foundation Model for Images and Videos at Scale,"We present GLEE in this work, an object-level foundation model for locating
+and identifying objects in images and videos. Through a unified framework, GLEE
+accomplishes detection, segmentation, tracking, grounding, and identification
+of arbitrary objects in the open world scenario for various object perception
+tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from
+diverse data sources with varying supervision levels to formulate general
+object representations, excelling in zero-shot transfer to new data and tasks.
+Specifically, we employ an image encoder, text encoder, and visual prompter to
+handle multi-modal inputs, enabling to simultaneously solve various
+object-centric downstream tasks while maintaining state-of-the-art performance.
+Demonstrated through extensive training on over five million images from
+diverse benchmarks, GLEE exhibits remarkable versatility and improved
+generalization performance, efficiently tackling downstream tasks without the
+need for task-specific adaptation. By integrating large volumes of
+automatically labeled data, we further enhance its zero-shot generalization
+capabilities. Additionally, GLEE is capable of being integrated into Large
+Language Models, serving as a foundational model to provide universal
+object-level information for multi-modal tasks. We hope that the versatility
+and universality of our method will mark a significant step in the development
+of efficient visual foundation models for AGI systems. The model and code will
+be released at https://glee-vision.github.io .",cs.CV,['cs.CV']
+Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning,Siteng Huang · Biao Gong · Yutong Feng · Zhang Min · Yiliang Lv · Donglin Wang, ,https://arxiv.org/abs/2311.14749,,2311.14749.pdf,Compositional Zero-shot Learning via Progressive Language-based Observations,"Compositional zero-shot learning aims to recognize unseen state-object
+compositions by leveraging known primitives (state and object) during training.
+However, effectively modeling interactions between primitives and generalizing
+knowledge to novel compositions remains a perennial challenge. There are two
+key factors: object-conditioned and state-conditioned variance, i.e., the
+appearance of states (or objects) can vary significantly when combined with
+different objects (or states). For instance, the state ""old"" can signify a
+vintage design for a ""car"" or an advanced age for a ""cat"". In this paper, we
+argue that these variances can be mitigated by predicting composition
+categories based on pre-observed primitive. To this end, we propose Progressive
+Language-based Observations (PLO), which can dynamically determine a better
+observation order of primitives. These observations comprise a series of
+concepts or languages that allow the model to understand image content in a
+step-by-step manner. Specifically, PLO adopts pre-trained vision-language
+models (VLMs) to empower the model with observation capabilities. We further
+devise two variants: 1) PLO-VLM: a two-step method, where a pre-observing
+classifier dynamically determines the observation order of two primitives. 2)
+PLO-LLM: a multi-step scheme, which utilizes large language models (LLMs) to
+craft composition-specific prompts for step-by-step observing. Extensive
+ablations on three challenging datasets demonstrate the superiority of PLO
+compared with state-of-the-art methods, affirming its abilities in
+compositional recognition.",cs.CV,['cs.CV']
+EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models,Sijie Cheng · Zhicheng Guo · Jingwen Wu · Kechen Fang · Peng Li · Huaping Liu · Yang Liu,https://adacheng.github.io/EgoThink/,https://arxiv.org/abs/2311.15596,,2311.15596.pdf,EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models,"Vision-language models (VLMs) have recently shown promising results in
+traditional downstream tasks. Evaluation studies have emerged to assess their
+abilities, with the majority focusing on the third-person perspective, and only
+a few addressing specific tasks from the first-person perspective. However, the
+capability of VLMs to ""think"" from a first-person perspective, a crucial
+attribute for advancing autonomous agents and robotics, remains largely
+unexplored. To bridge this research gap, we introduce EgoThink, a novel visual
+question-answering benchmark that encompasses six core capabilities with twelve
+detailed dimensions. The benchmark is constructed using selected clips from
+egocentric videos, with manually annotated question-answer pairs containing
+first-person information. To comprehensively assess VLMs, we evaluate eighteen
+popular VLMs on EgoThink. Moreover, given the open-ended format of the answers,
+we use GPT-4 as the automatic judge to compute single-answer grading.
+Experimental results indicate that although GPT-4V leads in numerous
+dimensions, all evaluated VLMs still possess considerable potential for
+improvement in first-person perspective tasks. Meanwhile, enlarging the number
+of trainable parameters has the most significant impact on model performance on
+EgoThink. In conclusion, EgoThink serves as a valuable addition to existing
+evaluation benchmarks for VLMs, providing an indispensable resource for future
+research in the realm of embodied artificial intelligence and robotics.",cs.CV,"['cs.CV', 'cs.CL']"
+Inverse Rendering of Glossy Objects via the Neural Plenoptic Function and Radiance Fields,Haoyuan Wang · Wenbo Hu · Lei Zhu · Rynson W.H. Lau,https://www.whyy.site/paper/nep,https://arxiv.org/abs/2403.16224,,2403.16224.pdf,Inverse Rendering of Glossy Objects via the Neural Plenoptic Function and Radiance Fields,"Inverse rendering aims at recovering both geometry and materials of objects.
+It provides a more compatible reconstruction for conventional rendering
+engines, compared with the neural radiance fields (NeRFs). On the other hand,
+existing NeRF-based inverse rendering methods cannot handle glossy objects with
+local light interactions well, as they typically oversimplify the illumination
+as a 2D environmental map, which assumes infinite lights only. Observing the
+superiority of NeRFs in recovering radiance fields, we propose a novel 5D
+Neural Plenoptic Function (NeP) based on NeRFs and ray tracing, such that more
+accurate lighting-object interactions can be formulated via the rendering
+equation. We also design a material-aware cone sampling strategy to efficiently
+integrate lights inside the BRDF lobes with the help of pre-filtered radiance
+fields. Our method has two stages: the geometry of the target object and the
+pre-filtered environmental radiance fields are reconstructed in the first
+stage, and materials of the target object are estimated in the second stage
+with the proposed NeP and material-aware cone sampling strategy. Extensive
+experiments on the proposed real-world and synthetic datasets demonstrate that
+our method can reconstruct high-fidelity geometry/materials of challenging
+glossy objects with complex lighting interactions from nearby objects. Project
+webpage: https://whyy.site/paper/nep",cs.CV,['cs.CV']
+Collaborating Foundation models for Domain Generalized Semantic Segmentation,Yasser Benigmim · Subhankar Roy · Slim Essid · Vicky Kalogeiton · Stéphane Lathuilière,https://yasserben.github.io/CLOUDS/,https://arxiv.org/abs/2312.09788,,2312.09788.pdf,Collaborating Foundation Models for Domain Generalized Semantic Segmentation,"Domain Generalized Semantic Segmentation (DGSS) deals with training a model
+on a labeled source domain with the aim of generalizing to unseen domains
+during inference. Existing DGSS methods typically effectuate robust features by
+means of Domain Randomization (DR). Such an approach is often limited as it can
+only account for style diversification and not content. In this work, we take
+an orthogonal approach to DGSS and propose to use an assembly of CoLlaborative
+FOUndation models for Domain Generalized Semantic Segmentation (CLOUDS). In
+detail, CLOUDS is a framework that integrates FMs of various kinds: (i) CLIP
+backbone for its robust feature representation, (ii) generative models to
+diversify the content, thereby covering various modes of the possible target
+distribution, and (iii) Segment Anything Model (SAM) for iteratively refining
+the predictions of the segmentation model. Extensive experiments show that our
+CLOUDS excels in adapting from synthetic to real DGSS benchmarks and under
+varying weather conditions, notably outperforming prior methods by 5.6% and
+6.7% on averaged miou, respectively. The code is available at :
+https://github.com/yasserben/CLOUDS",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+CausalPC: Improving the Robustness of Point Cloud Classification by Causal Effect Identification,Yuanmin Huang · Mi Zhang · Daizong Ding · Erling Jiang · Zhaoxiang Wang · Min Yang, ,,https://www.semanticscholar.org/paper/Deep-learning-for-large-scale-point-cloud-in-causal-Zhang-Ji/e1c76c0ba122201e813e3349dc0ebc8bde90eb34,,,,,nan
+Multi-Scale Video Anomaly Detection by Multi-Grained Spatio-Temporal Representation Learning,Menghao Zhang · Jingyu Wang · Qi Qi · Haifeng Sun · Zirui Zhuang · Pengfei Ren · Ruilong Ma · Jianxin Liao, ,https://arxiv.org/abs/2306.10239,,2306.10239.pdf,Multi-scale Spatial-temporal Interaction Network for Video Anomaly Detection,"Video Anomaly Detection (VAD) is an essential yet challenging task in signal
+processing. Since certain anomalies cannot be detected by isolated analysis of
+either temporal or spatial information, the interaction between these two types
+of data is considered crucial for VAD. However, current dual-stream
+architectures either confine this integral interaction to the bottleneck of the
+autoencoder or introduce anomaly-irrelevant background pixels into the
+interactive process, hindering the accuracy of VAD. To address these
+deficiencies, we propose a Multi-scale Spatial-Temporal Interaction Network
+(MSTI-Net) for VAD. First, to prioritize the detection of moving objects in the
+scene and harmonize the substantial semantic discrepancies between the two
+types of data, we propose an Attention-based Spatial-Temporal Fusion Module
+(ASTFM) as a substitute for the conventional direct fusion. Furthermore, we
+inject multi-ASTFM-based connections that bridge the appearance and motion
+streams of the dual-stream network, thus fostering multi-scale spatial-temporal
+interaction. Finally, to bolster the delineation between normal and abnormal
+activities, our system records the regular information in a memory module.
+Experimental results on three benchmark datasets validate the effectiveness of
+our approach, which achieves AUCs of 96.8%, 87.6%, and 73.9% on the UCSD Ped2,
+CUHK Avenue, and ShanghaiTech datasets, respectively.",cs.CV,['cs.CV']
+Data-Free Quantization via Pseudo-label Filtering,Chunxiao Fan · Ziqi Wang · Dan Guo · Meng Wang, ,http://export.arxiv.org/abs/2403.11256,,2403.11256.pdf,Uncertainty-Aware Pseudo-Label Filtering for Source-Free Unsupervised Domain Adaptation,"Source-free unsupervised domain adaptation (SFUDA) aims to enable the
+utilization of a pre-trained source model in an unlabeled target domain without
+access to source data. Self-training is a way to solve SFUDA, where confident
+target samples are iteratively selected as pseudo-labeled samples to guide
+target model learning. However, prior heuristic noisy pseudo-label filtering
+methods all involve introducing extra models, which are sensitive to model
+assumptions and may introduce additional errors or mislabeling. In this work,
+we propose a method called Uncertainty-aware Pseudo-label-filtering Adaptation
+(UPA) to efficiently address this issue in a coarse-to-fine manner. Specially,
+we first introduce a sample selection module named Adaptive Pseudo-label
+Selection (APS), which is responsible for filtering noisy pseudo labels. The
+APS utilizes a simple sample uncertainty estimation method by aggregating
+knowledge from neighboring samples and confident samples are selected as clean
+pseudo-labeled. Additionally, we incorporate Class-Aware Contrastive Learning
+(CACL) to mitigate the memorization of pseudo-label noise by learning robust
+pair-wise representation supervised by pseudo labels. Through extensive
+experiments conducted on three widely used benchmarks, we demonstrate that our
+proposed method achieves competitive performance on par with state-of-the-art
+SFUDA methods. Code is available at https://github.com/chenxi52/UPA.",cs.CV,['cs.CV']
+Adaptive Softassign via Hadamard-Equipped Sinkhorn,Binrui Shen · Qiang Niu · Shengxin Zhu, ,https://arxiv.org/abs/2309.13855,,2309.13855.pdf,Adaptive Softassign via Hadamard-Equipped Sinkhorn,"Softassign is a pivotal method in graph matching and other learning tasks.
+Many softassign-based algorithms exhibit performance sensitivity to a parameter
+in the softassign. However, tuning the parameter is challenging and almost done
+empirically. This paper proposes an adaptive softassign method for graph
+matching by analyzing the relationship between the objective score and the
+parameter. This method can automatically tune the parameter based on a given
+error bound to guarantee accuracy. The Hadamard-Equipped Sinkhorn formulas
+introduced in this study significantly enhance the efficiency and stability of
+the adaptive softassign. Moreover, these formulas can also be used in optimal
+transport problems. The resulting adaptive softassign graph matching algorithm
+enjoys significantly higher accuracy than previous state-of-the-art large graph
+matching algorithms while maintaining comparable efficiency.",math.OC,"['math.OC', 'math.CO']"
+SIGNeRF: Scene Integrated Generation for Neural Radiance Fields,Jan-Niklas Dihlmann · Andreas Engelhardt · Hendrik Lensch,https://signerf.jdihlmann.com/,https://arxiv.org/abs/2401.01647,,2401.01647.pdf,SIGNeRF: Scene Integrated Generation for Neural Radiance Fields,"Advances in image diffusion models have recently led to notable improvements
+in the generation of high-quality images. In combination with Neural Radiance
+Fields (NeRFs), they enabled new opportunities in 3D generation. However, most
+generative 3D approaches are object-centric and applying them to editing
+existing photorealistic scenes is not trivial. We propose SIGNeRF, a novel
+approach for fast and controllable NeRF scene editing and scene-integrated
+object generation. A new generative update strategy ensures 3D consistency
+across the edited images, without requiring iterative optimization. We find
+that depth-conditioned diffusion models inherently possess the capability to
+generate 3D consistent views by requesting a grid of images instead of single
+views. Based on these insights, we introduce a multi-view reference sheet of
+modified images. Our method updates an image collection consistently based on
+the reference sheet and refines the original NeRF with the newly generated
+image set in one go. By exploiting the depth conditioning mechanism of the
+image diffusion model, we gain fine control over the spatial location of the
+edit and enforce shape guidance by a selected region or an external mesh.",cs.CV,"['cs.CV', 'cs.GR']"
+Putting the Object Back into Video Object Segmentation,Ho Kei Cheng · Seoung Wug Oh · Brian Price · Joon-Young Lee · Alexander G. Schwing,https://hkchengrex.com/Cutie/,https://arxiv.org/abs/2310.12982,,2310.12982.pdf,Putting the Object Back into Video Object Segmentation,"We present Cutie, a video object segmentation (VOS) network with object-level
+memory reading, which puts the object representation from memory back into the
+video object segmentation result. Recent works on VOS employ bottom-up
+pixel-level memory reading which struggles due to matching noise, especially in
+the presence of distractors, resulting in lower performance in more challenging
+data. In contrast, Cutie performs top-down object-level memory reading by
+adapting a small set of object queries. Via those, it interacts with the
+bottom-up pixel features iteratively with a query-based object transformer (qt,
+hence Cutie). The object queries act as a high-level summary of the target
+object, while high-resolution feature maps are retained for accurate
+segmentation. Together with foreground-background masked attention, Cutie
+cleanly separates the semantics of the foreground object from the background.
+On the challenging MOSE dataset, Cutie improves by 8.7 J&F over XMem with a
+similar running time and improves by 4.2 J&F over DeAOT while being three times
+faster. Code is available at: https://hkchengrex.github.io/Cutie",cs.CV,['cs.CV']
+Generalized Predictive Model for Autonomous Driving,Jiazhi Yang · Shenyuan Gao · Yihang Qiu · Li Chen · Tianyu Li · Bo Dai · Kashyap Chitta · Penghao Wu · Jia Zeng · Ping Luo · Jun Zhang · Andreas Geiger · Yu Qiao · Hongyang Li,https://github.com/OpenDriveLab/DriveAGI,https://arxiv.org/abs/2403.09630,,2403.09630.pdf,Generalized Predictive Model for Autonomous Driving,"In this paper, we introduce the first large-scale video prediction model in
+the autonomous driving discipline. To eliminate the restriction of high-cost
+data collection and empower the generalization ability of our model, we acquire
+massive data from the web and pair it with diverse and high-quality text
+descriptions. The resultant dataset accumulates over 2000 hours of driving
+videos, spanning areas all over the world with diverse weather conditions and
+traffic scenarios. Inheriting the merits from recent latent diffusion models,
+our model, dubbed GenAD, handles the challenging dynamics in driving scenes
+with novel temporal reasoning blocks. We showcase that it can generalize to
+various unseen driving datasets in a zero-shot manner, surpassing general or
+driving-specific video prediction counterparts. Furthermore, GenAD can be
+adapted into an action-conditioned prediction model or a motion planner,
+holding great potential for real-world driving applications.",cs.CV,['cs.CV']
+BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition,Yuxuan Zhou · Xudong Yan · Zhi-Qi Cheng · Yan Yan · Qi Dai · Xian-Sheng Hua,https://github.com/ZhouYuxuanYX/BlockGCN,https://arxiv.org/html/2305.11468v3,,2305.11468v3.pdf,Overcoming Topology Agnosticism: Enhancing Skeleton-Based Action Recognition through Redefined Skeletal Topology Awareness,"Graph Convolutional Networks (GCNs) have long defined the state-of-the-art in
+skeleton-based action recognition, leveraging their ability to unravel the
+complex dynamics of human joint topology through the graph's adjacency matrix.
+However, an inherent flaw has come to light in these cutting-edge models: they
+tend to optimize the adjacency matrix jointly with the model weights. This
+process, while seemingly efficient, causes a gradual decay of bone connectivity
+data, culminating in a model indifferent to the very topology it sought to map.
+As a remedy, we propose a threefold strategy: (1) We forge an innovative
+pathway that encodes bone connectivity by harnessing the power of graph
+distances. This approach preserves the vital topological nuances often lost in
+conventional GCNs. (2) We highlight an oft-overlooked feature - the temporal
+mean of a skeletal sequence, which, despite its modest guise, carries highly
+action-specific information. (3) Our investigation revealed strong variations
+in joint-to-joint relationships across different actions. This finding exposes
+the limitations of a single adjacency matrix in capturing the variations of
+relational configurations emblematic of human movement, which we remedy by
+proposing an efficient refinement to Graph Convolutions (GC) - the BlockGC.
+This evolution slashes parameters by a substantial margin (above 40%), while
+elevating performance beyond original GCNs. Our full model, the BlockGCN,
+establishes new standards in skeleton-based action recognition for small model
+sizes. Its high accuracy, notably on the large-scale NTU RGB+D 120 dataset,
+stand as compelling proof of the efficacy of BlockGCN.",cs.CV,['cs.CV']
+MotionEditor: Editing Video Motion via Content-Aware Diffusion,Shuyuan Tu · Qi Dai · Zhi-Qi Cheng · Han Hu · Xintong Han · Zuxuan Wu · Yu-Gang Jiang, ,https://arxiv.org/abs/2311.18830,,2311.18830.pdf,MotionEditor: Editing Video Motion via Content-Aware Diffusion,"Existing diffusion-based video editing models have made gorgeous advances for
+editing attributes of a source video over time but struggle to manipulate the
+motion information while preserving the original protagonist's appearance and
+background. To address this, we propose MotionEditor, a diffusion model for
+video motion editing. MotionEditor incorporates a novel content-aware motion
+adapter into ControlNet to capture temporal motion correspondence. While
+ControlNet enables direct generation based on skeleton poses, it encounters
+challenges when modifying the source motion in the inverted noise due to
+contradictory signals between the noise (source) and the condition (reference).
+Our adapter complements ControlNet by involving source content to transfer
+adapted control signals seamlessly. Further, we build up a two-branch
+architecture (a reconstruction branch and an editing branch) with a
+high-fidelity attention injection mechanism facilitating branch interaction.
+This mechanism enables the editing branch to query the key and value from the
+reconstruction branch in a decoupled manner, making the editing branch retain
+the original background and protagonist appearance. We also propose a skeleton
+alignment algorithm to address the discrepancies in pose size and position.
+Experiments demonstrate the promising motion editing ability of MotionEditor,
+both qualitatively and quantitatively.",cs.CV,['cs.CV']
+ReconFusion: 3D Reconstruction with Diffusion Priors,Rundi Wu · Ben Mildenhall · Philipp Henzler · Ruiqi Gao · Keunhong Park · Daniel Watson · Pratul P. Srinivasan · Dor Verbin · Jonathan T. Barron · Ben Poole · Aleksander Holynski,https://reconfusion.github.io,https://arxiv.org/abs/2312.02981v1,,2312.02981v1.pdf,ReconFusion: 3D Reconstruction with Diffusion Priors,"3D reconstruction methods such as Neural Radiance Fields (NeRFs) excel at
+rendering photorealistic novel views of complex scenes. However, recovering a
+high-quality NeRF typically requires tens to hundreds of input images,
+resulting in a time-consuming capture process. We present ReconFusion to
+reconstruct real-world scenes using only a few photos. Our approach leverages a
+diffusion prior for novel view synthesis, trained on synthetic and multiview
+datasets, which regularizes a NeRF-based 3D reconstruction pipeline at novel
+camera poses beyond those captured by the set of input images. Our method
+synthesizes realistic geometry and texture in underconstrained regions while
+preserving the appearance of observed regions. We perform an extensive
+evaluation across various real-world datasets, including forward-facing and
+360-degree scenes, demonstrating significant performance improvements over
+previous few-view NeRF reconstruction approaches.",cs.CV,['cs.CV']
+Learning Vision from Models Rivals Learning Vision from Data,Yonglong Tian · Lijie Fan · Kaifeng Chen · Dina Katabi · Dilip Krishnan · Phillip Isola,https://github.com/google-research/syn-rep-learn/tree/main/SynCLR,https://arxiv.org/abs/2312.17742,,2312.17742.pdf,Learning Vision from Models Rivals Learning Vision from Data,"We introduce SynCLR, a novel approach for learning visual representations
+exclusively from synthetic images and synthetic captions, without any real
+data. We synthesize a large dataset of image captions using LLMs, then use an
+off-the-shelf text-to-image model to generate multiple images corresponding to
+each synthetic caption. We perform visual representation learning on these
+synthetic images via contrastive learning, treating images sharing the same
+caption as positive pairs. The resulting representations transfer well to many
+downstream tasks, competing favorably with other general-purpose visual
+representation learners such as CLIP and DINO v2 in image classification tasks.
+Furthermore, in dense prediction tasks such as semantic segmentation, SynCLR
+outperforms previous self-supervised methods by a significant margin, e.g.,
+improving over MAE and iBOT by 6.2 and 4.3 mIoU on ADE20k for ViT-B/16.",cs.CV,['cs.CV']
+"Unknown Prompt, the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization",Mainak Singha · Ankit Jha · Shirsha Bose · Ashwin Nair · Moloud Abdar · Biplab Banerjee, ,https://arxiv.org/abs/2404.00710,,2404.00710.pdf,"Unknown Prompt, the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization","We delve into Open Domain Generalization (ODG), marked by domain and category
+shifts between training's labeled source and testing's unlabeled target
+domains. Existing solutions to ODG face limitations due to constrained
+generalizations of traditional CNN backbones and errors in detecting target
+open samples in the absence of prior knowledge. Addressing these pitfalls, we
+introduce ODG-CLIP, harnessing the semantic prowess of the vision-language
+model, CLIP. Our framework brings forth three primary innovations: Firstly,
+distinct from prevailing paradigms, we conceptualize ODG as a multi-class
+classification challenge encompassing both known and novel categories. Central
+to our approach is modeling a unique prompt tailored for detecting unknown
+class samples, and to train this, we employ a readily accessible stable
+diffusion model, elegantly generating proxy images for the open class.
+Secondly, aiming for domain-tailored classification (prompt) weights while
+ensuring a balance of precision and simplicity, we devise a novel visual
+stylecentric prompt learning mechanism. Finally, we infuse images with
+class-discriminative knowledge derived from the prompt space to augment the
+fidelity of CLIP's visual embeddings. We introduce a novel objective to
+safeguard the continuity of this infused semantic intel across domains,
+especially for the shared classes. Through rigorous testing on diverse
+datasets, covering closed and open-set DG contexts, ODG-CLIP demonstrates clear
+supremacy, consistently outpacing peers with performance boosts between 8%-16%.
+Code will be available at https://github.com/mainaksingha01/ODG-CLIP.",cs.CV,['cs.CV']
+Density-guided Translator Boosts Synthetic-to-Real Unsupervised Domain Adaptive Segmentation of 3D Point Clouds,Zhimin Yuan · Wankang Zeng · Yanfei Su · Weiquan Liu · Ming Cheng · Yulan Guo · Cheng Wang,https://github.com/yuan-zm/DGT-ST,https://arxiv.org/abs/2403.18469,,2403.18469.pdf,Density-guided Translator Boosts Synthetic-to-Real Unsupervised Domain Adaptive Segmentation of 3D Point Clouds,"3D synthetic-to-real unsupervised domain adaptive segmentation is crucial to
+annotating new domains. Self-training is a competitive approach for this task,
+but its performance is limited by different sensor sampling patterns (i.e.,
+variations in point density) and incomplete training strategies. In this work,
+we propose a density-guided translator (DGT), which translates point density
+between domains, and integrates it into a two-stage self-training pipeline
+named DGT-ST. First, in contrast to existing works that simultaneously conduct
+data generation and feature/output alignment within unstable adversarial
+training, we employ the non-learnable DGT to bridge the domain gap at the input
+level. Second, to provide a well-initialized model for self-training, we
+propose a category-level adversarial network in stage one that utilizes the
+prototype to prevent negative transfer. Finally, by leveraging the designs
+above, a domain-mixed self-training method with source-aware consistency loss
+is proposed in stage two to narrow the domain gap further. Experiments on two
+synthetic-to-real segmentation tasks (SynLiDAR $\rightarrow$ semanticKITTI and
+SynLiDAR $\rightarrow$ semanticPOSS) demonstrate that DGT-ST outperforms
+state-of-the-art methods, achieving 9.4$\%$ and 4.3$\%$ mIoU improvements,
+respectively. Code is available at \url{https://github.com/yuan-zm/DGT-ST}.",cs.CV,"['cs.CV', 'cs.AI']"
+Absolute Pose from One or Two Scaled and Oriented Features,Jonathan Ventura · Zuzana Kukelova · Torsten Sattler · Daniel Barath,https://github.com/danini/absolute-pose-from-oriented-and-scaled-features,https://arxiv.org/abs/2404.16552,,,Efficient Solution of Point-Line Absolute Pose,"We revisit certain problems of pose estimation based on 3D--2D
+correspondences between features which may be points or lines. Specifically, we
+address the two previously-studied minimal problems of estimating camera
+extrinsics from $p \in \{ 1, 2 \}$ point--point correspondences and $l=3-p$
+line--line correspondences. To the best of our knowledge, all of the
+previously-known practical solutions to these problems required computing the
+roots of degree $\ge 4$ (univariate) polynomials when $p=2$, or degree $\ge 8$
+polynomials when $p=1.$ We describe and implement two elementary solutions
+which reduce the degrees of the needed polynomials from $4$ to $2$ and from $8$
+to $4$, respectively. We show experimentally that the resulting solvers are
+numerically stable and fast: when compared to the previous state-of-the art, we
+may obtain nearly an order of magnitude speedup. The code is available at
+\url{https://github.com/petrhruby97/efficient\_absolute}",cs.CV,"['cs.CV', '68T45', 'I.4.5']"
+IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images,Yushuang Wu · Luyue Shi · Junhao Cai · Weihao Yuan · Lingteng Qiu · Zilong Dong · Liefeng Bo · Shuguang Cui · Xiaoguang Han,https://yushuang-wu.github.io/IPoD/,https://arxiv.org/abs/2404.00269,,2404.00269.pdf,IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images,"Generalizable 3D object reconstruction from single-view RGB-D images remains
+a challenging task, particularly with real-world data. Current state-of-the-art
+methods develop Transformer-based implicit field learning, necessitating an
+intensive learning paradigm that requires dense query-supervision uniformly
+sampled throughout the entire space. We propose a novel approach, IPoD, which
+harmonizes implicit field learning with point diffusion. This approach treats
+the query points for implicit field learning as a noisy point cloud for
+iterative denoising, allowing for their dynamic adaptation to the target object
+shape. Such adaptive query points harness diffusion learning's capability for
+coarse shape recovery and also enhances the implicit representation's ability
+to delineate finer details. Besides, an additional self-conditioning mechanism
+is designed to use implicit predictions as the guidance of diffusion learning,
+leading to a cooperative system. Experiments conducted on the CO3D-v2 dataset
+affirm the superiority of IPoD, achieving 7.8% improvement in F-score and 28.6%
+in Chamfer distance over existing methods. The generalizability of IPoD is also
+demonstrated on the MVImgNet dataset. Our project page is at
+https://yushuang-wu.github.io/IPoD.",cs.CV,['cs.CV']
+MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures,Zhangyang Xiong · Chenghong Li · Kenkun Liu · Hongjie Liao · Jianqiao HU · Junyi Zhu · Shuliang Ning · Lingteng Qiu · Chongjie Wang · Shijie Wang · Shuguang Cui · Xiaoguang Han, ,https://arxiv.org/abs/2312.02963,,2312.02963.pdf,MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures,"In this era, the success of large language models and text-to-image models
+can be attributed to the driving force of large-scale datasets. However, in the
+realm of 3D vision, while remarkable progress has been made with models trained
+on large-scale synthetic and real-captured object data like Objaverse and
+MVImgNet, a similar level of progress has not been observed in the domain of
+human-centric tasks partially due to the lack of a large-scale human dataset.
+Existing datasets of high-fidelity 3D human capture continue to be mid-sized
+due to the significant challenges in acquiring large-scale high-quality 3D
+human data. To bridge this gap, we present MVHumanNet, a dataset that comprises
+multi-view human action sequences of 4,500 human identities. The primary focus
+of our work is on collecting human data that features a large number of diverse
+identities and everyday clothing using a multi-view human capture system, which
+facilitates easily scalable data collection. Our dataset contains 9,000 daily
+outfits, 60,000 motion sequences and 645 million frames with extensive
+annotations, including human masks, camera parameters, 2D and 3D keypoints,
+SMPL/SMPLX parameters, and corresponding textual descriptions. To explore the
+potential of MVHumanNet in various 2D and 3D visual tasks, we conducted pilot
+studies on view-consistent action recognition, human NeRF reconstruction,
+text-driven view-unconstrained human image generation, as well as 2D
+view-unconstrained human image and 3D avatar generation. Extensive experiments
+demonstrate the performance improvements and effective applications enabled by
+the scale provided by MVHumanNet. As the current largest-scale 3D human
+dataset, we hope that the release of MVHumanNet data with annotations will
+foster further innovations in the domain of 3D human-centric tasks at scale.",cs.CV,['cs.CV']
+SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing,Zeyinzi Jiang · Chaojie Mao · Yulin Pan · Zhen Han · Jingfeng Zhang,https://scedit.github.io/,https://arxiv.org/abs/2312.11392,,2312.11392.pdf,SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing,"Image diffusion models have been utilized in various tasks, such as
+text-to-image generation and controllable image synthesis. Recent research has
+introduced tuning methods that make subtle adjustments to the original models,
+yielding promising results in specific adaptations of foundational generative
+diffusion models. Rather than modifying the main backbone of the diffusion
+model, we delve into the role of skip connection in U-Net and reveal that
+hierarchical features aggregating long-distance information across encoder and
+decoder make a significant impact on the content and quality of image
+generation. Based on the observation, we propose an efficient generative tuning
+framework, dubbed SCEdit, which integrates and edits Skip Connection using a
+lightweight tuning module named SC-Tuner. Furthermore, the proposed framework
+allows for straightforward extension to controllable image synthesis by
+injecting different conditions with Controllable SC-Tuner, simplifying and
+unifying the network design for multi-condition inputs. Our SCEdit
+substantially reduces training parameters, memory usage, and computational
+expense due to its lightweight tuners, with backward propagation only passing
+to the decoder blocks. Extensive experiments conducted on text-to-image
+generation and controllable image synthesis tasks demonstrate the superiority
+of our method in terms of efficiency and performance. Project page:
+\url{https://scedit.github.io/}",cs.CV,['cs.CV']
+Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis,Zanlin Ni · Yulin Wang · Renping Zhou · Jiayi Guo · Jinyi Hu · Zhiyuan Liu · Shiji Song · Yuan Yao · Gao Huang, ,https://arxiv.org/html/2312.14988v1,,2312.14988v1.pdf,Emage: Non-Autoregressive Text-to-Image Generation,"Autoregressive and diffusion models drive the recent breakthroughs on
+text-to-image generation. Despite their huge success of generating
+high-realistic images, a common shortcoming of these models is their high
+inference latency - autoregressive models run more than a thousand times
+successively to produce image tokens and diffusion models convert Gaussian
+noise into images with many hundreds of denoising steps. In this work, we
+explore non-autoregressive text-to-image models that efficiently generate
+hundreds of image tokens in parallel. We develop many model variations with
+different learning and inference strategies, initialized text encoders, etc.
+Compared with autoregressive baselines that needs to run one thousand times,
+our model only runs 16 times to generate images of competitive quality with an
+order of magnitude lower inference latency. Our non-autoregressive model with
+346M parameters generates an image of 256$\times$256 with about one second on
+one V100 GPU.",cs.CV,['cs.CV']
+360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model,Qian Wang · Weiqi Li · Chong Mou · Xinhua Cheng · Jian Zhang, ,https://arxiv.org/abs/2401.06578,,2401.06578.pdf,360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model,"Panorama video recently attracts more interest in both study and application,
+courtesy of its immersive experience. Due to the expensive cost of capturing
+360-degree panoramic videos, generating desirable panorama videos by prompts is
+urgently required. Lately, the emerging text-to-video (T2V) diffusion methods
+demonstrate notable effectiveness in standard video generation. However, due to
+the significant gap in content and motion patterns between panoramic and
+standard videos, these methods encounter challenges in yielding satisfactory
+360-degree panoramic videos. In this paper, we propose a pipeline named
+360-Degree Video Diffusion model (360DVD) for generating 360-degree panoramic
+videos based on the given prompts and motion conditions. Specifically, we
+introduce a lightweight 360-Adapter accompanied by 360 Enhancement Techniques
+to transform pre-trained T2V models for panorama video generation. We further
+propose a new panorama dataset named WEB360 consisting of panoramic video-text
+pairs for training 360DVD, addressing the absence of captioned panoramic video
+datasets. Extensive experiments demonstrate the superiority and effectiveness
+of 360DVD for panorama video generation. Our project page is at
+https://akaneqwq.github.io/360DVD/.",cs.CV,['cs.CV']
+All in One Framework for Multimodal Re-identification in the Wild,He Li · Mang Ye · Ming Zhang · Bo Du, ,https://arxiv.org/abs/2405.04741,,2405.04741.pdf,All in One Framework for Multimodal Re-identification in the Wild,"In Re-identification (ReID), recent advancements yield noteworthy progress in
+both unimodal and cross-modal retrieval tasks. However, the challenge persists
+in developing a unified framework that could effectively handle varying
+multimodal data, including RGB, infrared, sketches, and textual information.
+Additionally, the emergence of large-scale models shows promising performance
+in various vision tasks but the foundation model in ReID is still blank. In
+response to these challenges, a novel multimodal learning paradigm for ReID is
+introduced, referred to as All-in-One (AIO), which harnesses a frozen
+pre-trained big model as an encoder, enabling effective multimodal retrieval
+without additional fine-tuning. The diverse multimodal data in AIO are
+seamlessly tokenized into a unified space, allowing the modality-shared frozen
+encoder to extract identity-consistent features comprehensively across all
+modalities. Furthermore, a meticulously crafted ensemble of cross-modality
+heads is designed to guide the learning trajectory. AIO is the \textbf{first}
+framework to perform all-in-one ReID, encompassing four commonly used
+modalities. Experiments on cross-modal and multimodal ReID reveal that AIO not
+only adeptly handles various modal data but also excels in challenging
+contexts, showcasing exceptional performance in zero-shot and domain
+generalization scenarios.",cs.CV,['cs.CV']
+TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models,Zhongwei Zhang · Fuchen Long · Yingwei Pan · Zhaofan Qiu · Ting Yao · Yang Cao · Tao Mei,https://trip-i2v.github.io/TRIP/,https://arxiv.org/abs/2403.17005v1,,2403.17005v1.pdf,TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models,"Recent advances in text-to-video generation have demonstrated the utility of
+powerful diffusion models. Nevertheless, the problem is not trivial when
+shaping diffusion models to animate static image (i.e., image-to-video
+generation). The difficulty originates from the aspect that the diffusion
+process of subsequent animated frames should not only preserve the faithful
+alignment with the given image but also pursue temporal coherence among
+adjacent frames. To alleviate this, we present TRIP, a new recipe of
+image-to-video diffusion paradigm that pivots on image noise prior derived from
+static image to jointly trigger inter-frame relational reasoning and ease the
+coherent temporal modeling via temporal residual learning. Technically, the
+image noise prior is first attained through one-step backward diffusion process
+based on both static image and noised video latent codes. Next, TRIP executes a
+residual-like dual-path scheme for noise prediction: 1) a shortcut path that
+directly takes image noise prior as the reference noise of each frame to
+amplify the alignment between the first frame and subsequent frames; 2) a
+residual path that employs 3D-UNet over noised video and static image latent
+codes to enable inter-frame relational reasoning, thereby easing the learning
+of the residual noise for each frame. Furthermore, both reference and residual
+noise of each frame are dynamically merged via attention mechanism for final
+video generation. Extensive experiments on WebVid-10M, DTDB and MSR-VTT
+datasets demonstrate the effectiveness of our TRIP for image-to-video
+generation. Please see our project page at https://trip-i2v.github.io/TRIP/.",cs.CV,"['cs.CV', 'cs.MM']"
+Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities,Yiyuan Zhang · Xiaohan Ding · Kaixiong Gong · Yixiao Ge · Ying Shan · Xiangyu Yue, ,https://arxiv.org/abs/2401.14405,,2401.14405.pdf,Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities,"We propose to improve transformers of a specific modality with irrelevant
+data from other modalities, e.g., improve an ImageNet model with audio or point
+cloud datasets. We would like to highlight that the data samples of the target
+modality are irrelevant to the other modalities, which distinguishes our method
+from other works utilizing paired (e.g., CLIP) or interleaved data of different
+modalities. We propose a methodology named Multimodal Pathway - given a target
+modality and a transformer designed for it, we use an auxiliary transformer
+trained with data of another modality and construct pathways to connect
+components of the two models so that data of the target modality can be
+processed by both models. In this way, we utilize the universal
+sequence-to-sequence modeling abilities of transformers obtained from two
+modalities. As a concrete implementation, we use a modality-specific tokenizer
+and task-specific head as usual but utilize the transformer blocks of the
+auxiliary model via a proposed method named Cross-Modal Re-parameterization,
+which exploits the auxiliary weights without any inference costs. On the image,
+point cloud, video, and audio recognition tasks, we observe significant and
+consistent performance improvements with irrelevant data from other modalities.
+The code and models are available at https://github.com/AILab-CVC/M2PT.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Geometry Transfer for Stylizing Radiance Fields,Hyunyoung Jung · Seonghyeon Nam · Nikolaos Sarafianos · Sungjoo Yoo · Alexander Sorkine-Hornung · Rakesh Ranjan,https://hyblue.github.io/geo-srf/,https://arxiv.org/abs/2402.00863,,2402.00863.pdf,Geometry Transfer for Stylizing Radiance Fields,"Shape and geometric patterns are essential in defining stylistic identity.
+However, current 3D style transfer methods predominantly focus on transferring
+colors and textures, often overlooking geometric aspects. In this paper, we
+introduce Geometry Transfer, a novel method that leverages geometric
+deformation for 3D style transfer. This technique employs depth maps to extract
+a style guide, subsequently applied to stylize the geometry of radiance fields.
+Moreover, we propose new techniques that utilize geometric cues from the 3D
+scene, thereby enhancing aesthetic expressiveness and more accurately
+reflecting intended styles. Our extensive experiments show that Geometry
+Transfer enables a broader and more expressive range of stylizations, thereby
+significantly expanding the scope of 3D style transfer.",cs.CV,['cs.CV']
+HRVDA: High-Resolution Visual Document Assistant,Chaohu Liu · Kun Yin · Haoyu Cao · Xinghua Jiang · Xin Li · Yinsong Liu · Deqiang Jiang · Xing Sun · Linli Xu, ,https://arxiv.org/abs/2404.06918,,2404.06918.pdf,HRVDA: High-Resolution Visual Document Assistant,"Leveraging vast training data, multimodal large language models (MLLMs) have
+demonstrated formidable general visual comprehension capabilities and achieved
+remarkable performance across various tasks. However, their performance in
+visual document understanding still leaves much room for improvement. This
+discrepancy is primarily attributed to the fact that visual document
+understanding is a fine-grained prediction task. In natural scenes, MLLMs
+typically use low-resolution images, leading to a substantial loss of visual
+information. Furthermore, general-purpose MLLMs do not excel in handling
+document-oriented instructions. In this paper, we propose a High-Resolution
+Visual Document Assistant (HRVDA), which bridges the gap between MLLMs and
+visual document understanding. This model employs a content filtering mechanism
+and an instruction filtering module to separately filter out the
+content-agnostic visual tokens and instruction-agnostic visual tokens, thereby
+achieving efficient model training and inference for high-resolution images. In
+addition, we construct a document-oriented visual instruction tuning dataset
+and apply a multi-stage training strategy to enhance the model's document
+modeling capabilities. Extensive experiments demonstrate that our model
+achieves state-of-the-art performance across multiple document understanding
+datasets, while maintaining training efficiency and inference speed comparable
+to low-resolution models.",cs.CV,['cs.CV']
+TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes,Xuying Zhang · Bo-Wen Yin · yuming chen · Zheng Lin · Yunheng Li · Qibin Hou · Ming-Ming Cheng, ,https://arxiv.org/abs/2312.04248,,2312.04248.pdf,TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes,"Recent progress in the text-driven 3D stylization of a single object has been
+considerably promoted by CLIP-based methods. However, the stylization of
+multi-object 3D scenes is still impeded in that the image-text pairs used for
+pre-training CLIP mostly consist of an object. Meanwhile, the local details of
+multiple objects may be susceptible to omission due to the existing supervision
+manner primarily relying on coarse-grained contrast of image-text pairs. To
+overcome these challenges, we present a novel framework, dubbed TeMO, to parse
+multi-object 3D scenes and edit their styles under the contrast supervision at
+multiple levels. We first propose a Decoupled Graph Attention (DGA) module to
+distinguishably reinforce the features of 3D surface points. Particularly, a
+cross-modal graph is constructed to align the object points accurately and noun
+phrases decoupled from the 3D mesh and textual description. Then, we develop a
+Cross-Grained Contrast (CGC) supervision system, where a fine-grained loss
+between the words in the textual description and the randomly rendered images
+are constructed to complement the coarse-grained loss. Extensive experiments
+show that our method can synthesize high-quality stylized content and
+outperform the existing methods over a wide range of multi-object 3D meshes.
+Our code and results will be made publicly available",cs.CV,['cs.CV']
+Revisiting Single Image Reflection Removal In the Wild,Yurui Zhu · Bo Li · Xueyang Fu · Peng-Tao Jiang · Hao Zhang · Qibin Sun · Zheng-Jun Zha · Jinwei Chen, ,https://arxiv.org/abs/2311.17320,,2311.17320.pdf,Revisiting Single Image Reflection Removal In the Wild,"This research focuses on the issue of single-image reflection removal (SIRR)
+in real-world conditions, examining it from two angles: the collection pipeline
+of real reflection pairs and the perception of real reflection locations. We
+devise an advanced reflection collection pipeline that is highly adaptable to a
+wide range of real-world reflection scenarios and incurs reduced costs in
+collecting large-scale aligned reflection pairs. In the process, we develop a
+large-scale, high-quality reflection dataset named Reflection Removal in the
+Wild (RRW). RRW contains over 14,950 high-resolution real-world reflection
+pairs, a dataset forty-five times larger than its predecessors. Regarding
+perception of reflection locations, we identify that numerous virtual
+reflection objects visible in reflection images are not present in the
+corresponding ground-truth images. This observation, drawn from the aligned
+pairs, leads us to conceive the Maximum Reflection Filter (MaxRF). The MaxRF
+could accurately and explicitly characterize reflection locations from pairs of
+images. Building upon this, we design a reflection location-aware cascaded
+framework, specifically tailored for SIRR. Powered by these innovative
+techniques, our solution achieves superior performance than current leading
+methods across multiple real-world benchmarks. Codes and datasets will be
+publicly available.",cs.CV,['cs.CV']
+Inlier Confidence Calibration for Point Cloud Registration,Yongzhe Yuan · Yue Wu · Xiaolong Fan · Maoguo Gong · Qiguang Miao · Wenping Ma, ,https://arxiv.org/abs/2307.14019,,2307.14019.pdf,One-Nearest Neighborhood Guides Inlier Estimation for Unsupervised Point Cloud Registration,"The precision of unsupervised point cloud registration methods is typically
+limited by the lack of reliable inlier estimation and self-supervised signal,
+especially in partially overlapping scenarios. In this paper, we propose an
+effective inlier estimation method for unsupervised point cloud registration by
+capturing geometric structure consistency between the source point cloud and
+its corresponding reference point cloud copy. Specifically, to obtain a high
+quality reference point cloud copy, an One-Nearest Neighborhood (1-NN) point
+cloud is generated by input point cloud. This facilitates matching map
+construction and allows for integrating dual neighborhood matching scores of
+1-NN point cloud and input point cloud to improve matching confidence.
+Benefiting from the high quality reference copy, we argue that the neighborhood
+graph formed by inlier and its neighborhood should have consistency between
+source point cloud and its corresponding reference copy. Based on this
+observation, we construct transformation-invariant geometric structure
+representations and capture geometric structure consistency to score the inlier
+confidence for estimated correspondences between source point cloud and its
+reference copy. This strategy can simultaneously provide the reliable
+self-supervised signal for model optimization. Finally, we further calculate
+transformation estimation by the weighted SVD algorithm with the estimated
+correspondences and corresponding inlier confidence. We train the proposed
+model in an unsupervised manner, and extensive experiments on synthetic and
+real-world datasets illustrate the effectiveness of the proposed method.",cs.CV,"['cs.CV', 'cs.AI']"
+Domain-Specific Block Selection and Paired-View Pseudo-Labeling for Online Test-Time Adaptation,Yeonguk Yu · Sungho Shin · Seunghyeok Back · Minhwan Ko · Sangjun Noh · Kyoobin Lee,https://github.com/gist-ailab/domain-specific-block-selection-and-paired-view-pseudo-labeling-for-online-TTA,https://arxiv.org/abs/2404.10966v2,,2404.10966v2.pdf,Domain-Specific Block Selection and Paired-View Pseudo-Labeling for Online Test-Time Adaptation,"Test-time adaptation (TTA) aims to adapt a pre-trained model to a new test
+domain without access to source data after deployment. Existing approaches
+typically rely on self-training with pseudo-labels since ground-truth cannot be
+obtained from test data. Although the quality of pseudo labels is important for
+stable and accurate long-term adaptation, it has not been previously addressed.
+In this work, we propose DPLOT, a simple yet effective TTA framework that
+consists of two components: (1) domain-specific block selection and (2)
+pseudo-label generation using paired-view images. Specifically, we select
+blocks that involve domain-specific feature extraction and train these blocks
+by entropy minimization. After blocks are adjusted for current test domain, we
+generate pseudo-labels by averaging given test images and corresponding flipped
+counterparts. By simply using flip augmentation, we prevent a decrease in the
+quality of the pseudo-labels, which can be caused by the domain gap resulting
+from strong augmentation. Our experimental results demonstrate that DPLOT
+outperforms previous TTA methods in CIFAR10-C, CIFAR100-C, and ImageNet-C
+benchmarks, reducing error by up to 5.4%, 9.1%, and 2.9%, respectively. Also,
+we provide an extensive analysis to demonstrate effectiveness of our framework.
+Code is available at
+https://github.com/gist-ailab/domain-specific-block-selection-and-paired-view-pseudo-labeling-for-online-TTA.",cs.CV,['cs.CV']
+BiTT: Bi-directional Texture Reconstruction of Interacting Two Hands from a Single Image,Minje Kim · Tae-Kyun Kim,https://yunminjin2.github.io/projects/bitt/,https://arxiv.org/abs/2403.08262,,2403.08262.pdf,BiTT: Bi-directional Texture Reconstruction of Interacting Two Hands from a Single Image,"Creating personalized hand avatars is important to offer a realistic
+experience to users on AR / VR platforms. While most prior studies focused on
+reconstructing 3D hand shapes, some recent work has tackled the reconstruction
+of hand textures on top of shapes. However, these methods are often limited to
+capturing pixels on the visible side of a hand, requiring diverse views of the
+hand in a video or multiple images as input. In this paper, we propose a novel
+method, BiTT(Bi-directional Texture reconstruction of Two hands), which is the
+first end-to-end trainable method for relightable, pose-free texture
+reconstruction of two interacting hands taking only a single RGB image, by
+three novel components: 1) bi-directional (left $\leftrightarrow$ right)
+texture reconstruction using the texture symmetry of left / right hands, 2)
+utilizing a texture parametric model for hand texture recovery, and 3) the
+overall coarse-to-fine stage pipeline for reconstructing personalized texture
+of two interacting hands. BiTT first estimates the scene light condition and
+albedo image from an input image, then reconstructs the texture of both hands
+through the texture parametric model and bi-directional texture reconstructor.
+In experiments using InterHand2.6M and RGB2Hands datasets, our method
+significantly outperforms state-of-the-art hand texture reconstruction methods
+quantitatively and qualitatively. The code is available at
+https://github.com/yunminjin2/BiTT",cs.CV,['cs.CV']
+Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes,Chi-Hsi Kung · 書緯 呂 · Yi-Hsuan Tsai · Yi-Ting Chen, ,https://arxiv.org/abs/2311.17948,,2311.17948.pdf,Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes,"In this paper, we study multi-label atomic activity recognition. Despite the
+notable progress in action recognition, it is still challenging to recognize
+atomic activities due to a deficiency in a holistic understanding of both
+multiple road users' motions and their contextual information. In this paper,
+we introduce Action-slot, a slot attention-based approach that learns visual
+action-centric representations, capturing both motion and contextual
+information. Our key idea is to design action slots that are capable of paying
+attention to regions where atomic activities occur, without the need for
+explicit perception guidance. To further enhance slot attention, we introduce a
+background slot that competes with action slots, aiding the training process in
+avoiding unnecessary focus on background regions devoid of activities. Yet, the
+imbalanced class distribution in the existing dataset hampers the assessment of
+rare activities. To address the limitation, we collect a synthetic dataset
+called TACO, which is four times larger than OATS and features a balanced
+distribution of atomic activities. To validate the effectiveness of our method,
+we conduct comprehensive experiments and ablation studies against various
+action recognition baselines. We also show that the performance of multi-label
+atomic activity recognition on real-world datasets can be improved by
+pretraining representations on TACO. We will release our source code and
+dataset. See the videos of visualization on the project page:
+https://hcis-lab.github.io/Action-slot/",cs.CV,"['cs.CV', 'cs.LG']"
+"Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models",Luo Jiayun · Siddhesh Khandelwal · Leonid Sigal · Boyang Li, ,https://arxiv.org/abs/2311.17095v1,,2311.17095v1.pdf,"Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models","From an enormous amount of image-text pairs, large-scale vision-language
+models (VLMs) learn to implicitly associate image regions with words, which is
+vital for tasks such as image captioning and visual question answering.
+However, leveraging such pre-trained models for open-vocabulary semantic
+segmentation remains a challenge. In this paper, we propose a simple, yet
+extremely effective, training-free technique, Plug-and-Play Open-Vocabulary
+Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with
+direct text-to-image cross-attention and an image-text matching loss to produce
+semantic segmentation. However, cross-attention alone tends to over-segment,
+whereas cross-attention plus GradCAM tend to under-segment. To alleviate this
+issue, we introduce Salience Dropout; by iteratively dropping patches that the
+model is most attentive to, we are able to better resolve the entire extent of
+the segmentation mask. Compared to existing techniques, the proposed method
+does not require any neural network training and performs hyperparameter tuning
+without the need for any segmentation annotations, even for a validation set.
+PnP-OVSS demonstrates substantial improvements over a comparable baseline
+(+29.4% mIoU on Pascal VOC, +13.2% mIoU on Pascal Context, +14.0% mIoU on MS
+COCO, +2.4% mIoU on COCO Stuff) and even outperforms most baselines that
+conduct additional network training on top of pretrained VLMs.",cs.CV,"['cs.CV', 'cs.AI']"
+Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange,Yanhao Wu · Tong Zhang · Wei Ke · Congpei Qiu · Sabine Süsstrunk · Mathieu Salzmann, ,,https://www.semanticscholar.org/paper/Mitigating-Object-Dependencies:-Improving-Point-Wu-Zhang/1cafd8d79a0e2242cb1f8a2ce26db175785ebf88,,,,,nan
+MCPNet: An Interpretable Classifier via Multi-Level Concept Prototypes,Bor Shiun Wang · Chien-Yi Wang · Wei-Chen Chiu,https://eddie221.github.io/MCPNet/,https://arxiv.org/abs/2404.08968,,2404.08968.pdf,MCPNet: An Interpretable Classifier via Multi-Level Concept Prototypes,"Recent advancements in post-hoc and inherently interpretable methods have
+markedly enhanced the explanations of black box classifier models. These
+methods operate either through post-analysis or by integrating concept learning
+during model training. Although being effective in bridging the semantic gap
+between a model's latent space and human interpretation, these explanation
+methods only partially reveal the model's decision-making process. The outcome
+is typically limited to high-level semantics derived from the last feature map.
+We argue that the explanations lacking insights into the decision processes at
+low and mid-level features are neither fully faithful nor useful. Addressing
+this gap, we introduce the Multi-Level Concept Prototypes Classifier (MCPNet),
+an inherently interpretable model. MCPNet autonomously learns meaningful
+concept prototypes across multiple feature map levels using Centered Kernel
+Alignment (CKA) loss and an energy-based weighted PCA mechanism, and it does so
+without reliance on predefined concept labels. Further, we propose a novel
+classifier paradigm that learns and aligns multi-level concept prototype
+distributions for classification purposes via Class-aware Concept Distribution
+(CCD) loss. Our experiments reveal that our proposed MCPNet while being
+adaptable to various model architectures, offers comprehensive multi-level
+explanations while maintaining classification accuracy. Additionally, its
+concept distribution-based classification approach shows improved
+generalization capabilities in few-shot classification scenarios.",cs.CV,"['cs.CV', 'cs.LG']"
+OmniMotionGPT: Animal Motion Generation with Limited Data,Zhangsihao Yang · Mingyuan Zhou · Mengyi Shan · Bingbing Wen · Ziwei Xuan · Mitch Hill · Junjie Bai · Guo-Jun Qi · Yalin Wang, ,https://arxiv.org/abs/2311.18303,,2311.18303.pdf,OmniMotionGPT: Animal Motion Generation with Limited Data,"Our paper aims to generate diverse and realistic animal motion sequences from
+textual descriptions, without a large-scale animal text-motion dataset. While
+the task of text-driven human motion synthesis is already extensively studied
+and benchmarked, it remains challenging to transfer this success to other
+skeleton structures with limited data. In this work, we design a model
+architecture that imitates Generative Pretraining Transformer (GPT), utilizing
+prior knowledge learned from human data to the animal domain. We jointly train
+motion autoencoders for both animal and human motions and at the same time
+optimize through the similarity scores among human motion encoding, animal
+motion encoding, and text CLIP embedding. Presenting the first solution to this
+problem, we are able to generate animal motions with high diversity and
+fidelity, quantitatively and qualitatively outperforming the results of
+training human motion generation baselines on animal data. Additionally, we
+introduce AnimalML3D, the first text-animal motion dataset with 1240 animation
+sequences spanning 36 different animal identities. We hope this dataset would
+mediate the data scarcity problem in text-driven animal motion generation,
+providing a new playground for the research community.",cs.CV,['cs.CV']
+Noisy-Correspondence Learning for Text-to-Image Person Re-identification,Yang Qin · Yingke Chen · Dezhong Peng · Xi Peng · Joey Tianyi Zhou · Peng Hu,https://github.com/QinYang79/RDE,https://arxiv.org/abs/2308.09911,,2308.09911.pdf,Noisy-Correspondence Learning for Text-to-Image Person Re-identification,"Text-to-image person re-identification (TIReID) is a compelling topic in the
+cross-modal community, which aims to retrieve the target person based on a
+textual query. Although numerous TIReID methods have been proposed and achieved
+promising performance, they implicitly assume the training image-text pairs are
+correctly aligned, which is not always the case in real-world scenarios. In
+practice, the image-text pairs inevitably exist under-correlated or even
+false-correlated, a.k.a noisy correspondence (NC), due to the low quality of
+the images and annotation errors. To address this problem, we propose a novel
+Robust Dual Embedding method (RDE) that can learn robust visual-semantic
+associations even with NC. Specifically, RDE consists of two main components:
+1) A Confident Consensus Division (CCD) module that leverages the dual-grained
+decisions of dual embedding modules to obtain a consensus set of clean training
+data, which enables the model to learn correct and reliable visual-semantic
+associations. 2) A Triplet Alignment Loss (TAL) relaxes the conventional
+Triplet Ranking loss with the hardest negative samples to a log-exponential
+upper bound over all negative ones, thus preventing the model collapse under NC
+and can also focus on hard-negative samples for promising performance. We
+conduct extensive experiments on three public benchmarks, namely CUHK-PEDES,
+ICFG-PEDES, and RSTPReID, to evaluate the performance and robustness of our
+RDE. Our method achieves state-of-the-art results both with and without
+synthetic noisy correspondences on all three datasets. Code is available at
+https://github.com/QinYang79/RDE.",cs.CV,"['cs.CV', 'cs.MM']"
+SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer,Rui Zhu · Yingwei Pan · Yehao Li · Ting Yao · Zhenglong Sun · Tao Mei · Chang-Wen Chen, ,https://arxiv.org/abs/2403.17004,,2403.17004.pdf,SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer,"Diffusion Transformer (DiT) has emerged as the new trend of generative
+diffusion models on image generation. In view of extremely slow convergence in
+typical DiT, recent breakthroughs have been driven by mask strategy that
+significantly improves the training efficiency of DiT with additional
+intra-image contextual learning. Despite this progress, mask strategy still
+suffers from two inherent limitations: (a) training-inference discrepancy and
+(b) fuzzy relations between mask reconstruction & generative diffusion process,
+resulting in sub-optimal training of DiT. In this work, we address these
+limitations by novelly unleashing the self-supervised discrimination knowledge
+to boost DiT training. Technically, we frame our DiT in a teacher-student
+manner. The teacher-student discriminative pairs are built on the diffusion
+noises along the same Probability Flow Ordinary Differential Equation (PF-ODE).
+Instead of applying mask reconstruction loss over both DiT encoder and decoder,
+we decouple DiT encoder and decoder to separately tackle discriminative and
+generative objectives. In particular, by encoding discriminative pairs with
+student and teacher DiT encoders, a new discriminative loss is designed to
+encourage the inter-image alignment in the self-supervised embedding space.
+After that, student samples are fed into student DiT decoder to perform the
+typical generative diffusion task. Extensive experiments are conducted on
+ImageNet dataset, and our method achieves a competitive balance between
+training cost and generative capacity.",cs.CV,"['cs.CV', 'cs.MM']"
+CoDeF: Content Deformation Fields for Temporally Consistent Video Processing,Hao Ouyang · Qiuyu Wang · Yuxi Xiao · Qingyan Bai · Juntao Zhang · Kecheng Zheng · Xiaowei Zhou · Qifeng Chen · Yujun Shen,https://qiuyu96.github.io/CoDeF/,https://arxiv.org/abs/2308.07926,,2308.07926.pdf,CoDeF: Content Deformation Fields for Temporally Consistent Video Processing,"We present the content deformation field CoDeF as a new type of video
+representation, which consists of a canonical content field aggregating the
+static contents in the entire video and a temporal deformation field recording
+the transformations from the canonical image (i.e., rendered from the canonical
+content field) to each individual frame along the time axis.Given a target
+video, these two fields are jointly optimized to reconstruct it through a
+carefully tailored rendering pipeline.We advisedly introduce some
+regularizations into the optimization process, urging the canonical content
+field to inherit semantics (e.g., the object shape) from the video.With such a
+design, CoDeF naturally supports lifting image algorithms for video processing,
+in the sense that one can apply an image algorithm to the canonical image and
+effortlessly propagate the outcomes to the entire video with the aid of the
+temporal deformation field.We experimentally show that CoDeF is able to lift
+image-to-image translation to video-to-video translation and lift keypoint
+detection to keypoint tracking without any training.More importantly, thanks to
+our lifting strategy that deploys the algorithms on only one image, we achieve
+superior cross-frame consistency in processed videos compared to existing
+video-to-video translation approaches, and even manage to track non-rigid
+objects like water and smog.Project page can be found at
+https://qiuyu96.github.io/CoDeF/.",cs.CV,['cs.CV']
+Action Detection via an Image Diffusion Process,Lin Geng Foo · Tianjiao Li · Hossein Rahmani · Jun Liu, ,https://arxiv.org/abs/2404.01051,,2404.01051.pdf,Action Detection via an Image Diffusion Process,"Action detection aims to localize the starting and ending points of action
+instances in untrimmed videos, and predict the classes of those instances. In
+this paper, we make the observation that the outputs of the action detection
+task can be formulated as images. Thus, from a novel perspective, we tackle
+action detection via a three-image generation process to generate starting
+point, ending point and action-class predictions as images via our proposed
+Action Detection Image Diffusion (ADI-Diff) framework. Furthermore, since our
+images differ from natural images and exhibit special properties, we further
+explore a Discrete Action-Detection Diffusion Process and a Row-Column
+Transformer design to better handle their processing. Our ADI-Diff framework
+achieves state-of-the-art results on two widely-used datasets.",cs.CV,['cs.CV']
+T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory,Daehee Park · Jaeseok Jeong · Sung-Hoon Yoon · Jaewoo Jeong · Kuk-Jin Yoon, ,https://arxiv.org/abs/2403.10052,,2403.10052.pdf,T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory,"Trajectory prediction is a challenging problem that requires considering
+interactions among multiple actors and the surrounding environment. While
+data-driven approaches have been used to address this complex problem, they
+suffer from unreliable predictions under distribution shifts during test time.
+Accordingly, several online learning methods have been proposed using
+regression loss from the ground truth of observed data leveraging the
+auto-labeling nature of trajectory prediction task. We mainly tackle the
+following two issues. First, previous works underfit and overfit as they only
+optimize the last layer of the motion decoder. To this end, we employ the
+masked autoencoder (MAE) for representation learning to encourage complex
+interaction modeling in shifted test distribution for updating deeper layers.
+Second, utilizing the sequential nature of driving data, we propose an
+actor-specific token memory that enables the test-time learning of actor-wise
+motion characteristics. Our proposed method has been validated across various
+challenging cross-dataset distribution shift scenarios including nuScenes,
+Lyft, Waymo, and Interaction. Our method surpasses the performance of existing
+state-of-the-art online learning methods in terms of both prediction accuracy
+and computational efficiency. The code is available at
+https://github.com/daeheepark/T4P.",cs.CV,['cs.CV']
+Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation,Song Wang · Jiawei Yu · Wentong Li · Wenyu Liu · Xiaolu Liu · Junbo Chen · Jianke Zhu,https://github.com/songw-zju/HASSC,https://arxiv.org/abs/2404.11958,,2404.11958.pdf,Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation,"Semantic scene completion, also known as semantic occupancy prediction, can
+provide dense geometric and semantic information for autonomous vehicles, which
+attracts the increasing attention of both academia and industry. Unfortunately,
+existing methods usually formulate this task as a voxel-wise classification
+problem and treat each voxel equally in 3D space during training. As the hard
+voxels have not been paid enough attention, the performance in some challenging
+regions is limited. The 3D dense space typically contains a large number of
+empty voxels, which are easy to learn but require amounts of computation due to
+handling all the voxels uniformly for the existing models. Furthermore, the
+voxels in the boundary region are more challenging to differentiate than those
+in the interior. In this paper, we propose HASSC approach to train the semantic
+scene completion model with hardness-aware design. The global hardness from the
+network optimization process is defined for dynamical hard voxel selection.
+Then, the local hardness with geometric anisotropy is adopted for voxel-wise
+refinement. Besides, self-distillation strategy is introduced to make training
+process stable and consistent. Extensive experiments show that our HASSC scheme
+can effectively promote the accuracy of the baseline model without incurring
+the extra inference cost. Source code is available at:
+https://github.com/songw-zju/HASSC.",cs.CV,"['cs.CV', 'cs.RO']"
+CogAgent: A Visual Language Model for GUI Agents,Wenyi Hong · Weihan Wang · Qingsong Lv · Jiazheng Xu · Wenmeng Yu · Junhui Ji · Yan Wang · Zihan Wang · Yuxiao Dong · Ming Ding · Jie Tang, ,https://arxiv.org/abs/2312.08914,,2312.08914.pdf,CogAgent: A Visual Language Model for GUI Agents,"People are spending an enormous amount of time on digital devices through
+graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large
+language models (LLMs) such as ChatGPT can assist people in tasks like writing
+emails, but struggle to understand and interact with GUIs, thus limiting their
+potential to increase automation levels. In this paper, we introduce CogAgent,
+an 18-billion-parameter visual language model (VLM) specializing in GUI
+understanding and navigation. By utilizing both low-resolution and
+high-resolution image encoders, CogAgent supports input at a resolution of
+1120*1120, enabling it to recognize tiny page elements and text. As a
+generalist visual language model, CogAgent achieves the state of the art on
+five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA,
+Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using
+only screenshots as input, outperforms LLM-based methods that consume extracted
+HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW,
+advancing the state of the art. The model and codes are available at
+https://github.com/THUDM/CogVLM .",cs.CV,['cs.CV']
+Representing Signs as Language: A New Method for Sign Language Translation from Videos,Jia Gong · Lin Geng Foo · Yixuan He · Hossein Rahmani · Jun Liu, ,https://arxiv.org/abs/2404.00925,,2404.00925.pdf,LLMs are Good Sign Language Translators,"Sign Language Translation (SLT) is a challenging task that aims to translate
+sign videos into spoken language. Inspired by the strong translation
+capabilities of large language models (LLMs) that are trained on extensive
+multilingual text corpora, we aim to harness off-the-shelf LLMs to handle SLT.
+In this paper, we regularize the sign videos to embody linguistic
+characteristics of spoken language, and propose a novel SignLLM framework to
+transform sign videos into a language-like representation for improved
+readability by off-the-shelf LLMs. SignLLM comprises two key modules: (1) The
+Vector-Quantized Visual Sign module converts sign videos into a sequence of
+discrete character-level sign tokens, and (2) the Codebook Reconstruction and
+Alignment module converts these character-level tokens into word-level sign
+representations using an optimal transport formulation. A sign-text alignment
+loss further bridges the gap between sign and text tokens, enhancing semantic
+compatibility. We achieve state-of-the-art gloss-free results on two
+widely-used SLT benchmarks.",cs.CV,"['cs.CV', 'cs.CL']"
+Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution,Zhikai Chen · Fuchen Long · Zhaofan Qiu · Ting Yao · Wengang Zhou · Jiebo Luo · Tao Mei, ,https://arxiv.org/abs/2403.17000,,2403.17000.pdf,Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution,"Diffusion models are just at a tipping point for image super-resolution task.
+Nevertheless, it is not trivial to capitalize on diffusion models for video
+super-resolution which necessitates not only the preservation of visual
+appearance from low-resolution to high-resolution videos, but also the temporal
+consistency across video frames. In this paper, we propose a novel approach,
+pursuing Spatial Adaptation and Temporal Coherence (SATeCo), for video
+super-resolution. SATeCo pivots on learning spatial-temporal guidance from
+low-resolution videos to calibrate both latent-space high-resolution video
+denoising and pixel-space video reconstruction. Technically, SATeCo freezes all
+the parameters of the pre-trained UNet and VAE, and only optimizes two
+deliberately-designed spatial feature adaptation (SFA) and temporal feature
+alignment (TFA) modules, in the decoder of UNet and VAE. SFA modulates frame
+features via adaptively estimating affine parameters for each pixel,
+guaranteeing pixel-wise guidance for high-resolution frame synthesis. TFA
+delves into feature interaction within a 3D local window (tubelet) through
+self-attention, and executes cross-attention between tubelet and its
+low-resolution counterpart to guide temporal feature alignment. Extensive
+experiments conducted on the REDS4 and Vid4 datasets demonstrate the
+effectiveness of our approach.",cs.CV,"['cs.CV', 'cs.MM']"
+Light the Night: A Multi-Condition Diffusion Framework for Unpaired Low-Light Enhancement in Autonomous Driving,JINLONG LI · Baolu Li · Zhengzhong Tu · XINYU LIU · Qing Guo · Felix Juefei Xu · Runsheng Xu · Hongkai Yu, ,https://arxiv.org/abs/2404.04804,,2404.04804.pdf,Light the Night: A Multi-Condition Diffusion Framework for Unpaired Low-Light Enhancement in Autonomous Driving,"Vision-centric perception systems for autonomous driving have gained
+considerable attention recently due to their cost-effectiveness and
+scalability, especially compared to LiDAR-based systems. However, these systems
+often struggle in low-light conditions, potentially compromising their
+performance and safety. To address this, our paper introduces LightDiff, a
+domain-tailored framework designed to enhance the low-light image quality for
+autonomous driving applications. Specifically, we employ a multi-condition
+controlled diffusion model. LightDiff works without any human-collected paired
+data, leveraging a dynamic data degradation process instead. It incorporates a
+novel multi-condition adapter that adaptively controls the input weights from
+different modalities, including depth maps, RGB images, and text captions, to
+effectively illuminate dark scenes while maintaining context consistency.
+Furthermore, to align the enhanced images with the detection model's knowledge,
+LightDiff employs perception-specific scores as rewards to guide the diffusion
+training process through reinforcement learning. Extensive experiments on the
+nuScenes datasets demonstrate that LightDiff can significantly improve the
+performance of several state-of-the-art 3D detectors in night-time conditions
+while achieving high visual quality scores, highlighting its potential to
+safeguard autonomous driving.",cs.CV,['cs.CV']
+Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration,Chen Zhao · Weiling Cai · Chenyu Dong · Chengwei Hu,https://github.com/zhihefang,https://arxiv.org/abs/2311.16845,,2311.16845.pdf,Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration,"Underwater images are subject to intricate and diverse degradation,
+inevitably affecting the effectiveness of underwater visual tasks. However,
+most approaches primarily operate in the raw pixel space of images, which
+limits the exploration of the frequency characteristics of underwater images,
+leading to an inadequate utilization of deep models' representational
+capabilities in producing high-quality images. In this paper, we introduce a
+novel Underwater Image Enhancement (UIE) framework, named WF-Diff, designed to
+fully leverage the characteristics of frequency domain information and
+diffusion models. WF-Diff consists of two detachable networks: Wavelet-based
+Fourier information interaction network (WFI2-net) and Frequency Residual
+Diffusion Adjustment Module (FRDAM). With our full exploration of the frequency
+domain information, WFI2-net aims to achieve preliminary enhancement of
+frequency information in the wavelet space. Our proposed FRDAM can further
+refine the high- and low-frequency information of the initial enhanced images,
+which can be viewed as a plug-and-play universal module to adjust the detail of
+the underwater images. With the above techniques, our algorithm can show SOTA
+performance on real-world underwater image datasets, and achieves competitive
+performance in visual quality.",cs.CV,['cs.CV']
+Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning,Jaewoo Jeong · Daehee Park · Kuk-Jin Yoon, ,https://arxiv.org/abs/2404.05218v1,,2404.05218v1.pdf,Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning,"Human pose forecasting garners attention for its diverse applications.
+However, challenges in modeling the multi-modal nature of human motion and
+intricate interactions among agents persist, particularly with longer
+timescales and more agents. In this paper, we propose an interaction-aware
+trajectory-conditioned long-term multi-agent human pose forecasting model,
+utilizing a coarse-to-fine prediction approach: multi-modal global trajectories
+are initially forecasted, followed by respective local pose forecasts
+conditioned on each mode. In doing so, our Trajectory2Pose model introduces a
+graph-based agent-wise interaction module for a reciprocal forecast of local
+motion-conditioned global trajectory and trajectory-conditioned local pose. Our
+model effectively handles the multi-modality of human motion and the complexity
+of long-term multi-agent interactions, improving performance in complex
+environments. Furthermore, we address the lack of long-term (6s+) multi-agent
+(5+) datasets by constructing a new dataset from real-world images and 2D
+annotations, enabling a comprehensive evaluation of our proposed model.
+State-of-the-art prediction performance on both complex and simpler datasets
+confirms the generalized effectiveness of our method. The code is available at
+https://github.com/Jaewoo97/T2P.",cs.CV,"['cs.CV', 'cs.AI']"
+Unexplored Faces of Robustness and Out-of-Distribution: Covariate Shifts in Environment and Sensor Domains,Eunsu Baek · Keondo Park · Ji-yoon Kim · Hyung-Sin Kim,https://github.com/Edw2n/ImageNet-ES,https://arxiv.org/abs/2404.15882,,2404.15882.pdf,Unexplored Faces of Robustness and Out-of-Distribution: Covariate Shifts in Environment and Sensor Domains,"Computer vision applications predict on digital images acquired by a camera
+from physical scenes through light. However, conventional robustness benchmarks
+rely on perturbations in digitized images, diverging from distribution shifts
+occurring in the image acquisition process. To bridge this gap, we introduce a
+new distribution shift dataset, ImageNet-ES, comprising variations in
+environmental and camera sensor factors by directly capturing 202k images with
+a real camera in a controllable testbed. With the new dataset, we evaluate
+out-of-distribution (OOD) detection and model robustness. We find that existing
+OOD detection methods do not cope with the covariate shifts in ImageNet-ES,
+implying that the definition and detection of OOD should be revisited to
+embrace real-world distribution shifts. We also observe that the model becomes
+more robust in both ImageNet-C and -ES by learning environment and sensor
+variations in addition to existing digital augmentations. Lastly, our results
+suggest that effective shift mitigation via camera sensor control can
+significantly improve performance without increasing model size. With these
+findings, our benchmark may aid future research on robustness, OOD, and camera
+sensor control for computer vision. Our code and dataset are available at
+https://github.com/Edw2n/ImageNet-ES.",cs.CV,"['cs.CV', 'cs.AI']"
+Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training,Xiaoyang Wu · Zhuotao Tian · Xin Wen · Bohao Peng · Xihui Liu · Kaicheng Yu · Hengshuang Zhao,https://github.com/Pointcept/Pointcept,https://arxiv.org/abs/2308.09718,,2308.09718.pdf,Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training,"The rapid advancement of deep learning models often attributes to their
+ability to leverage massive training data. In contrast, such privilege has not
+yet fully benefited 3D deep learning, mainly due to the limited availability of
+large-scale 3D datasets. Merging multiple available data sources and letting
+them collaboratively train a single model is a potential solution. However, due
+to the large domain gap between 3D point cloud datasets, such mixed supervision
+could adversely affect the model's performance and lead to degenerated
+performance (i.e., negative transfer) compared to single-dataset training. In
+view of this challenge, we introduce Point Prompt Training (PPT), a novel
+framework for multi-dataset synergistic learning in the context of 3D
+representation learning that supports multiple pre-training paradigms. Based on
+this framework, we propose Prompt-driven Normalization, which adapts the model
+to different datasets with domain-specific prompts and Language-guided
+Categorical Alignment that decently unifies the multiple-dataset label spaces
+by leveraging the relationship between label text. Extensive experiments verify
+that PPT can overcome the negative transfer associated with synergistic
+learning and produce generalizable representations. Notably, it achieves
+state-of-the-art performance on each dataset using a single weight-shared model
+with supervised multi-dataset training. Moreover, when served as a pre-training
+framework, it outperforms other pre-training approaches regarding
+representation quality and attains remarkable state-of-the-art performance
+across over ten diverse downstream tasks spanning both indoor and outdoor 3D
+scenarios.",cs.CV,['cs.CV']
+How to Train Neural Field Representations: A Comprehensive Study and Benchmark,Samuele Papa · Riccardo Valperga · David Knigge · Miltiadis Kofinas · Phillip Lippe · Jan-Jakob Sonke · Efstratios Gavves,https://fit-a-nef.github.io/,https://arxiv.org/abs/2312.10531,,2312.10531.pdf,How to Train Neural Field Representations: A Comprehensive Study and Benchmark,"Neural fields (NeFs) have recently emerged as a versatile method for modeling
+signals of various modalities, including images, shapes, and scenes.
+Subsequently, a number of works have explored the use of NeFs as
+representations for downstream tasks, e.g. classifying an image based on the
+parameters of a NeF that has been fit to it. However, the impact of the NeF
+hyperparameters on their quality as downstream representation is scarcely
+understood and remains largely unexplored. This is in part caused by the large
+amount of time required to fit datasets of neural fields.
+  In this work, we propose $\verb|fit-a-nef|$, a JAX-based library that
+leverages parallelization to enable fast optimization of large-scale NeF
+datasets, resulting in a significant speed-up. With this library, we perform a
+comprehensive study that investigates the effects of different hyperparameters
+-- including initialization, network architecture, and optimization strategies
+-- on fitting NeFs for downstream tasks. Our study provides valuable insights
+on how to train NeFs and offers guidance for optimizing their effectiveness in
+downstream applications. Finally, based on the proposed library and our
+analysis, we propose Neural Field Arena, a benchmark consisting of neural field
+variants of popular vision datasets, including MNIST, CIFAR, variants of
+ImageNet, and ShapeNetv2. Our library and the Neural Field Arena will be
+open-sourced to introduce standardized benchmarking and promote further
+research on neural fields.",cs.CV,['cs.CV']
+Towards Memorization-Free Diffusion Models,Chen Chen · Daochang Liu · Chang Xu,https://chenchen-usyd.github.io/AMG-Project-Page/,https://arxiv.org/abs/2404.00922,,2404.00922.pdf,Towards Memorization-Free Diffusion Models,"Pretrained diffusion models and their outputs are widely accessible due to
+their exceptional capacity for synthesizing high-quality images and their
+open-source nature. The users, however, may face litigation risks owing to the
+models' tendency to memorize and regurgitate training data during inference. To
+address this, we introduce Anti-Memorization Guidance (AMG), a novel framework
+employing three targeted guidance strategies for the main causes of
+memorization: image and caption duplication, and highly specific user prompts.
+Consequently, AMG ensures memorization-free outputs while maintaining high
+image quality and text alignment, leveraging the synergy of its guidance
+methods, each indispensable in its own right. AMG also features an innovative
+automatic detection system for potential memorization during each step of
+inference process, allows selective application of guidance strategies,
+minimally interfering with the original sampling process to preserve output
+utility. We applied AMG to pretrained Denoising Diffusion Probabilistic Models
+(DDPM) and Stable Diffusion across various generation tasks. The results
+demonstrate that AMG is the first approach to successfully eradicates all
+instances of memorization with no or marginal impacts on image quality and
+text-alignment, as evidenced by FID and CLIP scores.",cs.CV,['cs.CV']
+Gradient Alignment for Cross-domain Face Anti-Spoofing,MINH BINH LE · Simon Woo,https://github.com/Leminhbinh0209/CVPR24-FAS,https://arxiv.org/abs/2402.18817,,2402.18817.pdf,Gradient Alignment for Cross-Domain Face Anti-Spoofing,"Recent advancements in domain generalization (DG) for face anti-spoofing
+(FAS) have garnered considerable attention. Traditional methods have focused on
+designing learning objectives and additional modules to isolate domain-specific
+features while retaining domain-invariant characteristics in their
+representations. However, such approaches often lack guarantees of consistent
+maintenance of domain-invariant features or the complete removal of
+domain-specific features. Furthermore, most prior works of DG for FAS do not
+ensure convergence to a local flat minimum, which has been shown to be
+advantageous for DG. In this paper, we introduce GAC-FAS, a novel learning
+objective that encourages the model to converge towards an optimal flat minimum
+without necessitating additional learning modules. Unlike conventional
+sharpness-aware minimizers, GAC-FAS identifies ascending points for each domain
+and regulates the generalization gradient updates at these points to align
+coherently with empirical risk minimization (ERM) gradient updates. This unique
+approach specifically guides the model to be robust against domain shifts. We
+demonstrate the efficacy of GAC-FAS through rigorous testing on challenging
+cross-domain FAS datasets, where it establishes state-of-the-art performance.
+The code is available at https://github.com/leminhbinh0209/CVPR24-FAS.",cs.CV,['cs.CV']
+MANUS: Markerless Grasp Capture using Articulated 3D Gaussians,Chandradeep Pokhariya · Ishaan Shah · Angela Xing · Zekun Li · Kefan Chen · Avinash Sharma · Srinath Sridhar,https://ivl.cs.brown.edu/research/manus.html,https://arxiv.org/abs/2312.02137,,2312.02137.pdf,MANUS: Markerless Grasp Capture using Articulated 3D Gaussians,"Understanding how we grasp objects with our hands has important applications
+in areas like robotics and mixed reality. However, this challenging problem
+requires accurate modeling of the contact between hands and objects. To capture
+grasps, existing methods use skeletons, meshes, or parametric models that does
+not represent hand shape accurately resulting in inaccurate contacts. We
+present MANUS, a method for Markerless Hand-Object Grasp Capture using
+Articulated 3D Gaussians. We build a novel articulated 3D Gaussians
+representation that extends 3D Gaussian splatting for high-fidelity
+representation of articulating hands. Since our representation uses Gaussian
+primitives, it enables us to efficiently and accurately estimate contacts
+between the hand and the object. For the most accurate results, our method
+requires tens of camera views that current datasets do not provide. We
+therefore build MANUS-Grasps, a new dataset that contains hand-object grasps
+viewed from 50+ cameras across 30+ scenes, 3 subjects, and comprising over 7M
+frames. In addition to extensive qualitative results, we also show that our
+method outperforms others on a quantitative contact evaluation method that uses
+paint transfer from the object to the hand.",cs.CV,['cs.CV']
+Language-guided Image Reflection Separation,Haofeng Zhong · Yuchen Hong · Shuchen Weng · Jinxiu Liang · Boxin Shi, ,https://arxiv.org/abs/2402.11874,,2402.11874.pdf,Language-guided Image Reflection Separation,"This paper studies the problem of language-guided reflection separation,
+which aims at addressing the ill-posed reflection separation problem by
+introducing language descriptions to provide layer content. We propose a
+unified framework to solve this problem, which leverages the cross-attention
+mechanism with contrastive learning strategies to construct the correspondence
+between language descriptions and image layers. A gated network design and a
+randomized training strategy are employed to tackle the recognizable layer
+ambiguity. The effectiveness of the proposed method is validated by the
+significant performance advantage over existing reflection separation methods
+on both quantitative and qualitative comparisons.",cs.CV,['cs.CV']
+A Subspace-Constrained Tyler's Estimator and its Applications to Structure from Motion,Feng Yu · Teng Zhang · Gilad Lerman, ,https://arxiv.org/abs/2404.11590,,2404.11590.pdf,A Subspace-Constrained Tyler's Estimator and its Applications to Structure from Motion,"We present the subspace-constrained Tyler's estimator (STE) designed for
+recovering a low-dimensional subspace within a dataset that may be highly
+corrupted with outliers. STE is a fusion of the Tyler's M-estimator (TME) and a
+variant of the fast median subspace. Our theoretical analysis suggests that,
+under a common inlier-outlier model, STE can effectively recover the underlying
+subspace, even when it contains a smaller fraction of inliers relative to other
+methods in the field of robust subspace recovery. We apply STE in the context
+of Structure from Motion (SfM) in two ways: for robust estimation of the
+fundamental matrix and for the removal of outlying cameras, enhancing the
+robustness of the SfM pipeline. Numerical experiments confirm the
+state-of-the-art performance of our method in these applications. This research
+makes significant contributions to the field of robust subspace recovery,
+particularly in the context of computer vision and 3D reconstruction.",cs.CV,['cs.CV']
+CNC-Net: Self-Supervised Learning for CNC Machining Operations,Mohsen Yavartanoo · Sangmin Hong · Reyhaneh Neshatavar · Kyoung Mu Lee,https://github.com/myavartanoo/CNC-Net_PyTorch,https://arxiv.org/abs/2312.09925,,2312.09925.pdf,CNC-Net: Self-Supervised Learning for CNC Machining Operations,"CNC manufacturing is a process that employs computer numerical control (CNC)
+machines to govern the movements of various industrial tools and machinery,
+encompassing equipment ranging from grinders and lathes to mills and CNC
+routers. However, the reliance on manual CNC programming has become a
+bottleneck, and the requirement for expert knowledge can result in significant
+costs. Therefore, we introduce a pioneering approach named CNC-Net,
+representing the use of deep neural networks (DNNs) to simulate CNC machines
+and grasp intricate operations when supplied with raw materials. CNC-Net
+constitutes a self-supervised framework that exclusively takes an input 3D
+model and subsequently generates the essential operation parameters required by
+the CNC machine to construct the object. Our method has the potential to
+transformative automation in manufacturing by offering a cost-effective
+alternative to the high costs of manual CNC programming while maintaining
+exceptional precision in 3D object production. Our experiments underscore the
+effectiveness of our CNC-Net in constructing the desired 3D objects through the
+utilization of CNC operations. Notably, it excels in preserving finer local
+details, exhibiting a marked enhancement in precision compared to the
+state-of-the-art 3D CAD reconstruction approaches.",cs.CV,['cs.CV']
+Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation,Xiyi Chen · Marko Mihajlovic · Shaofei Wang · Sergey Prokudin · Siyu Tang, ,https://arxiv.org/abs/2401.04728,,2401.04728.pdf,Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation,"Recent advances in generative diffusion models have enabled the previously
+unfeasible capability of generating 3D assets from a single input image or a
+text prompt. In this work, we aim to enhance the quality and functionality of
+these models for the task of creating controllable, photorealistic human
+avatars. We achieve this by integrating a 3D morphable model into the
+state-of-the-art multi-view-consistent diffusion approach. We demonstrate that
+accurate conditioning of a generative pipeline on the articulated 3D model
+enhances the baseline model performance on the task of novel view synthesis
+from a single image. More importantly, this integration facilitates a seamless
+and accurate incorporation of facial expression and body pose control into the
+generation process. To the best of our knowledge, our proposed framework is the
+first diffusion model to enable the creation of fully 3D-consistent,
+animatable, and photorealistic human avatars from a single image of an unseen
+subject; extensive quantitative and qualitative evaluations demonstrate the
+advantages of our approach over existing state-of-the-art avatar creation
+models on both novel view and novel expression synthesis tasks. The code for
+our project is publicly available.",cs.CV,"['cs.CV', 'cs.AI']"
+3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting,Zhiyin Qian · Shaofei Wang · Marko Mihajlovic · Andreas Geiger · Siyu Tang, ,https://arxiv.org/abs/2312.09228,,2312.09228.pdf,3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting,"We introduce an approach that creates animatable human avatars from monocular
+videos using 3D Gaussian Splatting (3DGS). Existing methods based on neural
+radiance fields (NeRFs) achieve high-quality novel-view/novel-pose image
+synthesis but often require days of training, and are extremely slow at
+inference time. Recently, the community has explored fast grid structures for
+efficient training of clothed avatars. Albeit being extremely fast at training,
+these methods can barely achieve an interactive rendering frame rate with
+around 15 FPS. In this paper, we use 3D Gaussian Splatting and learn a
+non-rigid deformation network to reconstruct animatable clothed human avatars
+that can be trained within 30 minutes and rendered at real-time frame rates
+(50+ FPS). Given the explicit nature of our representation, we further
+introduce as-isometric-as-possible regularizations on both the Gaussian mean
+vectors and the covariance matrices, enhancing the generalization of our model
+on highly articulated unseen poses. Experimental results show that our method
+achieves comparable and even better performance compared to state-of-the-art
+approaches on animatable avatar creation from a monocular input, while being
+400x and 250x faster in training and inference, respectively.",cs.CV,['cs.CV']
+TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models,Yushi Huang · Ruihao Gong · Jing Liu · Tianlong Chen · Xianglong Liu,https://github.com/ModelTC/TFMQ-DM,https://arxiv.org/abs/2311.16503,,2311.16503.pdf,TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models,"The Diffusion model, a prevalent framework for image generation, encounters
+significant challenges in terms of broad applicability due to its extended
+inference times and substantial memory requirements. Efficient Post-training
+Quantization (PTQ) is pivotal for addressing these issues in traditional
+models. Different from traditional models, diffusion models heavily depend on
+the time-step $t$ to achieve satisfactory multi-round denoising. Usually, $t$
+from the finite set $\{1, \ldots, T\}$ is encoded to a temporal feature by a
+few modules totally irrespective of the sampling data. However, existing PTQ
+methods do not optimize these modules separately. They adopt inappropriate
+reconstruction targets and complex calibration methods, resulting in a severe
+disturbance of the temporal feature and denoising trajectory, as well as a low
+compression efficiency. To solve these, we propose a Temporal Feature
+Maintenance Quantization (TFMQ) framework building upon a Temporal Information
+Block which is just related to the time-step $t$ and unrelated to the sampling
+data. Powered by the pioneering block design, we devise temporal information
+aware reconstruction (TIAR) and finite set calibration (FSC) to align the
+full-precision temporal features in a limited time. Equipped with the
+framework, we can maintain the most temporal information and ensure the
+end-to-end generation quality. Extensive experiments on various datasets and
+diffusion models prove our state-of-the-art results. Remarkably, our
+quantization approach, for the first time, achieves model performance nearly on
+par with the full-precision model under 4-bit weight quantization.
+Additionally, our method incurs almost no extra computational cost and
+accelerates quantization time by $2.0 \times$ on LSUN-Bedrooms $256 \times 256$
+compared to previous works. Our code is publicly available at
+https://github.com/ModelTC/TFMQ-DM.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+"Point Transformer V3: Simpler, Faster, Stronger",Xiaoyang Wu · Li Jiang · Peng-Shuai Wang · Zhijian Liu · Xihui Liu · Yu Qiao · Wanli Ouyang · Tong He · Hengshuang Zhao,https://github.com/Pointcept/PointTransformerV3,https://arxiv.org/abs/2312.10035,,2312.10035.pdf,"Point Transformer V3: Simpler, Faster, Stronger","This paper is not motivated to seek innovation within the attention
+mechanism. Instead, it focuses on overcoming the existing trade-offs between
+accuracy and efficiency within the context of point cloud processing,
+leveraging the power of scale. Drawing inspiration from recent advances in 3D
+large-scale representation learning, we recognize that model performance is
+more influenced by scale than by intricate design. Therefore, we present Point
+Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the
+accuracy of certain mechanisms that are minor to the overall performance after
+scaling, such as replacing the precise neighbor search by KNN with an efficient
+serialized neighbor mapping of point clouds organized with specific patterns.
+This principle enables significant scaling, expanding the receptive field from
+16 to 1024 points while remaining efficient (a 3x increase in processing speed
+and a 10x improvement in memory efficiency compared with its predecessor,
+PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that
+span both indoor and outdoor scenarios. Further enhanced with multi-dataset
+joint training, PTv3 pushes these results to a higher level.",cs.CV,['cs.CV']
+Efficient Stitchable Task Adaptation,Haoyu He · Zizheng Pan · Jing Liu · Jianfei Cai · Bohan Zhuang, ,https://arxiv.org/abs/2311.17352,,2311.17352.pdf,Efficient Stitchable Task Adaptation,"The paradigm of pre-training and fine-tuning has laid the foundation for
+deploying deep learning models. However, most fine-tuning methods are designed
+to meet a specific resource budget. Recently, considering diverse deployment
+scenarios with various resource budgets, stitchable neural network (SN-Net) is
+introduced to quickly obtain numerous new networks (stitches) from the
+pre-trained models (anchors) in a model family via model stitching. Although
+promising, SN-Net confronts new challenges when adapting it to new target
+domains, including huge memory and storage requirements and a long and
+sub-optimal multistage adaptation process. In this work, we present a novel
+framework, Efficient Stitchable Task Adaptation (ESTA), to efficiently produce
+a palette of fine-tuned models that adhere to diverse resource constraints.
+Specifically, we first tailor parameter-efficient fine-tuning to share low-rank
+updates among the stitches while maintaining independent bias terms. In this
+way, we largely reduce fine-tuning memory burdens and mitigate the interference
+among stitches that arises in task adaptation. Furthermore, we streamline a
+simple yet effective one-stage deployment pipeline, which estimates the
+important stitches to deploy with training-time gradient statistics. By
+assigning higher sampling probabilities to important stitches, we also get a
+boosted Pareto frontier. Extensive experiments on 25 downstream visual
+recognition tasks demonstrate that our ESTA is capable of generating stitches
+with smooth accuracy-efficiency trade-offs and surpasses the direct SN-Net
+adaptation by remarkable margins with significantly lower training time and
+fewer trainable parameters. Furthermore, we demonstrate the flexibility and
+scalability of our ESTA framework by stitching LLMs from LLaMA family,
+obtaining chatbot stitches of assorted sizes.",cs.LG,"['cs.LG', 'cs.CL', 'cs.CV']"
+CPGA: Coding Priors-Guided Aggregation Network for Compressed Video Quality Enhancement,Qiang Zhu · Jinhua Hao · Yukang Ding · Yu Liu · Qiao Mo · Ming Sun · Chao Zhou · Shuyuan Zhu, ,https://arxiv.org/abs/2403.10362,,2403.10362.pdf,CPGA: Coding Priors-Guided Aggregation Network for Compressed Video Quality Enhancement,"Recently, numerous approaches have achieved notable success in compressed
+video quality enhancement (VQE). However, these methods usually ignore the
+utilization of valuable coding priors inherently embedded in compressed videos,
+such as motion vectors and residual frames, which carry abundant temporal and
+spatial information. To remedy this problem, we propose the Coding
+Priors-Guided Aggregation (CPGA) network to utilize temporal and spatial
+information from coding priors. The CPGA mainly consists of an inter-frame
+temporal aggregation (ITA) module and a multi-scale non-local aggregation (MNA)
+module. Specifically, the ITA module aggregates temporal information from
+consecutive frames and coding priors, while the MNA module globally captures
+spatial information guided by residual frames. In addition, to facilitate
+research in VQE task, we newly construct the Video Coding Priors (VCP) dataset,
+comprising 300 videos with various coding priors extracted from corresponding
+bitstreams. It remedies the shortage of previous datasets on the lack of coding
+information. Experimental results demonstrate the superiority of our method
+compared to existing state-of-the-art methods. The code and dataset will be
+released at https://github.com/CPGA/CPGA.git.",eess.IV,"['eess.IV', 'cs.CV']"
+WonderJourney: Going from Anywhere to Everywhere,Hong-Xing Yu · Haoyi Duan · Junhwa Hur · Kyle Sargent · Michael Rubinstein · William Freeman · Forrester Cole · Deqing Sun · Noah Snavely · Jiajun Wu · Charles Herrmann, ,https://arxiv.org/abs/2312.03884,,,WonderJourney: Going from Anywhere to Everywhere,"We introduce WonderJourney, a modularized framework for perpetual 3D scene
+generation. Unlike prior work on view generation that focuses on a single type
+of scenes, we start at any user-provided location (by a text description or an
+image) and generate a journey through a long sequence of diverse yet coherently
+connected 3D scenes. We leverage an LLM to generate textual descriptions of the
+scenes in this journey, a text-driven point cloud generation pipeline to make a
+compelling and coherent sequence of 3D scenes, and a large VLM to verify the
+generated scenes. We show compelling, diverse visual results across various
+scene types and styles, forming imaginary ""wonderjourneys"". Project website:
+https://kovenyu.com/WonderJourney/",cs.CV,"['cs.CV', 'cs.GR']"
+Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge,Dongjin Kim · Sung Jin Um · Sangmin Lee · Jung Uk Kim, ,https://arxiv.org/abs/2403.17420,,2403.17420.pdf,Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge,"The goal of the multi-sound source localization task is to localize sound
+sources from the mixture individually. While recent multi-sound source
+localization methods have shown improved performance, they face challenges due
+to their reliance on prior information about the number of objects to be
+separated. In this paper, to overcome this limitation, we present a novel
+multi-sound source localization method that can perform localization without
+prior knowledge of the number of sound sources. To achieve this goal, we
+propose an iterative object identification (IOI) module, which can recognize
+sound-making objects in an iterative manner. After finding the regions of
+sound-making objects, we devise object similarity-aware clustering (OSC) loss
+to guide the IOI module to effectively combine regions of the same object but
+also distinguish between different objects and backgrounds. It enables our
+method to perform accurate localization of sound-making objects without any
+prior knowledge. Extensive experimental results on the MUSIC and VGGSound
+benchmarks show the significant performance improvements of the proposed method
+over the existing methods for both single and multi-source. Our code is
+available at: https://github.com/VisualAIKHU/NoPrior_MultiSSL",cs.CV,"['cs.CV', 'cs.MM', 'cs.SD', 'eess.AS']"
+Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation,Qiyuan Dai · Sibei Yang, ,,https://paperswithcode.com/paper/curriculum-point-prompting-for-weakly,,,,,nan
+Osprey: Pixel Understanding with Visual Instruction Tuning,Yuqian Yuan · Wentong Li · Jian liu · Dongqi Tang · Xinjie Luo · Chi Qin · Lei Zhang · Jianke Zhu,https://github.com/CircleRadon/Osprey,https://arxiv.org/abs/2312.10032,,2312.10032.pdf,Osprey: Pixel Understanding with Visual Instruction Tuning,"Multimodal large language models (MLLMs) have recently achieved impressive
+general-purpose vision-language capabilities through visual instruction tuning.
+However, current MLLMs primarily focus on image-level or box-level
+understanding, falling short in achieving fine-grained vision-language
+alignment at pixel level. Besides, the lack of mask-based instruction data
+limits their advancements. In this paper, we propose Osprey, a mask-text
+instruction tuning approach, to extend MLLMs by incorporating fine-grained mask
+regions into language instruction, aiming at achieving pixel-wise visual
+understanding. To achieve this goal, we first meticulously curate a mask-based
+region-text dataset with 724K samples, and then design a vision-language model
+by injecting pixel-level representation into LLM. Specifically, Osprey adopts a
+convolutional CLIP backbone as the vision encoder and employs a mask-aware
+visual extractor to extract precise visual mask features from high resolution
+input. Experimental results demonstrate Osprey's superiority in various region
+understanding tasks, showcasing its new capability for pixel-level instruction
+tuning. In particular, Osprey can be integrated with Segment Anything Model
+(SAM) seamlessly to obtain multi-granularity semantics. The source code,
+dataset and demo can be found at https://github.com/CircleRadon/Osprey.",cs.CV,['cs.CV']
+Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs,Hao Fei · Shengqiong Wu · Wei Ji · Hanwang Zhang · Tat-seng Chua,http://haofei.vip/Dysen-VDM/,https://arxiv.org/abs/2308.13812,,2308.13812.pdf,Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs,"Text-to-video (T2V) synthesis has gained increasing attention in the
+community, in which the recently emerged diffusion models (DMs) have
+promisingly shown stronger performance than the past approaches. While existing
+state-of-the-art DMs are competent to achieve high-resolution video generation,
+they may largely suffer from key limitations (e.g., action occurrence
+disorders, crude video motions) with respect to the intricate temporal dynamics
+modeling, one of the crux of video synthesis. In this work, we investigate
+strengthening the awareness of video dynamics for DMs, for high-quality T2V
+generation. Inspired by human intuition, we design an innovative dynamic scene
+manager (dubbed as Dysen) module, which includes (step-1) extracting from input
+text the key actions with proper time-order arrangement, (step-2) transforming
+the action schedules into the dynamic scene graph (DSG) representations, and
+(step-3) enriching the scenes in the DSG with sufficient and reasonable
+details. Taking advantage of the existing powerful LLMs (e.g., ChatGPT) via
+in-context learning, Dysen realizes (nearly) human-level temporal dynamics
+understanding. Finally, the resulting video DSG with rich action scene details
+is encoded as fine-grained spatio-temporal features, integrated into the
+backbone T2V DM for video generating. Experiments on popular T2V datasets
+suggest that our Dysen-VDM consistently outperforms prior arts with significant
+margins, especially in scenarios with complex actions. Codes at
+https://haofei.vip/Dysen-VDM",cs.AI,"['cs.AI', 'cs.CV']"
+RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D,Lingteng Qiu · Guanying Chen · Xiaodong Gu · Qi Zuo · Mutian Xu · Yushuang Wu · Weihao Yuan · Zilong Dong · Liefeng Bo · Xiaoguang Han,https://aigc3d.github.io/richdreamer/,https://arxiv.org/abs/2311.16918v1,,2311.16918v1.pdf,RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D,"Lifting 2D diffusion for 3D generation is a challenging problem due to the
+lack of geometric prior and the complex entanglement of materials and lighting
+in natural images. Existing methods have shown promise by first creating the
+geometry through score-distillation sampling (SDS) applied to rendered surface
+normals, followed by appearance modeling. However, relying on a 2D RGB
+diffusion model to optimize surface normals is suboptimal due to the
+distribution discrepancy between natural images and normals maps, leading to
+instability in optimization. In this paper, recognizing that the normal and
+depth information effectively describe scene geometry and be automatically
+estimated from images, we propose to learn a generalizable Normal-Depth
+diffusion model for 3D generation. We achieve this by training on the
+large-scale LAION dataset together with the generalizable image-to-depth and
+normal prior models. In an attempt to alleviate the mixed illumination effects
+in the generated materials, we introduce an albedo diffusion model to impose
+data-driven constraints on the albedo component. Our experiments show that when
+integrated into existing text-to-3D pipelines, our models significantly enhance
+the detail richness, achieving state-of-the-art results. Our project page is
+https://lingtengqiu.github.io/RichDreamer/.",cs.CV,"['cs.CV', 'cs.AI']"
+Towards Generalizing to Unseen Domains with Few Labels,Chamuditha Jayanga Galappaththige · Sanoojan Baliah · Malitha Gunawardhana · Muhammad Haris Khan, ,https://arxiv.org/abs/2403.11674,,2403.11674.pdf,Towards Generalizing to Unseen Domains with Few Labels,"We approach the challenge of addressing semi-supervised domain generalization
+(SSDG). Specifically, our aim is to obtain a model that learns
+domain-generalizable features by leveraging a limited subset of labelled data
+alongside a substantially larger pool of unlabeled data. Existing domain
+generalization (DG) methods which are unable to exploit unlabeled data perform
+poorly compared to semi-supervised learning (SSL) methods under SSDG setting.
+Nevertheless, SSL methods have considerable room for performance improvement
+when compared to fully-supervised DG training. To tackle this underexplored,
+yet highly practical problem of SSDG, we make the following core contributions.
+First, we propose a feature-based conformity technique that matches the
+posterior distributions from the feature space with the pseudo-label from the
+model's output space. Second, we develop a semantics alignment loss to learn
+semantically-compatible representations by regularizing the semantic structure
+in the feature space. Our method is plug-and-play and can be readily integrated
+with different SSL-based SSDG baselines without introducing any additional
+parameters. Extensive experimental results across five challenging DG
+benchmarks with four strong SSL baselines suggest that our method provides
+consistent and notable gains in two different SSDG settings.",cs.CV,['cs.CV']
+SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving,Yiming Xie · Henglu Wei · Zhenyi Liu · Xiaoyu Wang · Xiangyang Ji, ,https://arxiv.org/abs/2403.17094,,2403.17094.pdf,SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving,"To advance research in learning-based defogging algorithms, various synthetic
+fog datasets have been developed. However, existing datasets created using the
+Atmospheric Scattering Model (ASM) or real-time rendering engines often
+struggle to produce photo-realistic foggy images that accurately mimic the
+actual imaging process. This limitation hinders the effective generalization of
+models from synthetic to real data. In this paper, we introduce an end-to-end
+simulation pipeline designed to generate photo-realistic foggy images. This
+pipeline comprehensively considers the entire physically-based foggy scene
+imaging process, closely aligning with real-world image capture methods. Based
+on this pipeline, we present a new synthetic fog dataset named SynFog, which
+features both sky light and active lighting conditions, as well as three levels
+of fog density. Experimental results demonstrate that models trained on SynFog
+exhibit superior performance in visual perception and detection accuracy
+compared to others when applied to real-world foggy images.",cs.CV,"['cs.CV', 'cs.LG']"
+FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication,Eric Slyman · Stefan Lee · Scott Cohen · Kushal Kafle,https://ericslyman.com/fairdedup/,https://arxiv.org/abs/2404.16123,,2404.16123.pdf,FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication,"Recent dataset deduplication techniques have demonstrated that content-aware
+dataset pruning can dramatically reduce the cost of training Vision-Language
+Pretrained (VLP) models without significant performance losses compared to
+training on the original dataset. These results have been based on pruning
+commonly used image-caption datasets collected from the web -- datasets that
+are known to harbor harmful social biases that may then be codified in trained
+models. In this work, we evaluate how deduplication affects the prevalence of
+these biases in the resulting trained models and introduce an easy-to-implement
+modification to the recent SemDeDup algorithm that can reduce the negative
+effects that we observe. When examining CLIP-style models trained on
+deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm
+consistently leads to improved fairness metrics over SemDeDup on the FairFace
+and FACET datasets while maintaining zero-shot performance on CLIP benchmarks.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'I.4.10; I.2.7; E.0']"
+ESCAPE: Encoding Super-keypoints for Category-Agnostic Pose Estimation,Khoi D Nguyen · Chen Li · Gim Hee Lee, ,https://arxiv.org/abs/2403.13647,,,Meta-Point Learning and Refining for Category-Agnostic Pose Estimation,"Category-agnostic pose estimation (CAPE) aims to predict keypoints for
+arbitrary classes given a few support images annotated with keypoints. Existing
+methods only rely on the features extracted at support keypoints to predict or
+refine the keypoints on query image, but a few support feature vectors are
+local and inadequate for CAPE. Considering that human can quickly perceive
+potential keypoints of arbitrary objects, we propose a novel framework for CAPE
+based on such potential keypoints (named as meta-points). Specifically, we
+maintain learnable embeddings to capture inherent information of various
+keypoints, which interact with image feature maps to produce meta-points
+without any support. The produced meta-points could serve as meaningful
+potential keypoints for CAPE. Due to the inevitable gap between inherency and
+annotation, we finally utilize the identities and details offered by support
+keypoints to assign and refine meta-points to desired keypoints in query image.
+In addition, we propose a progressive deformable point decoder and a slacked
+regression loss for better prediction and supervision. Our novel framework not
+only reveals the inherency of keypoints but also outperforms existing methods
+of CAPE. Comprehensive experiments and in-depth studies on large-scale MP-100
+dataset demonstrate the effectiveness of our framework.",cs.CV,['cs.CV']
+SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation,Jiehong Lin · lihua liu · Dekun Lu · Kui Jia, ,https://arxiv.org/abs/2311.15707,,2311.15707.pdf,SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation,"Zero-shot 6D object pose estimation involves the detection of novel objects
+with their 6D poses in cluttered scenes, presenting significant challenges for
+model generalizability. Fortunately, the recent Segment Anything Model (SAM)
+has showcased remarkable zero-shot transfer performance, which provides a
+promising solution to tackle this task. Motivated by this, we introduce SAM-6D,
+a novel framework designed to realize the task through two steps, including
+instance segmentation and pose estimation. Given the target objects, SAM-6D
+employs two dedicated sub-networks, namely Instance Segmentation Model (ISM)
+and Pose Estimation Model (PEM), to perform these steps on cluttered RGB-D
+images. ISM takes SAM as an advanced starting point to generate all possible
+object proposals and selectively preserves valid ones through meticulously
+crafted object matching scores in terms of semantics, appearance and geometry.
+By treating pose estimation as a partial-to-partial point matching problem, PEM
+performs a two-stage point matching process featuring a novel design of
+background tokens to construct dense 3D-3D correspondence, ultimately yielding
+the pose estimates. Without bells and whistles, SAM-6D outperforms the existing
+methods on the seven core datasets of the BOP Benchmark for both instance
+segmentation and pose estimation of novel objects.",cs.CV,['cs.CV']
+Test-Time Zero-Shot Temporal Action Localization,Benedetta Liberatori · Alessandro Conti · Paolo Rota · Yiming Wang · Elisa Ricci, ,https://arxiv.org/abs/2404.05426,,2404.05426.pdf,Test-Time Zero-Shot Temporal Action Localization,"Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate
+actions in untrimmed videos unseen during training. Existing ZS-TAL methods
+involve fine-tuning a model on a large amount of annotated training data. While
+effective, training-based ZS-TAL approaches assume the availability of labeled
+data for supervised learning, which can be impractical in some applications.
+Furthermore, the training process naturally induces a domain bias into the
+learned model, which may adversely affect the model's generalization ability to
+arbitrary videos. These considerations prompt us to approach the ZS-TAL problem
+from a radically novel perspective, relaxing the requirement for training data.
+To this aim, we introduce a novel method that performs Test-Time adaptation for
+Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained
+Vision and Language Model (VLM). T3AL operates in three steps. First, a
+video-level pseudo-label of the action category is computed by aggregating
+information from the entire video. Then, action localization is performed
+adopting a novel procedure inspired by self-supervised learning. Finally,
+frame-level textual descriptions extracted with a state-of-the-art captioning
+model are employed for refining the action region proposals. We validate the
+effectiveness of T3AL by conducting experiments on the THUMOS14 and the
+ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly
+outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the
+benefit of a test-time adaptation approach.",cs.CV,['cs.CV']
+De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts,Yuzheng Wang · Dingkang Yang · Zhaoyu Chen · Yang Liu · Siao Liu · Wenqiang Zhang · Lihua Zhang · Lizhe Qi, ,https://arxiv.org/abs/2403.19539,,,De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts,"Data-Free Knowledge Distillation (DFKD) is a promising task to train
+high-performance small models to enhance actual deployment without relying on
+the original training data. Existing methods commonly avoid relying on private
+data by utilizing synthetic or sampled data. However, a long-overlooked issue
+is that the severe distribution shifts between their substitution and original
+data, which manifests as huge differences in the quality of images and class
+proportions. The harmful shifts are essentially the confounder that
+significantly causes performance bottlenecks. To tackle the issue, this paper
+proposes a novel perspective with causal inference to disentangle the student
+models from the impact of such shifts. By designing a customized causal graph,
+we first reveal the causalities among the variables in the DFKD task.
+Subsequently, we propose a Knowledge Distillation Causal Intervention (KDCI)
+framework based on the backdoor adjustment to de-confound the confounder. KDCI
+can be flexibly combined with most existing state-of-the-art baselines.
+Experiments in combination with six representative DFKD methods demonstrate the
+effectiveness of our KDCI, which can obviously help existing methods under
+almost all settings, \textit{e.g.}, improving the baseline by up to 15.54\%
+accuracy on the CIFAR-100 dataset.",cs.CV,['cs.CV']
+DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields,Cheng-You Lu · Peisen Zhou · Angela Xing · Chandradeep Pokhariya · Arnab Dey · Ishaan Shah · Rugved Mavidipalli · Dylan Hu · Andrew Comport · Kefan Chen · Srinath Sridhar, ,https://arxiv.org/abs/2307.16897,,2307.16897.pdf,DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields,"Advances in neural fields are enabling high-fidelity capture of the shape and
+appearance of dynamic 3D scenes. However, their capabilities lag behind those
+offered by conventional representations such as 2D videos because of
+algorithmic challenges and the lack of large-scale multi-view real-world
+datasets. We address the dataset limitation with DiVa-360, a real-world 360
+dynamic visual dataset that contains synchronized high-resolution and
+long-duration multi-view video sequences of table-scale scenes captured using a
+customized low-cost system with 53 cameras. It contains 21 object-centric
+sequences categorized by different motion types, 25 intricate hand-object
+interaction sequences, and 8 long-duration sequences for a total of 17.4 M
+image frames. In addition, we provide foreground-background segmentation masks,
+synchronized audio, and text descriptions. We benchmark the state-of-the-art
+dynamic neural field methods on DiVa-360 and provide insights about existing
+methods and future challenges on long-duration neural field capture.",cs.CV,"['cs.CV', 'cs.AI']"
+When StyleGAN Meets Stable Diffusion: a ${\mathcal{W}_+}$ Adapter for Personalized Image Generation,Xiaoming Li · Xinyu Hou · Chen Change Loy,https://github.com/csxmli2016/w-plus-adapter,https://arxiv.org/abs/2311.17461v1,,2311.17461v1.pdf,When StyleGAN Meets Stable Diffusion: a $\mathscr{W}_+$ Adapter for Personalized Image Generation,"Text-to-image diffusion models have remarkably excelled in producing diverse,
+high-quality, and photo-realistic images. This advancement has spurred a
+growing interest in incorporating specific identities into generated content.
+Most current methods employ an inversion approach to embed a target visual
+concept into the text embedding space using a single reference image. However,
+the newly synthesized faces either closely resemble the reference image in
+terms of facial attributes, such as expression, or exhibit a reduced capacity
+for identity preservation. Text descriptions intended to guide the facial
+attributes of the synthesized face may fall short, owing to the intricate
+entanglement of identity information with identity-irrelevant facial attributes
+derived from the reference image. To address these issues, we present the novel
+use of the extended StyleGAN embedding space $\mathcal{W}_+$, to achieve
+enhanced identity preservation and disentanglement for diffusion models. By
+aligning this semantically meaningful human face latent space with
+text-to-image diffusion models, we succeed in maintaining high fidelity in
+identity preservation, coupled with the capacity for semantic editing.
+Additionally, we propose new training objectives to balance the influences of
+both prompt and identity conditions, ensuring that the identity-irrelevant
+background remains unaffected during facial attribute modifications. Extensive
+experiments reveal that our method adeptly generates personalized text-to-image
+outputs that are not only compatible with prompt descriptions but also amenable
+to common StyleGAN editing directions in diverse settings. Our source code will
+be available at \url{https://github.com/csxmli2016/w-plus-adapter}.",cs.CV,['cs.CV']
+Spectral and Polarization Vision: Spectro-polarimetric Real-world Dataset,Yujin Jeon · Eunsue Choi · Youngchan Kim · Yunseong Moon · Khalid Omer · Felix Heide · Seung-Hwan Baek, ,https://arxiv.org/abs/2311.17396,,2311.17396.pdf,Spectral and Polarization Vision: Spectro-polarimetric Real-world Dataset,"Image datasets are essential not only in validating existing methods in
+computer vision but also in developing new methods. Most existing image
+datasets focus on trichromatic intensity images to mimic human vision. However,
+polarization and spectrum, the wave properties of light that animals in harsh
+environments and with limited brain capacity often rely on, remain
+underrepresented in existing datasets. Although spectro-polarimetric datasets
+exist, these datasets have insufficient object diversity, limited illumination
+conditions, linear-only polarization data, and inadequate image count. Here, we
+introduce two spectro-polarimetric datasets: trichromatic Stokes images and
+hyperspectral Stokes images. These novel datasets encompass both linear and
+circular polarization; they introduce multiple spectral channels; and they
+feature a broad selection of real-world scenes. With our dataset in hand, we
+analyze the spectro-polarimetric image statistics, develop efficient
+representations of such high-dimensional data, and evaluate spectral dependency
+of shape-from-polarization methods. As such, the proposed dataset promises a
+foundation for data-driven spectro-polarimetric imaging and vision research.
+Dataset and code will be publicly available.",cs.CV,"['cs.CV', 'eess.IV']"
+LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation,Linfeng Yuan · Miaojing Shi · Zijie Yue · Qijun Chen, ,https://arxiv.org/abs/2306.08736,,2306.08736.pdf,LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation,"Referring video object segmentation (RVOS) aims to segment the target
+instance referred by a given text expression in a video clip. The text
+expression normally contains sophisticated description of the instance's
+appearance, action, and relation with others. It is therefore rather difficult
+for a RVOS model to capture all these attributes correspondingly in the video;
+in fact, the model often favours more on the action- and relation-related
+visual attributes of the instance. This can end up with partial or even
+incorrect mask prediction of the target instance. We tackle this problem by
+taking a subject-centric short text expression from the original long text
+expression. The short one retains only the appearance-related information of
+the target instance so that we can use it to focus the model's attention on the
+instance's appearance. We let the model make joint predictions using both long
+and short text expressions; and insert a long-short cross-attention module to
+interact the joint features and a long-short predictions intersection loss to
+regulate the joint predictions. Besides the improvement on the linguistic part,
+we also introduce a forward-backward visual consistency loss, which utilizes
+optical flows to warp visual features between the annotated frames and their
+temporal neighbors for consistency. We build our method on top of two state of
+the art pipelines. Extensive experiments on A2D-Sentences, Refer-YouTube-VOS,
+JHMDB-Sentences and Refer-DAVIS17 show impressive improvements of our
+method.Code is available at https://github.com/LinfengYuan1997/Losh.",cs.CV,['cs.CV']
+FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures,Lisa Mais · Peter Hirsch · Claire Managan · Ramya Kandarpa · Josef Rumberger · Annika Reinke · Lena Maier-Hein · Gudrun Ihrke · Dagmar Kainmueller, ,https://arxiv.org/abs/2404.00130,,2404.00130.pdf,FISBe: A real-world benchmark dataset for instance segmentation of long-range thin filamentous structures,"Instance segmentation of neurons in volumetric light microscopy images of
+nervous systems enables groundbreaking research in neuroscience by facilitating
+joint functional and morphological analyses of neural circuits at cellular
+resolution. Yet said multi-neuron light microscopy data exhibits extremely
+challenging properties for the task of instance segmentation: Individual
+neurons have long-ranging, thin filamentous and widely branching morphologies,
+multiple neurons are tightly inter-weaved, and partial volume effects, uneven
+illumination and noise inherent to light microscopy severely impede local
+disentangling as well as long-range tracing of individual neurons. These
+properties reflect a current key challenge in machine learning research, namely
+to effectively capture long-range dependencies in the data. While respective
+methodological research is buzzing, to date methods are typically benchmarked
+on synthetic datasets. To address this gap, we release the FlyLight Instance
+Segmentation Benchmark (FISBe) dataset, the first publicly available
+multi-neuron light microscopy dataset with pixel-wise annotations. In addition,
+we define a set of instance segmentation metrics for benchmarking that we
+designed to be meaningful with regard to downstream analyses. Lastly, we
+provide three baselines to kick off a competition that we envision to both
+advance the field of machine learning regarding methodology for capturing
+long-range data dependencies, and facilitate scientific discovery in basic
+neuroscience.",cs.CV,"['cs.CV', 'cs.LG']"
+Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos,Sagnik Majumder · Ziad Al-Halah · Kristen Grauman, ,https://arxiv.org/abs/2307.04760,,2307.04760.pdf,Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos,"We propose a self-supervised method for learning representations based on
+spatial audio-visual correspondences in egocentric videos. Our method uses a
+masked auto-encoding framework to synthesize masked binaural (multi-channel)
+audio through the synergy of audio and vision, thereby learning useful spatial
+relationships between the two modalities. We use our pretrained features to
+tackle two downstream video tasks requiring spatial understanding in social
+scenarios: active speaker detection and spatial audio denoising. Through
+extensive experiments, we show that our features are generic enough to improve
+over multiple state-of-the-art baselines on both tasks on two challenging
+egocentric video datasets that offer binaural audio, EgoCom and EasyCom.
+Project: http://vision.cs.utexas.edu/projects/ego_av_corr.",cs.CV,"['cs.CV', 'cs.SD', 'eess.AS']"
+Motion Blur Decomposition with Cross-shutter Guidance,Xiang Ji · Haiyang Jiang · Yinqiang Zheng,https://jixiang2016.github.io/dualBR_site/,https://arxiv.org/abs/2404.01120,,2404.01120.pdf,Motion Blur Decomposition with Cross-shutter Guidance,"Motion blur is a frequently observed image artifact, especially under
+insufficient illumination where exposure time has to be prolonged so as to
+collect more photons for a bright enough image. Rather than simply removing
+such blurring effects, recent researches have aimed at decomposing a blurry
+image into multiple sharp images with spatial and temporal coherence. Since
+motion blur decomposition itself is highly ambiguous, priors from neighbouring
+frames or human annotation are usually needed for motion disambiguation. In
+this paper, inspired by the complementary exposure characteristics of a global
+shutter (GS) camera and a rolling shutter (RS) camera, we propose to utilize
+the ordered scanline-wise delay in a rolling shutter image to robustify motion
+decomposition of a single blurry image. To evaluate this novel dual imaging
+setting, we construct a triaxial system to collect realistic data, as well as a
+deep network architecture that explicitly addresses temporal and contextual
+information through reciprocal branches for cross-shutter motion blur
+decomposition. Experiment results have verified the effectiveness of our
+proposed algorithm, as well as the validity of our dual imaging setting.",cs.CV,['cs.CV']
+LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition,Zhonglin Sun · Chen Feng · Ioannis Patras · Georgios Tzimiropoulos, ,https://arxiv.org/abs/2403.08161,,2403.08161.pdf,LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition,"In this work we focus on learning facial representations that can be adapted
+to train effective face recognition models, particularly in the absence of
+labels. Firstly, compared with existing labelled face datasets, a vastly larger
+magnitude of unlabeled faces exists in the real world. We explore the learning
+strategy of these unlabeled facial images through self-supervised pretraining
+to transfer generalized face recognition performance. Moreover, motivated by
+one recent finding, that is, the face saliency area is critical for face
+recognition, in contrast to utilizing random cropped blocks of images for
+constructing augmentations in pretraining, we utilize patches localized by
+extracted facial landmarks. This enables our method - namely LAndmark-based
+Facial Self-supervised learning LAFS), to learn key representation that is more
+critical for face recognition. We also incorporate two landmark-specific
+augmentations which introduce more diversity of landmark information to further
+regularize the learning. With learned landmark-based facial representations, we
+further adapt the representation for face recognition with regularization
+mitigating variations in landmark positions. Our method achieves significant
+improvement over the state-of-the-art on multiple face recognition benchmarks,
+especially on more challenging few-shot scenarios.",cs.CV,"['cs.CV', 'cs.AI']"
+Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation,Wenxuan Wang · Tongtian Yue · Yisi Zhang · Longteng Guo · Xingjian He · Xinlong Wang · Jing Liu, ,https://arxiv.org/abs/2312.08007,,2312.08007.pdf,Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation,"Referring expression segmentation (RES) aims at segmenting the foreground
+masks of the entities that match the descriptive natural language expression.
+Previous datasets and methods for classic RES task heavily rely on the prior
+assumption that one expression must refer to object-level targets. In this
+paper, we take a step further to finer-grained part-level RES task. To promote
+the object-level RES task towards finer-grained vision-language understanding,
+we put forward a new multi-granularity referring expression segmentation (MRES)
+task and construct an evaluation benchmark called RefCOCOm by manual
+annotations. By employing our automatic model-assisted data engine, we build
+the largest visual grounding dataset namely MRES-32M, which comprises over
+32.2M high-quality masks and captions on the provided 1M images. Besides, a
+simple yet strong model named UniRES is designed to accomplish the unified
+object-level and part-level grounding task. Extensive experiments on our
+RefCOCOm for MRES and three datasets (i.e., RefCOCO(+/g) for classic RES task
+demonstrate the superiority of our method over previous state-of-the-art
+methods. To foster future research into fine-grained visual grounding, our
+benchmark RefCOCOm, the MRES-32M dataset and model UniRES will be publicly
+available at https://github.com/Rubics-Xuan/MRES",cs.CV,['cs.CV']
+Event-based Visible and Infrared Fusion via Multi-task Collaboration,Mengyue Geng · Lin Zhu · Lizhi Wang · Wei Zhang · Ruiqin Xiong · Yonghong Tian, ,https://arxiv.org/abs/2312.04328,,2312.04328.pdf,A Multi-scale Information Integration Framework for Infrared and Visible Image Fusion,"Infrared and visible image fusion aims at generating a fused image containing
+the intensity and detail information of source images, and the key issue is
+effectively measuring and integrating the complementary information of
+multi-modality images from the same scene. Existing methods mostly adopt a
+simple weight in the loss function to decide the information retention of each
+modality rather than adaptively measuring complementary information for
+different image pairs. In this study, we propose a multi-scale dual attention
+(MDA) framework for infrared and visible image fusion, which is designed to
+measure and integrate complementary information in both structure and loss
+function at the image and patch level. In our method, the residual downsample
+block decomposes source images into three scales first. Then, dual attention
+fusion block integrates complementary information and generates a spatial and
+channel attention map at each scale for feature fusion. Finally, the output
+image is reconstructed by the residual reconstruction block. Loss function
+consists of image-level, feature-level and patch-level three parts, of which
+the calculation of the image-level and patch-level two parts are based on the
+weights generated by the complementary information measurement. Indeed, to
+constrain the pixel intensity distribution between the output and infrared
+image, a style loss is added. Our fusion results perform robust and informative
+across different scenarios. Qualitative and quantitative results on two
+datasets illustrate that our method is able to preserve both thermal radiation
+and detailed information from two modalities and achieve comparable results
+compared with the other state-of-the-art methods. Ablation experiments show the
+effectiveness of our information integration architecture and adaptively
+measure complementary information retention in the loss function.",cs.CV,['cs.CV']
+PFStorer: Personalized Face Restoration and Super-Resolution,Tuomas Varanka · Tapani Toivonen · Soumya Tripathy · Guoying Zhao · Erman Acar, ,https://arxiv.org/abs/2403.08436,,2403.08436.pdf,PFStorer: Personalized Face Restoration and Super-Resolution,"Recent developments in face restoration have achieved remarkable results in
+producing high-quality and lifelike outputs. The stunning results however often
+fail to be faithful with respect to the identity of the person as the models
+lack necessary context. In this paper, we explore the potential of personalized
+face restoration with diffusion models. In our approach a restoration model is
+personalized using a few images of the identity, leading to tailored
+restoration with respect to the identity while retaining fine-grained details.
+By using independent trainable blocks for personalization, the rich prior of a
+base restoration model can be exploited to its fullest. To avoid the model
+relying on parts of identity left in the conditioning low-quality images, a
+generative regularizer is employed. With a learnable parameter, the model
+learns to balance between the details generated based on the input image and
+the degree of personalization. Moreover, we improve the training pipeline of
+face restoration models to enable an alignment-free approach. We showcase the
+robust capabilities of our approach in several real-world scenarios with
+multiple identities, demonstrating our method's ability to generate
+fine-grained details with faithful restoration. In the user study we evaluate
+the perceptual quality and faithfulness of the genereated details, with our
+method being voted best 61% of the time compared to the second best with 25% of
+the votes.",cs.CV,['cs.CV']
+UFORecon: Generalizable Sparse-View Surface Reconstruction from Arbitrary and Unfavorable Sets,Youngju Na · Woo Jae Kim · Kyu Han · Suhyeon Ha · Sung-Eui Yoon, ,https://arxiv.org/abs/2403.05086,,2403.05086.pdf,UFORecon: Generalizable Sparse-View Surface Reconstruction from Arbitrary and UnFavOrable Sets,"Generalizable neural implicit surface reconstruction aims to obtain an
+accurate underlying geometry given a limited number of multi-view images from
+unseen scenes. However, existing methods select only informative and relevant
+views using predefined scores for training and testing phases. This constraint
+renders the model impractical in real-world scenarios, where the availability
+of favorable combinations cannot always be ensured. We introduce and validate a
+view-combination score to indicate the effectiveness of the input view
+combination. We observe that previous methods output degenerate solutions under
+arbitrary and unfavorable sets. Building upon this finding, we propose
+UFORecon, a robust view-combination generalizable surface reconstruction
+framework. To achieve this, we apply cross-view matching transformers to model
+interactions between source images and build correlation frustums to capture
+global correlations. Additionally, we explicitly encode pairwise feature
+similarities as view-consistent priors. Our proposed framework significantly
+outperforms previous methods in terms of view-combination generalizability and
+also in the conventional generalizable protocol trained with favorable
+view-combinations. The code is available at
+https://github.com/Youngju-Na/UFORecon.",cs.CV,['cs.CV']
+Generalizable Face Landmarking Guided by Conditional Face Warping,Jiayi Liang · Haotian Liu · Hongteng Xu · Dixin Luo,https://plustwo0.github.io/project-face-landmarker/,https://arxiv.org/abs/2404.12322,,2404.12322.pdf,Generalizable Face Landmarking Guided by Conditional Face Warping,"As a significant step for human face modeling, editing, and generation, face
+landmarking aims at extracting facial keypoints from images. A generalizable
+face landmarker is required in practice because real-world facial images, e.g.,
+the avatars in animations and games, are often stylized in various ways.
+However, achieving generalizable face landmarking is challenging due to the
+diversity of facial styles and the scarcity of labeled stylized faces. In this
+study, we propose a simple but effective paradigm to learn a generalizable face
+landmarker based on labeled real human faces and unlabeled stylized faces. Our
+method learns the face landmarker as the key module of a conditional face
+warper. Given a pair of real and stylized facial images, the conditional face
+warper predicts a warping field from the real face to the stylized one, in
+which the face landmarker predicts the ending points of the warping field and
+provides us with high-quality pseudo landmarks for the corresponding stylized
+facial images. Applying an alternating optimization strategy, we learn the face
+landmarker to minimize $i)$ the discrepancy between the stylized faces and the
+warped real ones and $ii)$ the prediction errors of both real and pseudo
+landmarks. Experiments on various datasets show that our method outperforms
+existing state-of-the-art domain adaptation methods in face landmarking tasks,
+leading to a face landmarker with better generalizability. Code is available at
+https://plustwo0.github.io/project-face-landmarker.",cs.CV,"['cs.CV', 'cs.AI']"
+PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs,Michael Dorkenwald · Nimrod Barazani · Cees G. M. Snoek · Yuki Asano,https://quva-lab.github.io/PIN/,https://arxiv.org/abs/2402.08657,,2402.08657.pdf,PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs,"Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown
+immense potential by integrating large language models with vision systems.
+Nevertheless, these models face challenges in the fundamental computer vision
+task of object localisation, due to their training on multimodal data
+containing mostly captions without explicit spatial grounding. While it is
+possible to construct custom, supervised training pipelines with bounding box
+annotations that integrate with VLMs, these result in specialized and
+hard-to-scale models. In this paper, we aim to explore the limits of
+caption-based VLMs and instead propose to tackle the challenge in a simpler
+manner by i) keeping the weights of a caption-based VLM frozen and ii) not
+using any supervised detection data. To this end, we introduce an
+input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing
+a minimal set of parameters that are slid inside the frozen VLM, unlocking
+object localisation capabilities. Our PIN module is trained with a simple
+next-token prediction task on synthetic data without requiring the introduction
+of new output heads. Our experiments demonstrate strong zero-shot localisation
+performances on a variety of images, including Pascal VOC, COCO, LVIS, and
+diverse images like paintings or cartoons.",cs.CV,['cs.CV']
+SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation,Keqi Chen · vinkle srivastav · Nicolas Padoy,https://github.com/CAMMA-public/SelfPose3d/,https://arxiv.org/abs/2404.02041,,,SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation,"We present a new self-supervised approach, SelfPose3d, for estimating 3d
+poses of multiple persons from multiple camera views. Unlike current
+state-of-the-art fully-supervised methods, our approach does not require any 2d
+or 3d ground-truth poses and uses only the multi-view input images from a
+calibrated camera setup and 2d pseudo poses generated from an off-the-shelf 2d
+human pose estimator. We propose two self-supervised learning objectives:
+self-supervised person localization in 3d space and self-supervised 3d pose
+estimation. We achieve self-supervised 3d person localization by training the
+model on synthetically generated 3d points, serving as 3d person root
+positions, and on the projected root-heatmaps in all the views. We then model
+the 3d poses of all the localized persons with a bottleneck representation, map
+them onto all views obtaining 2d joints, and render them using 2d Gaussian
+heatmaps in an end-to-end differentiable manner. Afterwards, we use the
+corresponding 2d joints and heatmaps from the pseudo 2d poses for learning. To
+alleviate the intrinsic inaccuracy of the pseudo labels, we propose an adaptive
+supervision attention mechanism to guide the self-supervision. Our experiments
+and analysis on three public benchmark datasets, including Panoptic, Shelf, and
+Campus, show the effectiveness of our approach, which is comparable to
+fully-supervised methods. Code is available at
+\url{https://github.com/CAMMA-public/SelfPose3D}",cs.CV,['cs.CV']
+PEM: Prototype-based Efficient MaskFormer for Image Segmentation,Niccolò Cavagnero · Gabriele Rosi · Claudia Cuttano · Francesca Pistilli · Marco Ciccone · Giuseppe Averta · Fabio Cermelli,https://niccolocavagnero.github.io/PEM/,https://arxiv.org/abs/2402.19422,,2402.19422.pdf,PEM: Prototype-based Efficient MaskFormer for Image Segmentation,"Recent transformer-based architectures have shown impressive results in the
+field of image segmentation. Thanks to their flexibility, they obtain
+outstanding performance in multiple segmentation tasks, such as semantic and
+panoptic, under a single unified framework. To achieve such impressive
+performance, these architectures employ intensive operations and require
+substantial computational resources, which are often not available, especially
+on edge devices. To fill this gap, we propose Prototype-based Efficient
+MaskFormer (PEM), an efficient transformer-based architecture that can operate
+in multiple segmentation tasks. PEM proposes a novel prototype-based
+cross-attention which leverages the redundancy of visual features to restrict
+the computation and improve the efficiency without harming the performance. In
+addition, PEM introduces an efficient multi-scale feature pyramid network,
+capable of extracting features that have high semantic content in an efficient
+way, thanks to the combination of deformable convolutions and context-based
+self-modulation. We benchmark the proposed PEM architecture on two tasks,
+semantic and panoptic segmentation, evaluated on two different datasets,
+Cityscapes and ADE20K. PEM demonstrates outstanding performance on every task
+and dataset, outperforming task-specific architectures while being comparable
+and even better than computationally-expensive baselines.",cs.CV,"['cs.CV', 'cs.AI']"
+Improving Distant 3D Object Detection Using 2D Box Supervision,Zetong Yang · Zhiding Yu · Christopher Choy · Renhao Wang · Anima Anandkumar · Jose M. Alvarez, ,https://arxiv.org/abs/2403.09230,,2403.09230.pdf,Improving Distant 3D Object Detection Using 2D Box Supervision,"Improving the detection of distant 3d objects is an important yet challenging
+task. For camera-based 3D perception, the annotation of 3d bounding relies
+heavily on LiDAR for accurate depth information. As such, the distance of
+annotation is often limited due to the sparsity of LiDAR points on distant
+objects, which hampers the capability of existing detectors for long-range
+scenarios. We address this challenge by considering only 2D box supervision for
+distant objects since they are easy to annotate. We propose LR3D, a framework
+that learns to recover the missing depth of distant objects. LR3D adopts an
+implicit projection head to learn the generation of mapping between 2D boxes
+and depth using the 3D supervision on close objects. This mapping allows the
+depth estimation of distant objects conditioned on their 2D boxes, making
+long-range 3D detection with 2D supervision feasible. Experiments show that
+without distant 3D annotations, LR3D allows camera-based methods to detect
+distant objects (over 200m) with comparable accuracy to full 3D supervision.
+Our framework is general, and could widely benefit 3D detection methods to a
+large extent.",cs.CV,['cs.CV']
+Visual Point Cloud Forecasting enables Scalable Autonomous Driving,Zetong Yang · Li Chen · Yanan Sun · Hongyang Li,https://github.com/OpenDriveLab/ViDAR,https://arxiv.org/abs/2312.17655,,2312.17655.pdf,Visual Point Cloud Forecasting enables Scalable Autonomous Driving,"In contrast to extensive studies on general vision, pre-training for scalable
+visual autonomous driving remains seldom explored. Visual autonomous driving
+applications require features encompassing semantics, 3D geometry, and temporal
+information simultaneously for joint perception, prediction, and planning,
+posing dramatic challenges for pre-training. To resolve this, we bring up a new
+pre-training task termed as visual point cloud forecasting - predicting future
+point clouds from historical visual input. The key merit of this task captures
+the synergic learning of semantics, 3D structures, and temporal dynamics. Hence
+it shows superiority in various downstream tasks. To cope with this new
+problem, we present ViDAR, a general model to pre-train downstream visual
+encoders. It first extracts historical embeddings by the encoder. These
+representations are then transformed to 3D geometric space via a novel Latent
+Rendering operator for future point cloud prediction. Experiments show
+significant gain in downstream tasks, e.g., 3.1% NDS on 3D detection, ~10%
+error reduction on motion forecasting, and ~15% less collision rate on
+planning.",cs.CV,['cs.CV']
+Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting,Haipeng Liu · Yang Wang · Biao Qian · Meng Wang · Yong Rui,https://github.com/htyjers/StrDiffusion,https://arxiv.org/abs/2403.19898,,2403.19898.pdf,Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting,"Denoising diffusion probabilistic models for image inpainting aim to add the
+noise to the texture of image during the forward process and recover masked
+regions with unmasked ones of the texture via the reverse denoising process.
+Despite the meaningful semantics generation, the existing arts suffer from the
+semantic discrepancy between masked and unmasked regions, since the
+semantically dense unmasked texture fails to be completely degraded while the
+masked regions turn to the pure noise in diffusion process, leading to the
+large discrepancy between them. In this paper, we aim to answer how unmasked
+semantics guide texture denoising process;together with how to tackle the
+semantic discrepancy, to facilitate the consistent and meaningful semantics
+generation. To this end, we propose a novel structure-guided diffusion model
+named StrDiffusion, to reformulate the conventional texture denoising process
+under structure guidance to derive a simplified denoising objective for image
+inpainting, while revealing: 1) the semantically sparse structure is beneficial
+to tackle semantic discrepancy in early stage, while dense texture generates
+reasonable semantics in late stage; 2) the semantics from unmasked regions
+essentially offer the time-dependent structure guidance for the texture
+denoising process, benefiting from the time-dependent sparsity of the structure
+semantics. For the denoising process, a structure-guided neural network is
+trained to estimate the simplified denoising objective by exploiting the
+consistency of the denoised structure between masked and unmasked regions.
+Besides, we devise an adaptive resampling strategy as a formal criterion as
+whether structure is competent to guide the texture denoising process, while
+regulate their semantic correlations. Extensive experiments validate the merits
+of StrDiffusion over the state-of-the-arts. Our code is available at
+https://github.com/htyjers/StrDiffusion.",cs.CV,['cs.CV']
+Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models,Kota Sueyoshi · Takashi Matsubara, ,https://arxiv.org/abs/2311.16117,,2311.16117.pdf,Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models,"Diffusion models have achieved remarkable results in generating high-quality,
+diverse, and creative images. However, when it comes to text-based image
+generation, they often fail to capture the intended meaning presented in the
+text. For instance, a specified object may not be generated, an unnecessary
+object may be generated, and an adjective may alter objects it was not intended
+to modify. Moreover, we found that relationships indicating possession between
+objects are often overlooked. While users' intentions in text are diverse,
+existing methods tend to specialize in only some aspects of these. In this
+paper, we propose Predicated Diffusion, a unified framework to express users'
+intentions. We consider that the root of the above issues lies in the text
+encoder, which often focuses only on individual words and neglects the logical
+relationships between them. The proposed method does not solely rely on the
+text encoder, but instead, represents the intended meaning in the text as
+propositions using predicate logic and treats the pixels in the attention maps
+as the fuzzy predicates. This enables us to obtain a differentiable loss
+function that makes the image fulfill the proposition by minimizing it. When
+compared to several existing methods, we demonstrated that Predicated Diffusion
+can generate images that are more faithful to various text prompts, as verified
+by human evaluators and pretrained image-text models.",cs.CV,['cs.CV']
+Probabilistic Human Mesh Estimation with Hypothesis Scoring,Yuan Xu · Xiaoxuan Ma · Jiajun Su · Wentao Zhu · Yu Qiao · Yizhou Wang, ,https://arxiv.org/abs/2308.02963,,2308.02963.pdf,Generative Approach for Probabilistic Human Mesh Recovery using Diffusion Models,"This work focuses on the problem of reconstructing a 3D human body mesh from
+a given 2D image. Despite the inherent ambiguity of the task of human mesh
+recovery, most existing works have adopted a method of regressing a single
+output. In contrast, we propose a generative approach framework, called
+""Diffusion-based Human Mesh Recovery (Diff-HMR)"" that takes advantage of the
+denoising diffusion process to account for multiple plausible outcomes. During
+the training phase, the SMPL parameters are diffused from ground-truth
+parameters to random distribution, and Diff-HMR learns the reverse process of
+this diffusion. In the inference phase, the model progressively refines the
+given random SMPL parameters into the corresponding parameters that align with
+the input image. Diff-HMR, being a generative approach, is capable of
+generating diverse results for the same input image as the input noise varies.
+We conduct validation experiments, and the results demonstrate that the
+proposed framework effectively models the inherent ambiguity of the task of
+human mesh recovery in a probabilistic manner. The code is available at
+https://github.com/hanbyel0105/Diff-HMR",cs.CV,['cs.CV']
+TexVocab: Texture Vocabulary-conditioned Human Avatars,Yuxiao Liu · Zhe Li · Yebin Liu · Haoqian Wang, ,https://arxiv.org/abs/2404.00524,,2404.00524.pdf,TexVocab: Texture Vocabulary-conditioned Human Avatars,"To adequately utilize the available image evidence in multi-view video-based
+avatar modeling, we propose TexVocab, a novel avatar representation that
+constructs a texture vocabulary and associates body poses with texture maps for
+animation. Given multi-view RGB videos, our method initially back-projects all
+the available images in the training videos to the posed SMPL surface,
+producing texture maps in the SMPL UV domain. Then we construct pairs of human
+poses and texture maps to establish a texture vocabulary for encoding dynamic
+human appearances under various poses. Unlike the commonly used joint-wise
+manner, we further design a body-part-wise encoding strategy to learn the
+structural effects of the kinematic chain. Given a driving pose, we query the
+pose feature hierarchically by decomposing the pose vector into several body
+parts and interpolating the texture features for synthesizing fine-grained
+human dynamics. Overall, our method is able to create animatable human avatars
+with detailed and dynamic appearances from RGB videos, and the experiments show
+that our method outperforms state-of-the-art approaches. The project page can
+be found at https://texvocab.github.io/.",cs.CV,['cs.CV']
+LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network,Hao Yang · Liyuan Pan · Yan Yang · Richard Hartley · Miaomiao Liu, ,https://arxiv.org/abs/2307.09815,,2307.09815.pdf,LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network,"Recovering sharp images from dual-pixel (DP) pairs with disparity-dependent
+blur is a challenging task.~Existing blur map-based deblurring methods have
+demonstrated promising results. In this paper, we propose, to the best of our
+knowledge, the first framework that introduces the contrastive language-image
+pre-training framework (CLIP) to accurately estimate the blur map from a DP
+pair unsupervisedly. To achieve this, we first carefully design text prompts to
+enable CLIP to understand blur-related geometric prior knowledge from the DP
+pair. Then, we propose a format to input a stereo DP pair to CLIP without any
+fine-tuning, despite the fact that CLIP is pre-trained on monocular images.
+Given the estimated blur map, we introduce a blur-prior attention block, a
+blur-weighting loss, and a blur-aware loss to recover the all-in-focus image.
+Our method achieves state-of-the-art performance in extensive experiments (see
+Fig.~\ref{fig:teaser}).",cs.CV,['cs.CV']
+VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection,Zihua Liu · Hiroki Sakuma · Masatoshi Okutomi,http://www.ok.sc.e.titech.ac.jp/res/VSRD/index.html,https://arxiv.org/abs/2404.00149,,2404.00149.pdf,VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection,"Monocular 3D object detection poses a significant challenge in 3D scene
+understanding due to its inherently ill-posed nature in monocular depth
+estimation. Existing methods heavily rely on supervised learning using abundant
+3D labels, typically obtained through expensive and labor-intensive annotation
+on LiDAR point clouds. To tackle this problem, we propose a novel weakly
+supervised 3D object detection framework named VSRD (Volumetric Silhouette
+Rendering for Detection) to train 3D object detectors without any 3D
+supervision but only weak 2D supervision. VSRD consists of multi-view 3D
+auto-labeling and subsequent training of monocular 3D object detectors using
+the pseudo labels generated in the auto-labeling stage. In the auto-labeling
+stage, we represent the surface of each instance as a signed distance field
+(SDF) and render its silhouette as an instance mask through our proposed
+instance-aware volumetric silhouette rendering. To directly optimize the 3D
+bounding boxes through rendering, we decompose the SDF of each instance into
+the SDF of a cuboid and the residual distance field (RDF) that represents the
+residual from the cuboid. This mechanism enables us to optimize the 3D bounding
+boxes in an end-to-end manner by comparing the rendered instance masks with the
+ground truth instance masks. The optimized 3D bounding boxes serve as effective
+training data for 3D object detection. We conduct extensive experiments on the
+KITTI-360 dataset, demonstrating that our method outperforms the existing
+weakly supervised 3D object detection methods. The code is available at
+https://github.com/skmhrk1209/VSRD.",cs.CV,['cs.CV']
+Real-World Mobile Image Denoising Dataset with Efficient Baselines,Roman Flepp · Andrey Ignatov · Radu Timofte · Luc Van Gool, ,https://arxiv.org/html/2404.08514v2,,2404.08514v2.pdf,NIR-Assisted Image Denoising: A Selective Fusion Approach and A Real-World Benchmark Datase,"Despite the significant progress in image denoising, it is still challenging
+to restore fine-scale details while removing noise, especially in extremely
+low-light environments. Leveraging near-infrared (NIR) images to assist visible
+RGB image denoising shows the potential to address this issue, becoming a
+promising technology. Nonetheless, existing works still struggle with taking
+advantage of NIR information effectively for real-world image denoising, due to
+the content inconsistency between NIR-RGB images and the scarcity of real-world
+paired datasets. To alleviate the problem, we propose an efficient Selective
+Fusion Module (SFM), which can be plug-and-played into the advanced denoising
+networks to merge the deep NIR-RGB features. Specifically, we sequentially
+perform the global and local modulation for NIR and RGB features, and then
+integrate the two modulated features. Furthermore, we present a Real-world
+NIR-Assisted Image Denoising (Real-NAID) dataset, which covers diverse
+scenarios as well as various noise levels. Extensive experiments on both
+synthetic and our real-world datasets demonstrate that the proposed method
+achieves better results than state-of-the-art ones.",cs.CV,['cs.CV']
+Exploiting Style Latent Flows for Generalizing Video Deepfake Detection,Jongwook Choi · Taehoon Kim · Yonghyun Jeong · Seungryul Baek · Jongwon Choi, ,https://arxiv.org/abs/2403.06592v1,,2403.06592v1.pdf,Exploiting Style Latent Flows for Generalizing Deepfake Detection Video Detection,"This paper presents a new approach for the detection of fake videos, based on
+the analysis of style latent vectors and their abnormal behavior in temporal
+changes in the generated videos. We discovered that the generated facial videos
+suffer from the temporal distinctiveness in the temporal changes of style
+latent vectors, which are inevitable during the generation of temporally stable
+videos with various facial expressions and geometric transformations. Our
+framework utilizes the StyleGRU module, trained by contrastive learning, to
+represent the dynamic properties of style latent vectors. Additionally, we
+introduce a style attention module that integrates StyleGRU-generated features
+with content-based features, enabling the detection of visual and temporal
+artifacts. We demonstrate our approach across various benchmark scenarios in
+deepfake detection, showing its superiority in cross-dataset and
+cross-manipulation scenarios. Through further analysis, we also validate the
+importance of using temporal changes of style latent vectors to improve the
+generality of deepfake video detection.",cs.CV,"['cs.CV', 'cs.AI']"
+3D Face Reconstruction with the Geometric Guidance of Facial Part Segmentation,Zidu Wang · Xiangyu Zhu · Tianshuo Zhang · baiqin wang · Zhen Lei,https://github.com/wang-zidu/3DDFA-V3,https://arxiv.org/abs/2312.00311,,2312.00311.pdf,3D Face Reconstruction with the Geometric Guidance of Facial Part Segmentation,"3D Morphable Models (3DMMs) provide promising 3D face reconstructions in
+various applications. However, existing methods struggle to reconstruct faces
+with extreme expressions due to deficiencies in supervisory signals, such as
+sparse or inaccurate landmarks. Segmentation information contains effective
+geometric contexts for face reconstruction. Certain attempts intuitively depend
+on differentiable renderers to compare the rendered silhouettes of
+reconstruction with segmentation, which is prone to issues like local optima
+and gradient instability. In this paper, we fully utilize the facial part
+segmentation geometry by introducing Part Re-projection Distance Loss (PRDL).
+Specifically, PRDL transforms facial part segmentation into 2D points and
+re-projects the reconstruction onto the image plane. Subsequently, by
+introducing grid anchors and computing different statistical distances from
+these anchors to the point sets, PRDL establishes geometry descriptors to
+optimize the distribution of the point sets for face reconstruction. PRDL
+exhibits a clear gradient compared to the renderer-based methods and presents
+state-of-the-art reconstruction performance in extensive quantitative and
+qualitative experiments. Our project is available at
+https://github.com/wang-zidu/3DDFA-V3 .",cs.CV,['cs.CV']
+PerceptionGPT: Effectively Fusing Visual Perception into LLM,Renjie Pi · Lewei Yao · Jiahui Gao · Jipeng Zhang · Tong Zhang, ,https://arxiv.org/abs/2311.06612,,2311.06612.pdf,PerceptionGPT: Effectively Fusing Visual Perception into LLM,"The integration of visual inputs with large language models (LLMs) has led to
+remarkable advancements in multi-modal capabilities, giving rise to visual
+large language models (VLLMs). However, effectively harnessing VLLMs for
+intricate visual perception tasks remains a challenge. In this paper, we
+present a novel end-to-end framework named PerceptionGPT, which efficiently and
+effectively equips the VLLMs with visual perception abilities by leveraging the
+representation power of LLMs' token embedding. Our proposed method treats the
+token embedding of the LLM as the carrier of spatial information, then leverage
+lightweight visual task encoders and decoders to perform visual perception
+tasks (e.g., detection, segmentation). Our approach significantly alleviates
+the training difficulty suffered by previous approaches that formulate the
+visual outputs as discrete tokens, and enables achieving superior performance
+with fewer trainable parameters, less training data and shorted training time.
+Moreover, as only one token embedding is required to decode the visual outputs,
+the resulting sequence length during inference is significantly reduced.
+Consequently, our approach enables accurate and flexible representations,
+seamless integration of visual perception tasks, and efficient handling of a
+multiple of visual outputs. We validate the effectiveness and efficiency of our
+approach through extensive experiments. The results demonstrate significant
+improvements over previous methods with much fewer trainable parameters and GPU
+hours, which facilitates future research in enabling LLMs with visual
+perception abilities.",cs.CV,"['cs.CV', 'cs.CL']"
+In Search of a Data Transformation That Accelerates Neural Field Training,Junwon Seo · Sangyoon Lee · Kwang In Kim · Jaeho Lee, ,https://arxiv.org/abs/2311.17094,,2311.17094.pdf,In Search of a Data Transformation That Accelerates Neural Field Training,"Neural field is an emerging paradigm in data representation that trains a
+neural network to approximate the given signal. A key obstacle that prevents
+its widespread adoption is the encoding speed-generating neural fields requires
+an overfitting of a neural network, which can take a significant number of SGD
+steps to reach the desired fidelity level. In this paper, we delve into the
+impacts of data transformations on the speed of neural field training,
+specifically focusing on how permuting pixel locations affect the convergence
+speed of SGD. Counterintuitively, we find that randomly permuting the pixel
+locations can considerably accelerate the training. To explain this phenomenon,
+we examine the neural field training through the lens of PSNR curves, loss
+landscapes, and error patterns. Our analyses suggest that the random pixel
+permutations remove the easy-to-fit patterns, which facilitate easy
+optimization in the early stage but hinder capturing fine details of the
+signal.",cs.LG,"['cs.LG', 'cs.CV']"
+Multi-view Aggregation Network for Dichotomous Image Segmentation,Qian Yu · Xiaoqi Zhao · Youwei Pang · Lihe Zhang · Huchuan Lu, ,https://arxiv.org/abs/2404.07445,,2404.07445.pdf,Multi-view Aggregation Network for Dichotomous Image Segmentation,"Dichotomous Image Segmentation (DIS) has recently emerged towards
+high-precision object segmentation from high-resolution natural images.
+  When designing an effective DIS model, the main challenge is how to balance
+the semantic dispersion of high-resolution targets in the small receptive field
+and the loss of high-precision details in the large receptive field. Existing
+methods rely on tedious multiple encoder-decoder streams and stages to
+gradually complete the global localization and local refinement.
+  Human visual system captures regions of interest by observing them from
+multiple views. Inspired by it, we model DIS as a multi-view object perception
+problem and provide a parsimonious multi-view aggregation network (MVANet),
+which unifies the feature fusion of the distant view and close-up view into a
+single stream with one encoder-decoder structure. With the help of the proposed
+multi-view complementary localization and refinement modules, our approach
+established long-range, profound visual interactions across multiple views,
+allowing the features of the detailed close-up view to focus on highly slender
+structures.Experiments on the popular DIS-5K dataset show that our MVANet
+significantly outperforms state-of-the-art methods in both accuracy and speed.
+The source code and datasets will be publicly available at
+\href{https://github.com/qianyu-dlut/MVANet}{MVANet}.",cs.CV,['cs.CV']
+Three Pillars improving Vision Foundation Model Distillation for Lidar,Gilles Puy · Spyros Gidaris · Alexandre Boulch · Oriane Siméoni · Corentin Sautier · Patrick Pérez · Andrei Bursuc · Renaud Marlet,https://github.com/valeoai/ScaLR,https://arxiv.org/abs/2310.17504,,2310.17504.pdf,Three Pillars improving Vision Foundation Model Distillation for Lidar,"Self-supervised image backbones can be used to address complex 2D tasks
+(e.g., semantic segmentation, object discovery) very efficiently and with
+little or no downstream supervision. Ideally, 3D backbones for lidar should be
+able to inherit these properties after distillation of these powerful 2D
+features. The most recent methods for image-to-lidar distillation on autonomous
+driving data show promising results, obtained thanks to distillation methods
+that keep improving. Yet, we still notice a large performance gap when
+measuring the quality of distilled and fully supervised features by linear
+probing. In this work, instead of focusing only on the distillation method, we
+study the effect of three pillars for distillation: the 3D backbone, the
+pretrained 2D backbones, and the pretraining dataset. In particular, thanks to
+our scalable distillation method named ScaLR, we show that scaling the 2D and
+3D backbones and pretraining on diverse datasets leads to a substantial
+improvement of the feature quality. This allows us to significantly reduce the
+gap between the quality of distilled and fully-supervised 3D features, and to
+improve the robustness of the pretrained backbones to domain gaps and
+perturbations.",cs.CV,['cs.CV']
+Cloud-Device Collaborative Learning for Multimodal Large Language Models,Guanqun Wang · Jiaming Liu · Chenxuan Li · Yuan Zhang · Ma Junpeng · Xinyu Wei · Kevin Zhang · Maurice Chong · Renrui Zhang · Yijiang Liu · Shanghang Zhang,https://github.com/2644521362/Cdcca/tree/main,https://arxiv.org/abs/2312.16279,,2312.16279.pdf,Cloud-Device Collaborative Learning for Multimodal Large Language Models,"The burgeoning field of Multimodal Large Language Models (MLLMs) has
+exhibited remarkable performance in diverse tasks such as captioning,
+commonsense reasoning, and visual scene understanding. However, the deployment
+of these large-scale MLLMs on client devices is hindered by their extensive
+model parameters, leading to a notable decline in generalization capabilities
+when these models are compressed for device deployment. Addressing this
+challenge, we introduce a Cloud-Device Collaborative Continual Adaptation
+framework, designed to enhance the performance of compressed, device-deployed
+MLLMs by leveraging the robust capabilities of cloud-based, larger-scale MLLMs.
+Our framework is structured into three key components: a device-to-cloud uplink
+for efficient data transmission, cloud-based knowledge adaptation, and an
+optimized cloud-to-device downlink for model deployment. In the uplink phase,
+we employ an Uncertainty-guided Token Sampling (UTS) strategy to effectively
+filter out-of-distribution tokens, thereby reducing transmission costs and
+improving training efficiency. On the cloud side, we propose Adapter-based
+Knowledge Distillation (AKD) method to transfer refined knowledge from
+large-scale to compressed, pocket-size MLLMs. Furthermore, we propose a Dynamic
+Weight update Compression (DWC) strategy for the downlink, which adaptively
+selects and quantizes updated weight parameters, enhancing transmission
+efficiency and reducing the representational disparity between cloud and device
+models. Extensive experiments on several multimodal benchmarks demonstrate the
+superiority of our proposed framework over prior Knowledge Distillation and
+device-cloud collaboration methods. Notably, we also validate the feasibility
+of our approach to real-world experiments.",cs.CV,['cs.CV']
+Unlocking the Potential of Pre-trained Vision Transformers for Few-Shot Semantic Segmentation through Relationship Descriptors,Ziqin Zhou · Hai-Ming Xu · Yangyang Shu · Lingqiao Liu, ,https://arxiv.org/abs/2404.02117,,2404.02117.pdf,Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners,"Few-Shot Class Incremental Learning (FSCIL) is a task that requires a model
+to learn new classes incrementally without forgetting when only a few samples
+for each class are given. FSCIL encounters two significant challenges:
+catastrophic forgetting and overfitting, and these challenges have driven prior
+studies to primarily rely on shallow models, such as ResNet-18. Even though
+their limited capacity can mitigate both forgetting and overfitting issues, it
+leads to inadequate knowledge transfer during few-shot incremental sessions. In
+this paper, we argue that large models such as vision and language transformers
+pre-trained on large datasets can be excellent few-shot incremental learners.
+To this end, we propose a novel FSCIL framework called PriViLege, Pre-trained
+Vision and Language transformers with prompting functions and knowledge
+distillation. Our framework effectively addresses the challenges of
+catastrophic forgetting and overfitting in large models through new pre-trained
+knowledge tuning (PKT) and two losses: entropy-based divergence loss and
+semantic knowledge distillation loss. Experimental results show that the
+proposed PriViLege significantly outperforms the existing state-of-the-art
+methods with a large margin, e.g., +9.38% in CUB200, +20.58% in CIFAR-100, and
++13.36% in miniImageNet. Our implementation code is available at
+https://github.com/KHU-AGI/PriViLege.",cs.CV,['cs.CV']
+LSK3DNet: Towards Effective and Efficient 3D Perception with Large Sparse Kernels,Tuo Feng · Wenguan Wang · Fan Ma · Yi Yang,https://github.com/FengZicai/LSK3DNet,https://arxiv.org/abs/2403.15173,,2403.15173.pdf,LSK3DNet: Towards Effective and Efficient 3D Perception with Large Sparse Kernels,"Autonomous systems need to process large-scale, sparse, and irregular point
+clouds with limited compute resources. Consequently, it is essential to develop
+LiDAR perception methods that are both efficient and effective. Although
+naively enlarging 3D kernel size can enhance performance, it will also lead to
+a cubically-increasing overhead. Therefore, it is crucial to develop
+streamlined 3D large kernel designs that eliminate redundant weights and work
+effectively with larger kernels. In this paper, we propose an efficient and
+effective Large Sparse Kernel 3D Neural Network (LSK3DNet) that leverages
+dynamic pruning to amplify the 3D kernel size. Our method comprises two core
+components: Spatial-wise Dynamic Sparsity (SDS) and Channel-wise Weight
+Selection (CWS). SDS dynamically prunes and regrows volumetric weights from the
+beginning to learn a large sparse 3D kernel. It not only boosts performance but
+also significantly reduces model size and computational cost. Moreover, CWS
+selects the most important channels for 3D convolution during training and
+subsequently prunes the redundant channels to accelerate inference for 3D
+vision tasks. We demonstrate the effectiveness of LSK3DNet on three benchmark
+datasets and five tracks compared with classical models and large kernel
+designs. Notably, LSK3DNet achieves the state-of-the-art performance on
+SemanticKITTI (i.e., 75.6% on single-scan and 63.4% on multi-scan), with
+roughly 40% model size reduction and 60% computing operations reduction
+compared to the naive large 3D kernel model.",cs.CV,['cs.CV']
+On the Robustness of Large Multimodal Models Against Image Adversarial Attacks,Xuanming Cui · Alejandro Aparcedo · Young Kyun Jang · Ser-Nam Lim, ,https://arxiv.org/abs/2312.03777,,2312.03777.pdf,On the Robustness of Large Multimodal Models Against Image Adversarial Attacks,"Recent advances in instruction tuning have led to the development of
+State-of-the-Art Large Multimodal Models (LMMs). Given the novelty of these
+models, the impact of visual adversarial attacks on LMMs has not been
+thoroughly examined. We conduct a comprehensive study of the robustness of
+various LMMs against different adversarial attacks, evaluated across tasks
+including image classification, image captioning, and Visual Question Answer
+(VQA). We find that in general LMMs are not robust to visual adversarial
+inputs. However, our findings suggest that context provided to the model via
+prompts, such as questions in a QA pair helps to mitigate the effects of visual
+adversarial inputs. Notably, the LMMs evaluated demonstrated remarkable
+resilience to such attacks on the ScienceQA task with only an 8.10% drop in
+performance compared to their visual counterparts which dropped 99.73%. We also
+propose a new approach to real-world image classification which we term query
+decomposition. By incorporating existence queries into our input prompt we
+observe diminished attack effectiveness and improvements in image
+classification accuracy. This research highlights a previously under-explored
+facet of LMM robustness and sets the stage for future work aimed at
+strengthening the resilience of multimodal systems in adversarial environments.",cs.CV,['cs.CV']
+Amodal Ground Truth and Completion in the Wild,Guanqi Zhan · Chuanxia Zheng · Weidi Xie · Andrew Zisserman,https://www.robots.ox.ac.uk/~vgg/research/amodal/,https://arxiv.org/abs/2312.17247,,2312.17247.pdf,Amodal Ground Truth and Completion in the Wild,"This paper studies amodal image segmentation: predicting entire object
+segmentation masks including both visible and invisible (occluded) parts. In
+previous work, the amodal segmentation ground truth on real images is usually
+predicted by manual annotaton and thus is subjective. In contrast, we use 3D
+data to establish an automatic pipeline to determine authentic ground truth
+amodal masks for partially occluded objects in real images. This pipeline is
+used to construct an amodal completion evaluation benchmark, MP3D-Amodal,
+consisting of a variety of object categories and labels. To better handle the
+amodal completion task in the wild, we explore two architecture variants: a
+two-stage model that first infers the occluder, followed by amodal mask
+completion; and a one-stage model that exploits the representation power of
+Stable Diffusion for amodal segmentation across many categories. Without bells
+and whistles, our method achieves a new state-of-the-art performance on Amodal
+segmentation datasets that cover a large variety of objects, including COCOA
+and our new MP3D-Amodal dataset. The dataset, model, and code are available at
+https://www.robots.ox.ac.uk/~vgg/research/amodal/.",cs.CV,['cs.CV']
+MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction,Xiaolu Liu · Song Wang · Wentong Li · Ruizi Yang · Junbo Chen · Jianke Zhu,https://github.com/xiaolul2/MGMap,https://arxiv.org/abs/2404.00876,,2404.00876.pdf,MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction,"Currently, high-definition (HD) map construction leans towards a lightweight
+online generation tendency, which aims to preserve timely and reliable road
+scene information. However, map elements contain strong shape priors. Subtle
+and sparse annotations make current detection-based frameworks ambiguous in
+locating relevant feature scopes and cause the loss of detailed structures in
+prediction. To alleviate these problems, we propose MGMap, a mask-guided
+approach that effectively highlights the informative regions and achieves
+precise map element localization by introducing the learned masks.
+Specifically, MGMap employs learned masks based on the enhanced multi-scale BEV
+features from two perspectives. At the instance level, we propose the
+Mask-activated instance (MAI) decoder, which incorporates global instance and
+structural information into instance queries by the activation of instance
+masks. At the point level, a novel position-guided mask patch refinement
+(PG-MPR) module is designed to refine point locations from a finer-grained
+perspective, enabling the extraction of point-specific patch information.
+Compared to the baselines, our proposed MGMap achieves a notable improvement of
+around 10 mAP for different input modalities. Extensive experiments also
+demonstrate that our approach showcases strong robustness and generalization
+capabilities. Our code can be found at https://github.com/xiaolul2/MGMap.",cs.CV,['cs.CV']
+SeNM-VAE: Semi-Supervised Noise Modeling with Hierarchical Variational Autoencoder,Dihan Zheng · Yihang Zou · Xiaowen Zhang · Chenglong Bao, ,https://arxiv.org/abs/2403.17502,,2403.17502.pdf,SeNM-VAE: Semi-Supervised Noise Modeling with Hierarchical Variational Autoencoder,"The data bottleneck has emerged as a fundamental challenge in learning based
+image restoration methods. Researchers have attempted to generate synthesized
+training data using paired or unpaired samples to address this challenge. This
+study proposes SeNM-VAE, a semi-supervised noise modeling method that leverages
+both paired and unpaired datasets to generate realistic degraded data. Our
+approach is based on modeling the conditional distribution of degraded and
+clean images with a specially designed graphical model. Under the variational
+inference framework, we develop an objective function for handling both paired
+and unpaired data. We employ our method to generate paired training samples for
+real-world image denoising and super-resolution tasks. Our approach excels in
+the quality of synthetic degraded images compared to other unpaired and paired
+noise modeling methods. Furthermore, our approach demonstrates remarkable
+performance in downstream image restoration tasks, even with limited paired
+data. With more paired data, our method achieves the best performance on the
+SIDD dataset.",cs.CV,['cs.CV']
+What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation,Yihua Cheng · Yaning Zhu · Zongji Wang · hongquan hao · Liu wei · Shiqing Cheng · Xi Wang · Hyung Jin Chang,https://yihua.zone/work/ivgaze/,https://arxiv.org/abs/2403.15664,,2403.15664.pdf,What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation,"Driver's eye gaze holds a wealth of cognitive and intentional cues crucial
+for intelligent vehicles. Despite its significance, research on in-vehicle gaze
+estimation remains limited due to the scarcity of comprehensive and
+well-annotated datasets in real driving scenarios. In this paper, we present
+three novel elements to advance in-vehicle gaze research. Firstly, we introduce
+IVGaze, a pioneering dataset capturing in-vehicle gaze, collected from 125
+subjects and covering a large range of gaze and head poses within vehicles.
+Conventional gaze collection systems are inadequate for in-vehicle use. In this
+dataset, we propose a new vision-based solution for in-vehicle gaze collection,
+introducing a refined gaze target calibration method to tackle annotation
+challenges. Second, our research focuses on in-vehicle gaze estimation
+leveraging the IVGaze. In-vehicle face images often suffer from low resolution,
+prompting our introduction of a gaze pyramid transformer that leverages
+transformer-based multilevel features integration. Expanding upon this, we
+introduce the dual-stream gaze pyramid transformer (GazeDPTR). Employing
+perspective transformation, we rotate virtual cameras to normalize images,
+utilizing camera pose to merge normalized and original images for accurate gaze
+estimation. GazeDPTR shows state-of-the-art performance on the IVGaze dataset.
+Thirdly, we explore a novel strategy for gaze zone classification by extending
+the GazeDPTR. A foundational tri-plane and project gaze onto these planes are
+newly defined. Leveraging both positional features from the projection points
+and visual attributes from images, we achieve superior performance compared to
+relying solely on visual features, substantiating the advantage of gaze
+estimation. Our project is available at https://yihua.zone/work/ivgaze.",cs.CV,['cs.CV']
+LangSplat: 3D Language Gaussian Splatting,Minghan Qin · Wanhua Li · Jiawei ZHOU · Haoqian Wang · Hanspeter Pfister,https://langsplat.github.io/,https://arxiv.org/abs/2312.16084,,2312.16084.pdf,LangSplat: 3D Language Gaussian Splatting,"Humans live in a 3D world and commonly use natural language to interact with
+a 3D scene. Modeling a 3D language field to support open-ended language queries
+in 3D has gained increasing attention recently. This paper introduces
+LangSplat, which constructs a 3D language field that enables precise and
+efficient open-vocabulary querying within 3D spaces. Unlike existing methods
+that ground CLIP language embeddings in a NeRF model, LangSplat advances the
+field by utilizing a collection of 3D Gaussians, each encoding language
+features distilled from CLIP, to represent the language field. By employing a
+tile-based splatting technique for rendering language features, we circumvent
+the costly rendering process inherent in NeRF. Instead of directly learning
+CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and
+then learns language features on the scene-specific latent space, thereby
+alleviating substantial memory demands imposed by explicit modeling. Existing
+methods struggle with imprecise and vague 3D language fields, which fail to
+discern clear boundaries between objects. We delve into this issue and propose
+to learn hierarchical semantics using SAM, thereby eliminating the need for
+extensively querying the language field across various scales and the
+regularization of DINO features. Extensive experimental results show that
+LangSplat significantly outperforms the previous state-of-the-art method LERF
+by a large margin. Notably, LangSplat is extremely efficient, achieving a 199
+$\times$ speedup compared to LERF at the resolution of 1440 $\times$ 1080. We
+strongly recommend readers to check out our video results at
+https://langsplat.github.io/",cs.CV,['cs.CV']
+DGC-GNN: Leveraging Geometry and Color Cues for Visual Descriptor-Free 2D-3D Matching,Shuzhe Wang · Juho Kannala · Daniel Barath, ,https://arxiv.org/abs/2306.12547,,2306.12547.pdf,DGC-GNN: Leveraging Geometry and Color Cues for Visual Descriptor-Free 2D-3D Matching,"Matching 2D keypoints in an image to a sparse 3D point cloud of the scene
+without requiring visual descriptors has garnered increased interest due to its
+low memory requirements, inherent privacy preservation, and reduced need for
+expensive 3D model maintenance compared to visual descriptor-based methods.
+However, existing algorithms often compromise on performance, resulting in a
+significant deterioration compared to their descriptor-based counterparts. In
+this paper, we introduce DGC-GNN, a novel algorithm that employs a
+global-to-local Graph Neural Network (GNN) that progressively exploits
+geometric and color cues to represent keypoints, thereby improving matching
+accuracy. Our procedure encodes both Euclidean and angular relations at a
+coarse level, forming the geometric embedding to guide the point matching. We
+evaluate DGC-GNN on both indoor and outdoor datasets, demonstrating that it not
+only doubles the accuracy of the state-of-the-art visual descriptor-free
+algorithm but also substantially narrows the performance gap between
+descriptor-based and descriptor-free methods.",cs.CV,['cs.CV']
+DiffForensics: Leveraging Diffusion Prior to Image Forgery Detection and Localization,Zeqin Yu · Jiangqun Ni · Yuzhen Lin · Haoyi Deng · Bin Li, ,https://arxiv.org/abs/2401.15859,,2401.15859.pdf,Diffusion Facial Forgery Detection,"Detecting diffusion-generated images has recently grown into an emerging
+research area. Existing diffusion-based datasets predominantly focus on general
+image generation. However, facial forgeries, which pose a more severe social
+risk, have remained less explored thus far. To address this gap, this paper
+introduces DiFF, a comprehensive dataset dedicated to face-focused
+diffusion-generated images. DiFF comprises over 500,000 images that are
+synthesized using thirteen distinct generation methods under four conditions.
+In particular, this dataset leverages 30,000 carefully collected textual and
+visual prompts, ensuring the synthesis of images with both high fidelity and
+semantic consistency. We conduct extensive experiments on the DiFF dataset via
+a human test and several representative forgery detection methods. The results
+demonstrate that the binary detection accuracy of both human observers and
+automated detectors often falls below 30%, shedding light on the challenges in
+detecting diffusion-generated facial forgeries. Furthermore, we propose an edge
+graph regularization approach to effectively enhance the generalization
+capability of existing detectors.",cs.CV,"['cs.CV', 'cs.AI']"
+Affine Equivariant Networks Based on Differential Invariants,Yikang Li · Yeqing Qiu · Yuxuan Chen · Lingshen He · Zhouchen Lin, ,,https://www.semanticscholar.org/paper/Lie-Group-Decompositions-for-Equivariant-Neural-Mironenco-Forr'e/5302620834b3969b11097f66375cadbf9ee9c817,,,,,nan
+EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World,Yifei Huang · Guo Chen · Jilan Xu · Mingfang Zhang · Lijin Yang · Baoqi Pei · Hongjie Zhang · Lu Dong · Yali Wang · Limin Wang · Yu Qiao,https://github.com/OpenGVLab/EgoExoLearn,https://arxiv.org/abs/2403.16182,,2403.16182.pdf,EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World,"Being able to map the activities of others into one's own point of view is
+one fundamental human skill even from a very early age. Taking a step toward
+understanding this human ability, we introduce EgoExoLearn, a large-scale
+dataset that emulates the human demonstration following process, in which
+individuals record egocentric videos as they execute tasks guided by
+demonstration videos. Focusing on the potential applications in daily
+assistance and professional support, EgoExoLearn contains egocentric and
+demonstration video data spanning 120 hours captured in daily life scenarios
+and specialized laboratories. Along with the videos we record high-quality gaze
+data and provide detailed multimodal annotations, formulating a playground for
+modeling the human ability to bridge asynchronous procedural actions from
+different viewpoints. To this end, we present benchmarks such as cross-view
+association, cross-view action planning, and cross-view referenced skill
+assessment, along with detailed analysis. We expect EgoExoLearn can serve as an
+important resource for bridging the actions across views, thus paving the way
+for creating AI agents capable of seamlessly learning by observing humans in
+the real world. Code and data can be found at:
+https://github.com/OpenGVLab/EgoExoLearn",cs.CV,['cs.CV']
+Learning from Observer Gaze: Zero-shot Attention Prediction Oriented by Human-Object Interaction Recognition,Yuchen Zhou · Linkai Liu · Chao Gou,https://yuchen2199.github.io/Interactive-Gaze/,https://arxiv.org/abs/2405.09931,,2405.09931.pdf,Learning from Observer Gaze:Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition,"Most existing attention prediction research focuses on salient instances like
+humans and objects. However, the more complex interaction-oriented attention,
+arising from the comprehension of interactions between instances by human
+observers, remains largely unexplored. This is equally crucial for advancing
+human-machine interaction and human-centered artificial intelligence. To bridge
+this gap, we first collect a novel gaze fixation dataset named IG, comprising
+530,000 fixation points across 740 diverse interaction categories, capturing
+visual attention during human observers cognitive processes of interactions.
+Subsequently, we introduce the zero-shot interaction-oriented attention
+prediction task ZeroIA, which challenges models to predict visual cues for
+interactions not encountered during training. Thirdly, we present the
+Interactive Attention model IA, designed to emulate human observers cognitive
+processes to tackle the ZeroIA problem. Extensive experiments demonstrate that
+the proposed IA outperforms other state-of-the-art approaches in both ZeroIA
+and fully supervised settings. Lastly, we endeavor to apply
+interaction-oriented attention to the interaction recognition task itself.
+Further experimental results demonstrate the promising potential to enhance the
+performance and interpretability of existing state-of-the-art HOI models by
+incorporating real human attention data from IG and attention labels generated
+by IA.",cs.CV,['cs.CV']
+EFHQ: Multi-purpose ExtremePose-Face-HQ dataset,Trung Dao · Duc H Vu · Cuong Pham · Anh Tran,https://bomcon123456.github.io/efhq/,https://arxiv.org/abs/2312.17205,,2312.17205.pdf,EFHQ: Multi-purpose ExtremePose-Face-HQ dataset,"The existing facial datasets, while having plentiful images at near frontal
+views, lack images with extreme head poses, leading to the downgraded
+performance of deep learning models when dealing with profile or pitched faces.
+This work aims to address this gap by introducing a novel dataset named Extreme
+Pose Face High-Quality Dataset (EFHQ), which includes a maximum of 450k
+high-quality images of faces at extreme poses. To produce such a massive
+dataset, we utilize a novel and meticulous dataset processing pipeline to
+curate two publicly available datasets, VFHQ and CelebV-HQ, which contain many
+high-resolution face videos captured in various settings. Our dataset can
+complement existing datasets on various facial-related tasks, such as facial
+synthesis with 2D/3D-aware GAN, diffusion-based text-to-image face generation,
+and face reenactment. Specifically, training with EFHQ helps models generalize
+well across diverse poses, significantly improving performance in scenarios
+involving extreme views, confirmed by extensive experiments. Additionally, we
+utilize EFHQ to define a challenging cross-view face verification benchmark, in
+which the performance of SOTA face recognition models drops 5-37% compared to
+frontal-to-frontal scenarios, aiming to stimulate studies on face recognition
+under severe pose conditions in the wild.",cs.CV,['cs.CV']
+Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling,Xinhang Liu · Yu-Wing Tai · Chi-Keung Tang · Pedro Miraldo · Suhas Lohit · Moitreya Chatterjee, ,https://arxiv.org/abs/2405.06214,,2405.06214.pdf,Aerial-NeRF: Adaptive Spatial Partitioning and Sampling for Large-Scale Aerial Rendering,"Recent progress in large-scale scene rendering has yielded Neural Radiance
+Fields (NeRF)-based models with an impressive ability to synthesize scenes
+across small objects and indoor scenes. Nevertheless, extending this idea to
+large-scale aerial rendering poses two critical problems. Firstly, a single
+NeRF cannot render the entire scene with high-precision for complex large-scale
+aerial datasets since the sampling range along each view ray is insufficient to
+cover buildings adequately. Secondly, traditional NeRFs are infeasible to train
+on one GPU to enable interactive fly-throughs for modeling massive images.
+Instead, existing methods typically separate the whole scene into multiple
+regions and train a NeRF on each region, which are unaccustomed to different
+flight trajectories and difficult to achieve fast rendering. To that end, we
+propose Aerial-NeRF with three innovative modifications for jointly adapting
+NeRF in large-scale aerial rendering: (1) Designing an adaptive spatial
+partitioning and selection method based on drones' poses to adapt different
+flight trajectories; (2) Using similarity of poses instead of (expert) network
+for rendering speedup to determine which region a new viewpoint belongs to; (3)
+Developing an adaptive sampling approach for rendering performance improvement
+to cover the entire buildings at different heights. Extensive experiments have
+conducted to verify the effectiveness and efficiency of Aerial-NeRF, and new
+state-of-the-art results have been achieved on two public large-scale aerial
+datasets and presented SCUTic dataset. Note that our model allows us to perform
+rendering over 4 times as fast as compared to multiple competitors. Our
+dataset, code, and model are publicly available at https://drliuqi.github.io/.",cs.CV,['cs.CV']
+PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models,Yiming Zhang · Zhening Xing · Yanhong Zeng · Youqing Fang · Kai Chen, ,https://arxiv.org/abs/2312.13964,,2312.13964.pdf,PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models,"Recent advancements in personalized text-to-image (T2I) models have
+revolutionized content creation, empowering non-experts to generate stunning
+images with unique styles. While promising, adding realistic motions into these
+personalized images by text poses significant challenges in preserving distinct
+styles, high-fidelity details, and achieving motion controllability by text. In
+this paper, we present PIA, a Personalized Image Animator that excels in
+aligning with condition images, achieving motion controllability by text, and
+the compatibility with various personalized T2I models without specific tuning.
+To achieve these goals, PIA builds upon a base T2I model with well-trained
+temporal alignment layers, allowing for the seamless transformation of any
+personalized T2I model into an image animation model. A key component of PIA is
+the introduction of the condition module, which utilizes the condition frame
+and inter-frame affinity as input to transfer appearance information guided by
+the affinity hint for individual frame synthesis in the latent space. This
+design mitigates the challenges of appearance-related image alignment within
+and allows for a stronger focus on aligning with motion-related guidance.",cs.CV,"['cs.CV', 'cs.AI']"
+Weakly Supervised Video Individual Counting,Xinyan Liu · Guorong Li · Yuankai Qi · Ziheng Yan · Zhenjun Han · Anton van den Hengel · Ming-Hsuan Yang · Qingming Huang, ,https://arxiv.org/abs/2312.05923,,2312.05923.pdf,Weakly Supervised Video Individual CountingWeakly Supervised Video Individual Counting,"Video Individual Counting (VIC) aims to predict the number of unique
+individuals in a single video. % Existing methods learn representations based
+on trajectory labels for individuals, which are annotation-expensive. % To
+provide a more realistic reflection of the underlying practical challenge, we
+introduce a weakly supervised VIC task, wherein trajectory labels are not
+provided. Instead, two types of labels are provided to indicate traffic
+entering the field of view (inflow) and leaving the field view (outflow). % We
+also propose the first solution as a baseline that formulates the task as a
+weakly supervised contrastive learning problem under group-level matching. In
+doing so, we devise an end-to-end trainable soft contrastive loss to drive the
+network to distinguish inflow, outflow, and the remaining. % To facilitate
+future study in this direction, we generate annotations from the existing VIC
+datasets SenseCrowd and CroHD and also build a new dataset, UAVVIC. % Extensive
+results show that our baseline weakly supervised method outperforms supervised
+methods, and thus, little information is lost in the transition to the more
+practically relevant weakly supervised task. The code and trained model will be
+public at \href{https://github.com/streamer-AP/CGNet}{CGNet}",cs.CV,['cs.CV']
+Model Inversion Robustness: Can Transfer Learning Help?,Sy-Tuyen Ho · Koh Jun Hao · Keshigeyan Chandrasegaran · Ngoc-Bao Nguyen · Ngai-Man Cheung, ,https://arxiv.org/abs/2405.05588,,2405.05588.pdf,Model Inversion Robustness: Can Transfer Learning Help?,"Model Inversion (MI) attacks aim to reconstruct private training data by
+abusing access to machine learning models. Contemporary MI attacks have
+achieved impressive attack performance, posing serious threats to privacy.
+Meanwhile, all existing MI defense methods rely on regularization that is in
+direct conflict with the training objective, resulting in noticeable
+degradation in model utility. In this work, we take a different perspective,
+and propose a novel and simple Transfer Learning-based Defense against Model
+Inversion (TL-DMI) to render MI-robust models. Particularly, by leveraging TL,
+we limit the number of layers encoding sensitive information from private
+training dataset, thereby degrading the performance of MI attack. We conduct an
+analysis using Fisher Information to justify our method. Our defense is
+remarkably simple to implement. Without bells and whistles, we show in
+extensive experiments that TL-DMI achieves state-of-the-art (SOTA) MI
+robustness. Our code, pre-trained models, demo and inverted data are available
+at: https://hosytuyen.github.io/projects/TL-DMI",cs.LG,"['cs.LG', 'cs.CR', 'cs.CV']"
+$M^3$-UDA: A New Benchmark for Unsupervised Domain Adaptive Fetal Cardiac Structure Detection,Bin Pu · Liwen Wang · Jiewen Yang · He Guannan · Xingbo Dong · Shengli Li · Ying Tan · Ming Chen · Zhe Jin · Kenli Li · Xiaomeng Li, ,https://arxiv.org/abs/2310.14172,,2310.14172.pdf,ASC: Appearance and Structure Consistency for Unsupervised Domain Adaptation in Fetal Brain MRI Segmentation,"Automatic tissue segmentation of fetal brain images is essential for the
+quantitative analysis of prenatal neurodevelopment. However, producing
+voxel-level annotations of fetal brain imaging is time-consuming and expensive.
+To reduce labeling costs, we propose a practical unsupervised domain adaptation
+(UDA) setting that adapts the segmentation labels of high-quality fetal brain
+atlases to unlabeled fetal brain MRI data from another domain. To address the
+task, we propose a new UDA framework based on Appearance and Structure
+Consistency, named ASC. We adapt the segmentation model to the appearances of
+different domains by constraining the consistency before and after a
+frequency-based image transformation, which is to swap the appearance between
+brain MRI data and atlases. Consider that even in the same domain, the fetal
+brain images of different gestational ages could have significant variations in
+the anatomical structures. To make the model adapt to the structural variations
+in the target domain, we further encourage prediction consistency under
+different structural perturbations. Extensive experiments on FeTA 2021
+benchmark demonstrate the effectiveness of our ASC in comparison to
+registration-based, semi-supervised learning-based, and existing UDA-based
+methods.",eess.IV,"['eess.IV', 'cs.CV']"
+A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?,Galadrielle Humblot-Renaux · Sergio Escalera · Thomas B. Moeslund, ,https://arxiv.org/abs/2404.01775,,2404.01775.pdf,A noisy elephant in the room: Is your out-of-distribution detector robust to label noise?,"The ability to detect unfamiliar or unexpected images is essential for safe
+deployment of computer vision systems. In the context of classification, the
+task of detecting images outside of a model's training domain is known as
+out-of-distribution (OOD) detection. While there has been a growing research
+interest in developing post-hoc OOD detection methods, there has been
+comparably little discussion around how these methods perform when the
+underlying classifier is not trained on a clean, carefully curated dataset. In
+this work, we take a closer look at 20 state-of-the-art OOD detection methods
+in the (more realistic) scenario where the labels used to train the underlying
+classifier are unreliable (e.g. crowd-sourced or web-scraped labels). Extensive
+experiments across different datasets, noise types & levels, architectures and
+checkpointing strategies provide insights into the effect of class label noise
+on OOD detection, and show that poor separation between incorrectly classified
+ID samples vs. OOD samples is an overlooked yet important limitation of
+existing methods. Code: https://github.com/glhr/ood-labelnoise",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+HHMR: Holistic Hand Mesh Recovery by Enhancing the Multimodal Controllability of Graph Diffusion Models,Mengcheng Li · Hongwen Zhang · Yuxiang Zhang · Ruizhi Shao · Tao Yu · Yebin Liu,https://www.liuyebin.com/HHMR/HHMR.html,https://arxiv.org/abs/2402.14654,,2402.14654.pdf,Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot,"We present Multi-HMR, a strong single-shot model for multi-person 3D human
+mesh recovery from a single RGB image. Predictions encompass the whole body,
+i.e, including hands and facial expressions, using the SMPL-X parametric model
+and spatial location in the camera coordinate system. Our model detects people
+by predicting coarse 2D heatmaps of person centers, using features produced by
+a standard Vision Transformer (ViT) backbone. It then predicts their whole-body
+pose, shape and spatial location using a new cross-attention module called the
+Human Prediction Head (HPH), with one query per detected center token,
+attending to the entire set of features. As direct prediction of SMPL-X
+parameters yields suboptimal results, we introduce CUFFS; the Close-Up Frames
+of Full-Body Subjects dataset, containing humans close to the camera with
+diverse hand poses. We show that incorporating this dataset into training
+further enhances predictions, particularly for hands, enabling us to achieve
+state-of-the-art performance. Multi-HMR also optionally accounts for camera
+intrinsics, if available, by encoding camera ray directions for each image
+token. This simple design achieves strong performance on whole-body and
+body-only benchmarks simultaneously. We train models with various backbone
+sizes and input resolutions. In particular, using a ViT-S backbone and
+$448\times448$ input images already yields a fast and competitive model with
+respect to state-of-the-art methods, while considering larger models and higher
+resolutions further improve performance.",cs.CV,['cs.CV']
+C$^\text{2}$RV: Cross-Regional and Cross-View Learning for Sparse-View CBCT Reconstruction,Yiqun Lin · Jiewen Yang · hualiang wang · Xinpeng Ding · Wei Zhao · Xiaomeng Li,https://github.com/xmed-lab/C2RV-CBCT,https://arxiv.org/abs/2312.01689,,2312.01689.pdf,Fast and accurate sparse-view CBCT reconstruction using meta-learned neural attenuation field and hash-encoding regularization,"Cone beam computed tomography (CBCT) is an emerging medical imaging technique
+to visualize the internal anatomical structures of patients. During a CBCT
+scan, several projection images of different angles or views are collectively
+utilized to reconstruct a tomographic image. However, reducing the number of
+projections in a CBCT scan while preserving the quality of a reconstructed
+image is challenging due to the nature of an ill-posed inverse problem.
+Recently, a neural attenuation field (NAF) method was proposed by adopting a
+neural radiance field algorithm as a new way for CBCT reconstruction,
+demonstrating fast and promising results using only 50 views. However,
+decreasing the number of projections is still preferable to reduce potential
+radiation exposure, and a faster reconstruction time is required considering a
+typical scan time. In this work, we propose a fast and accurate sparse-view
+CBCT reconstruction (FACT) method to provide better reconstruction quality and
+faster optimization speed in the minimal number of view acquisitions ($<$ 50
+views). In the FACT method, we meta-trained a neural network and a hash-encoder
+using a few scans (= 15), and a new regularization technique is utilized to
+reconstruct the details of an anatomical structure. In conclusion, we have
+shown that the FACT method produced better, and faster reconstruction results
+over the other conventional algorithms based on CBCT scans of different body
+parts (chest, head, and abdomen) and CT vendors (Siemens, Phillips, and GE).",eess.IV,"['eess.IV', 'cs.CV']"
+Explaining the Implicit Neural Canvas: Connecting Pixels to Neurons by Tracing their Contributions,Namitha Padmanabhan · Matthew A Gwilliam · Pulkit Kumar · Shishira R Maiya · Max Ehrlich · Abhinav Shrivastava,https://namithap10.github.io/xinc/,https://arxiv.org/abs/2401.10217,,2401.10217.pdf,Explaining the Implicit Neural Canvas: Connecting Pixels to Neurons by Tracing their Contributions,"The many variations of Implicit Neural Representations (INRs), where a neural
+network is trained as a continuous representation of a signal, have tremendous
+practical utility for downstream tasks including novel view synthesis, video
+compression, and image superresolution. Unfortunately, the inner workings of
+these networks are seriously under-studied. Our work, eXplaining the Implicit
+Neural Canvas (XINC), is a unified framework for explaining properties of INRs
+by examining the strength of each neuron's contribution to each output pixel.
+We call the aggregate of these contribution maps the Implicit Neural Canvas and
+we use this concept to demonstrate that the INRs which we study learn to
+''see'' the frames they represent in surprising ways. For example, INRs tend to
+have highly distributed representations. While lacking high-level object
+semantics, they have a significant bias for color and edges, and are almost
+entirely space-agnostic. We arrive at our conclusions by examining how objects
+are represented across time in video INRs, using clustering to visualize
+similar neurons across layers and architectures, and show that this is
+dominated by motion. These insights demonstrate the general usefulness of our
+analysis framework. Our project page is available at
+https://namithap10.github.io/xinc.",cs.CV,['cs.CV']
+Posterior Distillation Sampling,Juil Koo · Chanho Park · Minhyuk Sung,https://posterior-distillation-sampling.github.io/,https://arxiv.org/abs/2311.13831,,2311.13831.pdf,Posterior Distillation Sampling,"We introduce Posterior Distillation Sampling (PDS), a novel optimization
+method for parametric image editing based on diffusion models. Existing
+optimization-based methods, which leverage the powerful 2D prior of diffusion
+models to handle various parametric images, have mainly focused on generation.
+Unlike generation, editing requires a balance between conforming to the target
+attribute and preserving the identity of the source content. Recent 2D image
+editing methods have achieved this balance by leveraging the stochastic latent
+encoded in the generative process of diffusion models. To extend the editing
+capabilities of diffusion models shown in pixel space to parameter space, we
+reformulate the 2D image editing method into an optimization form named PDS.
+PDS matches the stochastic latents of the source and the target, enabling the
+sampling of targets in diverse parameter spaces that align with a desired
+attribute while maintaining the source's identity. We demonstrate that this
+optimization resembles running a generative process with the target attribute,
+but aligning this process with the trajectory of the source's generative
+process. Extensive editing results in Neural Radiance Fields and Scalable
+Vector Graphics representations demonstrate that PDS is capable of sampling
+targets to fulfill the aforementioned balance across various parameter spaces.",cs.CV,['cs.CV']
+Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices,Huancheng Chen · Haris Vikalo, ,https://arxiv.org/abs/2311.18129,,2311.18129.pdf,Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices,"While federated learning (FL) systems often utilize quantization to battle
+communication and computational bottlenecks, they have heretofore been limited
+to deploying fixed-precision quantization schemes. Meanwhile, the concept of
+mixed-precision quantization (MPQ), where different layers of a deep learning
+model are assigned varying bit-width, remains unexplored in the FL settings. We
+present a novel FL algorithm, FedMPQ, which introduces mixed-precision
+quantization to resource-heterogeneous FL systems. Specifically, local models,
+quantized so as to satisfy bit-width constraint, are trained by optimizing an
+objective function that includes a regularization term which promotes reduction
+of precision in some of the layers without significant performance degradation.
+The server collects local model updates, de-quantizes them into full-precision
+models, and then aggregates them into a global model. To initialize the next
+round of local training, the server relies on the information learned in the
+previous training round to customize bit-width assignments of the models
+delivered to different clients. In extensive benchmarking experiments on
+several model architectures and different datasets in both iid and non-iid
+settings, FedMPQ outperformed the baseline FL schemes that utilize
+fixed-precision quantization while incurring only a minor computational
+overhead on the participating devices.",cs.LG,"['cs.LG', 'cs.DC']"
+Coherent Temporal Synthesis for Incremental Action Segmentation,Guodong Ding · Hans Golong · Angela Yao,https://guodongding.cn/projects/itas/itas.html,https://arxiv.org/abs/2403.06102,,2403.06102.pdf,Coherent Temporal Synthesis for Incremental Action Segmentation,"Data replay is a successful incremental learning technique for images. It
+prevents catastrophic forgetting by keeping a reservoir of previous data,
+original or synthesized, to ensure the model retains past knowledge while
+adapting to novel concepts. However, its application in the video domain is
+rudimentary, as it simply stores frame exemplars for action recognition. This
+paper presents the first exploration of video data replay techniques for
+incremental action segmentation, focusing on action temporal modeling. We
+propose a Temporally Coherent Action (TCA) model, which represents actions
+using a generative model instead of storing individual frames. The integration
+of a conditioning variable that captures temporal coherence allows our model to
+understand the evolution of action features over time. Therefore, action
+segments generated by TCA for replay are diverse and temporally coherent. In a
+10-task incremental setup on the Breakfast dataset, our approach achieves
+significant increases in accuracy for up to 22% compared to the baselines.",cs.CV,['cs.CV']
+GLACE: Global Local Accelerated Coordinate Encoding,Fangjinhua Wang · Xudong Jiang · Silvano Galliani · Christoph Vogel · Marc Pollefeys, ,,https://ieeexplore.ieee.org/document/10204902/figures,,,,,nan
+Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation,Mingyu Lee · Jongwon Choi,https://github.com/MingyuLee82/TGI_AD_v1,https://arxiv.org/abs/2403.06247,,2403.06247.pdf,Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation,"We propose a text-guided variational image generation method to address the
+challenge of getting clean data for anomaly detection in industrial
+manufacturing. Our method utilizes text information about the target object,
+learned from extensive text library documents, to generate non-defective data
+images resembling the input image. The proposed framework ensures that the
+generated non-defective images align with anticipated distributions derived
+from textual and image-based knowledge, ensuring stability and generality.
+Experimental results demonstrate the effectiveness of our approach, surpassing
+previous methods even with limited non-defective data. Our approach is
+validated through generalization tests across four baseline models and three
+distinct datasets. We present an additional analysis to enhance the
+effectiveness of anomaly detection models by utilizing the generated images.",cs.CV,"['cs.CV', 'cs.AI']"
+Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models,Bin Fu · Fanghua Yu · Anran Liu · Zixuan Wang · Jie Wen · Junjun He · Yu Qiao, ,https://arxiv.org/abs/2312.12142,,2312.12142.pdf,FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning,"Automatic font generation is an imitation task, which aims to create a font
+library that mimics the style of reference images while preserving the content
+from source images. Although existing font generation methods have achieved
+satisfactory performance, they still struggle with complex characters and large
+style variations. To address these issues, we propose FontDiffuser, a
+diffusion-based image-to-image one-shot font generation method, which
+innovatively models the font imitation task as a noise-to-denoise paradigm. In
+our method, we introduce a Multi-scale Content Aggregation (MCA) block, which
+effectively combines global and local content cues across different scales,
+leading to enhanced preservation of intricate strokes of complex characters.
+Moreover, to better manage the large variations in style transfer, we propose a
+Style Contrastive Refinement (SCR) module, which is a novel structure for style
+representation learning. It utilizes a style extractor to disentangle styles
+from images, subsequently supervising the diffusion model via a meticulously
+designed style contrastive loss. Extensive experiments demonstrate
+FontDiffuser's state-of-the-art performance in generating diverse characters
+and styles. It consistently excels on complex characters and large style
+changes compared to previous methods. The code is available at
+https://github.com/yeungchenwa/FontDiffuser.",cs.CV,"['cs.CV', 'cs.AI']"
+PTQ4SAM: Post-Training Quantization for Segment Anything,Chengtao Lv · Hong Chen · Jinyang Guo · Yifu Ding · Xianglong Liu, ,https://arxiv.org/abs/2405.03144,,2405.03144.pdf,PTQ4SAM: Post-Training Quantization for Segment Anything,"Segment Anything Model (SAM) has achieved impressive performance in many
+computer vision tasks. However, as a large-scale model, the immense memory and
+computation costs hinder its practical deployment. In this paper, we propose a
+post-training quantization (PTQ) framework for Segment Anything Model, namely
+PTQ4SAM. First, we investigate the inherent bottleneck of SAM quantization
+attributed to the bimodal distribution in post-Key-Linear activations. We
+analyze its characteristics from both per-tensor and per-channel perspectives,
+and propose a Bimodal Integration strategy, which utilizes a mathematically
+equivalent sign operation to transform the bimodal distribution into a
+relatively easy-quantized normal distribution offline. Second, SAM encompasses
+diverse attention mechanisms (i.e., self-attention and two-way
+cross-attention), resulting in substantial variations in the post-Softmax
+distributions. Therefore, we introduce an Adaptive Granularity Quantization for
+Softmax through searching the optimal power-of-two base, which is
+hardware-friendly. Extensive experimental results across various vision tasks
+(instance segmentation, semantic segmentation and object detection), datasets
+and model variants show the superiority of PTQ4SAM. For example, when
+quantizing SAM-L to 6-bit, we achieve lossless accuracy for instance
+segmentation, about 0.5\% drop with theoretical 3.9$\times$ acceleration. The
+code is available at \url{https://github.com/chengtao-lv/PTQ4SAM}.",cs.CV,"['cs.CV', 'cs.LG']"
+Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach,Mir Hossain Hossain · Mennatullah Siam · Leonid Sigal · Jim Little, ,https://arxiv.org/abs/2404.11732,,2404.11732.pdf,Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach,"The emergence of attention-based transformer models has led to their
+extensive use in various tasks, due to their superior generalization and
+transfer properties. Recent research has demonstrated that such models, when
+prompted appropriately, are excellent for few-shot inference. However, such
+techniques are under-explored for dense prediction tasks like semantic
+segmentation. In this work, we examine the effectiveness of prompting a
+transformer-decoder with learned visual prompts for the generalized few-shot
+segmentation (GFSS) task. Our goal is to achieve strong performance not only on
+novel categories with limited examples, but also to retain performance on base
+categories. We propose an approach to learn visual prompts with limited
+examples. These learned visual prompts are used to prompt a multiscale
+transformer decoder to facilitate accurate dense predictions. Additionally, we
+introduce a unidirectional causal attention mechanism between the novel
+prompts, learned with limited examples, and the base prompts, learned with
+abundant data. This mechanism enriches the novel prompts without deteriorating
+the base class performance. Overall, this form of prompting helps us achieve
+state-of-the-art performance for GFSS on two different benchmark datasets:
+COCO-$20^i$ and Pascal-$5^i$, without the need for test-time optimization (or
+transduction). Furthermore, test-time optimization leveraging unlabelled test
+data can be used to improve the prompts, which we refer to as transductive
+prompt tuning.",cs.CV,['cs.CV']
+Precise Image Editing via Recognition and Generation Tasks,Shelly Sheynin · Adam Polyak · Uriel Singer · Yuval Kirstain · Amit Zohar · Oron Ashual · Devi Parikh · Yaniv Taigman,https://emu-edit.metademolab.com/,https://arxiv.org/abs/2311.10089,,2311.10089.pdf,Emu Edit: Precise Image Editing via Recognition and Generation Tasks,"Instruction-based image editing holds immense potential for a variety of
+applications, as it enables users to perform any editing operation using a
+natural language instruction. However, current models in this domain often
+struggle with accurately executing user instructions. We present Emu Edit, a
+multi-task image editing model which sets state-of-the-art results in
+instruction-based image editing. To develop Emu Edit we train it to multi-task
+across an unprecedented range of tasks, such as region-based editing, free-form
+editing, and Computer Vision tasks, all of which are formulated as generative
+tasks. Additionally, to enhance Emu Edit's multi-task learning abilities, we
+provide it with learned task embeddings which guide the generation process
+towards the correct edit type. Both these elements are essential for Emu Edit's
+outstanding performance. Furthermore, we show that Emu Edit can generalize to
+new tasks, such as image inpainting, super-resolution, and compositions of
+editing tasks, with just a few labeled examples. This capability offers a
+significant advantage in scenarios where high-quality samples are scarce.
+Lastly, to facilitate a more rigorous and informed assessment of instructable
+image editing models, we release a new challenging and versatile benchmark that
+includes seven different image editing tasks.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+UniVS: Unified and Universal Video Segmentation with Prompts as Queries,Minghan LI · Shuai Li · Xindong Zhang · Lei Zhang, ,https://arxiv.org/abs/2402.18115,,2402.18115.pdf,UniVS: Unified and Universal Video Segmentation with Prompts as Queries,"Despite the recent advances in unified image segmentation (IS), developing a
+unified video segmentation (VS) model remains a challenge. This is mainly
+because generic category-specified VS tasks need to detect all objects and
+track them across consecutive frames, while prompt-guided VS tasks require
+re-identifying the target with visual/text prompts throughout the entire video,
+making it hard to handle the different tasks with the same architecture. We
+make an attempt to address these issues and present a novel unified VS
+architecture, namely UniVS, by using prompts as queries. UniVS averages the
+prompt features of the target from previous frames as its initial query to
+explicitly decode masks, and introduces a target-wise prompt cross-attention
+layer in the mask decoder to integrate prompt features in the memory pool. By
+taking the predicted masks of entities from previous frames as their visual
+prompts, UniVS converts different VS tasks into prompt-guided target
+segmentation, eliminating the heuristic inter-frame matching process. Our
+framework not only unifies the different VS tasks but also naturally achieves
+universal training and testing, ensuring robust performance across different
+scenarios. UniVS shows a commendable balance between performance and
+universality on 10 challenging VS benchmarks, covering video instance,
+semantic, panoptic, object, and referring segmentation tasks. Code can be found
+at \url{https://github.com/MinghanLi/UniVS}.",cs.CV,"['cs.CV', 'cs.CL']"
+A-Teacher: Asymmetric Network for 3D Semi-Supervised Object Detection,Hanshi Wang · Zhipeng Zhang · Jin Gao · Weiming Hu, ,https://arxiv.org/abs/2401.05011,,2401.05011.pdf,Dual-Perspective Knowledge Enrichment for Semi-Supervised 3D Object Detection,"Semi-supervised 3D object detection is a promising yet under-explored
+direction to reduce data annotation costs, especially for cluttered indoor
+scenes. A few prior works, such as SESS and 3DIoUMatch, attempt to solve this
+task by utilizing a teacher model to generate pseudo-labels for unlabeled
+samples. However, the availability of unlabeled samples in the 3D domain is
+relatively limited compared to its 2D counterpart due to the greater effort
+required to collect 3D data. Moreover, the loose consistency regularization in
+SESS and restricted pseudo-label selection strategy in 3DIoUMatch lead to
+either low-quality supervision or a limited amount of pseudo labels. To address
+these issues, we present a novel Dual-Perspective Knowledge Enrichment approach
+named DPKE for semi-supervised 3D object detection. Our DPKE enriches the
+knowledge of limited training data, particularly unlabeled data, from two
+perspectives: data-perspective and feature-perspective. Specifically, from the
+data-perspective, we propose a class-probabilistic data augmentation method
+that augments the input data with additional instances based on the varying
+distribution of class probabilities. Our DPKE achieves feature-perspective
+knowledge enrichment by designing a geometry-aware feature matching method that
+regularizes feature-level similarity between object proposals from the student
+and teacher models. Extensive experiments on the two benchmark datasets
+demonstrate that our DPKE achieves superior performance over existing
+state-of-the-art approaches under various label ratio conditions. The source
+code will be made available to the public.",cs.CV,['cs.CV']
+MRFS: Mutually Reinforcing Image Fusion and Segmentation,HAO ZHANG · Xuhui Zuo · Jie Jiang · Chunchao Guo · Jiayi Ma, ,,https://ojs.aaai.org/index.php/AAAI/article/view/28536,,,,,nan
+OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning,Haiyang Ying · Yixuan Yin · Jinzhi Zhang · Fan Wang · Tao Yu · Ruqi Huang · Lu Fang,https://oceanying.github.io/OmniSeg3D/,https://arxiv.org/abs/2311.11666,,2311.11666.pdf,OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning,"Towards holistic understanding of 3D scenes, a general 3D segmentation method
+is needed that can segment diverse objects without restrictions on object
+quantity or categories, while also reflecting the inherent hierarchical
+structure. To achieve this, we propose OmniSeg3D, an omniversal segmentation
+method aims for segmenting anything in 3D all at once. The key insight is to
+lift multi-view inconsistent 2D segmentations into a consistent 3D feature
+field through a hierarchical contrastive learning framework, which is
+accomplished by two steps. Firstly, we design a novel hierarchical
+representation based on category-agnostic 2D segmentations to model the
+multi-level relationship among pixels. Secondly, image features rendered from
+the 3D feature field are clustered at different levels, which can be further
+drawn closer or pushed apart according to the hierarchical relationship between
+different levels. In tackling the challenges posed by inconsistent 2D
+segmentations, this framework yields a global consistent 3D feature field,
+which further enables hierarchical segmentation, multi-object selection, and
+global discretization. Extensive experiments demonstrate the effectiveness of
+our method on high-quality 3D segmentation and accurate hierarchical structure
+understanding. A graphical user interface further facilitates flexible
+interaction for omniversal 3D segmentation.",cs.CV,['cs.CV']
+Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance,Kelvin C.K. Chan · Yang Zhao · Xuhui Jia · Ming-Hsuan Yang · Huisheng Wang, ,https://arxiv.org/abs/2405.01356,,2405.01356.pdf,Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance,"In subject-driven text-to-image synthesis, the synthesis process tends to be
+heavily influenced by the reference images provided by users, often overlooking
+crucial attributes detailed in the text prompt. In this work, we propose
+Subject-Agnostic Guidance (SAG), a simple yet effective solution to remedy the
+problem. We show that through constructing a subject-agnostic condition and
+applying our proposed dual classifier-free guidance, one could obtain outputs
+consistent with both the given subject and input text prompts. We validate the
+efficacy of our approach through both optimization-based and encoder-based
+methods. Additionally, we demonstrate its applicability in second-order
+customization methods, where an encoder-based model is fine-tuned with
+DreamBooth. Our approach is conceptually simple and requires only minimal code
+modifications, but leads to substantial quality improvements, as evidenced by
+our evaluations and user studies.",cs.CV,['cs.CV']
+DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement,Hao Wu · Huabin Liu · Yu Qiao · Xiao Sun, ,https://arxiv.org/abs/2404.02755,,2404.02755.pdf,DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement,"We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for
+dense video captioning (DVC), that elaborates on improving the quality of the
+generated event captions and their associated pseudo event boundaries from
+unlabeled videos. By leveraging the capabilities of diverse large language
+models (LLMs), we generate rich DVC-oriented caption candidates and optimize
+the corresponding pseudo boundaries under several meticulously designed
+objectives, considering diversity, event-centricity, temporal ordering, and
+coherence. Moreover, we further introduce a novel online boundary refinement
+strategy that iteratively improves the quality of pseudo boundaries during
+training. Comprehensive experiments have been conducted to examine the
+effectiveness of the proposed technique components. By leveraging a substantial
+amount of unlabeled video data, such as HowTo100M, we achieve a remarkable
+advancement on standard DVC datasets like YouCook2 and ActivityNet. We
+outperform the previous state-of-the-art Vid2Seq across a majority of metrics,
+achieving this with just 0.4% of the unlabeled video data used for pre-training
+by Vid2Seq.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM']"
+AdaBM: On-the-Fly Adaptive Bit Mapping for Image Super-Resolution,Cheeun Hong · Kyoung Mu Lee, ,https://arxiv.org/abs/2404.03296,,2404.03296.pdf,AdaBM: On-the-Fly Adaptive Bit Mapping for Image Super-Resolution,"Although image super-resolution (SR) problem has experienced unprecedented
+restoration accuracy with deep neural networks, it has yet limited versatile
+applications due to the substantial computational costs. Since different input
+images for SR face different restoration difficulties, adapting computational
+costs based on the input image, referred to as adaptive inference, has emerged
+as a promising solution to compress SR networks. Specifically, adapting the
+quantization bit-widths has successfully reduced the inference and memory cost
+without sacrificing the accuracy. However, despite the benefits of the
+resultant adaptive network, existing works rely on time-intensive
+quantization-aware training with full access to the original training pairs to
+learn the appropriate bit allocation policies, which limits its ubiquitous
+usage. To this end, we introduce the first on-the-fly adaptive quantization
+framework that accelerates the processing time from hours to seconds. We
+formulate the bit allocation problem with only two bit mapping modules: one to
+map the input image to the image-wise bit adaptation factor and one to obtain
+the layer-wise adaptation factors. These bit mappings are calibrated and
+fine-tuned using only a small number of calibration images. We achieve
+competitive performance with the previous adaptive quantization methods, while
+the processing time is accelerated by x2000. Codes are available at
+https://github.com/Cheeun/AdaBM.",cs.CV,"['cs.CV', 'eess.IV']"
+Residual Denoising Diffusion Models,Jiawei Liu · Qiang Wang · Huijie Fan · Yinong Wang · Yandong Tang · Liangqiong Qu,https://github.com/nachifur/RDDM,https://arxiv.org/abs/2308.13712,,,Residual Denoising Diffusion Models,"We propose residual denoising diffusion models (RDDM), a novel dual diffusion
+process that decouples the traditional single denoising diffusion process into
+residual diffusion and noise diffusion. This dual diffusion framework expands
+the denoising-based diffusion models, initially uninterpretable for image
+restoration, into a unified and interpretable model for both image generation
+and restoration by introducing residuals. Specifically, our residual diffusion
+represents directional diffusion from the target image to the degraded input
+image and explicitly guides the reverse generation process for image
+restoration, while noise diffusion represents random perturbations in the
+diffusion process. The residual prioritizes certainty, while the noise
+emphasizes diversity, enabling RDDM to effectively unify tasks with varying
+certainty or diversity requirements, such as image generation and restoration.
+We demonstrate that our sampling process is consistent with that of DDPM and
+DDIM through coefficient transformation, and propose a partially
+path-independent generation process to better understand the reverse process.
+Notably, our RDDM enables a generic UNet, trained with only an L1 loss and a
+batch size of 1, to compete with state-of-the-art image restoration methods. We
+provide code and pre-trained models to encourage further exploration,
+application, and development of our innovative framework
+(https://github.com/nachifur/RDDM).",cs.CV,"['cs.CV', 'cs.LG']"
+Navigate Beyond Shortcuts: Debiased Learning through the Lens of Neural Collapse,Yining Wang · Junjie Sun · Chenyue Wang · Mi Zhang · Min Yang, ,https://arxiv.org/abs/2405.05587,,2405.05587.pdf,Navigate Beyond Shortcuts: Debiased Learning through the Lens of Neural Collapse,"Recent studies have noted an intriguing phenomenon termed Neural Collapse,
+that is, when the neural networks establish the right correlation between
+feature spaces and the training targets, their last-layer features, together
+with the classifier weights, will collapse into a stable and symmetric
+structure. In this paper, we extend the investigation of Neural Collapse to the
+biased datasets with imbalanced attributes. We observe that models will easily
+fall into the pitfall of shortcut learning and form a biased, non-collapsed
+feature space at the early period of training, which is hard to reverse and
+limits the generalization capability. To tackle the root cause of biased
+classification, we follow the recent inspiration of prime training, and propose
+an avoid-shortcut learning framework without additional training complexity.
+With well-designed shortcut primes based on Neural Collapse structure, the
+models are encouraged to skip the pursuit of simple shortcuts and naturally
+capture the intrinsic correlations. Experimental results demonstrate that our
+method induces better convergence properties during training, and achieves
+state-of-the-art generalization performance on both synthetic and real-world
+biased datasets.",cs.CV,"['cs.CV', 'cs.LG']"
+TIGER: Time-Varying Denoising Model for 3D Point Cloud Generation with Diffusion Process,Zhiyuan Ren · Minchul Kim · Feng Liu · Xiaoming Liu, ,,https://link.springer.com/article/10.1007/s00371-024-03370-x,,,,,nan
+IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation,Yizhi Song · Zhifei Zhang · Zhe Lin · Scott Cohen · Brian Price · Jianming Zhang · Soo Ye Kim · He Zhang · Wei Xiong · Daniel Aliaga,https://song630.github.io/IMPRINT-Project-Page/,https://arxiv.org/abs/2403.10701,,2403.10701.pdf,IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation,"Generative object compositing emerges as a promising new avenue for
+compositional image editing. However, the requirement of object identity
+preservation poses a significant challenge, limiting practical usage of most
+existing methods. In response, this paper introduces IMPRINT, a novel
+diffusion-based generative model trained with a two-stage learning framework
+that decouples learning of identity preservation from that of compositing. The
+first stage is targeted for context-agnostic, identity-preserving pretraining
+of the object encoder, enabling the encoder to learn an embedding that is both
+view-invariant and conducive to enhanced detail preservation. The subsequent
+stage leverages this representation to learn seamless harmonization of the
+object composited to the background. In addition, IMPRINT incorporates a
+shape-guidance mechanism offering user-directed control over the compositing
+process. Extensive experiments demonstrate that IMPRINT significantly
+outperforms existing methods and various baselines on identity preservation and
+composition quality.",cs.CV,['cs.CV']
+Differentiable Micro-Mesh Construction,Yishun Dou · Zhong Zheng · Qiaoqiao Jin · Rui Shi · Yuhan Li · Bingbing Ni, ,http://export.arxiv.org/abs/2310.08332v1,,2310.08332v1.pdf,Real-Time Neural BRDF with Spherically Distributed Primitives,"We propose a novel compact and efficient neural BRDF offering highly
+versatile material representation, yet with very-light memory and neural
+computation consumption towards achieving real-time rendering. The results in
+Figure 1, rendered at full HD resolution on a current desktop machine, show
+that our system achieves real-time rendering with a wide variety of
+appearances, which is approached by the following two designs. On the one hand,
+noting that bidirectional reflectance is distributed in a very sparse
+high-dimensional subspace, we propose to project the BRDF into two
+low-dimensional components, i.e., two hemisphere feature-grids for incoming and
+outgoing directions, respectively. On the other hand, learnable neural
+reflectance primitives are distributed on our highly-tailored spherical surface
+grid, which offer informative features for each component and alleviate the
+conventional heavy feature learning network to a much smaller one, leading to
+very fast evaluation. These primitives are centrally stored in a codebook and
+can be shared across multiple grids and even across materials, based on the
+low-cost indices stored in material-specific spherical surface grids. Our
+neural BRDF, which is agnostic to the material, provides a unified framework
+that can represent a variety of materials in consistent manner. Comprehensive
+experimental results on measured BRDF compression, Monte Carlo simulated BRDF
+acceleration, and extension to spatially varying effect demonstrate the
+superior quality and generalizability achieved by the proposed scheme.",cs.CV,['cs.CV']
+FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization,Jiahui Zhang · Fangneng Zhan · MUYU XU · Shijian Lu · Eric P. Xing, ,https://arxiv.org/abs/2403.06908v1,,2403.06908v1.pdf,FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization,"3D Gaussian splatting has achieved very impressive performance in real-time
+novel view synthesis. However, it often suffers from over-reconstruction during
+Gaussian densification where high-variance image regions are covered by a few
+large Gaussians only, leading to blur and artifacts in the rendered images. We
+design a progressive frequency regularization (FreGS) technique to tackle the
+over-reconstruction issue within the frequency space. Specifically, FreGS
+performs coarse-to-fine Gaussian densification by exploiting low-to-high
+frequency components that can be easily extracted with low-pass and high-pass
+filters in the Fourier space. By minimizing the discrepancy between the
+frequency spectrum of the rendered image and the corresponding ground truth, it
+achieves high-quality Gaussian densification and alleviates the
+over-reconstruction of Gaussian splatting effectively. Experiments over
+multiple widely adopted benchmarks (e.g., Mip-NeRF360, Tanks-and-Temples and
+Deep Blending) show that FreGS achieves superior novel view synthesis and
+outperforms the state-of-the-art consistently.",cs.CV,['cs.CV']
+Parameter Efficient Self-Supervised Geospatial Domain Adaptation,Linus Scheibenreif · Michael Mommert · Damian Borth, ,https://arxiv.org/abs/2312.13066,,2312.13066.pdf,PPEA-Depth: Progressive Parameter-Efficient Adaptation for Self-Supervised Monocular Depth Estimation,"Self-supervised monocular depth estimation is of significant importance with
+applications spanning across autonomous driving and robotics. However, the
+reliance on self-supervision introduces a strong static-scene assumption,
+thereby posing challenges in achieving optimal performance in dynamic scenes,
+which are prevalent in most real-world situations. To address these issues, we
+propose PPEA-Depth, a Progressive Parameter-Efficient Adaptation approach to
+transfer a pre-trained image model for self-supervised depth estimation. The
+training comprises two sequential stages: an initial phase trained on a dataset
+primarily composed of static scenes, succeeded by an expansion to more
+intricate datasets involving dynamic scenes. To facilitate this process, we
+design compact encoder and decoder adapters to enable parameter-efficient
+tuning, allowing the network to adapt effectively. They not only uphold
+generalized patterns from pre-trained image models but also retain knowledge
+gained from the preceding phase into the subsequent one. Extensive experiments
+demonstrate that PPEA-Depth achieves state-of-the-art performance on KITTI,
+CityScapes and DDAD datasets.",cs.CV,['cs.CV']
+Question Aware Vision Transformer for Multimodal Reasoning,Roy Ganz · Yair Kittenplon · Aviad Aberdam · Elad Ben Avraham · Oren Nuriel · Shai Mazor · Ron Litman, ,https://arxiv.org/abs/2402.05472,,2402.05472.pdf,Question Aware Vision Transformer for Multimodal Reasoning,"Vision-Language (VL) models have gained significant research focus, enabling
+remarkable advances in multimodal reasoning. These architectures typically
+comprise a vision encoder, a Large Language Model (LLM), and a projection
+module that aligns visual features with the LLM's representation space. Despite
+their success, a critical limitation persists: the vision encoding process
+remains decoupled from user queries, often in the form of image-related
+questions. Consequently, the resulting visual features may not be optimally
+attuned to the query-specific elements of the image. To address this, we
+introduce QA-ViT, a Question Aware Vision Transformer approach for multimodal
+reasoning, which embeds question awareness directly within the vision encoder.
+This integration results in dynamic visual features focusing on relevant image
+aspects to the posed question. QA-ViT is model-agnostic and can be incorporated
+efficiently into any VL architecture. Extensive experiments demonstrate the
+effectiveness of applying our method to various multimodal architectures,
+leading to consistent improvement across diverse tasks and showcasing its
+potential for enhancing visual and scene-text understanding.",cs.CV,['cs.CV']
+Real-Time Neural BRDF with Spherically Distributed Primitives,Yishun Dou · Zhong Zheng · Qiaoqiao Jin · Bingbing Ni · Yugang Chen · Junxiang Ke, ,https://arxiv.org/abs/2310.08332,,2310.08332.pdf,Real-Time Neural BRDF with Spherically Distributed Primitives,"We propose a novel compact and efficient neural BRDF offering highly
+versatile material representation, yet with very-light memory and neural
+computation consumption towards achieving real-time rendering. The results in
+Figure 1, rendered at full HD resolution on a current desktop machine, show
+that our system achieves real-time rendering with a wide variety of
+appearances, which is approached by the following two designs. On the one hand,
+noting that bidirectional reflectance is distributed in a very sparse
+high-dimensional subspace, we propose to project the BRDF into two
+low-dimensional components, i.e., two hemisphere feature-grids for incoming and
+outgoing directions, respectively. On the other hand, learnable neural
+reflectance primitives are distributed on our highly-tailored spherical surface
+grid, which offer informative features for each component and alleviate the
+conventional heavy feature learning network to a much smaller one, leading to
+very fast evaluation. These primitives are centrally stored in a codebook and
+can be shared across multiple grids and even across materials, based on the
+low-cost indices stored in material-specific spherical surface grids. Our
+neural BRDF, which is agnostic to the material, provides a unified framework
+that can represent a variety of materials in consistent manner. Comprehensive
+experimental results on measured BRDF compression, Monte Carlo simulated BRDF
+acceleration, and extension to spatially varying effect demonstrate the
+superior quality and generalizability achieved by the proposed scheme.",cs.CV,['cs.CV']
+Dispel Darkness for Better Fusion: A Controllable Visual Enhancer based on Cross-modal Conditional Adversarial Learning,HAO ZHANG · Linfeng Tang · Xinyu Xiang · Xuhui Zuo · Jiayi Ma, ,,https://github.com/HaoZhang1018/DDBF,,,,,nan
+HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention,Xiaolong Tang · Meina Kan · Shiguang Shan · Zhilong Ji · Jinfeng Bai · Xilin Chen, ,https://arxiv.org/abs/2404.06351,,2404.06351.pdf,HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention,"Predicting the trajectories of road agents is essential for autonomous
+driving systems. The recent mainstream methods follow a static paradigm, which
+predicts the future trajectory by using a fixed duration of historical frames.
+These methods make the predictions independently even at adjacent time steps,
+which leads to potential instability and temporal inconsistency. As successive
+time steps have largely overlapping historical frames, their forecasting should
+have intrinsic correlation, such as overlapping predicted trajectories should
+be consistent, or be different but share the same motion goal depending on the
+road situation. Motivated by this, in this work, we introduce HPNet, a novel
+dynamic trajectory forecasting method. Aiming for stable and accurate
+trajectory forecasting, our method leverages not only historical frames
+including maps and agent states, but also historical predictions. Specifically,
+we newly design a Historical Prediction Attention module to automatically
+encode the dynamic relationship between successive predictions. Besides, it
+also extends the attention range beyond the currently visible window
+benefitting from the use of historical predictions. The proposed Historical
+Prediction Attention together with the Agent Attention and Mode Attention is
+further formulated as the Triple Factorized Attention module, serving as the
+core design of HPNet.Experiments on the Argoverse and INTERACTION datasets show
+that HPNet achieves state-of-the-art performance, and generates accurate and
+stable future trajectories. Our code are available at
+https://github.com/XiaolongTang23/HPNet.",cs.CV,['cs.CV']
+Scene-adaptive and Region-aware Multi-modal Prompt for Open Vocabulary Object Detection,Xiaowei Zhao · Xianglong Liu · Duorui Wang · Yajun Gao · Zhide Liu, ,https://arxiv.org/abs/2306.05493,,,Multi-Modal Classifiers for Open-Vocabulary Object Detection,"The goal of this paper is open-vocabulary object detection (OVOD)
+$\unicode{x2013}$ building a model that can detect objects beyond the set of
+categories seen at training, thus enabling the user to specify categories of
+interest at inference without the need for model retraining. We adopt a
+standard two-stage object detector architecture, and explore three ways for
+specifying novel categories: via language descriptions, via image exemplars, or
+via a combination of the two. We make three contributions: first, we prompt a
+large language model (LLM) to generate informative language descriptions for
+object classes, and construct powerful text-based classifiers; second, we
+employ a visual aggregator on image exemplars that can ingest any number of
+images as input, forming vision-based classifiers; and third, we provide a
+simple method to fuse information from language descriptions and image
+exemplars, yielding a multi-modal classifier. When evaluating on the
+challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our
+text-based classifiers outperform all previous OVOD works; (ii) our
+vision-based classifiers perform as well as text-based classifiers in prior
+work; (iii) using multi-modal classifiers perform better than either modality
+alone; and finally, (iv) our text-based and multi-modal classifiers yield
+better performance than a fully-supervised detector.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'I.4.6; I.4.8; I.4.9; I.2.10']"
+CAPE: CAM as a Probabilistic Ensemble for Enhanced DNN Interpretation,Townim Chowdhury · Kewen Liao · Vu Minh Hieu Phan · Minh-Son To · Yutong Xie · Kevin Hung · David Ross · Anton van den Hengel · Johan Verjans · Zhibin Liao, ,https://arxiv.org/abs/2404.02388,,2404.02388.pdf,CAPE: CAM as a Probabilistic Ensemble for Enhanced DNN Interpretation,"Deep Neural Networks (DNNs) are widely used for visual classification tasks,
+but their complex computation process and black-box nature hinder decision
+transparency and interpretability. Class activation maps (CAMs) and recent
+variants provide ways to visually explain the DNN decision-making process by
+displaying 'attention' heatmaps of the DNNs. Nevertheless, the CAM explanation
+only offers relative attention information, that is, on an attention heatmap,
+we can interpret which image region is more or less important than the others.
+However, these regions cannot be meaningfully compared across classes, and the
+contribution of each region to the model's class prediction is not revealed. To
+address these challenges that ultimately lead to better DNN Interpretation, in
+this paper, we propose CAPE, a novel reformulation of CAM that provides a
+unified and probabilistically meaningful assessment of the contributions of
+image regions. We quantitatively and qualitatively compare CAPE with
+state-of-the-art CAM methods on CUB and ImageNet benchmark datasets to
+demonstrate enhanced interpretability. We also test on a cytology imaging
+dataset depicting a challenging Chronic Myelomonocytic Leukemia (CMML)
+diagnosis problem. Code is available at: https://github.com/AIML-MED/CAPE.",cs.CV,['cs.CV']
+Focus on Hiders: Exploring Hidden Threats for Enhancing Adversarial Training,Qian Li · Yuxiao Hu · Yinpeng Dong · Dongxiao Zhang · Yuntian Chen, ,https://arxiv.org/abs/2312.07067,,2312.07067.pdf,Focus on Hiders: Exploring Hidden Threats for Enhancing Adversarial Training,"Adversarial training is often formulated as a min-max problem, however,
+concentrating only on the worst adversarial examples causes alternating
+repetitive confusion of the model, i.e., previously defended or correctly
+classified samples are not defensible or accurately classifiable in subsequent
+adversarial training. We characterize such non-ignorable samples as ""hiders"",
+which reveal the hidden high-risk regions within the secure area obtained
+through adversarial training and prevent the model from finding the real worst
+cases. We demand the model to prevent hiders when defending against adversarial
+examples for improving accuracy and robustness simultaneously. By rethinking
+and redefining the min-max optimization problem for adversarial training, we
+propose a generalized adversarial training algorithm called Hider-Focused
+Adversarial Training (HFAT). HFAT introduces the iterative evolution
+optimization strategy to simplify the optimization problem and employs an
+auxiliary model to reveal hiders, effectively combining the optimization
+directions of standard adversarial training and prevention hiders. Furthermore,
+we introduce an adaptive weighting mechanism that facilitates the model in
+adaptively adjusting its focus between adversarial examples and hiders during
+different training periods. We demonstrate the effectiveness of our method
+based on extensive experiments, and ensure that HFAT can provide higher
+robustness and accuracy.",cs.LG,"['cs.LG', 'cs.CR', 'cs.CV', 'stat.AP']"
+Multi-Space Alignments Towards Universal LiDAR Segmentation,Youquan Liu · Lingdong Kong · Xiaoyang Wu · Runnan Chen · Xin Li · Liang Pan · Ziwei Liu · Yuexin Ma, ,https://arxiv.org/abs/2405.01538,,2405.01538.pdf,Multi-Space Alignments Towards Universal LiDAR Segmentation,"A unified and versatile LiDAR segmentation model with strong robustness and
+generalizability is desirable for safe autonomous driving perception. This work
+presents M3Net, a one-of-a-kind framework for fulfilling multi-task,
+multi-dataset, multi-modality LiDAR segmentation in a universal manner using
+just a single set of parameters. To better exploit data volume and diversity,
+we first combine large-scale driving datasets acquired by different types of
+sensors from diverse scenes and then conduct alignments in three spaces, namely
+data, feature, and label spaces, during the training. As a result, M3Net is
+capable of taming heterogeneous data for training state-of-the-art LiDAR
+segmentation models. Extensive experiments on twelve LiDAR segmentation
+datasets verify our effectiveness. Notably, using a shared set of parameters,
+M3Net achieves 75.1%, 83.1%, and 72.4% mIoU scores, respectively, on the
+official benchmarks of SemanticKITTI, nuScenes, and Waymo Open.",cs.CV,"['cs.CV', 'cs.LG', 'cs.RO']"
+Fast ODE-based Sampling for Diffusion Models in Around 5 Steps,Zhenyu Zhou · Defang Chen · Can Wang · Chun Chen, ,https://arxiv.org/abs/2312.00094,,2312.00094.pdf,Fast ODE-based Sampling for Diffusion Models in Around 5 Steps,"Sampling from diffusion models can be treated as solving the corresponding
+ordinary differential equations (ODEs), with the aim of obtaining an accurate
+solution with as few number of function evaluations (NFE) as possible.
+Recently, various fast samplers utilizing higher-order ODE solvers have emerged
+and achieved better performance than the initial first-order one. However,
+these numerical methods inherently result in certain approximation errors,
+which significantly degrades sample quality with extremely small NFE (e.g.,
+around 5). In contrast, based on the geometric observation that each sampling
+trajectory almost lies in a two-dimensional subspace embedded in the ambient
+space, we propose Approximate MEan-Direction Solver (AMED-Solver) that
+eliminates truncation errors by directly learning the mean direction for fast
+diffusion sampling. Besides, our method can be easily used as a plugin to
+further improve existing ODE-based samplers. Extensive experiments on image
+synthesis with the resolution ranging from 32 to 512 demonstrate the
+effectiveness of our method. With only 5 NFE, we achieve 6.61 FID on CIFAR-10,
+10.74 FID on ImageNet 64$\times$64, and 13.20 FID on LSUN Bedroom. Our code is
+available at https://github.com/zju-pi/diff-sampler.",cs.CV,"['cs.CV', 'cs.AI']"
+OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies,Lingdong Kong · Youquan Liu · Lai Xing Ng · Benoit Cottereau · Wei Tsang Ooi,https://github.com/ldkong1205/OpenESS,http://export.arxiv.org/abs/2405.05259,,2405.05259.pdf,OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies,"Event-based semantic segmentation (ESS) is a fundamental yet challenging task
+for event camera sensing. The difficulties in interpreting and annotating event
+data limit its scalability. While domain adaptation from images to event data
+can help to mitigate this issue, there exist data representational differences
+that require additional effort to resolve. In this work, for the first time, we
+synergize information from image, text, and event-data domains and introduce
+OpenESS to enable scalable ESS in an open-world, annotation-efficient manner.
+We achieve this goal by transferring the semantically rich CLIP knowledge from
+image-text pairs to event streams. To pursue better cross-modality adaptation,
+we propose a frame-to-event contrastive distillation and a text-to-event
+semantic consistency regularization. Experimental results on popular ESS
+benchmarks showed our approach outperforms existing methods. Notably, we
+achieve 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic without using either
+event or frame labels.",cs.CV,"['cs.CV', 'cs.RO']"
+Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior,Zike Wu · Pan Zhou · YI Xuanyu · Xiaoding Yuan · Hanwang Zhang, ,,https://paperswithcode.com/paper/consistent3d-towards-consistent-high-fidelity,,,,,nan
+VMINer: Versatile Multi-view Inverse Rendering with Near- and Far-field Light Sources,Fan Fei · Jiajun Tang · Ping Tan · Boxin Shi,https://costrice.github.io/vminer/,https://arxiv.org/abs/2402.06136,,2402.06136.pdf,SIR: Multi-view Inverse Rendering with Decomposable Shadow for Indoor Scenes,"We propose SIR, an efficient method to decompose differentiable shadows for
+inverse rendering on indoor scenes using multi-view data, addressing the
+challenges in accurately decomposing the materials and lighting conditions.
+Unlike previous methods that struggle with shadow fidelity in complex lighting
+environments, our approach explicitly learns shadows for enhanced realism in
+material estimation under unknown light positions. Utilizing posed HDR images
+as input, SIR employs an SDF-based neural radiance field for comprehensive
+scene representation. Then, SIR integrates a shadow term with a three-stage
+material estimation approach to improve SVBRDF quality. Specifically, SIR is
+designed to learn a differentiable shadow, complemented by BRDF regularization,
+to optimize inverse rendering accuracy. Extensive experiments on both synthetic
+and real-world indoor scenes demonstrate the superior performance of SIR over
+existing methods in both quantitative metrics and qualitative analysis. The
+significant decomposing ability of SIR enables sophisticated editing
+capabilities like free-view relighting, object insertion, and material
+replacement. The code and data are available at
+https://xiaokangwei.github.io/SIR/.",cs.CV,['cs.CV']
+Weak-to-Strong 3D Object Detection with X-Ray Distillation,Alexander Gambashidze · Aleksandr Dadukin · Maksim Golyadkin · Maria Razzhivina · Ilya Makarov, ,https://arxiv.org/abs/2404.00679,,2404.00679.pdf,Weak-to-Strong 3D Object Detection with X-Ray Distillation,"This paper addresses the critical challenges of sparsity and occlusion in
+LiDAR-based 3D object detection. Current methods often rely on supplementary
+modules or specific architectural designs, potentially limiting their
+applicability to new and evolving architectures. To our knowledge, we are the
+first to propose a versatile technique that seamlessly integrates into any
+existing framework for 3D Object Detection, marking the first instance of
+Weak-to-Strong generalization in 3D computer vision. We introduce a novel
+framework, X-Ray Distillation with Object-Complete Frames, suitable for both
+supervised and semi-supervised settings, that leverages the temporal aspect of
+point cloud sequences. This method extracts crucial information from both
+previous and subsequent LiDAR frames, creating Object-Complete frames that
+represent objects from multiple viewpoints, thus addressing occlusion and
+sparsity. Given the limitation of not being able to generate Object-Complete
+frames during online inference, we utilize Knowledge Distillation within a
+Teacher-Student framework. This technique encourages the strong Student model
+to emulate the behavior of the weaker Teacher, which processes simple and
+informative Object-Complete frames, effectively offering a comprehensive view
+of objects as if seen through X-ray vision. Our proposed methods surpass
+state-of-the-art in semi-supervised learning by 1-1.5 mAP and enhance the
+performance of five established supervised models by 1-2 mAP on standard
+autonomous driving datasets, even with default hyperparameters. Code for
+Object-Complete frames is available here:
+https://github.com/sakharok13/X-Ray-Teacher-Patching-Tools.",cs.CV,['cs.CV']
+AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents,Jieming Cui · Tengyu Liu · Nian Liu · Yaodong Yang · Yixin Zhu · Siyuan Huang, ,https://arxiv.org/abs/2403.12835,,2403.12835.pdf,AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents,"Traditional approaches in physics-based motion generation, centered around
+imitation learning and reward shaping, often struggle to adapt to new
+scenarios. To tackle this limitation, we propose AnySkill, a novel hierarchical
+method that learns physically plausible interactions following open-vocabulary
+instructions. Our approach begins by developing a set of atomic actions via a
+low-level controller trained via imitation learning. Upon receiving an
+open-vocabulary textual instruction, AnySkill employs a high-level policy that
+selects and integrates these atomic actions to maximize the CLIP similarity
+between the agent's rendered images and the text. An important feature of our
+method is the use of image-based rewards for the high-level policy, which
+allows the agent to learn interactions with objects without manual reward
+engineering. We demonstrate AnySkill's capability to generate realistic and
+natural motion sequences in response to unseen instructions of varying lengths,
+marking it the first method capable of open-vocabulary physical skill learning
+for interactive humanoid agents.",cs.CV,"['cs.CV', 'cs.RO']"
+Learning Continuous 3D Words for Text-to-Image Generation,Ta-Ying Cheng · Matheus Gadelha · Thibault Groueix · Matthew Fisher · Radomir Mech · Andrew Markham · Niki Trigoni,https://ttchengab.github.io/continuous_3d_words/,https://arxiv.org/abs/2402.08654,,2402.08654.pdf,Learning Continuous 3D Words for Text-to-Image Generation,"Current controls over diffusion models (e.g., through text or ControlNet) for
+image generation fall short in recognizing abstract, continuous attributes like
+illumination direction or non-rigid shape change. In this paper, we present an
+approach for allowing users of text-to-image models to have fine-grained
+control of several attributes in an image. We do this by engineering special
+sets of input tokens that can be transformed in a continuous manner -- we call
+them Continuous 3D Words. These attributes can, for example, be represented as
+sliders and applied jointly with text prompts for fine-grained control over
+image generation. Given only a single mesh and a rendering engine, we show that
+our approach can be adopted to provide continuous user control over several
+3D-aware attributes, including time-of-day illumination, bird wing orientation,
+dollyzoom effect, and object poses. Our method is capable of conditioning image
+creation with multiple Continuous 3D Words and text descriptions simultaneously
+while adding no overhead to the generative process. Project Page:
+https://ttchengab.github.io/continuous_3d_words",cs.CV,['cs.CV']
+Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model,Xu He · Qiaochu Huang · Zhensong Zhang · Zhiwei Lin · Zhiyong Wu · Sicheng Yang · Minglei Li · Zhiyi Chen · Songcen Xu · Xiaofei Wu, ,https://arxiv.org/abs/2404.01862v1,,2404.01862v1.pdf,Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model,"Co-speech gestures, if presented in the lively form of videos, can achieve
+superior visual effects in human-machine interaction. While previous works
+mostly generate structural human skeletons, resulting in the omission of
+appearance information, we focus on the direct generation of audio-driven
+co-speech gesture videos in this work. There are two main challenges: 1) A
+suitable motion feature is needed to describe complex human movements with
+crucial appearance information. 2) Gestures and speech exhibit inherent
+dependencies and should be temporally aligned even of arbitrary length. To
+solve these problems, we present a novel motion-decoupled framework to generate
+co-speech gesture videos. Specifically, we first introduce a well-designed
+nonlinear TPS transformation to obtain latent motion features preserving
+essential appearance information. Then a transformer-based diffusion model is
+proposed to learn the temporal correlation between gestures and speech, and
+performs generation in the latent motion space, followed by an optimal motion
+selection module to produce long-term coherent and consistent gesture videos.
+For better visual perception, we further design a refinement network focusing
+on missing details of certain areas. Extensive experimental results show that
+our proposed framework significantly outperforms existing approaches in both
+motion and video-related evaluations. Our code, demos, and more resources are
+available at https://github.com/thuhcsi/S2G-MDDiffusion.",cs.CV,"['cs.CV', 'cs.HC', 'cs.MM']"
+Leveraging Camera Triplets for Efficient and Accurate Structure-from-Motion,Lalit Manam · Venu Madhav Govindu,https://ee.iisc.ac.in/cvlab/research/camtripsfm/,,,,,,,nan
+TextCraftor: Your Text Encoder Can be Image Quality Controller,Yanyu Li · Xian Liu · Anil Kag · Ju Hu · Yerlan Idelbayev · Dhritiman Sagar · Yanzhi Wang · Sergey Tulyakov · Jian Ren, ,https://arxiv.org/abs/2403.18978,,2403.18978.pdf,TextCraftor: Your Text Encoder Can be Image Quality Controller,"Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have
+revolutionized the field of content generation, enabling significant
+advancements in areas like image editing and video synthesis. Despite their
+formidable capabilities, these models are not without their limitations. It is
+still challenging to synthesize an image that aligns well with the input text,
+and multiple runs with carefully crafted prompts are required to achieve
+satisfactory results. To mitigate these limitations, numerous studies have
+endeavored to fine-tune the pre-trained diffusion models, i.e., UNet, utilizing
+various technologies. Yet, amidst these efforts, a pivotal question of
+text-to-image diffusion model training has remained largely unexplored: Is it
+possible and feasible to fine-tune the text encoder to improve the performance
+of text-to-image diffusion models? Our findings reveal that, instead of
+replacing the CLIP text encoder used in Stable Diffusion with other large
+language models, we can enhance it through our proposed fine-tuning approach,
+TextCraftor, leading to substantial improvements in quantitative benchmarks and
+human assessments. Interestingly, our technique also empowers controllable
+image generation through the interpolation of different text encoders
+fine-tuned with various rewards. We also demonstrate that TextCraftor is
+orthogonal to UNet finetuning, and can be combined to further improve
+generative quality.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Cross-Domain Few-Shot Segmentation via Iterative Support-Query Correspondence Mining,Jiahao Nie · Yun Xing · Gongjie Zhang · Pei Yan · Aoran Xiao · Yap-peng Tan · Alex C. Kot · Shijian Lu, ,https://arxiv.org/abs/2401.08407,,,Cross-Domain Few-Shot Segmentation via Iterative Support-Query Correspondence Mining,"Cross-Domain Few-Shot Segmentation (CD-FSS) poses the challenge of segmenting
+novel categories from a distinct domain using only limited exemplars. In this
+paper, we undertake a comprehensive study of CD-FSS and uncover two crucial
+insights: (i) the necessity of a fine-tuning stage to effectively transfer the
+learned meta-knowledge across domains, and (ii) the overfitting risk during the
+na\""ive fine-tuning due to the scarcity of novel category examples. With these
+insights, we propose a novel cross-domain fine-tuning strategy that addresses
+the challenging CD-FSS tasks. We first design Bi-directional Few-shot
+Prediction (BFP), which establishes support-query correspondence in a
+bi-directional manner, crafting augmented supervision to reduce the overfitting
+risk. Then we further extend BFP into Iterative Few-shot Adaptor (IFA), which
+is a recursive framework to capture the support-query correspondence
+iteratively, targeting maximal exploitation of supervisory signals from the
+sparse novel category samples. Extensive empirical evaluations show that our
+method significantly outperforms the state-of-the-arts (+7.8\%), which verifies
+that IFA tackles the cross-domain challenges and mitigates the overfitting
+simultaneously. The code is available at: https://github.com/niejiahao1998/IFA.",cs.CV,['cs.CV']
+Learning Large-Factor EM Image Super-Resolution with Generative Priors,Jiateng Shou · Zeyu Xiao · Shiyu Deng · Wei Huang · ShiPeiyao · Ruobing Zhang · Zhiwei Xiong · Feng Wu,https://github.com/jtshou/GPEMSR,https://arxiv.org/html/2405.07044v1,,2405.07044v1.pdf,Semantic Guided Large Scale Factor Remote Sensing Image Super-resolution with Generative Diffusion Prior,"Remote sensing images captured by different platforms exhibit significant
+disparities in spatial resolution. Large scale factor super-resolution (SR)
+algorithms are vital for maximizing the utilization of low-resolution (LR)
+satellite data captured from orbit. However, existing methods confront
+challenges in recovering SR images with clear textures and correct ground
+objects. We introduce a novel framework, the Semantic Guided Diffusion Model
+(SGDM), designed for large scale factor remote sensing image super-resolution.
+The framework exploits a pre-trained generative model as a prior to generate
+perceptually plausible SR images. We further enhance the reconstruction by
+incorporating vector maps, which carry structural and semantic cues. Moreover,
+pixel-level inconsistencies in paired remote sensing images, stemming from
+sensor-specific imaging characteristics, may hinder the convergence of the
+model and diversity in generated results. To address this problem, we propose
+to extract the sensor-specific imaging characteristics and model the
+distribution of them, allowing diverse SR images generation based on imaging
+characteristics provided by reference images or sampled from the imaging
+characteristic probability distributions. To validate and evaluate our
+approach, we create the Cross-Modal Super-Resolution Dataset (CMSRD).
+Qualitative and quantitative experiments on CMSRD showcase the superiority and
+broad applicability of our method. Experimental results on downstream vision
+tasks also demonstrate the utilitarian of the generated SR images. The dataset
+and code will be publicly available at https://github.com/wwangcece/SGDM",cs.CV,['cs.CV']
+FairCLIP: Harnessing Fairness in Vision-Language Learning,Yan Luo · MIN SHI · Muhammad Osama Khan · Muhammad Muneeb Afzal · Hao Huang · Shuaihang Yuan · Yu Tian · Luo Song · Ava Kouhana · Tobias Elze · Yi Fang · Mengyu Wang, ,https://arxiv.org/abs/2403.19949,,2403.19949.pdf,FairCLIP: Harnessing Fairness in Vision-Language Learning,"Fairness is a critical concern in deep learning, especially in healthcare,
+where these models influence diagnoses and treatment decisions. Although
+fairness has been investigated in the vision-only domain, the fairness of
+medical vision-language (VL) models remains unexplored due to the scarcity of
+medical VL datasets for studying fairness. To bridge this research gap, we
+introduce the first fair vision-language medical dataset Harvard-FairVLMed that
+provides detailed demographic attributes, ground-truth labels, and clinical
+notes to facilitate an in-depth examination of fairness within VL foundation
+models. Using Harvard-FairVLMed, we conduct a comprehensive fairness analysis
+of two widely-used VL models (CLIP and BLIP2), pre-trained on both natural and
+medical domains, across four different protected attributes. Our results
+highlight significant biases in all VL models, with Asian, Male, Non-Hispanic,
+and Spanish being the preferred subgroups across the protected attributes of
+race, gender, ethnicity, and language, respectively. In order to alleviate
+these biases, we propose FairCLIP, an optimal-transport-based approach that
+achieves a favorable trade-off between performance and fairness by reducing the
+Sinkhorn distance between the overall sample distribution and the distributions
+corresponding to each demographic group. As the first VL dataset of its kind,
+Harvard-FairVLMed holds the potential to catalyze advancements in the
+development of machine learning models that are both ethically aware and
+clinically effective. Our dataset and code are available at
+https://ophai.hms.harvard.edu/datasets/harvard-fairvlmed10k.",cs.CV,['cs.CV']
+Distributionally Generative Augmentation for Fair Facial Attribute Classification,Fengda Zhang · Qianpei He · Kun Kuang · Jiashuo Liu · Long Chen · Chao Wu · Jun Xiao · Hanwang Zhang,https://github.com/heqianpei/DiGA,https://arxiv.org/abs/2403.06606,,2403.06606.pdf,Distributionally Generative Augmentation for Fair Facial Attribute Classification,"Facial Attribute Classification (FAC) holds substantial promise in widespread
+applications. However, FAC models trained by traditional methodologies can be
+unfair by exhibiting accuracy inconsistencies across varied data
+subpopulations. This unfairness is largely attributed to bias in data, where
+some spurious attributes (e.g., Male) statistically correlate with the target
+attribute (e.g., Smiling). Most of existing fairness-aware methods rely on the
+labels of spurious attributes, which may be unavailable in practice. This work
+proposes a novel, generation-based two-stage framework to train a fair FAC
+model on biased data without additional annotation. Initially, we identify the
+potential spurious attributes based on generative models. Notably, it enhances
+interpretability by explicitly showing the spurious attributes in image space.
+Following this, for each image, we first edit the spurious attributes with a
+random degree sampled from a uniform distribution, while keeping target
+attribute unchanged. Then we train a fair FAC model by fostering model
+invariance to these augmentation. Extensive experiments on three common
+datasets demonstrate the effectiveness of our method in promoting fairness in
+FAC without compromising accuracy. Codes are in
+https://github.com/heqianpei/DiGA.",cs.CV,"['cs.CV', 'cs.LG']"
+RobustSAM: Segment Anything Robustly on Degraded Images,Wei-Ting Chen · Yu Jiet Vong · Sy-Yen Kuo · Sizhuo Ma · Jian Wang, ,https://arxiv.org/abs/2306.07713,,2306.07713.pdf,Robustness of SAM: Segment Anything Under Corruptions and Beyond,"Segment anything model (SAM), as the name suggests, is claimed to be capable
+of cutting out any object and demonstrates impressive zero-shot transfer
+performance with the guidance of prompts. However, there is currently a lack of
+comprehensive evaluation regarding its robustness under various corruptions.
+Understanding the robustness of SAM across different corruption scenarios is
+crucial for its real-world deployment. Prior works show that SAM is biased
+towards texture (style) rather than shape, motivated by which we start by
+investigating its robustness against style transfer, which is synthetic
+corruption. Following by interpreting the effects of synthetic corruption as
+style changes, we proceed to conduct a comprehensive evaluation for its
+robustness against 15 types of common corruption. These corruptions mainly fall
+into categories such as digital, noise, weather, and blur, and within each
+corruption category, we explore 5 severity levels to simulate real-world
+corruption scenarios. Beyond the corruptions, we further assess the robustness
+of SAM against local occlusion and local adversarial patch attacks. To the best
+of our knowledge, our work is the first of its kind to evaluate the robustness
+of SAM under style change, local occlusion, and local adversarial patch
+attacks. Given that patch attacks visible to human eyes are easily detectable,
+we further assess its robustness against global adversarial attacks that are
+imperceptible to human eyes. Overall, this work provides a comprehensive
+empirical study of the robustness of SAM, evaluating its performance under
+various corruptions and extending the assessment to critical aspects such as
+local occlusion, local adversarial patch attacks, and global adversarial
+attacks. These evaluations yield valuable insights into the practical
+applicability and effectiveness of SAM in addressing real-world challenges.",cs.CV,['cs.CV']
+ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation,Dar-Yen Chen · Hamish Tennent · Ching-Wen Hsu,https://cardinalblue.github.io/artadapter.github.io/,https://arxiv.org/abs/2312.02109v1,,2312.02109v1.pdf,ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation,"This work introduces ArtAdapter, a transformative text-to-image (T2I) style
+transfer framework that transcends traditional limitations of color,
+brushstrokes, and object shape, capturing high-level style elements such as
+composition and distinctive artistic expression. The integration of a
+multi-level style encoder with our proposed explicit adaptation mechanism
+enables ArtAdapte to achieve unprecedented fidelity in style transfer, ensuring
+close alignment with textual descriptions. Additionally, the incorporation of
+an Auxiliary Content Adapter (ACA) effectively separates content from style,
+alleviating the borrowing of content from style references. Moreover, our novel
+fast finetuning approach could further enhance zero-shot style representation
+while mitigating the risk of overfitting. Comprehensive evaluations confirm
+that ArtAdapter surpasses current state-of-the-art methods.",cs.CV,['cs.CV']
+NAPGuard: Towards Detecting Naturalistic Adversarial Patches,Siyang Wu · Jiakai Wang · Jiejie Zhao · Yazhe Wang · Xianglong Liu,https://github.com/wsynuiag/NAPGaurd,https://arxiv.org/abs/2307.08076,,2307.08076.pdf,Diffusion to Confusion: Naturalistic Adversarial Patch Generation Based on Diffusion Model for Object Detector,"Many physical adversarial patch generation methods are widely proposed to
+protect personal privacy from malicious monitoring using object detectors.
+However, they usually fail to generate satisfactory patch images in terms of
+both stealthiness and attack performance without making huge efforts on careful
+hyperparameter tuning. To address this issue, we propose a novel naturalistic
+adversarial patch generation method based on the diffusion models (DM). Through
+sampling the optimal image from the DM model pretrained upon natural images, it
+allows us to stably craft high-quality and naturalistic physical adversarial
+patches to humans without suffering from serious mode collapse problems as
+other deep generative models. To the best of our knowledge, we are the first to
+propose DM-based naturalistic adversarial patch generation for object
+detectors. With extensive quantitative, qualitative, and subjective
+experiments, the results demonstrate the effectiveness of the proposed approach
+to generate better-quality and more naturalistic adversarial patches while
+achieving acceptable attack performance than other state-of-the-art patch
+generation methods. We also show various generation trade-offs under different
+conditions.",cs.CV,['cs.CV']
+DSL-FIQA: Assessing Facial Image Quality via Dual-Set Degradation Learning and Landmark-Guided Transformer,Wei-Ting Chen · Gurunandan Krishnan · Qiang Gao · Sy-Yen Kuo · Sizhuo Ma · Jian Wang, ,,https://ieeexplore.ieee.org/abstract/document/10381809/authors,,,,,nan
+PACER+: On-Demand Pedestrian Animation Controller in Driving Scenarios,Jingbo Wang · Zhengyi Luo · Ye Yuan · Yixuan LI · Bo Dai, ,https://arxiv.org/html/2404.19722v1,,2404.19722v1.pdf,PACER+: On-Demand Pedestrian Animation Controller in Driving Scenarios,"We address the challenge of content diversity and controllability in
+pedestrian simulation for driving scenarios. Recent pedestrian animation
+frameworks have a significant limitation wherein they primarily focus on either
+following trajectory [46] or the content of the reference video [57],
+consequently overlooking the potential diversity of human motion within such
+scenarios. This limitation restricts the ability to generate pedestrian
+behaviors that exhibit a wider range of variations and realistic motions and
+therefore restricts its usage to provide rich motion content for other
+components in the driving simulation system, e.g., suddenly changed motion to
+which the autonomous vehicle should respond. In our approach, we strive to
+surpass the limitation by showcasing diverse human motions obtained from
+various sources, such as generated human motions, in addition to following the
+given trajectory. The fundamental contribution of our framework lies in
+combining the motion tracking task with trajectory following, which enables the
+tracking of specific motion parts (e.g., upper body) while simultaneously
+following the given trajectory by a single policy. This way, we significantly
+enhance both the diversity of simulated human motion within the given scenario
+and the controllability of the content, including language-based control. Our
+framework facilitates the generation of a wide range of human motions,
+contributing to greater realism and adaptability in pedestrian simulations for
+driving scenarios. More information is on our project page
+https://wangjingbo1219.github.io/papers/CVPR2024_PACER_PLUS/PACERPLUSPage.html .",cs.CV,['cs.CV']
+Cache Me if You Can: Accelerating Diffusion Models through Block Caching,Felix Wimbauer · Bichen Wu · Edgar Schoenfeld · Xiaoliang Dai · Ji Hou · Zijian He · Artsiom Sanakoyeu · Peizhao Zhang · Sam Tsai · Jonas Kohler · Christian Rupprecht · Daniel Cremers · Peter Vajda · Jialiang Wang, ,https://arxiv.org/abs/2312.03209,,2312.03209.pdf,Cache Me if You Can: Accelerating Diffusion Models through Block Caching,"Diffusion models have recently revolutionized the field of image synthesis
+due to their ability to generate photorealistic images. However, one of the
+major drawbacks of diffusion models is that the image generation process is
+costly. A large image-to-image network has to be applied many times to
+iteratively refine an image from random noise. While many recent works propose
+techniques to reduce the number of required steps, they generally treat the
+underlying denoising network as a black box. In this work, we investigate the
+behavior of the layers within the network and find that 1) the layers' output
+changes smoothly over time, 2) the layers show distinct patterns of change, and
+3) the change from step to step is often very small. We hypothesize that many
+layer computations in the denoising network are redundant. Leveraging this, we
+introduce block caching, in which we reuse outputs from layer blocks of
+previous steps to speed up inference. Furthermore, we propose a technique to
+automatically determine caching schedules based on each block's changes over
+timesteps. In our experiments, we show through FID, human evaluation and
+qualitative analysis that Block Caching allows to generate images with higher
+visual quality at the same computational cost. We demonstrate this for
+different state-of-the-art models (LDM and EMU) and solvers (DDIM and DPM).",cs.CV,['cs.CV']
+Multi-Modal Hallucination Control by Visual Information Grounding,Alessandro Favero · Luca Zancato · Matthew Trager · Siddharth Choudhary · Pramuditha Perera · Alessandro Achille · Ashwin Swaminathan · Stefano Soatto, ,https://arxiv.org/abs/2403.14003,,2403.14003.pdf,Multi-Modal Hallucination Control by Visual Information Grounding,"Generative Vision-Language Models (VLMs) are prone to generate
+plausible-sounding textual answers that, however, are not always grounded in
+the input image. We investigate this phenomenon, usually referred to as
+""hallucination"" and show that it stems from an excessive reliance on the
+language prior. In particular, we show that as more tokens are generated, the
+reliance on the visual prompt decreases, and this behavior strongly correlates
+with the emergence of hallucinations. To reduce hallucinations, we introduce
+Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method for
+prompt amplification. M3ID amplifies the influence of the reference image over
+the language prior, hence favoring the generation of tokens with higher mutual
+information with the visual prompt. M3ID can be applied to any pre-trained
+autoregressive VLM at inference time without necessitating further training and
+with minimal computational overhead. If training is an option, we show that
+M3ID can be paired with Direct Preference Optimization (DPO) to improve the
+model's reliance on the prompt image without requiring any labels. Our
+empirical findings show that our algorithms maintain the fluency and linguistic
+capabilities of pre-trained VLMs while reducing hallucinations by mitigating
+visually ungrounded answers. Specifically, for the LLaVA 13B model, M3ID and
+M3ID+DPO reduce the percentage of hallucinated objects in captioning tasks by
+25% and 28%, respectively, and improve the accuracy on VQA benchmarks such as
+POPE by 21% and 24%.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']"
+Diffusion Time-step Curriculum for One Image to 3D Generation,YI Xuanyu · Zike Wu · Qingshan Xu · Pan Zhou · Joo Lim · Hanwang Zhang, ,https://arxiv.org/abs/2404.04562,,2404.04562.pdf,Diffusion Time-step Curriculum for One Image to 3D Generation,"Score distillation sampling~(SDS) has been widely adopted to overcome the
+absence of unseen views in reconstructing 3D objects from a \textbf{single}
+image. It leverages pre-trained 2D diffusion models as teacher to guide the
+reconstruction of student 3D models. Despite their remarkable success,
+SDS-based methods often encounter geometric artifacts and texture saturation.
+We find out the crux is the overlooked indiscriminate treatment of diffusion
+time-steps during optimization: it unreasonably treats the student-teacher
+knowledge distillation to be equal at all time-steps and thus entangles
+coarse-grained and fine-grained modeling. Therefore, we propose the Diffusion
+Time-step Curriculum one-image-to-3D pipeline (DTC123), which involves both the
+teacher and student models collaborating with the time-step curriculum in a
+coarse-to-fine manner. Extensive experiments on NeRF4, RealFusion15, GSO and
+Level50 benchmark demonstrate that DTC123 can produce multi-view consistent,
+high-quality, and diverse 3D assets. Codes and more generation demos will be
+released in https://github.com/yxymessi/DTC123.",cs.CV,['cs.CV']
+3DToonify: Creating Your High-Fidelity 3D Stylized Avatar Easily from 2D Portrait Images,Yifang Men · Hanxi Liu · Yuan Yao · Miaomiao Cui · Xuansong Xie · Zhouhui Lian, ,https://arxiv.org/abs/2311.17917,,2311.17917.pdf,AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text,"We study the problem of creating high-fidelity and animatable 3D avatars from
+only textual descriptions. Existing text-to-avatar methods are either limited
+to static avatars which cannot be animated or struggle to generate animatable
+avatars with promising quality and precise pose control. To address these
+limitations, we propose AvatarStudio, a coarse-to-fine generative model that
+generates explicit textured 3D meshes for animatable human avatars.
+Specifically, AvatarStudio begins with a low-resolution NeRF-based
+representation for coarse generation, followed by incorporating SMPL-guided
+articulation into the explicit mesh representation to support avatar animation
+and high resolution rendering. To ensure view consistency and pose
+controllability of the resulting avatars, we introduce a 2D diffusion model
+conditioned on DensePose for Score Distillation Sampling supervision. By
+effectively leveraging the synergy between the articulated mesh representation
+and the DensePose-conditional diffusion model, AvatarStudio can create
+high-quality avatars from text that are ready for animation, significantly
+outperforming previous methods. Moreover, it is competent for many
+applications, e.g., multimodal avatar animations and style-guided avatar
+creation. For more results, please refer to our project page:
+http://jeff95.me/projects/avatarstudio.html",cs.GR,"['cs.GR', 'cs.CV']"
+Unleashing Network Potentials for Semantic Scene Completion,Fengyun Wang · Qianru Sun · Dong Zhang · Jinhui Tang,https://github.com/fereenwong/AMMNet,https://arxiv.org/abs/2403.07560v2,,2403.07560v2.pdf,Unleashing Network Potentials for Semantic Scene Completion,"Semantic scene completion (SSC) aims to predict complete 3D voxel occupancy
+and semantics from a single-view RGB-D image, and recent SSC methods commonly
+adopt multi-modal inputs. However, our investigation reveals two limitations:
+ineffective feature learning from single modalities and overfitting to limited
+datasets. To address these issues, this paper proposes a novel SSC framework -
+Adversarial Modality Modulation Network (AMMNet) - with a fresh perspective of
+optimizing gradient updates. The proposed AMMNet introduces two core modules: a
+cross-modal modulation enabling the interdependence of gradient flows between
+modalities, and a customized adversarial training scheme leveraging dynamic
+gradient competition. Specifically, the cross-modal modulation adaptively
+re-calibrates the features to better excite representation potentials from each
+single modality. The adversarial training employs a minimax game of evolving
+gradients, with customized guidance to strengthen the generator's perception of
+visual fidelity from both geometric completeness and semantic correctness.
+Extensive experimental results demonstrate that AMMNet outperforms
+state-of-the-art SSC methods by a large margin, providing a promising direction
+for improving the effectiveness and generalization of SSC methods.",cs.CV,['cs.CV']
+NeRF Director: Revisiting View Selection in Neural Volume Rendering,Wenhui Xiao · Rodrigo Santa Cruz · David Ahmedt-Aristizabal · Olivier Salvado · Clinton Fookes · Leo Lebrat,https://wenwhx.github.io/nerfdirector/,https://arxiv.org/abs/2310.20685,,2310.20685.pdf,NeRF Revisited: Fixing Quadrature Instability in Volume Rendering,"Neural radiance fields (NeRF) rely on volume rendering to synthesize novel
+views. Volume rendering requires evaluating an integral along each ray, which
+is numerically approximated with a finite sum that corresponds to the exact
+integral along the ray under piecewise constant volume density. As a
+consequence, the rendered result is unstable w.r.t. the choice of samples along
+the ray, a phenomenon that we dub quadrature instability. We propose a
+mathematically principled solution by reformulating the sample-based rendering
+equation so that it corresponds to the exact integral under piecewise linear
+volume density. This simultaneously resolves multiple issues: conflicts between
+samples along different rays, imprecise hierarchical sampling, and
+non-differentiability of quantiles of ray termination distances w.r.t. model
+parameters. We demonstrate several benefits over the classical sample-based
+rendering equation, such as sharper textures, better geometric reconstruction,
+and stronger depth supervision. Our proposed formulation can be also be used as
+a drop-in replacement to the volume rendering equation of existing NeRF-based
+methods. Our project page can be found at pl-nerf.github.io.",cs.CV,['cs.CV']
+Exploring Efficient Asymmetric Blind-Spots for Self-Supervised Denoising in Real-World Scenarios,Shiyan Chen · Jiyuan Zhang · Zhaofei Yu · Tiejun Huang, ,https://ar5iv.labs.arxiv.org/html/2303.16783,,2303.16783.pdf,Exploring Efficient Asymmetric Blind-Spots for Self-Supervised Denoising in Real-World Scenarios,"Self-supervised denoising has attracted widespread attention due to its
+ability to train without clean images. However, noise in real-world scenarios
+is often spatially correlated, which causes many self-supervised algorithms
+that assume pixel-wise independent noise to perform poorly. Recent works have
+attempted to break noise correlation with downsampling or neighborhood masking.
+However, denoising on downsampled subgraphs can lead to aliasing effects and
+loss of details due to a lower sampling rate. Furthermore, the neighborhood
+masking methods either come with high computational complexity or do not
+consider local spatial preservation during inference. Through the analysis of
+existing methods, we point out that the key to obtaining high-quality and
+texture-rich results in real-world self-supervised denoising tasks is to train
+at the original input resolution structure and use asymmetric operations during
+training and inference. Based on this, we propose Asymmetric Tunable Blind-Spot
+Network (AT-BSN), where the blind-spot size can be freely adjusted, thus better
+balancing noise correlation suppression and image local spatial destruction
+during training and inference. In addition, we regard the pre-trained AT-BSN as
+a meta-teacher network capable of generating various teacher networks by
+sampling different blind-spots. We propose a blind-spot based multi-teacher
+distillation strategy to distill a lightweight network, significantly improving
+performance. Experimental results on multiple datasets prove that our method
+achieves state-of-the-art, and is superior to other self-supervised algorithms
+in terms of computational overhead and visual effects.",cs.CV,['cs.CV']
+Interpretable Measures of Conceptual Similarity by  Complexity-Constrained Descriptive Auto-Encoding,Alessandro Achille · Greg Ver Steeg · Tian Yu Liu · Matthew Trager · Carson Klingenberg · Stefano Soatto, ,https://arxiv.org/abs/2402.08919v1,,2402.08919v1.pdf,Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding,"Quantifying the degree of similarity between images is a key copyright issue
+for image-based machine learning. In legal doctrine however, determining the
+degree of similarity between works requires subjective analysis, and
+fact-finders (judges and juries) can demonstrate considerable variability in
+these subjective judgement calls. Images that are structurally similar can be
+deemed dissimilar, whereas images of completely different scenes can be deemed
+similar enough to support a claim of copying. We seek to define and compute a
+notion of ""conceptual similarity"" among images that captures high-level
+relations even among images that do not share repeated elements or visually
+similar components. The idea is to use a base multi-modal model to generate
+""explanations"" (captions) of visual data at increasing levels of complexity.
+Then, similarity can be measured by the length of the caption needed to
+discriminate between the two images: Two highly dissimilar images can be
+discriminated early in their description, whereas conceptually dissimilar ones
+will need more detail to be distinguished. We operationalize this definition
+and show that it correlates with subjective (averaged human evaluation)
+assessment, and beats existing baselines on both image-to-image and
+text-to-text similarity benchmarks. Beyond just providing a number, our method
+also offers interpretability by pointing to the specific level of granularity
+of the description where the source data are differentiated.",cs.CV,"['cs.CV', 'cs.LG']"
+Attack To Defend: Exploiting Adversarial Attacks for Detecting Poisoned Models,Samar Fares · Karthik Nandakumar, ,https://arxiv.org/abs/2312.06230,,2312.06230.pdf,Activation Gradient based Poisoned Sample Detection Against Backdoor Attacks,"This work studies the task of poisoned sample detection for defending against
+data poisoning based backdoor attacks. Its core challenge is finding a
+generalizable and discriminative metric to distinguish between clean and
+various types of poisoned samples (e.g., various triggers, various poisoning
+ratios). Inspired by a common phenomenon in backdoor attacks that the
+backdoored model tend to map significantly different poisoned and clean samples
+within the target class to similar activation areas, we introduce a novel
+perspective of the circular distribution of the gradients w.r.t. sample
+activation, dubbed gradient circular distribution (GCD). And, we find two
+interesting observations based on GCD. One is that the GCD of samples in the
+target class is much more dispersed than that in the clean class. The other is
+that in the GCD of target class, poisoned and clean samples are clearly
+separated. Inspired by above two observations, we develop an innovative
+three-stage poisoned sample detection approach, called Activation Gradient
+based Poisoned sample Detection (AGPD). First, we calculate GCDs of all classes
+from the model trained on the untrustworthy dataset. Then, we identify the
+target class(es) based on the difference on GCD dispersion between target and
+clean classes. Last, we filter out poisoned samples within the identified
+target class(es) based on the clear separation between poisoned and clean
+samples. Extensive experiments under various settings of backdoor attacks
+demonstrate the superior detection performance of the proposed method to
+existing poisoned detection approaches according to sample activation-based
+metrics.",cs.CR,['cs.CR']
+YOLO-World: Real-Time Open-Vocabulary Object Detection,Tianheng Cheng · Lin Song · Yixiao Ge · Wenyu Liu · Xinggang Wang · Ying Shan,https://github.com/AILab-CVC/YOLO-World,https://arxiv.org/abs/2401.17270,,2401.17270.pdf,YOLO-World: Real-Time Open-Vocabulary Object Detection,"The You Only Look Once (YOLO) series of detectors have established themselves
+as efficient and practical tools. However, their reliance on predefined and
+trained object categories limits their applicability in open scenarios.
+Addressing this limitation, we introduce YOLO-World, an innovative approach
+that enhances YOLO with open-vocabulary detection capabilities through
+vision-language modeling and pre-training on large-scale datasets.
+Specifically, we propose a new Re-parameterizable Vision-Language Path
+Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate
+the interaction between visual and linguistic information. Our method excels in
+detecting a wide range of objects in a zero-shot manner with high efficiency.
+On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on
+V100, which outperforms many state-of-the-art methods in terms of both accuracy
+and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable
+performance on several downstream tasks, including object detection and
+open-vocabulary instance segmentation.",cs.CV,['cs.CV']
+Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction,Devikalyan Das · Christopher Wewer · Raza Yunus · Eddy Ilg · Jan Lenssen,https://geometric-rl.mpi-inf.mpg.de/npg/,https://arxiv.org/abs/2312.01196,,2312.01196.pdf,Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction,"Reconstructing dynamic objects from monocular videos is a severely
+underconstrained and challenging problem, and recent work has approached it in
+various directions. However, owing to the ill-posed nature of this problem,
+there has been no solution that can provide consistent, high-quality novel
+views from camera positions that are significantly different from the training
+views. In this work, we introduce Neural Parametric Gaussians (NPGs) to take on
+this challenge by imposing a two-stage approach: first, we fit a low-rank
+neural deformation model, which then is used as regularization for non-rigid
+reconstruction in the second stage. The first stage learns the object's
+deformations such that it preserves consistency in novel views. The second
+stage obtains high reconstruction quality by optimizing 3D Gaussians that are
+driven by the coarse model. To this end, we introduce a local 3D Gaussian
+representation, where temporally shared Gaussians are anchored in and deformed
+by local oriented volumes. The resulting combined model can be rendered as
+radiance fields, resulting in high-quality photo-realistic reconstructions of
+the non-rigidly deforming objects. We demonstrate that NPGs achieve superior
+results compared to previous works, especially in challenging scenarios with
+few multi-view cues.",cs.CV,['cs.CV']
+AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing,Fan Yang · Tianyi Chen · XIAOSHENG HE · Zhongang Cai · Lei Yang · Si Wu · Guosheng Lin, ,https://arxiv.org/abs/2312.02209,,2312.02209.pdf,AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing,"Editable 3D-aware generation, which supports user-interacted editing, has
+witnessed rapid development recently. However, existing editable 3D GANs either
+fail to achieve high-accuracy local editing or suffer from huge computational
+costs. We propose AttriHuman-3D, an editable 3D human generation model, which
+address the aforementioned problems with attribute decomposition and indexing.
+The core idea of the proposed model is to generate all attributes (e.g. human
+body, hair, clothes and so on) in an overall attribute space with six feature
+planes, which are then decomposed and manipulated with different attribute
+indexes. To precisely extract features of different attributes from the
+generated feature planes, we propose a novel attribute indexing method as well
+as an orthogonal projection regularization to enhance the disentanglement. We
+also introduce a hyper-latent training strategy and an attribute-specific
+sampling strategy to avoid style entanglement and misleading punishment from
+the discriminator. Our method allows users to interactively edit selected
+attributes in the generated 3D human avatars while keeping others fixed. Both
+qualitative and quantitative experiments demonstrate that our model provides a
+strong disentanglement between different attributes, allows fine-grained image
+editing and generates high-quality 3D human avatars.",cs.CV,['cs.CV']
+GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting,Yiwen Chen · Zilong Chen · Chi Zhang · Feng Wang · Xiaofeng Yang · Yikai Wang · Zhongang Cai · Lei Yang · Huaping Liu · Guosheng Lin, ,https://arxiv.org/abs/2311.14521,,2311.14521.pdf,GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting,"3D editing plays a crucial role in many areas such as gaming and virtual
+reality. Traditional 3D editing methods, which rely on representations like
+meshes and point clouds, often fall short in realistically depicting complex
+scenes. On the other hand, methods based on implicit 3D representations, like
+Neural Radiance Field (NeRF), render complex scenes effectively but suffer from
+slow processing speeds and limited control over specific scene areas. In
+response to these challenges, our paper presents GaussianEditor, an innovative
+and efficient 3D editing algorithm based on Gaussian Splatting (GS), a novel 3D
+representation. GaussianEditor enhances precision and control in editing
+through our proposed Gaussian semantic tracing, which traces the editing target
+throughout the training process. Additionally, we propose Hierarchical Gaussian
+splatting (HGS) to achieve stabilized and fine results under stochastic
+generative guidance from 2D diffusion models. We also develop editing
+strategies for efficient object removal and integration, a challenging task for
+existing methods. Our comprehensive experiments demonstrate GaussianEditor's
+superior control, efficacy, and rapid performance, marking a significant
+advancement in 3D editing. Project Page:
+https://buaacyw.github.io/gaussian-editor/",cs.CV,['cs.CV']
+AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation,Qingping SUN · Yanjun Wang · Ailing Zeng · Wanqi Yin · Chen Wei · Wenjia Wang · Haiy Mei · Chi LEUNG · Ziwei Liu · Lei Yang · Zhongang Cai, ,https://arxiv.org/abs/2403.17934,,2403.17934.pdf,AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation,"Expressive human pose and shape estimation (a.k.a. 3D whole-body mesh
+recovery) involves the human body, hand, and expression estimation. Most
+existing methods have tackled this task in a two-stage manner, first detecting
+the human body part with an off-the-shelf detection model and inferring the
+different human body parts individually. Despite the impressive results
+achieved, these methods suffer from 1) loss of valuable contextual information
+via cropping, 2) introducing distractions, and 3) lacking inter-association
+among different persons and body parts, inevitably causing performance
+degradation, especially for crowded scenes. To address these issues, we
+introduce a novel all-in-one-stage framework, AiOS, for multiple expressive
+human pose and shape recovery without an additional human detection step.
+Specifically, our method is built upon DETR, which treats multi-person
+whole-body mesh recovery task as a progressive set prediction problem with
+various sequential detection. We devise the decoder tokens and extend them to
+our task. Specifically, we first employ a human token to probe a human location
+in the image and encode global features for each instance, which provides a
+coarse location for the later transformer block. Then, we introduce a
+joint-related token to probe the human joint in the image and encoder a
+fine-grained local feature, which collaborates with the global feature to
+regress the whole-body mesh. This straightforward but effective model
+outperforms previous state-of-the-art methods by a 9% reduction in NMVE on
+AGORA, a 30% reduction in PVE on EHF, a 10% reduction in PVE on ARCTIC, and a
+3% reduction in PVE on EgoBody.",cs.CV,['cs.CV']
+Edge-Aware 3D Instance Segmentation Network with Intelligent Semantic Prior,Wonseok Roh · Hwanhee Jung · Giljoo Nam · Jinseop Yeom · Hyunje Park · Sang Ho Yoon · Sangpil Kim, ,https://arxiv.org/abs/2311.12291,,2311.12291.pdf,Instance-aware 3D Semantic Segmentation powered by Shape Generators and Classifiers,"Existing 3D semantic segmentation methods rely on point-wise or voxel-wise
+feature descriptors to output segmentation predictions. However, these
+descriptors are often supervised at point or voxel level, leading to
+segmentation models that can behave poorly at instance-level. In this paper, we
+proposed a novel instance-aware approach for 3D semantic segmentation. Our
+method combines several geometry processing tasks supervised at instance-level
+to promote the consistency of the learned feature representation. Specifically,
+our methods use shape generators and shape classifiers to perform shape
+reconstruction and classification tasks for each shape instance. This enforces
+the feature representation to faithfully encode both structural and local shape
+information, with an awareness of shape instances. In the experiments, our
+method significantly outperform existing approaches in 3D semantic segmentation
+on several public benchmarks, such as Waymo Open Dataset, SemanticKITTI and
+ScanNetV2.",cs.CV,['cs.CV']
+ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation,Suraj Patni · Aradhye Agarwal · Chetan Arora,https://ecodepth-iitd.github.io/,https://arxiv.org/abs/2403.18807,,2403.18807.pdf,ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation,"In the absence of parallax cues, a learning-based single image depth
+estimation (SIDE) model relies heavily on shading and contextual cues in the
+image. While this simplicity is attractive, it is necessary to train such
+models on large and varied datasets, which are difficult to capture. It has
+been shown that using embeddings from pre-trained foundational models, such as
+CLIP, improves zero shot transfer in several applications. Taking inspiration
+from this, in our paper we explore the use of global image priors generated
+from a pre-trained ViT model to provide more detailed contextual information.
+We argue that the embedding vector from a ViT model, pre-trained on a large
+dataset, captures greater relevant information for SIDE than the usual route of
+generating pseudo image captions, followed by CLIP based text embeddings. Based
+on this idea, we propose a new SIDE model using a diffusion backbone which is
+conditioned on ViT embeddings. Our proposed design establishes a new
+state-of-the-art (SOTA) for SIDE on NYUv2 dataset, achieving Abs Rel error of
+0.059 (14% improvement) compared to 0.069 by the current SOTA (VPD). And on
+KITTI dataset, achieving Sq Rel error of 0.139 (2% improvement) compared to
+0.142 by the current SOTA (GEDepth). For zero-shot transfer with a model
+trained on NYUv2, we report mean relative improvement of (20%, 23%, 81%, 25%)
+over NeWCRFs on (Sun-RGBD, iBims1, DIODE, HyperSim) datasets, compared to (16%,
+18%, 45%, 9%) by ZoeDepth. The project page is available at
+https://ecodepth-iitd.github.io",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs,Lin Song · Yukang Chen · Shuai Yang · Xiaohan Ding · Yixiao Ge · Ying-Cong Chen · Ying Shan, ,https://arxiv.org/abs/2405.18572,,2405.18572.pdf,Low-rank finetuning for LLMs: A fairness perspective,"Low-rank approximation techniques have become the de facto standard for
+fine-tuning Large Language Models (LLMs) due to their reduced computational and
+memory requirements. This paper investigates the effectiveness of these methods
+in capturing the shift of fine-tuning datasets from the initial pre-trained
+data distribution. Our findings reveal that there are cases in which low-rank
+fine-tuning falls short in learning such shifts. This, in turn, produces
+non-negligible side effects, especially when fine-tuning is adopted for
+toxicity mitigation in pre-trained models, or in scenarios where it is
+important to provide fair models. Through comprehensive empirical evidence on
+several models, datasets, and tasks, we show that low-rank fine-tuning
+inadvertently preserves undesirable biases and toxic behaviors. We also show
+that this extends to sequential decision-making tasks, emphasizing the need for
+careful evaluation to promote responsible LLMs development.",cs.LG,"['cs.LG', 'cs.AI', 'cs.CL']"
+CG-HOI: Contact-Guided 3D Human-Object Interaction Generation,Christian Diller · Angela Dai,https://cg-hoi.christian-diller.de/#main,https://arxiv.org/abs/2311.16097v2,,2311.16097v2.pdf,CG-HOI: Contact-Guided 3D Human-Object Interaction Generation,"We propose CG-HOI, the first method to address the task of generating dynamic
+3D human-object interactions (HOIs) from text. We model the motion of both
+human and object in an interdependent fashion, as semantically rich human
+motion rarely happens in isolation without any interactions. Our key insight is
+that explicitly modeling contact between the human body surface and object
+geometry can be used as strong proxy guidance, both during training and
+inference. Using this guidance to bridge human and object motion enables
+generating more realistic and physically plausible interaction sequences, where
+the human body and corresponding object move in a coherent manner. Our method
+first learns to model human motion, object motion, and contact in a joint
+diffusion process, inter-correlated through cross-attention. We then leverage
+this learned contact for guidance during inference to synthesize realistic and
+coherent HOIs. Extensive evaluation shows that our joint contact-based
+human-object interaction approach generates realistic and physically plausible
+sequences, and we show two applications highlighting the capabilities of our
+method. Conditioned on a given object trajectory, we can generate the
+corresponding human motion without re-training, demonstrating strong
+human-object interdependency learning. Our approach is also flexible, and can
+be applied to static real-world 3D scene scans.",cs.CV,"['cs.CV', 'I.2.10; I.4.8; I.5.1; I.5.4']"
+Digital Life Project: Autonomous 3D Characters with Social Intelligence,Zhongang Cai · Jianping Jiang · Zhongfei Qing · Xinying Guo · Mingyuan Zhang · Zhengyu Lin · Haiy Mei · Chen Wei · Wang Ruisi · Wanqi Yin · Liang Pan · Xiangyu Fan · Han Du · Peng Gao · Zhitao Yang · Yang Gao · Jiaqi Li · Tianxiang Ren · YuKun Wei · Xiaogang Wang · Chen Change Loy · Lei Yang · Ziwei Liu,https://digital-life-project.com/,https://arxiv.org/abs/2312.04547,,2312.04547.pdf,Digital Life Project: Autonomous 3D Characters with Social Intelligence,"In this work, we present Digital Life Project, a framework utilizing language
+as the universal medium to build autonomous 3D characters, who are capable of
+engaging in social interactions and expressing with articulated body motions,
+thereby simulating life in a digital environment. Our framework comprises two
+primary components: 1) SocioMind: a meticulously crafted digital brain that
+models personalities with systematic few-shot exemplars, incorporates a
+reflection process based on psychology principles, and emulates autonomy by
+initiating dialogue topics; 2) MoMat-MoGen: a text-driven motion synthesis
+paradigm for controlling the character's digital body. It integrates motion
+matching, a proven industry technique to ensure motion quality, with
+cutting-edge advancements in motion generation for diversity. Extensive
+experiments demonstrate that each module achieves state-of-the-art performance
+in its respective domain. Collectively, they enable virtual characters to
+initiate and sustain dialogues autonomously, while evolving their
+socio-psychological states. Concurrently, these characters can perform
+contextually relevant bodily movements. Additionally, a motion captioning
+module further allows the virtual character to recognize and appropriately
+respond to human players' actions. Homepage: https://digital-life-project.com/",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.HC']"
+From Isolated Islands to Pangea: Unifying Semantic Space for Human Action Understanding,Yonglu Li · Xiaoqian Wu · Xinpeng Liu · Zehao Wang · Yiming Dou · Yikun Ji · Junyi Zhang · Yixing Li · Xudong LU · Jingru Tan · Cewu Lu, ,,https://synthical.com/article/a412be8a-adaa-450f-81ea-957ce0f2d0e4,,,,,nan
+FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations,Christian Diller · Thomas Funkhouser · Angela Dai,https://future-human-3d.christian-diller.de/#main,https://arxiv.org/abs/2312.11972,,,Expressive Forecasting of 3D Whole-body Human Motions,"Human motion forecasting, with the goal of estimating future human behavior
+over a period of time, is a fundamental task in many real-world applications.
+However, existing works typically concentrate on predicting the major joints of
+the human body without considering the delicate movements of the human hands.
+In practical applications, hand gesture plays an important role in human
+communication with the real world, and expresses the primary intention of human
+beings. In this work, we are the first to formulate a whole-body human pose
+forecasting task, which jointly predicts the future body and hand activities.
+Correspondingly, we propose a novel Encoding-Alignment-Interaction (EAI)
+framework that aims to predict both coarse (body joints) and fine-grained
+(gestures) activities collaboratively, enabling expressive and
+cross-facilitated forecasting of 3D whole-body human motions. Specifically, our
+model involves two key constituents: cross-context alignment (XCA) and
+cross-context interaction (XCI). Considering the heterogeneous information
+within the whole-body, XCA aims to align the latent features of various human
+components, while XCI focuses on effectively capturing the context interaction
+among the human components. We conduct extensive experiments on a
+newly-introduced large-scale benchmark and achieve state-of-the-art
+performance. The code is public for research purposes at
+https://github.com/Dingpx/EAI.",cs.CV,['cs.CV']
+"UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition",Xiaohan Ding · Yiyuan Zhang · Yixiao Ge · Sijie Zhao · Lin Song · Xiangyu Yue · Ying Shan, ,https://arxiv.org/abs/2311.15599,,2311.15599.pdf,"UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition","Large-kernel convolutional neural networks (ConvNets) have recently received
+extensive research attention, but two unresolved and critical issues demand
+further investigation. 1) The architectures of existing large-kernel ConvNets
+largely follow the design principles of conventional ConvNets or transformers,
+while the architectural design for large-kernel ConvNets remains
+under-addressed. 2) As transformers have dominated multiple modalities, it
+remains to be investigated whether ConvNets also have a strong universal
+perception ability in domains beyond vision. In this paper, we contribute from
+two aspects. 1) We propose four architectural guidelines for designing
+large-kernel ConvNets, the core of which is to exploit the essential
+characteristics of large kernels that distinguish them from small kernels -
+they can see wide without going deep. Following such guidelines, our proposed
+large-kernel ConvNet shows leading performance in image recognition (ImageNet
+accuracy of 88.0%, ADE20K mIoU of 55.6%, and COCO box AP of 56.4%),
+demonstrating better performance and higher speed than the recent powerful
+competitors. 2) We discover large kernels are the key to unlocking the
+exceptional performance of ConvNets in domains where they were originally not
+proficient. With certain modality-related preprocessing approaches, the
+proposed model achieves state-of-the-art performance on time-series forecasting
+and audio recognition tasks even without modality-specific customization to the
+architecture. All the code and models are publicly available on GitHub and
+Huggingface.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning,Yuxiang Zhang · Hongwen Zhang · Liangxiao Hu · Jiajun Zhang · Hongwei Yi · Shengping Zhang · Yebin Liu,https://zhangyux15.github.io/ProxyCapV2,https://arxiv.org/abs/2307.01200,,2307.01200.pdf,ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning,"Learning-based approaches to monocular motion capture have recently shown
+promising results by learning to regress in a data-driven manner. However, due
+to the challenges in data collection and network designs, it remains
+challenging for existing solutions to achieve real-time full-body capture while
+being accurate in world space. In this work, we introduce ProxyCap, a
+human-centric proxy-to-motion learning scheme to learn world-space motions from
+a proxy dataset of 2D skeleton sequences and 3D rotational motions. Such proxy
+data enables us to build a learning-based network with accurate world-space
+supervision while also mitigating the generalization issues. For more accurate
+and physically plausible predictions in world space, our network is designed to
+learn human motions from a human-centric perspective, which enables the
+understanding of the same motion captured with different camera trajectories.
+Moreover, a contact-aware neural motion descent module is proposed in our
+network so that it can be aware of foot-ground contact and motion misalignment
+with the proxy observations. With the proposed learning-based solution, we
+demonstrate the first real-time monocular full-body capture system with
+plausible foot-ground contact in world space even using hand-held moving
+cameras. Our project page is https://zhangyux15.github.io/ProxyCapV2.",cs.CV,['cs.CV']
+DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation,Chenyang Wang · Zerong Zheng · Tao Yu · Xiaoqian Lv · Bineng Zhong · Shengping Zhang · Liqiang Nie, ,https://arxiv.org/abs/2312.00853,,2312.00853.pdf,Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution,"Real-world low-resolution (LR) videos have diverse and complex degradations,
+imposing great challenges on video super-resolution (VSR) algorithms to
+reproduce their high-resolution (HR) counterparts with high quality. Recently,
+the diffusion models have shown compelling performance in generating realistic
+details for image restoration tasks. However, the diffusion process has
+randomness, making it hard to control the contents of restored images. This
+issue becomes more serious when applying diffusion models to VSR tasks because
+temporal consistency is crucial to the perceptual quality of videos. In this
+paper, we propose an effective real-world VSR algorithm by leveraging the
+strength of pre-trained latent diffusion models. To ensure the content
+consistency among adjacent frames, we exploit the temporal dynamics in LR
+videos to guide the diffusion process by optimizing the latent sampling path
+with a motion-guided loss, ensuring that the generated HR video maintains a
+coherent and continuous visual flow. To further mitigate the discontinuity of
+generated details, we insert temporal module to the decoder and fine-tune it
+with an innovative sequence-oriented loss. The proposed motion-guided latent
+diffusion (MGLD) based VSR algorithm achieves significantly better perceptual
+quality than state-of-the-arts on real-world VSR benchmark datasets, validating
+the effectiveness of the proposed model design and training strategies.",cs.CV,['cs.CV']
+DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis,Yuming Gu · Hongyi Xu · You Xie · Guoxian Song · Yichun Shi · Di Chang · Jing Yang · Linjie Luo,https://freedomgu.github.io/DiffPortrait3D/,https://arxiv.org/abs/2312.13016,,2312.13016.pdf,DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis,"We present DiffPortrait3D, a conditional diffusion model that is capable of
+synthesizing 3D-consistent photo-realistic novel views from as few as a single
+in-the-wild portrait. Specifically, given a single RGB input, we aim to
+synthesize plausible but consistent facial details rendered from novel camera
+views with retained both identity and facial expression. In lieu of
+time-consuming optimization and fine-tuning, our zero-shot method generalizes
+well to arbitrary face portraits with unposed camera views, extreme facial
+expressions, and diverse artistic depictions. At its core, we leverage the
+generative prior of 2D diffusion models pre-trained on large-scale image
+datasets as our rendering backbone, while the denoising is guided with
+disentangled attentive control of appearance and camera pose. To achieve this,
+we first inject the appearance context from the reference image into the
+self-attention layers of the frozen UNets. The rendering view is then
+manipulated with a novel conditional control module that interprets the camera
+pose by watching a condition image of a crossed subject from the same view.
+Furthermore, we insert a trainable cross-view attention module to enhance view
+consistency, which is further strengthened with a novel 3D-aware noise
+generation process during inference. We demonstrate state-of-the-art results
+both qualitatively and quantitatively on our challenging in-the-wild and
+multi-view benchmarks.",cs.CV,['cs.CV']
+Tyche: Stochastic in Context Learning for Medical Image Segmentation,Marianne Rakic · Hallee Wong · Jose Javier Gonzalez Ortiz · Beth Cimini · John Guttag · Adrian V. Dalca, ,https://arxiv.org/abs/2401.13650,,2401.13650.pdf,Tyche: Stochastic In-Context Learning for Medical Image Segmentation,"Existing learning-based solutions to medical image segmentation have two
+important shortcomings. First, for most new segmentation task, a new model has
+to be trained or fine-tuned. This requires extensive resources and machine
+learning expertise, and is therefore often infeasible for medical researchers
+and clinicians. Second, most existing segmentation methods produce a single
+deterministic segmentation mask for a given image. In practice however, there
+is often considerable uncertainty about what constitutes the correct
+segmentation, and different expert annotators will often segment the same image
+differently. We tackle both of these problems with Tyche, a model that uses a
+context set to generate stochastic predictions for previously unseen tasks
+without the need to retrain. Tyche differs from other in-context segmentation
+methods in two important ways. (1) We introduce a novel convolution block
+architecture that enables interactions among predictions. (2) We introduce
+in-context test-time augmentation, a new mechanism to provide prediction
+stochasticity. When combined with appropriate model design and loss functions,
+Tyche can predict a set of plausible diverse segmentation candidates for new or
+unseen medical images and segmentation tasks without the need to retrain.",eess.IV,"['eess.IV', 'cs.CV']"
+Incremental Residual Concept Bottleneck Models,Chenming Shang · Shiji Zhou · Hengyuan Zhang · Xinzhe Ni · Yujiu Yang · Yuwang Wang, ,https://arxiv.org/abs/2404.08978,,2404.08978.pdf,Incremental Residual Concept Bottleneck Models,"Concept Bottleneck Models (CBMs) map the black-box visual representations
+extracted by deep neural networks onto a set of interpretable concepts and use
+the concepts to make predictions, enhancing the transparency of the
+decision-making process. Multimodal pre-trained models can match visual
+representations with textual concept embeddings, allowing for obtaining the
+interpretable concept bottleneck without the expertise concept annotations.
+Recent research has focused on the concept bank establishment and the
+high-quality concept selection. However, it is challenging to construct a
+comprehensive concept bank through humans or large language models, which
+severely limits the performance of CBMs. In this work, we propose the
+Incremental Residual Concept Bottleneck Model (Res-CBM) to address the
+challenge of concept completeness. Specifically, the residual concept
+bottleneck model employs a set of optimizable vectors to complete missing
+concepts, then the incremental concept discovery module converts the
+complemented vectors with unclear meanings into potential concepts in the
+candidate concept bank. Our approach can be applied to any user-defined concept
+bank, as a post-hoc processing method to enhance the performance of any CBMs.
+Furthermore, to measure the descriptive efficiency of CBMs, the Concept
+Utilization Efficiency (CUE) metric is proposed. Experiments show that the
+Res-CBM outperforms the current state-of-the-art methods in terms of both
+accuracy and efficiency and achieves comparable performance to black-box models
+across multiple datasets.",cs.LG,"['cs.LG', 'cs.AI']"
+RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features,Geonho Bang · Kwangjin Choi · Jisong Kim · Dongsuk Kum · Jun Won Choi, ,https://arxiv.org/abs/2403.05061,,2403.05061.pdf,RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features,"The inherent noisy and sparse characteristics of radar data pose challenges
+in finding effective representations for 3D object detection. In this paper, we
+propose RadarDistill, a novel knowledge distillation (KD) method, which can
+improve the representation of radar data by leveraging LiDAR data. RadarDistill
+successfully transfers desirable characteristics of LiDAR features into radar
+features using three key components: Cross-Modality Alignment (CMA),
+Activation-based Feature Distillation (AFD), and Proposal-based Feature
+Distillation (PFD). CMA enhances the density of radar features by employing
+multiple layers of dilation operations, effectively addressing the challenge of
+inefficient knowledge transfer from LiDAR to radar. AFD selectively transfers
+knowledge based on regions of the LiDAR features, with a specific focus on
+areas where activation intensity exceeds a predefined threshold. PFD similarly
+guides the radar network to selectively mimic features from the LiDAR network
+within the object proposals. Our comparative analyses conducted on the nuScenes
+datasets demonstrate that RadarDistill achieves state-of-the-art (SOTA)
+performance for radar-only object detection task, recording 20.5% in mAP and
+43.7% in NDS. Also, RadarDistill significantly improves the performance of the
+camera-radar fusion model.",cs.CV,['cs.CV']
+Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes,Gaurav Shrivastava · Abhinav Shrivastava,https://www.cs.umd.edu/~gauravsh/cvp/supp/website.html,https://arxiv.org/abs/2401.14718,,2401.14718.pdf,A Survey on Video Prediction: From Deterministic to Generative Approaches,"Video prediction, a fundamental task in computer vision, aims to enable
+models to generate sequences of future frames based on existing video content.
+This task has garnered widespread application across various domains. In this
+paper, we comprehensively survey both historical and contemporary works in this
+field, encompassing the most widely used datasets and algorithms. Our survey
+scrutinizes the challenges and evolving landscape of video prediction within
+the realm of computer vision. We propose a novel taxonomy centered on the
+stochastic nature of video prediction algorithms. This taxonomy accentuates the
+gradual transition from deterministic to generative prediction methodologies,
+underlining significant advancements and shifts in approach.",cs.CV,['cs.CV']
+Efficient Multi-scale Network with Learnable Discrete Wavelet Transform for Blind Motion Deblurring,Xin Gao · Tianheng Qiu · Xinyu Zhang · Hanlin Bai · Kang Liu · xuan huang · Hu Wei · Guoying Zhang · Huaping Liu, ,https://arxiv.org/abs/2401.00027,,2401.00027.pdf,Efficient Multi-scale Network with Learnable Discrete Wavelet Transform for Blind Motion Deblurring,"Coarse-to-fine schemes are widely used in traditional single-image motion
+deblur; however, in the context of deep learning, existing multi-scale
+algorithms not only require the use of complex modules for feature fusion of
+low-scale RGB images and deep semantics, but also manually generate
+low-resolution pairs of images that do not have sufficient confidence. In this
+work, we propose a multi-scale network based on single-input and
+multiple-outputs(SIMO) for motion deblurring. This simplifies the complexity of
+algorithms based on a coarse-to-fine scheme. To alleviate restoration defects
+impacting detail information brought about by using a multi-scale architecture,
+we combine the characteristics of real-world blurring trajectories with a
+learnable wavelet transform module to focus on the directional continuity and
+frequency features of the step-by-step transitions between blurred images to
+sharp images. In conclusion, we propose a multi-scale network with a learnable
+discrete wavelet transform (MLWNet), which exhibits state-of-the-art
+performance on multiple real-world deblurred datasets, in terms of both
+subjective and objective quality as well as computational efficiency.",cs.CV,['cs.CV']
+Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners,Keon Hee Park · Kyungwoo Song · Gyeong-Moon Park, ,https://arxiv.org/abs/2404.02117,,,Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners,"Few-Shot Class Incremental Learning (FSCIL) is a task that requires a model
+to learn new classes incrementally without forgetting when only a few samples
+for each class are given. FSCIL encounters two significant challenges:
+catastrophic forgetting and overfitting, and these challenges have driven prior
+studies to primarily rely on shallow models, such as ResNet-18. Even though
+their limited capacity can mitigate both forgetting and overfitting issues, it
+leads to inadequate knowledge transfer during few-shot incremental sessions. In
+this paper, we argue that large models such as vision and language transformers
+pre-trained on large datasets can be excellent few-shot incremental learners.
+To this end, we propose a novel FSCIL framework called PriViLege, Pre-trained
+Vision and Language transformers with prompting functions and knowledge
+distillation. Our framework effectively addresses the challenges of
+catastrophic forgetting and overfitting in large models through new pre-trained
+knowledge tuning (PKT) and two losses: entropy-based divergence loss and
+semantic knowledge distillation loss. Experimental results show that the
+proposed PriViLege significantly outperforms the existing state-of-the-art
+methods with a large margin, e.g., +9.38% in CUB200, +20.58% in CIFAR-100, and
++13.36% in miniImageNet. Our implementation code is available at
+https://github.com/KHU-AGI/PriViLege.",cs.CV,['cs.CV']
+PDF: A Probability-Driven Framework for Open World 3D Point Cloud Semantic Segmentation,Jinfeng Xu · Siyuan Yang · Xianzhi Li · Yuan Tang · yixue Hao · Long Hu · Min Chen, ,https://arxiv.org/abs/2404.00979,,2404.00979.pdf,PDF: A Probability-Driven Framework for Open World 3D Point Cloud Semantic Segmentation,"Existing point cloud semantic segmentation networks cannot identify unknown
+classes and update their knowledge, due to a closed-set and static perspective
+of the real world, which would induce the intelligent agent to make bad
+decisions. To address this problem, we propose a Probability-Driven Framework
+(PDF) for open world semantic segmentation that includes (i) a lightweight
+U-decoder branch to identify unknown classes by estimating the uncertainties,
+(ii) a flexible pseudo-labeling scheme to supply geometry features along with
+probability distribution features of unknown classes by generating pseudo
+labels, and (iii) an incremental knowledge distillation strategy to incorporate
+novel classes into the existing knowledge base gradually. Our framework enables
+the model to behave like human beings, which could recognize unknown objects
+and incrementally learn them with the corresponding knowledge. Experimental
+results on the S3DIS and ScanNetv2 datasets demonstrate that the proposed PDF
+outperforms other methods by a large margin in both important tasks of open
+world semantic segmentation.",cs.CV,['cs.CV']
+Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle,Youtian Lin · Zuozhuo Dai · Siyu Zhu · Yao Yao, ,https://arxiv.org/abs/2312.03431,,2312.03431.pdf,Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle,"We introduce Gaussian-Flow, a novel point-based approach for fast dynamic
+scene reconstruction and real-time rendering from both multi-view and monocular
+videos. In contrast to the prevalent NeRF-based approaches hampered by slow
+training and rendering speeds, our approach harnesses recent advancements in
+point-based 3D Gaussian Splatting (3DGS). Specifically, a novel Dual-Domain
+Deformation Model (DDDM) is proposed to explicitly model attribute deformations
+of each Gaussian point, where the time-dependent residual of each attribute is
+captured by a polynomial fitting in the time domain, and a Fourier series
+fitting in the frequency domain. The proposed DDDM is capable of modeling
+complex scene deformations across long video footage, eliminating the need for
+training separate 3DGS for each frame or introducing an additional implicit
+neural field to model 3D dynamics. Moreover, the explicit deformation modeling
+for discretized Gaussian points ensures ultra-fast training and rendering of a
+4D scene, which is comparable to the original 3DGS designed for static 3D
+reconstruction. Our proposed approach showcases a substantial efficiency
+improvement, achieving a $5\times$ faster training speed compared to the
+per-frame 3DGS modeling. In addition, quantitative results demonstrate that the
+proposed Gaussian-Flow significantly outperforms previous leading methods in
+novel view rendering quality. Project page:
+https://nju-3dv.github.io/projects/Gaussian-Flow",cs.CV,['cs.CV']
+Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation,Zhipeng Du · Miaojing Shi · Jiankang Deng,https://github.com/ZPDu/Boosting-Object-Detection-with-Zero-Shot-Day-Night-Domain-Adaptation,https://arxiv.org/abs/2312.01220,,2312.01220.pdf,Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation,"Detecting objects in low-light scenarios presents a persistent challenge, as
+detectors trained on well-lit data exhibit significant performance degradation
+on low-light data due to low visibility. Previous methods mitigate this issue
+by exploring image enhancement or object detection techniques with real
+low-light image datasets. However, the progress is impeded by the inherent
+difficulties about collecting and annotating low-light images. To address this
+challenge, we propose to boost low-light object detection with zero-shot
+day-night domain adaptation, which aims to generalize a detector from well-lit
+scenarios to low-light ones without requiring real low-light data. Revisiting
+Retinex theory in the low-level vision, we first design a reflectance
+representation learning module to learn Retinex-based illumination invariance
+in images with a carefully designed illumination invariance reinforcement
+strategy. Next, an interchange-redecomposition-coherence procedure is
+introduced to improve over the vanilla Retinex image decomposition process by
+performing two sequential image decompositions and introducing a
+redecomposition cohering loss. Extensive experiments on ExDark, DARK FACE, and
+CODaN datasets show strong low-light generalizability of our method. Our code
+is available at https://github.com/ZPDu/DAI-Net.",cs.CV,['cs.CV']
+Clockwork Diffusion: Efficient Generation With Model-Step Distillation,Amirhossein Habibian · Amir Ghodrati · Noor Fathima · Guillaume Sautiere · Risheek Garrepalli · Fatih Porikli · Jens Petersen, ,https://arxiv.org/abs/2312.08128,,2312.08128.pdf,Clockwork Diffusion: Efficient Generation With Model-Step Distillation,"This work aims to improve the efficiency of text-to-image diffusion models.
+While diffusion models use computationally expensive UNet-based denoising
+operations in every generation step, we identify that not all operations are
+equally relevant for the final output quality. In particular, we observe that
+UNet layers operating on high-res feature maps are relatively sensitive to
+small perturbations. In contrast, low-res feature maps influence the semantic
+layout of the final image and can often be perturbed with no noticeable change
+in the output. Based on this observation, we propose Clockwork Diffusion, a
+method that periodically reuses computation from preceding denoising steps to
+approximate low-res feature maps at one or more subsequent steps. For multiple
+baselines, and for both text-to-image generation and image editing, we
+demonstrate that Clockwork leads to comparable or improved perceptual scores
+with drastically reduced computational complexity. As an example, for Stable
+Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and
+CLIP change.",cs.CV,['cs.CV']
+BEVSpread: Spread Voxel Pooling for Bird’s-Eye-View Representation in Vision-based Roadside 3D Object Detection,Wenjie Wang · Yehao Lu · Guangcong Zheng · Shuigenzhan · Xiaoqing Ye · Zichang Tan · Jingdong Wang · Gaoang Wang · Xi Li,https://github.com/DaTongjie/BEVSpread,https://arxiv.org/abs/2312.00633,,2312.00633.pdf,Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach,"3D object detection in Bird's-Eye-View (BEV) space has recently emerged as a
+prevalent approach in the field of autonomous driving. Despite the demonstrated
+improvements in accuracy and velocity estimation compared to perspective view
+methods, the deployment of BEV-based techniques in real-world autonomous
+vehicles remains challenging. This is primarily due to their reliance on
+vision-transformer (ViT) based architectures, which introduce quadratic
+complexity with respect to the input resolution. To address this issue, we
+propose an efficient BEV-based 3D detection framework called BEVENet, which
+leverages a convolutional-only architectural design to circumvent the
+limitations of ViT models while maintaining the effectiveness of BEV-based
+methods. Our experiments show that BEVENet is 3$\times$ faster than
+contemporary state-of-the-art (SOTA) approaches on the NuScenes challenge,
+achieving a mean average precision (mAP) of 0.456 and a nuScenes detection
+score (NDS) of 0.555 on the NuScenes validation dataset, with an inference
+speed of 47.6 frames per second. To the best of our knowledge, this study
+stands as the first to achieve such significant efficiency improvements for
+BEV-based methods, highlighting their enhanced feasibility for real-world
+autonomous driving applications.",cs.CV,"['cs.CV', 'cs.AI']"
+GARField: Group Anything with Radiance Fields,Chung Min Kim · Mingxuan Wu · Justin Kerr · Ken Goldberg · Matthew Tancik · Angjoo Kanazawa, ,https://arxiv.org/abs/2401.09419,,2401.09419.pdf,GARField: Group Anything with Radiance Fields,"Grouping is inherently ambiguous due to the multiple levels of granularity in
+which one can decompose a scene -- should the wheels of an excavator be
+considered separate or part of the whole? We present Group Anything with
+Radiance Fields (GARField), an approach for decomposing 3D scenes into a
+hierarchy of semantically meaningful groups from posed image inputs. To do this
+we embrace group ambiguity through physical scale: by optimizing a
+scale-conditioned 3D affinity feature field, a point in the world can belong to
+different groups of different sizes. We optimize this field from a set of 2D
+masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine
+hierarchy, using scale to consistently fuse conflicting masks from different
+viewpoints. From this field we can derive a hierarchy of possible groupings via
+automatic tree construction or user interaction. We evaluate GARField on a
+variety of in-the-wild scenes and find it effectively extracts groups at many
+levels: clusters of objects, objects, and various subparts. GARField inherently
+represents multi-view consistent groupings and produces higher fidelity groups
+than the input SAM masks. GARField's hierarchical grouping could have exciting
+downstream applications such as 3D asset extraction or dynamic scene
+understanding. See the project website at https://www.garfield.studio/",cs.CV,"['cs.CV', 'cs.GR']"
+General Point Model Pretraining with Autoencoding and Autoregressive,Zhe Li · Zhangyang Gao · Cheng Tan · Bocheng Ren · Laurence Yang · Stan Z. Li, ,https://arxiv.org/abs/2310.16861,,2310.16861.pdf,General Point Model with Autoencoding and Autoregressive,"The pre-training architectures of large language models encompass various
+types, including autoencoding models, autoregressive models, and
+encoder-decoder models. We posit that any modality can potentially benefit from
+a large language model, as long as it undergoes vector quantization to become
+discrete tokens. Inspired by GLM, we propose a General Point Model (GPM) which
+seamlessly integrates autoencoding and autoregressive tasks in point cloud
+transformer. This model is versatile, allowing fine-tuning for downstream point
+cloud representation tasks, as well as unconditional and conditional generation
+tasks. GPM enhances masked prediction in autoencoding through various forms of
+mask padding tasks, leading to improved performance in point cloud
+understanding. Additionally, GPM demonstrates highly competitive results in
+unconditional point cloud generation tasks, even exhibiting the potential for
+conditional generation tasks by modifying the input's conditional information.
+Compared to models like Point-BERT, MaskPoint and PointMAE, our GPM achieves
+superior performance in point cloud understanding tasks. Furthermore, the
+integration of autoregressive and autoencoding within the same transformer
+underscores its versatility across different downstream tasks.",cs.LG,"['cs.LG', 'cs.CV']"
+NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation,Minh-Tuan Tran · Trung Le · Xuan-May Le · Mehrtash Harandi · Quan Tran · Dinh Phung,https://github.com/tmtuan1307/NAYER,https://arxiv.org/abs/2310.00258,,2310.00258.pdf,NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation,"Data-Free Knowledge Distillation (DFKD) has made significant recent strides
+by transferring knowledge from a teacher neural network to a student neural
+network without accessing the original data. Nonetheless, existing approaches
+encounter a significant challenge when attempting to generate samples from
+random noise inputs, which inherently lack meaningful information.
+Consequently, these models struggle to effectively map this noise to the
+ground-truth sample distribution, resulting in prolonging training times and
+low-quality outputs. In this paper, we propose a novel Noisy Layer Generation
+method (NAYER) which relocates the random source from the input to a noisy
+layer and utilizes the meaningful constant label-text embedding (LTE) as the
+input. LTE is generated by using the language model once, and then it is stored
+in memory for all subsequent training processes. The significance of LTE lies
+in its ability to contain substantial meaningful inter-class information,
+enabling the generation of high-quality samples with only a few training steps.
+Simultaneously, the noisy layer plays a key role in addressing the issue of
+diversity in sample generation by preventing the model from overemphasizing the
+constrained label information. By reinitializing the noisy layer in each
+iteration, we aim to facilitate the generation of diverse samples while still
+retaining the method's efficiency, thanks to the ease of learning provided by
+LTE. Experiments carried out on multiple datasets demonstrate that our NAYER
+not only outperforms the state-of-the-art methods but also achieves speeds 5 to
+15 times faster than previous approaches. The code is available at
+https://github.com/tmtuan1307/nayer.",cs.CV,['cs.CV']
+MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning,Zhe Li · Laurence Yang · Bocheng Ren · Xin Nie · Zhangyang Gao · Cheng Tan · Stan Z. Li, ,https://arxiv.org/abs/2402.02045,,2402.02045.pdf,MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning,"The scarcity of annotated data has sparked significant interest in
+unsupervised pre-training methods that leverage medical reports as auxiliary
+signals for medical visual representation learning. However, existing research
+overlooks the multi-granularity nature of medical visual representation and
+lacks suitable contrastive learning techniques to improve the models'
+generalizability across different granularities, leading to the
+underutilization of image-text information. To address this, we propose MLIP, a
+novel framework leveraging domain-specific medical knowledge as guiding signals
+to integrate language information into the visual domain through image-text
+contrastive learning. Our model includes global contrastive learning with our
+designed divergence encoder, local token-knowledge-patch alignment contrastive
+learning, and knowledge-guided category-level contrastive learning with expert
+knowledge. Experimental evaluations reveal the efficacy of our model in
+enhancing transfer performance for tasks such as image classification, object
+detection, and semantic segmentation. Notably, MLIP surpasses state-of-the-art
+methods even with limited annotated data, highlighting the potential of
+multimodal pre-training in advancing medical representation learning.",cs.CV,['cs.CV']
+Inversion-Free Image Editing with Language-Guided Diffusion Models,Sihan Xu · Yidong Huang · Jiayi Pan · Ziqiao Ma · Joyce Chai,https://sled-group.github.io/InfEdit/,https://arxiv.org/abs/2312.04965,,2312.04965.pdf,Inversion-Free Image Editing with Natural Language,"Despite recent advances in inversion-based editing, text-guided image
+manipulation remains challenging for diffusion models. The primary bottlenecks
+include 1) the time-consuming nature of the inversion process; 2) the struggle
+to balance consistency with accuracy; 3) the lack of compatibility with
+efficient consistency sampling methods used in consistency models. To address
+the above issues, we start by asking ourselves if the inversion process can be
+eliminated for editing. We show that when the initial sample is known, a
+special variance schedule reduces the denoising step to the same form as the
+multi-step consistency sampling. We name this Denoising Diffusion Consistent
+Model (DDCM), and note that it implies a virtual inversion strategy without
+explicit inversion in sampling. We further unify the attention control
+mechanisms in a tuning-free framework for text-guided editing. Combining them,
+we present inversion-free editing (InfEdit), which allows for consistent and
+faithful editing for both rigid and non-rigid semantic changes, catering to
+intricate modifications without compromising on the image's integrity and
+explicit inversion. Through extensive experiments, InfEdit shows strong
+performance in various editing tasks and also maintains a seamless workflow
+(less than 3 seconds on one single A40), demonstrating the potential for
+real-time applications. Project Page: https://sled-group.github.io/InfEdit/",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']"
+DUDF: Differentiable Unsigned Distance Fields with Hyperbolic Scaling,Miguel Fainstein · Viviana Siless · Emmanuel Iarussi,https://lia-ditella.github.io/DUDF/,https://arxiv.org/abs/2402.08876,,2402.08876.pdf,DUDF: Differentiable Unsigned Distance Fields with Hyperbolic Scaling,"In recent years, there has been a growing interest in training Neural
+Networks to approximate Unsigned Distance Fields (UDFs) for representing open
+surfaces in the context of 3D reconstruction. However, UDFs are
+non-differentiable at the zero level set which leads to significant errors in
+distances and gradients, generally resulting in fragmented and discontinuous
+surfaces. In this paper, we propose to learn a hyperbolic scaling of the
+unsigned distance field, which defines a new Eikonal problem with distinct
+boundary conditions. This allows our formulation to integrate seamlessly with
+state-of-the-art continuously differentiable implicit neural representation
+networks, largely applied in the literature to represent signed distance
+fields. Our approach not only addresses the challenge of open surface
+representation but also demonstrates significant improvement in reconstruction
+quality and training performance. Moreover, the unlocked field's
+differentiability allows the accurate computation of essential topological
+properties such as normal directions and curvatures, pervasive in downstream
+tasks such as rendering. Through extensive experiments, we validate our
+approach across various data sets and against competitive baselines. The
+results demonstrate enhanced accuracy and up to an order of magnitude increase
+in speed compared to previous methods.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'I.2.10; I.4.10; I.3.7']"
+RoMa: Robust Dense Feature Matching,Johan Edstedt · Qiyu Sun · Georg Bökman · Mårten Wadenbäck · Michael Felsberg,https://parskatt.github.io/RoMa/,https://arxiv.org/html/2305.15404v2,,2305.15404v2.pdf,RoMa: Robust Dense Feature Matching,"Feature matching is an important computer vision task that involves
+estimating correspondences between two images of a 3D scene, and dense methods
+estimate all such correspondences. The aim is to learn a robust model, i.e., a
+model able to match under challenging real-world changes. In this work, we
+propose such a model, leveraging frozen pretrained features from the foundation
+model DINOv2. Although these features are significantly more robust than local
+features trained from scratch, they are inherently coarse. We therefore combine
+them with specialized ConvNet fine features, creating a precisely localizable
+feature pyramid. To further improve robustness, we propose a tailored
+transformer match decoder that predicts anchor probabilities, which enables it
+to express multimodality. Finally, we propose an improved loss formulation
+through regression-by-classification with subsequent robust regression. We
+conduct a comprehensive set of experiments that show that our method, RoMa,
+achieves significant gains, setting a new state-of-the-art. In particular, we
+achieve a 36% improvement on the extremely challenging WxBS benchmark. Code is
+provided at https://github.com/Parskatt/RoMa",cs.CV,['cs.CV']
+Harnessing Large Language Models for Training-free Video Anomaly Detection,Luca Zanella · Willi Menapace · Massimiliano Mancini · Yiming Wang · Elisa Ricci, ,,https://paperswithcode.com/paper/harnessing-large-language-models-for-training,,,,,nan
+Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models,Nikita Starodubcev · Dmitry Baranchuk · Artem Fedorov · Artem Babenko, ,https://arxiv.org/abs/2312.10835,,2312.10835.pdf,Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models,"Knowledge distillation methods have recently shown to be a promising
+direction to speedup the synthesis of large-scale diffusion models by requiring
+only a few inference steps. While several powerful distillation methods were
+recently proposed, the overall quality of student samples is typically lower
+compared to the teacher ones, which hinders their practical usage. In this
+work, we investigate the relative quality of samples produced by the teacher
+text-to-image diffusion model and its distilled student version. As our main
+empirical finding, we discover that a noticeable portion of student samples
+exhibit superior fidelity compared to the teacher ones, despite the
+""approximate"" nature of the student. Based on this finding, we propose an
+adaptive collaboration between student and teacher diffusion models for
+effective text-to-image synthesis. Specifically, the distilled model produces
+the initial sample, and then an oracle decides whether it needs further
+improvements with a slow teacher model. Extensive experiments demonstrate that
+the designed pipeline surpasses state-of-the-art text-to-image alternatives for
+various inference budgets in terms of human preference. Furthermore, the
+proposed approach can be naturally used in popular applications such as
+text-guided image editing and controllable generation.",cs.CV,['cs.CV']
+Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering,Zhaohe Liao · Jiangtong Li · Li Niu · Liqing Zhang, ,,https://dl.acm.org/doi/abs/10.1145/3581783.3613909,,,,,nan
+$360+x$: A Panoptic Multi-modal Scene Understanding Dataset,Hao Chen · Yuqi Hou · Chenyuan Qu · Irene Testini · Xiaohan Hong · Jianbo Jiao,https://x360dataset.github.io/,https://arxiv.org/abs/2404.00989,,2404.00989.pdf,360+x: A Panoptic Multi-modal Scene Understanding Dataset,"Human perception of the world is shaped by a multitude of viewpoints and
+modalities. While many existing datasets focus on scene understanding from a
+certain perspective (e.g. egocentric or third-person views), our dataset offers
+a panoptic perspective (i.e. multiple viewpoints with multiple data
+modalities). Specifically, we encapsulate third-person panoramic and front
+views, as well as egocentric monocular/binocular views with rich modalities
+including video, multi-channel audio, directional binaural delay, location data
+and textual scene descriptions within each scene captured, presenting
+comprehensive observation of the world. Figure 1 offers a glimpse of all 28
+scene categories of our 360+x dataset. To the best of our knowledge, this is
+the first database that covers multiple viewpoints with multiple data
+modalities to mimic how daily information is accessed in the real world.
+Through our benchmark analysis, we presented 5 different scene understanding
+tasks on the proposed 360+x dataset to evaluate the impact and benefit of each
+data modality and perspective in panoptic scene understanding. We hope this
+unique dataset could broaden the scope of comprehensive scene understanding and
+encourage the community to approach these problems from more diverse
+perspectives.",cs.CV,"['cs.CV', 'cs.AI', 'cs.MM', 'cs.SD', 'eess.AS']"
+Text-Enhanced Data-free Approach for Federated Class-Incremental Learning,Minh-Tuan Tran · Trung Le · Xuan-May Le · Mehrtash Harandi · Dinh Phung,https://github.com/tmtuan1307/LANDER,https://arxiv.org/abs/2403.14101,,2403.14101.pdf,Text-Enhanced Data-free Approach for Federated Class-Incremental Learning,"Federated Class-Incremental Learning (FCIL) is an underexplored yet pivotal
+issue, involving the dynamic addition of new classes in the context of
+federated learning. In this field, Data-Free Knowledge Transfer (DFKT) plays a
+crucial role in addressing catastrophic forgetting and data privacy problems.
+However, prior approaches lack the crucial synergy between DFKT and the model
+training phases, causing DFKT to encounter difficulties in generating
+high-quality data from a non-anchored latent space of the old task model. In
+this paper, we introduce LANDER (Label Text Centered Data-Free Knowledge
+Transfer) to address this issue by utilizing label text embeddings (LTE)
+produced by pretrained language models. Specifically, during the model training
+phase, our approach treats LTE as anchor points and constrains the feature
+embeddings of corresponding training samples around them, enriching the
+surrounding area with more meaningful information. In the DFKT phase, by using
+these LTE anchors, LANDER can synthesize more meaningful samples, thereby
+effectively addressing the forgetting problem. Additionally, instead of tightly
+constraining embeddings toward the anchor, the Bounding Loss is introduced to
+encourage sample embeddings to remain flexible within a defined radius. This
+approach preserves the natural differences in sample embeddings and mitigates
+the embedding overlap caused by heterogeneous federated settings. Extensive
+experiments conducted on CIFAR100, Tiny-ImageNet, and ImageNet demonstrate that
+LANDER significantly outperforms previous methods and achieves state-of-the-art
+performance in FCIL. The code is available at
+https://github.com/tmtuan1307/lander.",cs.CV,"['cs.CV', 'cs.CL', 'cs.LG']"
+Rethinking Boundary Discontinuity Problem for Oriented Object Detection,Hang Xu · Xinyuan Liu · Haonan Xu · Yike Ma · Zunjie Zhu · Chenggang Yan · Feng Dai,https://github.com/hangxu-cv/cvpr24acm,,https://ieeexplore.ieee.org/abstract/document/10475581,,,,,nan
+GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation,Tong Wu · Guandao Yang · Zhibing Li · Kai Zhang · Ziwei Liu · Leonidas Guibas · Dahua Lin · Gordon Wetzstein, ,https://arxiv.org/abs/2401.04092,,2401.04092.pdf,GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation,"Despite recent advances in text-to-3D generative methods, there is a notable
+absence of reliable evaluation metrics. Existing metrics usually focus on a
+single criterion each, such as how well the asset aligned with the input text.
+These metrics lack the flexibility to generalize to different evaluation
+criteria and might not align well with human preferences. Conducting user
+preference studies is an alternative that offers both adaptability and
+human-aligned results. User studies, however, can be very expensive to scale.
+This paper presents an automatic, versatile, and human-aligned evaluation
+metric for text-to-3D generative models. To this end, we first develop a prompt
+generator using GPT-4V to generate evaluating prompts, which serve as input to
+compare text-to-3D models. We further design a method instructing GPT-4V to
+compare two 3D assets according to user-defined criteria. Finally, we use these
+pairwise comparison results to assign these models Elo ratings. Experimental
+results suggest our metric strongly align with human preference across
+different evaluation criteria.",cs.CV,['cs.CV']
+Adversarial Text to Continuous Image Generation,Kilichbek Haydarov · Aashiq Muhamed · Xiaoqian Shen · Jovana Lazarevic · Ivan Skorokhodov · Chamuditha Jayanga Galappaththige · Mohamed Elhoseiny, ,https://arxiv.org/abs/2312.14440,,2312.14440.pdf,Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks,"The widespread use of Text-to-Image (T2I) models in content generation
+requires careful examination of their safety, including their robustness to
+adversarial attacks. Despite extensive research on adversarial attacks, the
+reasons for their effectiveness remain underexplored. This paper presents an
+empirical study on adversarial attacks against T2I models, focusing on
+analyzing factors associated with attack success rates (ASR). We introduce a
+new attack objective - entity swapping using adversarial suffixes and two
+gradient-based attack algorithms. Human and automatic evaluations reveal the
+asymmetric nature of ASRs on entity swap: for example, it is easier to replace
+""human"" with ""robot"" in the prompt ""a human dancing in the rain."" with an
+adversarial suffix, but the reverse replacement is significantly harder. We
+further propose probing metrics to establish indicative signals from the
+model's beliefs to the adversarial ASR. We identify conditions that result in a
+success probability of 60% for adversarial attacks and others where this
+likelihood drops below 5%.",cs.LG,"['cs.LG', 'cs.CR']"
+Contextrast: Contextual Contrastive Learning for Semantic Segmentation,Changki Sung · Wanhee Kim · Jungho An · WooJu Lee · Hyungtae Lim · Hyun Myung, ,https://arxiv.org/abs/2404.10633,,2404.10633.pdf,Contextrast: Contextual Contrastive Learning for Semantic Segmentation,"Despite great improvements in semantic segmentation, challenges persist
+because of the lack of local/global contexts and the relationship between them.
+In this paper, we propose Contextrast, a contrastive learning-based semantic
+segmentation method that allows to capture local/global contexts and comprehend
+their relationships. Our proposed method comprises two parts: a) contextual
+contrastive learning (CCL) and b) boundary-aware negative (BANE) sampling.
+Contextual contrastive learning obtains local/global context from multi-scale
+feature aggregation and inter/intra-relationship of features for better
+discrimination capabilities. Meanwhile, BANE sampling selects embedding
+features along the boundaries of incorrectly predicted regions to employ them
+as harder negative samples on our contrastive learning, resolving segmentation
+issues along the boundary region by exploiting fine-grained details. We
+demonstrate that our Contextrast substantially enhances the performance of
+semantic segmentation networks, outperforming state-of-the-art contrastive
+learning approaches on diverse public datasets, e.g. Cityscapes, CamVid,
+PASCAL-C, COCO-Stuff, and ADE20K, without an increase in computational cost
+during inference.",cs.CV,['cs.CV']
+DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception,Yibo Wang · Ruiyuan Gao · Kai Chen · Kaiqiang Zhou · Yingjie CAI · Lanqing Hong · Zhenguo Li · Lihui Jiang · Dit-Yan Yeung · Qiang Xu · Kai Zhang, ,https://arxiv.org/abs/2403.13304,,2403.13304.pdf,DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception,"Current perceptive models heavily depend on resource-intensive datasets,
+prompting the need for innovative solutions. Leveraging recent advances in
+diffusion models, synthetic data, by constructing image inputs from various
+annotations, proves beneficial for downstream tasks. While prior methods have
+separately addressed generative and perceptive models, DetDiffusion, for the
+first time, harmonizes both, tackling the challenges in generating effective
+data for perceptive models. To enhance image generation with perceptive models,
+we introduce perception-aware loss (P.A. loss) through segmentation, improving
+both quality and controllability. To boost the performance of specific
+perceptive models, our method customizes data augmentation by extracting and
+utilizing perception-aware attribute (P.A. Attr) during generation.
+Experimental results from the object detection task highlight DetDiffusion's
+superior performance, establishing a new state-of-the-art in layout-guided
+generation. Furthermore, image syntheses from DetDiffusion can effectively
+augment training data, significantly enhancing downstream detection
+performance.",cs.CV,['cs.CV']
+Boosting Image Quality Assessment through Efficient Transformer Adaptation with Local Feature Enhancement,Kangmin Xu · Liang Liao · Jing Xiao · Chaofeng Chen · Haoning Wu · Qiong Yan · Weisi Lin, ,https://arxiv.org/abs/2308.12001,,2308.12001.pdf,Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment,"Image Quality Assessment (IQA) constitutes a fundamental task within the
+field of computer vision, yet it remains an unresolved challenge, owing to the
+intricate distortion conditions, diverse image contents, and limited
+availability of data. Recently, the community has witnessed the emergence of
+numerous large-scale pretrained foundation models, which greatly benefit from
+dramatically increased data and parameter capacities. However, it remains an
+open problem whether the scaling law in high-level tasks is also applicable to
+IQA task which is closely related to low-level clues. In this paper, we
+demonstrate that with proper injection of local distortion features, a larger
+pretrained and fixed foundation model performs better in IQA tasks.
+Specifically, for the lack of local distortion structure and inductive bias of
+vision transformer (ViT), alongside the large-scale pretrained ViT, we use
+another pretrained convolution neural network (CNN), which is well known for
+capturing the local structure, to extract multi-scale image features. Further,
+we propose a local distortion extractor to obtain local distortion features
+from the pretrained CNN and a local distortion injector to inject the local
+distortion features into ViT. By only training the extractor and injector, our
+method can benefit from the rich knowledge in the powerful foundation models
+and achieve state-of-the-art performance on popular IQA datasets, indicating
+that IQA is not only a low-level problem but also benefits from stronger
+high-level features drawn from large-scale pretrained models.",cs.CV,['cs.CV']
+Robust Distillation via Untargeted and Targeted Intermediate Adversarial Samples,Junhao Dong · Piotr Koniusz · Junxi Chen · Z. Wang · Yew-Soon Ong, ,,https://www.a-star.edu.sg/cfar/news/news/features/10-papers-accepted-at-cvpr-2024,,,,,nan
+Adversarially Robust Few-shot Learning via Parameter Co-distillation of Similarity and Class Concept Learners,Junhao Dong · Piotr Koniusz · Junxi Chen · Xiaohua Xie · Yew-Soon Ong, ,,https://openreview.net/forum?id=h9TTpQdGKJ,,,,,nan
+Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition,Kyle Buettner · Sina Malakouti · Xiang Li · Adriana Kovashka,https://krbuettner.github.io/GeoKnowledgePrompting/,https://arxiv.org/abs/2401.01482,,2401.01482.pdf,Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition,"Existing object recognition models have been shown to lack robustness in
+diverse geographical scenarios due to domain shifts in design and context.
+Class representations need to be adapted to more accurately reflect an object
+concept under these shifts. In the absence of training data from target
+geographies, we hypothesize that geographically diverse descriptive knowledge
+of categories can enhance robustness. For this purpose, we explore the
+feasibility of probing a large language model for geography-based object
+knowledge, and we examine the effects of integrating knowledge into zero-shot
+and learnable soft prompting with CLIP. Within this exploration, we propose
+geography knowledge regularization to ensure that soft prompts trained on a
+source set of geographies generalize to an unseen target set. Accuracy gains
+over prompting baselines on DollarStreet while training only on Europe data are
+up to +2.8/1.2/1.6 on target data from Africa/Asia/Americas, and +4.6 overall
+on the hardest classes. Competitive performance is shown vs. few-shot target
+training, and analysis is provided to direct future study of geographical
+robustness.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG']"
+Clustering for Protein Representation Learning,Ruijie Quan · Wenguan Wang · Fan Ma · Hehe Fan · Yi Yang, ,https://arxiv.org/abs/2404.00254,,2404.00254.pdf,Clustering for Protein Representation Learning,"Protein representation learning is a challenging task that aims to capture
+the structure and function of proteins from their amino acid sequences.
+Previous methods largely ignored the fact that not all amino acids are equally
+important for protein folding and activity. In this article, we propose a
+neural clustering framework that can automatically discover the critical
+components of a protein by considering both its primary and tertiary structure
+information. Our framework treats a protein as a graph, where each node
+represents an amino acid and each edge represents a spatial or sequential
+connection between amino acids. We then apply an iterative clustering strategy
+to group the nodes into clusters based on their 1D and 3D positions and assign
+scores to each cluster. We select the highest-scoring clusters and use their
+medoid nodes for the next iteration of clustering, until we obtain a
+hierarchical and informative representation of the protein. We evaluate on four
+protein-related tasks: protein fold classification, enzyme reaction
+classification, gene ontology term prediction, and enzyme commission number
+prediction. Experimental results demonstrate that our method achieves
+state-of-the-art performance.",cs.LG,"['cs.LG', 'cs.CE', 'q-bio.BM', 'q-bio.QM']"
+Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment,Alvi Md Ishmam · Chris Thomas, ,https://arxiv.org/abs/2402.06659,,2402.06659.pdf,Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models,"Vision-Language Models (VLMs) excel in generating textual responses from
+visual inputs, yet their versatility raises significant security concerns. This
+study takes the first step in exposing VLMs' susceptibility to data poisoning
+attacks that can manipulate responses to innocuous, everyday prompts. We
+introduce Shadowcast, a stealthy data poisoning attack method where poison
+samples are visually indistinguishable from benign images with matching texts.
+Shadowcast demonstrates effectiveness in two attack types. The first is Label
+Attack, tricking VLMs into misidentifying class labels, such as confusing
+Donald Trump for Joe Biden. The second is Persuasion Attack, which leverages
+VLMs' text generation capabilities to craft narratives, such as portraying junk
+food as health food, through persuasive and seemingly rational descriptions. We
+show that Shadowcast are highly effective in achieving attacker's intentions
+using as few as 50 poison samples. Moreover, these poison samples remain
+effective across various prompts and are transferable across different VLM
+architectures in the black-box setting. This work reveals how poisoned VLMs can
+generate convincing yet deceptive misinformation and underscores the importance
+of data quality for responsible deployments of VLMs. Our code is available at:
+https://github.com/umd-huang-lab/VLM-Poisoning.",cs.CR,"['cs.CR', 'cs.AI', 'cs.LG']"
+Structured Model Probing: Empowering Efficient Transfer Learning by Structured Regularization,Zhi-Fan Wu · Chaojie Mao · Xue Wang · Jianwen Jiang · Yiliang Lv · Rong Jin, ,https://arxiv.org/abs/2403.10799,,2403.10799.pdf,Efficient Pruning of Large Language Model with Adaptive Estimation Fusion,"Large language models (LLMs) have become crucial for many generative
+downstream tasks, leading to an inevitable trend and significant challenge to
+deploy them efficiently on resource-constrained devices. Structured pruning is
+a widely used method to address this challenge. However, when dealing with the
+complex structure of the multiple decoder layers, general methods often employ
+common estimation approaches for pruning. These approaches lead to a decline in
+accuracy for specific downstream tasks. In this paper, we introduce a simple
+yet efficient method that adaptively models the importance of each
+substructure. Meanwhile, it can adaptively fuse coarse-grained and finegrained
+estimations based on the results from complex and multilayer structures. All
+aspects of our design seamlessly integrate into the endto-end pruning
+framework. Our experimental results, compared with state-of-the-art methods on
+mainstream datasets, demonstrate average accuracy improvements of 1.1%, 1.02%,
+2.0%, and 1.2% for LLaMa-7B,Vicuna-7B, Baichuan-7B, and Bloom-7b1,
+respectively.",cs.CL,"['cs.CL', 'cs.AI', 'cs.LG']"
+Artist-Friendly Relightable and Animatable Neural Heads,Yingyan Xu · Prashanth Chandran · Sebastian Weiss · Markus Gross · Gaspard Zoss · Derek Bradley,https://studios.disneyresearch.com/2024/06/03/artist-friendly-relightable-and-animatable-neural-heads/,https://arxiv.org/abs/2312.03420,,2312.03420.pdf,Artist-Friendly Relightable and Animatable Neural Heads,"An increasingly common approach for creating photo-realistic digital avatars
+is through the use of volumetric neural fields. The original neural radiance
+field (NeRF) allowed for impressive novel view synthesis of static heads when
+trained on a set of multi-view images, and follow up methods showed that these
+neural representations can be extended to dynamic avatars. Recently, new
+variants also surpassed the usual drawback of baked-in illumination in neural
+representations, showing that static neural avatars can be relit in any
+environment. In this work we simultaneously tackle both the motion and
+illumination problem, proposing a new method for relightable and animatable
+neural heads. Our method builds on a proven dynamic avatar approach based on a
+mixture of volumetric primitives, combined with a recently-proposed lightweight
+hardware setup for relightable neural fields, and includes a novel architecture
+that allows relighting dynamic neural avatars performing unseen expressions in
+any environment, even with nearfield illumination and viewpoints.",cs.CV,"['cs.CV', 'cs.GR']"
+Psychometry: An Omnifit Model for Image Reconstruction from Human Brain Activity,Ruijie Quan · Wenguan Wang · Zhibo Tian · Fan Ma · Yi Yang, ,https://arxiv.org/abs/2403.20022,,2403.20022.pdf,Psychometry: An Omnifit Model for Image Reconstruction from Human Brain Activity,"Reconstructing the viewed images from human brain activity bridges human and
+computer vision through the Brain-Computer Interface. The inherent variability
+in brain function between individuals leads existing literature to focus on
+acquiring separate models for each individual using their respective brain
+signal data, ignoring commonalities between these data. In this article, we
+devise Psychometry, an omnifit model for reconstructing images from functional
+Magnetic Resonance Imaging (fMRI) obtained from different subjects. Psychometry
+incorporates an omni mixture-of-experts (Omni MoE) module where all the experts
+work together to capture the inter-subject commonalities, while each expert
+associated with subject-specific parameters copes with the individual
+differences. Moreover, Psychometry is equipped with a retrieval-enhanced
+inference strategy, termed Ecphory, which aims to enhance the learned fMRI
+representation via retrieving from prestored subject-specific memories. These
+designs collectively render Psychometry omnifit and efficient, enabling it to
+capture both inter-subject commonality and individual specificity across
+subjects. As a result, the enhanced fMRI representations serve as conditional
+signals to guide a generation model to reconstruct high-quality and realistic
+images, establishing Psychometry as state-of-the-art in terms of both
+high-level and low-level metrics.",cs.CV,['cs.CV']
+JointSQ: Joint Sparsification-Quantization for Distributed Learning,Weiying Xie · Haowei Li · Ma Jitao · Yunsong Li · Jie Lei · donglai Liu · Leyuan Fang, ,,https://www.semanticscholar.org/paper/Joint-Sparsification-and-Quantization-for-Wireless-Su-Wang/f940a77cd570b121a727d59cd249513930cd830a,,,,,nan
+PAPR in Motion: Seamless Point-level 3D Scene Interpolation,Shichong Peng · Yanshu Zhang · Ke Li, ,https://arxiv.org/abs/2307.11086,,2307.11086.pdf,PAPR: Proximity Attention Point Rendering,"Learning accurate and parsimonious point cloud representations of scene
+surfaces from scratch remains a challenge in 3D representation learning.
+Existing point-based methods often suffer from the vanishing gradient problem
+or require a large number of points to accurately model scene geometry and
+texture. To address these limitations, we propose Proximity Attention Point
+Rendering (PAPR), a novel method that consists of a point-based scene
+representation and a differentiable renderer. Our scene representation uses a
+point cloud where each point is characterized by its spatial position,
+influence score, and view-independent feature vector. The renderer selects the
+relevant points for each ray and produces accurate colours using their
+associated features. PAPR effectively learns point cloud positions to represent
+the correct scene geometry, even when the initialization drastically differs
+from the target geometry. Notably, our method captures fine texture details
+while using only a parsimonious set of points. We also demonstrate four
+practical applications of our method: zero-shot geometry editing, object
+manipulation, texture transfer, and exposure control. More results and code are
+available on our project website at https://zvict.github.io/papr/.",cs.CV,"['cs.CV', 'cs.AI', 'cs.GR', 'cs.LG', 'cs.NE']"
+Anatomically Constrained Implicit Face Models,Prashanth Chandran · Gaspard Zoss, ,https://arxiv.org/abs/2312.07538,,2312.07538.pdf,Anatomically Constrained Implicit Face Models,"Coordinate based implicit neural representations have gained rapid popularity
+in recent years as they have been successfully used in image, geometry and
+scene modeling tasks. In this work, we present a novel use case for such
+implicit representations in the context of learning anatomically constrained
+face models. Actor specific anatomically constrained face models are the state
+of the art in both facial performance capture and performance retargeting.
+Despite their practical success, these anatomical models are slow to evaluate
+and often require extensive data capture to be built. We propose the anatomical
+implicit face model; an ensemble of implicit neural networks that jointly learn
+to model the facial anatomy and the skin surface with high-fidelity, and can
+readily be used as a drop in replacement to conventional blendshape models.
+Given an arbitrary set of skin surface meshes of an actor and only a neutral
+shape with estimated skull and jaw bones, our method can recover a dense
+anatomical substructure which constrains every point on the facial surface. We
+demonstrate the usefulness of our approach in several tasks ranging from shape
+fitting, shape editing, and performance retargeting.",cs.GR,"['cs.GR', 'cs.CV']"
+EscherNet: A Generative Model for Scalable View Synthesis,Xin Kong · Shikun Liu · Xiaoyang Lyu · Marwan Taher · Xiaojuan Qi · Andrew J. Davison,https://kxhit.github.io/EscherNet,https://arxiv.org/abs/2402.03908,,2402.03908.pdf,EscherNet: A Generative Model for Scalable View Synthesis,"We introduce EscherNet, a multi-view conditioned diffusion model for view
+synthesis. EscherNet learns implicit and generative 3D representations coupled
+with a specialised camera positional encoding, allowing precise and continuous
+relative control of the camera transformation between an arbitrary number of
+reference and target views. EscherNet offers exceptional generality,
+flexibility, and scalability in view synthesis -- it can generate more than 100
+consistent target views simultaneously on a single consumer-grade GPU, despite
+being trained with a fixed number of 3 reference views to 3 target views. As a
+result, EscherNet not only addresses zero-shot novel view synthesis, but also
+naturally unifies single- and multi-image 3D reconstruction, combining these
+diverse tasks into a single, cohesive framework. Our extensive experiments
+demonstrate that EscherNet achieves state-of-the-art performance in multiple
+benchmarks, even when compared to methods specifically tailored for each
+individual problem. This remarkable versatility opens up new directions for
+designing scalable neural architectures for 3D vision. Project page:
+https://kxhit.github.io/EscherNet.",cs.CV,['cs.CV']
+Revisiting Adversarial Training under Long-Tailed Distributions,Xinli Yue · Ningping Mou · Qian Wang · Lingchen Zhao,https://github.com/NISPLab/AT-BSL,https://arxiv.org/abs/2403.10073,,2403.10073.pdf,Revisiting Adversarial Training under Long-Tailed Distributions,"Deep neural networks are vulnerable to adversarial attacks, often leading to
+erroneous outputs. Adversarial training has been recognized as one of the most
+effective methods to counter such attacks. However, existing adversarial
+training techniques have predominantly been tested on balanced datasets,
+whereas real-world data often exhibit a long-tailed distribution, casting doubt
+on the efficacy of these methods in practical scenarios.
+  In this paper, we delve into adversarial training under long-tailed
+distributions. Through an analysis of the previous work ""RoBal"", we discover
+that utilizing Balanced Softmax Loss alone can achieve performance comparable
+to the complete RoBal approach while significantly reducing training overheads.
+Additionally, we reveal that, similar to uniform distributions, adversarial
+training under long-tailed distributions also suffers from robust overfitting.
+To address this, we explore data augmentation as a solution and unexpectedly
+discover that, unlike results obtained with balanced data, data augmentation
+not only effectively alleviates robust overfitting but also significantly
+improves robustness. We further investigate the reasons behind the improvement
+of robustness through data augmentation and identify that it is attributable to
+the increased diversity of examples. Extensive experiments further corroborate
+that data augmentation alone can significantly improve robustness. Finally,
+building on these findings, we demonstrate that compared to RoBal, the
+combination of BSL and data augmentation leads to a +6.66% improvement in model
+robustness under AutoAttack on CIFAR-10-LT. Our code is available at
+https://github.com/NISPLab/AT-BSL .",cs.CV,['cs.CV']
+UniGS: Unified Representation for Image Generation and Segmentation,Lu Qi · Lehan Yang · Weidong Guo · Yu Xu · Bo Du · Varun Jampani · Ming-Hsuan Yang, ,https://arxiv.org/abs/2312.01985,,2312.01985.pdf,UniGS: Unified Representation for Image Generation and Segmentation,"This paper introduces a novel unified representation of diffusion models for
+image generation and segmentation. Specifically, we use a colormap to represent
+entity-level masks, addressing the challenge of varying entity numbers while
+aligning the representation closely with the image RGB domain. Two novel
+modules, including the location-aware color palette and progressive dichotomy
+module, are proposed to support our mask representation. On the one hand, a
+location-aware palette guarantees the colors' consistency to entities'
+locations. On the other hand, the progressive dichotomy module can efficiently
+decode the synthesized colormap to high-quality entity-level masks in a
+depth-first binary search without knowing the cluster numbers. To tackle the
+issue of lacking large-scale segmentation training data, we employ an
+inpainting pipeline and then improve the flexibility of diffusion models across
+various tasks, including inpainting, image synthesis, referring segmentation,
+and entity segmentation. Comprehensive experiments validate the efficiency of
+our approach, demonstrating comparable segmentation mask quality to
+state-of-the-art and adaptability to multiple tasks. The code will be released
+at \href{https://github.com/qqlu/Entity}{https://github.com/qqlu/Entity}.",cs.CV,['cs.CV']
+Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation,guo · Tianwei Lin, ,https://arxiv.org/abs/2312.10113,,2312.10113.pdf,Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation,"Recently, diffusion-based methods, like InstructPix2Pix (IP2P), have achieved
+effective instruction-based image editing, requiring only natural language
+instructions from the user. However, these methods often inadvertently alter
+unintended areas and struggle with multi-instruction editing, resulting in
+compromised outcomes. To address these issues, we introduce the Focus on Your
+Instruction (FoI), a method designed to ensure precise and harmonious editing
+across multiple instructions without extra training or test-time optimization.
+In the FoI, we primarily emphasize two aspects: (1) precisely extracting
+regions of interest for each instruction and (2) guiding the denoising process
+to concentrate within these regions of interest. For the first objective, we
+identify the implicit grounding capability of IP2P from the cross-attention
+between instruction and image, then develop an effective mask extraction
+method. For the second objective, we introduce a cross attention modulation
+module for rough isolation of target editing regions and unrelated regions.
+Additionally, we introduce a mask-guided disentangle sampling strategy to
+further ensure clear region isolation. Experimental results demonstrate that
+FoI surpasses existing methods in both quantitative and qualitative
+evaluations, especially excelling in multi-instruction editing task.",cs.CV,['cs.CV']
+MorpheuS: Neural Dynamic 360$^{\circ}$ Surface Reconstruction from Monocular RGB-D Video,Hengyi Wang · Jingwen Wang · Lourdes Agapito,https://hengyiwang.github.io/projects/morpheus.html,https://arxiv.org/abs/2312.00778,,2312.00778.pdf,MorpheuS: Neural Dynamic 360° Surface Reconstruction from Monocular RGB-D Video,"Neural rendering has demonstrated remarkable success in dynamic scene
+reconstruction. Thanks to the expressiveness of neural representations, prior
+works can accurately capture the motion and achieve high-fidelity
+reconstruction of the target object. Despite this, real-world video scenarios
+often feature large unobserved regions where neural representations struggle to
+achieve realistic completion. To tackle this challenge, we introduce MorpheuS,
+a framework for dynamic 360{\deg} surface reconstruction from a casually
+captured RGB-D video. Our approach models the target scene as a canonical field
+that encodes its geometry and appearance, in conjunction with a deformation
+field that warps points from the current frame to the canonical space. We
+leverage a view-dependent diffusion prior and distill knowledge from it to
+achieve realistic completion of unobserved regions. Experimental results on
+various real-world and synthetic datasets show that our method can achieve
+high-fidelity 360{\deg} surface reconstruction of a deformable object from a
+monocular RGB-D video.",cs.CV,['cs.CV']
+DiffusionLight: Light Probes for Free by Painting a Chrome Ball,Pakkapon Phongthawee · Worameth Chinchuthakun · Nontaphat Sinsunthithet · Varun Jampani · Amit Raj · Pramook Khungurn · Supasorn Suwajanakorn,https://diffusionlight.github.io/,https://arxiv.org/abs/2312.09168v2,,2312.09168v2.pdf,DiffusionLight: Light Probes for Free by Painting a Chrome Ball,"We present a simple yet effective technique to estimate lighting in a single
+input image. Current techniques rely heavily on HDR panorama datasets to train
+neural networks to regress an input with limited field-of-view to a full
+environment map. However, these approaches often struggle with real-world,
+uncontrolled settings due to the limited diversity and size of their datasets.
+To address this problem, we leverage diffusion models trained on billions of
+standard images to render a chrome ball into the input image. Despite its
+simplicity, this task remains challenging: the diffusion models often insert
+incorrect or inconsistent objects and cannot readily generate images in HDR
+format. Our research uncovers a surprising relationship between the appearance
+of chrome balls and the initial diffusion noise map, which we utilize to
+consistently generate high-quality chrome balls. We further fine-tune an LDR
+difusion model (Stable Diffusion XL) with LoRA, enabling it to perform exposure
+bracketing for HDR light estimation. Our method produces convincing light
+estimates across diverse settings and demonstrates superior generalization to
+in-the-wild scenarios.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG', 'I.3.3; I.4.8']"
+JRDB-PanoTrack: An Open-world Panoptic Segmentation and Tracking Robotic Dataset in Crowded Human Environments,Duy Tho Le · Chenhui Gou · Stavya Datta · Hengcan Shi · Ian Reid · Jianfei Cai · Hamid Rezatofighi, ,https://arxiv.org/abs/2404.01686,,2404.01686.pdf,JRDB-PanoTrack: An Open-world Panoptic Segmentation and Tracking Robotic Dataset in Crowded Human Environments,"Autonomous robot systems have attracted increasing research attention in
+recent years, where environment understanding is a crucial step for robot
+navigation, human-robot interaction, and decision. Real-world robot systems
+usually collect visual data from multiple sensors and are required to recognize
+numerous objects and their movements in complex human-crowded settings.
+Traditional benchmarks, with their reliance on single sensors and limited
+object classes and scenarios, fail to provide the comprehensive environmental
+understanding robots need for accurate navigation, interaction, and
+decision-making. As an extension of JRDB dataset, we unveil JRDB-PanoTrack, a
+novel open-world panoptic segmentation and tracking benchmark, towards more
+comprehensive environmental perception. JRDB-PanoTrack includes (1) various
+data involving indoor and outdoor crowded scenes, as well as comprehensive 2D
+and 3D synchronized data modalities; (2) high-quality 2D spatial panoptic
+segmentation and temporal tracking annotations, with additional 3D label
+projections for further spatial understanding; (3) diverse object classes for
+closed- and open-world recognition benchmarks, with OSPA-based metrics for
+evaluation. Extensive evaluation of leading methods shows significant
+challenges posed by our dataset.",cs.CV,['cs.CV']
+Adaptive VIO: Deep Visual-Inertial Odometry with Online Continual Learning,Youqi Pan · Wugen Zhou · Yingdian Cao · Hongbin Zha, ,https://arxiv.org/html/2308.11228v2,,2308.11228v2.pdf,VIO-DualProNet: Visual-Inertial Odometry with Learning Based Process Noise Covariance,"Visual-inertial odometry (VIO) is a vital technique used in robotics,
+augmented reality, and autonomous vehicles. It combines visual and inertial
+measurements to accurately estimate position and orientation. Existing VIO
+methods assume a fixed noise covariance for the inertial uncertainty. However,
+accurately determining in real-time the noise variance of the inertial sensors
+presents a significant challenge as the uncertainty changes throughout the
+operation leading to suboptimal performance and reduced accuracy. To circumvent
+this, we propose VIO-DualProNet, a novel approach that utilizes deep learning
+methods to dynamically estimate the inertial noise uncertainty in real-time. By
+designing and training a deep neural network to predict inertial noise
+uncertainty using only inertial sensor measurements, and integrating it into
+the VINS-Mono algorithm, we demonstrate a substantial improvement in accuracy
+and robustness, enhancing VIO performance and potentially benefiting other
+VIO-based systems for precise localization and mapping across diverse
+conditions.",cs.RO,"['cs.RO', 'cs.SY', 'eess.SY']"
+ZERO-IG: Zero-Shot Illumination-Guided Joint Denoising and Adaptive Enhancement for Low-Light Images,Yiqi Shi · Duo Liu · Liguo Zhang · Ye Tian · Xuezhi Xia · fuxiaojing,https://github.com/Doyle59217/ZeroIG,https://arxiv.org/abs/2311.02995,,2311.02995.pdf,Zero-Shot Enhancement of Low-Light Image Based on Retinex Decomposition,"Two difficulties here make low-light image enhancement a challenging task;
+firstly, it needs to consider not only luminance restoration but also image
+contrast, image denoising and color distortion issues simultaneously. Second,
+the effectiveness of existing low-light enhancement methods depends on paired
+or unpaired training data with poor generalization performance.
+  To solve these difficult problems, we propose in this paper a new
+learning-based Retinex decomposition of zero-shot low-light enhancement method,
+called ZERRINNet. To this end, we first designed the N-Net network, together
+with the noise loss term, to be used for denoising the original low-light image
+by estimating the noise of the low-light image. Moreover, RI-Net is used to
+estimate the reflection component and illumination component, and in order to
+solve the color distortion and contrast, we use the texture loss term and
+segmented smoothing loss to constrain the reflection component and illumination
+component. Finally, our method is a zero-reference enhancement method that is
+not affected by the training data of paired and unpaired datasets, so our
+generalization performance is greatly improved, and in the paper, we have
+effectively validated it with a homemade real-life low-light dataset and
+additionally with advanced vision tasks, such as face detection, target
+recognition, and instance segmentation. We conducted comparative experiments on
+a large number of public datasets and the results show that the performance of
+our method is competitive compared to the current state-of-the-art methods. The
+code is available at:https://github.com/liwenchao0615/ZERRINNet",cs.CV,"['cs.CV', 'cs.GR']"
+Make Me a BNN: A Simple Strategy for Estimating Bayesian Uncertainty from Pre-trained Models,Gianni Franchi · Olivier Laurent · Maxence Leguéry · Andrei Bursuc · Andrea Pilzer · Angela Yao,https://ensta-u2is-ai.github.io/ABNN-Make-me-a-BNN/,https://arxiv.org/abs/2312.15297,,2312.15297.pdf,Make Me a BNN: A Simple Strategy for Estimating Bayesian Uncertainty from Pre-trained Models,"Deep Neural Networks (DNNs) are powerful tools for various computer vision
+tasks, yet they often struggle with reliable uncertainty quantification - a
+critical requirement for real-world applications. Bayesian Neural Networks
+(BNN) are equipped for uncertainty estimation but cannot scale to large DNNs
+that are highly unstable to train. To address this challenge, we introduce the
+Adaptable Bayesian Neural Network (ABNN), a simple and scalable strategy to
+seamlessly transform DNNs into BNNs in a post-hoc manner with minimal
+computational and training overheads. ABNN preserves the main predictive
+properties of DNNs while enhancing their uncertainty quantification abilities
+through simple BNN adaptation layers (attached to normalization layers) and a
+few fine-tuning steps on pre-trained models. We conduct extensive experiments
+across multiple datasets for image classification and semantic segmentation
+tasks, and our results demonstrate that ABNN achieves state-of-the-art
+performance without the computational budget typically associated with ensemble
+methods.",cs.LG,"['cs.LG', 'cs.CV', 'stat.ML']"
+OpenBias: Open-set Bias Detection in Text-to-Image Generative Models,Moreno D&#x27;Incà · Elia Peruzzo · Massimiliano Mancini · Dejia Xu · Vidit Goel · Xingqian Xu · Zhangyang Wang · Humphrey Shi · Nicu Sebe,https://github.com/Picsart-AI-Research/OpenBias,https://arxiv.org/abs/2404.07990v1,,2404.07990v1.pdf,OpenBias: Open-set Bias Detection in Text-to-Image Generative Models,"Text-to-image generative models are becoming increasingly popular and
+accessible to the general public. As these models see large-scale deployments,
+it is necessary to deeply investigate their safety and fairness to not
+disseminate and perpetuate any kind of biases. However, existing works focus on
+detecting closed sets of biases defined a priori, limiting the studies to
+well-known concepts. In this paper, we tackle the challenge of open-set bias
+detection in text-to-image generative models presenting OpenBias, a new
+pipeline that identifies and quantifies the severity of biases agnostically,
+without access to any precompiled set. OpenBias has three stages. In the first
+phase, we leverage a Large Language Model (LLM) to propose biases given a set
+of captions. Secondly, the target generative model produces images using the
+same set of captions. Lastly, a Vision Question Answering model recognizes the
+presence and extent of the previously proposed biases. We study the behavior of
+Stable Diffusion 1.5, 2, and XL emphasizing new biases, never investigated
+before. Via quantitative experiments, we demonstrate that OpenBias agrees with
+current closed-set bias detection methods and human judgement.",cs.CV,"['cs.CV', 'cs.AI']"
+Depth-Aware Concealed Crop Detection in Dense Agricultural Scenes,Liqiong Wang · Jinyu Yang · Yanfu Zhang · Fangyi Wang · Feng Zheng,https://github.com/Kki2Eve/RISNet,,https://www.mdpi.com/1424-8220/24/6/1942,,,,,nan
+GeoReF: Geometric Alignment Across Shape Variation for Category-level Object Pose Refinement,Linfang Zheng · Tze Ho Elden Tse · Chen Wang · Yinghan Sun · Hua Chen · Aleš Leonardis · Wei Zhang · Hyung Jin Chang,https://lynne-zheng-linfang.github.io/georef.github.io/,https://arxiv.org/abs/2404.11139v1,,2404.11139v1.pdf,GeoReF: Geometric Alignment Across Shape Variation for Category-level Object Pose Refinement,"Object pose refinement is essential for robust object pose estimation.
+Previous work has made significant progress towards instance-level object pose
+refinement. Yet, category-level pose refinement is a more challenging problem
+due to large shape variations within a category and the discrepancies between
+the target object and the shape prior. To address these challenges, we
+introduce a novel architecture for category-level object pose refinement. Our
+approach integrates an HS-layer and learnable affine transformations, which
+aims to enhance the extraction and alignment of geometric information.
+Additionally, we introduce a cross-cloud transformation mechanism that
+efficiently merges diverse data sources. Finally, we push the limits of our
+model by incorporating the shape prior information for translation and size
+error prediction. We conducted extensive experiments to demonstrate the
+effectiveness of the proposed framework. Through extensive quantitative
+experiments, we demonstrate significant improvement over the baseline method by
+a large margin across all metrics.",cs.CV,['cs.CV']
+Learning to Control Camera Exposure via Reinforcement Learning,Kyunghyun Lee · Ukcheol Shin · Byeong-Uk Lee,https://sites.google.com/view/drl-ae,https://arxiv.org/abs/2404.01636,,2404.01636.pdf,Learning to Control Camera Exposure via Reinforcement Learning,"Adjusting camera exposure in arbitrary lighting conditions is the first step
+to ensure the functionality of computer vision applications. Poorly adjusted
+camera exposure often leads to critical failure and performance degradation.
+Traditional camera exposure control methods require multiple convergence steps
+and time-consuming processes, making them unsuitable for dynamic lighting
+conditions. In this paper, we propose a new camera exposure control framework
+that rapidly controls camera exposure while performing real-time processing by
+exploiting deep reinforcement learning. The proposed framework consists of four
+contributions: 1) a simplified training ground to simulate real-world's diverse
+and dynamic lighting changes, 2) flickering and image attribute-aware reward
+design, along with lightweight state design for real-time processing, 3) a
+static-to-dynamic lighting curriculum to gradually improve the agent's
+exposure-adjusting capability, and 4) domain randomization techniques to
+alleviate the limitation of the training ground and achieve seamless
+generalization in the wild.As a result, our proposed method rapidly reaches a
+desired exposure level within five steps with real-time processing (1 ms).
+Also, the acquired images are well-exposed and show superiority in various
+computer vision tasks, such as feature extraction and object detection.",cs.CV,"['cs.CV', 'cs.AI', 'cs.LG', 'cs.RO', 'cs.SY', 'eess.SY']"
+Differentiable Neural Surface Refinement for Transparent Objects,Weijian Deng · Dylan Campbell · Chunyi Sun · Shubham Kanitkar · Matthew Shaffer · Stephen Gould,https://weijiandeng.xyz/nsr,,https://dl.acm.org/doi/abs/10.1145/3610548.3618236,,,,,nan
+Discovering and Mitigating Visual Biases through Keyword Explanation,Younghyun Kim · Sangwoo Mo · Minkyu Kim · Kyungmin Lee · Jaeho Lee · Jinwoo Shin, ,,https://effl.postech.ac.kr/docs/research/papers/,,,,,nan
+MiKASA: Multi-Key-Anchor Scene-Aware Transformer for 3D Visual Grounding,Chun-Peng Chang · Shaoxiang Wang · Alain Pagani · Didier Stricker, ,https://arxiv.org/abs/2403.03077,,,MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding,"3D visual grounding involves matching natural language descriptions with
+their corresponding objects in 3D spaces. Existing methods often face
+challenges with accuracy in object recognition and struggle in interpreting
+complex linguistic queries, particularly with descriptions that involve
+multiple anchors or are view-dependent. In response, we present the MiKASA
+(Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model
+integrates a self-attention-based scene-aware object encoder and an original
+multi-key-anchor technique, enhancing object recognition accuracy and the
+understanding of spatial relationships. Furthermore, MiKASA improves the
+explainability of decision-making, facilitating error diagnosis. Our model
+achieves the highest overall accuracy in the Referit3D challenge for both the
+Sr3D and Nr3D datasets, particularly excelling by a large margin in categories
+that require viewpoint-dependent descriptions.",cs.CV,['cs.CV']
+Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3),Tsu-Ching Hsiao · Hao-Wei Chen · Hsuan-Kung Yang · Chun-Yi Lee, ,https://arxiv.org/abs/2401.00029,,,6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation,"Estimating the 6D object pose from a single RGB image often involves noise
+and indeterminacy due to challenges such as occlusions and cluttered
+backgrounds. Meanwhile, diffusion models have shown appealing performance in
+generating high-quality images from random noise with high indeterminacy
+through step-by-step denoising. Inspired by their denoising capability, we
+propose a novel diffusion-based framework (6D-Diff) to handle the noise and
+indeterminacy in object pose estimation for better performance. In our
+framework, to establish accurate 2D-3D correspondence, we formulate 2D
+keypoints detection as a reverse diffusion (denoising) process. To facilitate
+such a denoising process, we design a Mixture-of-Cauchy-based forward diffusion
+process and condition the reverse process on the object features. Extensive
+experiments on the LM-O and YCB-V datasets demonstrate the effectiveness of our
+framework.",cs.CV,['cs.CV']
+Text-guided Explorable Image Super-resolution,Kanchana Vaishnavi Gandikota · Paramanand Chandramouli, ,https://arxiv.org/abs/2403.01124,,2403.01124.pdf,Text-guided Explorable Image Super-resolution,"In this paper, we introduce the problem of zero-shot text-guided exploration
+of the solutions to open-domain image super-resolution. Our goal is to allow
+users to explore diverse, semantically accurate reconstructions that preserve
+data consistency with the low-resolution inputs for different large
+downsampling factors without explicitly training for these specific
+degradations. We propose two approaches for zero-shot text-guided
+super-resolution - i) modifying the generative process of text-to-image
+\textit{T2I} diffusion models to promote consistency with low-resolution
+inputs, and ii) incorporating language guidance into zero-shot diffusion-based
+restoration methods. We show that the proposed approaches result in diverse
+solutions that match the semantic meaning provided by the text prompt while
+preserving data consistency with the degraded inputs. We evaluate the proposed
+baselines for the task of extreme super-resolution and demonstrate advantages
+in terms of restoration quality, diversity, and explorability of solutions.",cs.CV,['cs.CV']
+$CrowdDiff$: Multi-hypothesis Crowd Density Estimation using Diffusion Models,Yasiru Ranasinghe · Nithin Gopalakrishnan Nair · Wele Gedara Chaminda Bandara · Vishal M. Patel, ,,https://jarxiv.com/2024/04/05/crowddiff-multi-hypothesis-crowd-density-estimation-using-diffusion-models/,,,,,nan
+Instruct-Imagen: Image Generation with Multi-modal Instruction,Hexiang Hu · Kelvin C.K. Chan · Yu-Chuan Su · Wenhu Chen · Yandong Li · Kihyuk Sohn · Yang Zhao · Xue Ben · William Cohen · Ming-Wei Chang · Xuhui Jia,https://instruct-imagen.github.io/,https://arxiv.org/abs/2401.01952,,2401.01952.pdf,Instruct-Imagen: Image Generation with Multi-modal Instruction,"This paper presents instruct-imagen, a model that tackles heterogeneous image
+generation tasks and generalizes across unseen tasks. We introduce *multi-modal
+instruction* for image generation, a task representation articulating a range
+of generation intents with precision. It uses natural language to amalgamate
+disparate modalities (e.g., text, edge, style, subject, etc.), such that
+abundant generation intents can be standardized in a uniform format.
+  We then build instruct-imagen by fine-tuning a pre-trained text-to-image
+diffusion model with a two-stage framework. First, we adapt the model using the
+retrieval-augmented training, to enhance model's capabilities to ground its
+generation on external multimodal context. Subsequently, we fine-tune the
+adapted model on diverse image generation tasks that requires vision-language
+understanding (e.g., subject-driven generation, etc.), each paired with a
+multi-modal instruction encapsulating the task's essence. Human evaluation on
+various image generation datasets reveals that instruct-imagen matches or
+surpasses prior task-specific models in-domain and demonstrates promising
+generalization to unseen and more complex tasks.",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL']"
+$MonoDiff$: Monocular 3D Object Detection and Pose Estimation with Diffusion Models,Yasiru Ranasinghe · Deepti Hegde · Vishal M. Patel, ,https://arxiv.org/abs/2403.18791,,,Object Pose Estimation via the Aggregation of Diffusion Features,"Estimating the pose of objects from images is a crucial task of 3D scene
+understanding, and recent approaches have shown promising results on very large
+benchmarks. However, these methods experience a significant performance drop
+when dealing with unseen objects. We believe that it results from the limited
+generalizability of image features. To address this problem, we have an
+in-depth analysis on the features of diffusion models, e.g. Stable Diffusion,
+which hold substantial potential for modeling unseen objects. Based on this
+analysis, we then innovatively introduce these diffusion features for object
+pose estimation. To achieve this, we propose three distinct architectures that
+can effectively capture and aggregate diffusion features of different
+granularity, greatly improving the generalizability of object pose estimation.
+Our approach outperforms the state-of-the-art methods by a considerable margin
+on three popular benchmark datasets, LM, O-LM, and T-LESS. In particular, our
+method achieves higher accuracy than the previous best arts on unseen objects:
+98.2% vs. 93.5% on Unseen LM, 85.9% vs. 76.3% on Unseen O-LM, showing the
+strong generalizability of our method. Our code is released at
+https://github.com/Tianfu18/diff-feats-pose.",cs.CV,['cs.CV']
+Towards More Unified In-context Visual Understanding,Dianmo Sheng · Dongdong Chen · Zhentao Tan · Qiankun Liu · Qi Chu · Jianmin Bao · Tao Gong · Bin Liu · Shengwei Xu · Nenghai Yu, ,https://arxiv.org/abs/2312.02520v2,,2312.02520v2.pdf,Towards More Unified In-context Visual Understanding,"The rapid advancement of large language models (LLMs) has accelerated the
+emergence of in-context learning (ICL) as a cutting-edge approach in the
+natural language processing domain. Recently, ICL has been employed in visual
+understanding tasks, such as semantic segmentation and image captioning,
+yielding promising results. However, existing visual ICL framework can not
+enable producing content across multiple modalities, which limits their
+potential usage scenarios. To address this issue, we present a new ICL
+framework for visual understanding with multi-modal output enabled. First, we
+quantize and embed both text and visual prompt into a unified representational
+space, structured as interleaved in-context sequences. Then a decoder-only
+sparse transformer architecture is employed to perform generative modeling on
+them, facilitating in-context learning. Thanks to this design, the model is
+capable of handling in-context vision understanding tasks with multimodal
+output in a unified pipeline.Experimental results demonstrate that our model
+achieves competitive performance compared with specialized models and previous
+ICL baselines. Overall, our research takes a further step toward unified
+multimodal in-context learning.",cs.CV,['cs.CV']
+Compositional Chain-of-Thought Prompting for Large Multimodal Models,Chancharik Mitra · Brandon Huang · Trevor Darrell · Roei Herzig, ,https://arxiv.org/abs/2311.17076,,2311.17076.pdf,Compositional Chain-of-Thought Prompting for Large Multimodal Models,"The combination of strong visual backbones and Large Language Model (LLM)
+reasoning has led to Large Multimodal Models (LMMs) becoming the current
+standard for a wide range of vision and language (VL) tasks. However, recent
+research has shown that even the most advanced LMMs still struggle to capture
+aspects of compositional visual reasoning, such as attributes and relationships
+between objects. One solution is to utilize scene graphs (SGs)--a formalization
+of objects and their relations and attributes that has been extensively used as
+a bridge between the visual and textual domains. Yet, scene graph data requires
+scene graph annotations, which are expensive to collect and thus not easily
+scalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic
+forgetting of the pretraining objective. To overcome this, inspired by
+chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a
+novel zero-shot Chain-of-Thought prompting method that utilizes SG
+representations in order to extract compositional knowledge from an LMM.
+Specifically, we first generate an SG using the LMM, and then use that SG in
+the prompt to produce a response. Through extensive experiments, we find that
+the proposed CCoT approach not only improves LMM performance on several vision
+and language VL compositional benchmarks but also improves the performance of
+several popular LMMs on general multimodal benchmarks, without the need for
+fine-tuning or annotated ground-truth SGs. Code:
+https://github.com/chancharikmitra/CCoT",cs.CV,"['cs.CV', 'cs.AI', 'cs.CL', 'cs.LG']"
+"CAMixerSR: Only Details Need More ""Attention""",Yan Wang · Yi Liu · Shijie Zhao · Junlin Li · Li zhang,https://github.com/icandle/CAMixerSR,https://arxiv.org/abs/2402.19289v2,,2402.19289v2.pdf,"CAMixerSR: Only Details Need More ""Attention""","To satisfy the rapidly increasing demands on the large image (2K-8K)
+super-resolution (SR), prevailing methods follow two independent tracks: 1)
+accelerate existing networks by content-aware routing, and 2) design better
+super-resolution networks via token mixer refining. Despite directness, they
+encounter unavoidable defects (e.g., inflexible route or non-discriminative
+processing) limiting further improvements of quality-complexity trade-off. To
+erase the drawbacks, we integrate these schemes by proposing a content-aware
+mixer (CAMixer), which assigns convolution for simple contexts and additional
+deformable window-attention for sparse textures. Specifically, the CAMixer uses
+a learnable predictor to generate multiple bootstraps, including offsets for
+windows warping, a mask for classifying windows, and convolutional attentions
+for endowing convolution with the dynamic property, which modulates attention
+to include more useful textures self-adaptively and improves the representation
+capability of convolution. We further introduce a global classification loss to
+improve the accuracy of predictors. By simply stacking CAMixers, we obtain
+CAMixerSR which achieves superior performance on large-image SR, lightweight
+SR, and omnidirectional-image SR.",eess.IV,"['eess.IV', 'cs.CV']"
+Geometrically-informed aggregation for zero-shot point cloud understanding,Guofeng Mei · Luigi Riz · Yiming Wang · Fabio Poiesi, ,https://arxiv.org/abs/2312.02244,,2312.02244.pdf,Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding,"Zero-shot 3D point cloud understanding can be achieved via 2D Vision-Language
+Models (VLMs). Existing strategies directly map Vision-Language Models from 2D
+pixels of rendered or captured views to 3D points, overlooking the inherent and
+expressible point cloud geometric structure. Geometrically similar or close
+regions can be exploited for bolstering point cloud understanding as they are
+likely to share semantic information. To this end, we introduce the first
+training-free aggregation technique that leverages the point cloud's 3D
+geometric structure to improve the quality of the transferred Vision-Language
+Models. Our approach operates iteratively, performing local-to-global
+aggregation based on geometric and semantic point-level reasoning. We benchmark
+our approach on three downstream tasks, including classification, part
+segmentation, and semantic segmentation, with a variety of datasets
+representing both synthetic/real-world, and indoor/outdoor scenarios. Our
+approach achieves new state-of-the-art results in all benchmarks. Our approach
+operates iteratively, performing local-to-global aggregation based on geometric
+and semantic point-level reasoning. Code and dataset are available at
+https://luigiriz.github.io/geoze-website/",cs.CV,['cs.CV']
+CrossKD: Cross-Head Knowledge Distillation for Dense Object Detection,JiaBao Wang · yuming chen · Zhaohui Zheng · Xiang Li · Ming-Ming Cheng · Qibin Hou,https://github.com/jbwang1997/CrossKD,https://arxiv.org/abs/2306.11369,,2306.11369.pdf,CrossKD: Cross-Head Knowledge Distillation for Object Detection,"Knowledge Distillation (KD) has been validated as an effective model
+compression technique for learning compact object detectors. Existing
+state-of-the-art KD methods for object detection are mostly based on feature
+imitation. In this paper, we present a general and effective prediction
+mimicking distillation scheme, called CrossKD, which delivers the intermediate
+features of the student's detection head to the teacher's detection head. The
+resulting cross-head predictions are then forced to mimic the teacher's
+predictions. This manner relieves the student's head from receiving
+contradictory supervision signals from the annotations and the teacher's
+predictions, greatly improving the student's detection performance. Moreover,
+as mimicking the teacher's predictions is the target of KD, CrossKD offers more
+task-oriented information in contrast with feature imitation. On MS COCO, with
+only prediction mimicking losses applied, our CrossKD boosts the average
+precision of GFL ResNet-50 with 1x training schedule from 40.2 to 43.7,
+outperforming all existing KD methods. In addition, our method also works well
+when distilling detectors with heterogeneous backbones. Code is available at
+https://github.com/jbwang1997/CrossKD.",cs.CV,['cs.CV']
+DriveTrack: A Benchmark for Long-Range Point Tracking in Real-World Videos,Arjun Balasingam · Joseph Chandler · Chenning Li · Zhoutong Zhang · Hari Balakrishnan,https://drivetrack.csail.mit.edu/,https://arxiv.org/abs/2312.09523,,2312.09523.pdf,DriveTrack: A Benchmark for Long-Range Point Tracking in Real-World Videos,"This paper presents DriveTrack, a new benchmark and data generation framework
+for long-range keypoint tracking in real-world videos. DriveTrack is motivated
+by the observation that the accuracy of state-of-the-art trackers depends
+strongly on visual attributes around the selected keypoints, such as texture
+and lighting. The problem is that these artifacts are especially pronounced in
+real-world videos, but these trackers are unable to train on such scenes due to
+a dearth of annotations. DriveTrack bridges this gap by building a framework to
+automatically annotate point tracks on autonomous driving datasets. We release
+a dataset consisting of 1 billion point tracks across 24 hours of video, which
+is seven orders of magnitude greater than prior real-world benchmarks and on
+par with the scale of synthetic benchmarks. DriveTrack unlocks new use cases
+for point tracking in real-world videos. First, we show that fine-tuning
+keypoint trackers on DriveTrack improves accuracy on real-world scenes by up to
+7%. Second, we analyze the sensitivity of trackers to visual artifacts in real
+scenes and motivate the idea of running assistive keypoint selectors alongside
+trackers.",cs.CV,['cs.CV']
+CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model,Jianhao Zeng · Dan Song · Weizhi Nie · Hongshuo Tian · Tongtong Wang · An-An Liu,https://zengjianhao.github.io/CAT-DM,https://arxiv.org/abs/2311.18405,,2311.18405.pdf,CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model,"Generative Adversarial Networks (GANs) dominate the research field in
+image-based virtual try-on, but have not resolved problems such as unnatural
+deformation of garments and the blurry generation quality. While the generative
+quality of diffusion models is impressive, achieving controllability poses a
+significant challenge when applying it to virtual try-on and multiple denoising
+iterations limit its potential for real-time applications. In this paper, we
+propose Controllable Accelerated virtual Try-on with Diffusion Model (CAT-DM).
+To enhance the controllability, a basic diffusion-based virtual try-on network
+is designed, which utilizes ControlNet to introduce additional control
+conditions and improves the feature extraction of garment images. In terms of
+acceleration, CAT-DM initiates a reverse denoising process with an implicit
+distribution generated by a pre-trained GAN-based model. Compared with previous
+try-on methods based on diffusion models, CAT-DM not only retains the pattern
+and texture details of the inshop garment but also reduces the sampling steps
+without compromising generation quality. Extensive experiments demonstrate the
+superiority of CAT-DM against both GANbased and diffusion-based methods in
+producing more realistic images and accurately reproducing garment patterns.",cs.CV,['cs.CV']
+Free3D: Consistent Novel View Synthesis without 3D Representation,Chuanxia Zheng · Andrea Vedaldi,https://chuanxiaz.com/free3d/,https://arxiv.org/abs/2312.04551,,2312.04551.pdf,Free3D: Consistent Novel View Synthesis without 3D Representation,"We introduce Free3D, a simple accurate method for monocular open-set novel
+view synthesis (NVS). Similar to Zero-1-to-3, we start from a pre-trained 2D
+image generator for generalization, and fine-tune it for NVS. Compared to other
+works that took a similar approach, we obtain significant improvements without
+resorting to an explicit 3D representation, which is slow and memory-consuming,
+and without training an additional network for 3D reconstruction. Our key
+contribution is to improve the way the target camera pose is encoded in the
+network, which we do by introducing a new ray conditioning normalization (RCN)
+layer. The latter injects pose information in the underlying 2D image generator
+by telling each pixel its viewing direction. We further improve multi-view
+consistency by using light-weight multi-view attention layers and by sharing
+generation noise between the different views. We train Free3D on the Objaverse
+dataset and demonstrate excellent generalization to new categories in new
+datasets, including OmniObject3D and GSO. The project page is available at
+https://chuanxiaz.com/free3d/.",cs.CV,['cs.CV']
+InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360-degree Neural Radiance Fields,Dongqing Wang · Tong Zhang · Alaa Abboud · Sabine Süsstrunk, ,https://arxiv.org/html/2401.05335v1,,2401.05335v1.pdf,InseRF: Text-Driven Generative Object Insertion in Neural 3D Scenes,"We introduce InseRF, a novel method for generative object insertion in the
+NeRF reconstructions of 3D scenes. Based on a user-provided textual description
+and a 2D bounding box in a reference viewpoint, InseRF generates new objects in
+3D scenes. Recently, methods for 3D scene editing have been profoundly
+transformed, owing to the use of strong priors of text-to-image diffusion
+models in 3D generative modeling. Existing methods are mostly effective in
+editing 3D scenes via style and appearance changes or removing existing
+objects. Generating new objects, however, remains a challenge for such methods,
+which we address in this study. Specifically, we propose grounding the 3D
+object insertion to a 2D object insertion in a reference view of the scene. The
+2D edit is then lifted to 3D using a single-view object reconstruction method.
+The reconstructed object is then inserted into the scene, guided by the priors
+of monocular depth estimation methods. We evaluate our method on various 3D
+scenes and provide an in-depth analysis of the proposed components. Our
+experiments with generative insertion of objects in several 3D scenes indicate
+the effectiveness of our method compared to the existing methods. InseRF is
+capable of controllable and 3D-consistent object insertion without requiring
+explicit 3D information as input. Please visit our project page at
+https://mohamad-shahbazi.github.io/inserf.",cs.CV,"['cs.CV', 'cs.GR', 'cs.LG']"
+Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation,Fahimeh Hosseini Noohdani · Parsa Hosseini · Aryan Yazdan Parast · Hamidreza Araghi · Mahdieh Baghshah, ,https://arxiv.org/abs/2402.18919,,2402.18919.pdf,Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation,"While standard Empirical Risk Minimization (ERM) training is proven effective
+for image classification on in-distribution data, it fails to perform well on
+out-of-distribution samples. One of the main sources of distribution shift for
+image classification is the compositional nature of images. Specifically, in
+addition to the main object or component(s) determining the label, some other
+image components usually exist, which may lead to the shift of input
+distribution between train and test environments. More importantly, these
+components may have spurious correlations with the label. To address this
+issue, we propose Decompose-and-Compose (DaC), which improves robustness to
+correlation shift by a compositional approach based on combining elements of
+images. Based on our observations, models trained with ERM usually highly
+attend to either the causal components or the components having a high spurious
+correlation with the label (especially in datapoints on which models have a
+high confidence). In fact, according to the amount of spurious correlation and
+the easiness of classification based on the causal or non-causal components,
+the model usually attends to one of these more (on samples with high
+confidence). Following this, we first try to identify the causal components of
+images using class activation maps of models trained with ERM. Afterward, we
+intervene on images by combining them and retraining the model on the augmented
+data, including the counterfactual ones. Along with its high interpretability,
+this work proposes a group-balancing method by intervening on images without
+requiring group labels or information regarding the spurious features during
+training. The method has an overall better worst group accuracy compared to
+previous methods with the same amount of supervision on the group labels in
+correlation shift.",cs.CV,"['cs.CV', 'cs.LG']"