LG - 机器学习 CV - 计算机视觉 CL - 计算与语言 AS - 音频与语音 RO - 机器人
1、[CV] Skeleton-free Pose Transfer for Stylized 3D Characters
Z Liao, J Yang, J Saito, G Pons-Moll, Y Zhou
[Saarland University & Adobe Research & University of Tubingen]
We present the first method that automatically transfers poses between stylized 3D characters without skeletal rigging. In contrast to previous attempts to learn pose transformations on fixed or topology-equivalent skeleton templates, our method focuses on a novel scenario to handle skeleton-free characters with diverse shapes, topologies, and mesh connectivities. The key idea of our method is to represent the characters in a unified articulation model so that the pose can be transferred through the correspondent parts. To achieve this, we propose a novel pose transfer network that predicts the character skinning weights and deformation transformations jointly to articulate the target character to match the desired pose. Our method is trained in a semi-supervised manner absorbing all existing character data with paired/unpaired poses and stylized shapes. It generalizes well to unseen stylized characters and inanimate objects. We conduct extensive experiments and demonstrate the effectiveness of our method on this novel task.
2、[CV] VolTeMorph: Realtime, Controllable and Generalisable Animation of Volumetric Representations
S J. Garbin, M Kowalski, V Estellers, S Szymanowicz, S Rezaeifar, J Shen, M Johnson, J Valentin
The recent increase in popularity of volumetric representations for scene reconstruction and novel view synthesis has put renewed focus on animating volumetric content at high visual quality and in real-time. While implicit deformation methods based on learned functions can produce impressive results, they are `black boxes' to artists and content creators, they require large amounts of training data to generalise meaningfully, and they do not produce realistic extrapolations outside the training data. In this work we solve these issues by introducing a volume deformation method which is real-time, easy to edit with off-the-shelf software and can extrapolate convincingly. To demonstrate the versatility of our method, we apply it in two scenarios: physics-based object deformation and telepresence where avatars are controlled using blendshapes. We also perform thorough experiments showing that our method compares favourably to both volumetric approaches combined with implicit deformation and methods based on mesh deformation.
3、[CV] Learning to Solve Hard Minimal Problems
P Hruby, T Duff, A Leykin, T Pajdla
[ETH Zurich & University of Washington & Georgia Institute of Technology & Czech Technical University in Prague]
We present an approach to solving hard geometric optimization problems in the RANSAC framework. The hard minimal problems arise from relaxing the original geometric optimization problem into a minimal problem with many spurious solutions. Our approach avoids computing large numbers of spurious solutions. We design a learning strategy for selecting a starting problem-solution pair that can be numerically continued to the problem and the solution of interest. We demonstrate our approach by developing a RANSAC solver for the problem of computing the relative pose of three calibrated cameras, via a minimal relaxation using four points in each view. On average, we can solve a single problem in under 70 μs. We also benchmark and study our engineering choices on the very familiar problem of computing the relative pose of two calibrated cameras, via the minimal case of five points in two views.
4、[CV] AvatarGen: a 3D Generative Model for Animatable Human Avatars
J Zhang, Z Jiang, D Yang, H Xu, Y Shi, G Song, Z Xu, X Wang, J Feng
[National University of Singapore & ByteDance]
AvatarGen:面向可动画人物化身的3D生成模型。无监督生成具有不同外观和可动画化姿态的着装虚拟人,对创建3D人物化身和其他AR/VR应用非常重要。现有的方法要么局限于僵硬的物体建模,要么不是生成性的,因此无法合成高质量的虚拟人物并为其制作动画。本文提出AvatarGen,不仅能生成具有多样化外观的非刚性人体,且能完全控制姿态和视角的方法,只需要2D图像进行训练。具体来说,通过利用粗略的人体模型作为代理,将观察空间扭曲成典型空间下的标准化身,将最近的3D GANs扩展到着装人体的生成。为了对非刚性动态进行建模,引入一个变形网络来学习规范空间中与姿态相关的变形。为提高生成的人物化身的几何质量,利用有符号距离场作为几何表示,使身体模型对几何的学习有更直接的规范化。受益于这些设计,所提出方法可生成具有高质量外观和几何建模的可动画的人物化身,明显优于之前的3D GAN。此外,还能胜任许多应用,例如,单视图重建、重塑和文本引导合成。
Unsupervised generation of clothed virtual humans with various appearance and animatable poses is important for creating 3D human avatars and other AR/VR applications. Existing methods are either limited to rigid object modeling, or not generative and thus unable to synthesize high-quality virtual humans and animate them. In this work, we propose AvatarGen, the first method that enables not only non-rigid human generation with diverse appearance but also full control over poses and viewpoints, while only requiring 2D images for training. Specifically, it extends the recent 3D GANs to clothed human generation by utilizing a coarse human body model as a proxy to warp the observation space into a standard avatar under a canonical space. To model non-rigid dynamics, it introduces a deformation network to learn pose-dependent deformations in the canonical space. To improve geometry quality of the generated human avatars, it leverages signed distance field as geometric representation, which allows more direct regularization from the body model on the geometry learning. Benefiting from these designs, our method can generate animatable human avatars with high-quality appearance and geometry modeling, significantly outperforming previous 3D GANs. Furthermore, it is competent for many applications, e.g., single-view reconstruction, reanimation, and text-guided synthesis. Code and pre-trained model will be available.
5、[CL] Efficient Long-Text Understanding with Short-Text Models
M Ivgi, U Shaham, J Berant
[Tel-Aviv University]
基于短文本模型的高效长文本理解。基于Transformer的预训练语言模型(LM)在自然语言理解中无处不在,但由于其二次复杂性,无法应用于长序列,如故事、科研文章和长文档。虽然已经提出了无数高效的Transformer变体,但通常基于定制的实现,需要从头开始进行昂贵的预训练。本文提出SLED(SLiding-Encoder and Decoder),一种处理长序列的简单方法,重新使用并利用经过实战检验的短文预训练的LM。将输入划分为重叠的块,用短文本LM编码器对每个块进行编码,用预训练解码器来融合各块的信息(融合解码器)。通过控制性实验说明SLED为长文本理解提供了可行的策略,并在SCROLLS上评估了该方法,SCROLLS是一个有七个数据集的基准,涉及广泛的语言理解任务。实验表明,SLED与专门的模型相比具有竞争力,这些模型的规模高达50倍,而且需要专门的、昂贵的预训练步骤。
Transformer-based pretrained language models (LMs) are ubiquitous across natural language understanding, but cannot be applied to long sequences such as stories, scientific articles and long documents, due to their quadratic complexity. While a myriad of efficient transformer variants have been proposed, they are typically based on custom implementations that require expensive pretraining from scratch. In this work, we propose SLED: SLiding-Encoder and Decoder, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. Specifically, we partition the input into overlapping chunks, encode each with a short-text LM encoder and use the pretrained decoder to fuse information across chunks (fusion-in-decoder). We illustrate through controlled experiments that SLED offers a viable strategy for long text understanding and evaluate our approach on SCROLLS, a benchmark with seven datasets across a wide range of language understanding tasks. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pretraining step.
[CV] Depth Field Networks for Generalizable Multi-view Scene Representation
V Guizilini, I Vasiljevic, J Fang, R Ambrus, G Shakhnarovich...
[Toyota Research Institute & Toyota Technological Institute at Chicago]
[CV] 3D Cartoon Face Generation with Controllable Expressions from a Single GAN Image
H Wang, G Lin, S C. H. Hoi, C Miao
[Nanyang Technological University & Singapore Management University]
[CL] GTrans: Grouping and Fusing Transformer Layers for Neural Machine Translation
J Yang, Y Yin, S Ma, H Huang, D Zhang, F Wei, Z Li
[Beihang University & Microsoft Research Asia]
[CL] An Interpretability Evaluation Benchmark for Pre-trained Language Models
Y Shen, L Wang, Y Chen, X Xiao, J Liu, H Wu
[Baidu Inc]