LG - 机器学习   CV - 计算机视觉   CL - 计算与语言   AS - 音频与语音 RO - 机器人

转自爱可可爱生活

摘要:无骨架3D人物姿态迁移、实时可控可泛化体表示动画、学习解决困难极小问题、面向可动画人物化身的3D生成模型、基于短文本模型的高效长文本理解、面向可泛化多视场景表示的深度场网络、基于单幅GAN图像可控表情的3D卡通人脸生成、面向神经机器翻译的分组融合Transformer层、预训练语言模型可解释性评估基准

 

1、[CV] Skeleton-free Pose Transfer for Stylized 3D Characters

Z Liao, J Yang, J Saito, G Pons-Moll, Y Zhou

[Saarland University & Adobe Research & University of Tubingen]

无骨架3D人物姿态迁移。本文提出一种在无骨架情况下自动迁移风格化3D人物姿态的方法。与以往在固定的或拓扑结构相同的骨架模板上学习姿态迁移的尝试不同,所提出方法专注于处理具有不同形状、拓扑结构和网格连接性的无骨架角色的新方案。该方法的关键思想是在一个统一的关节模型中表示人物,这样姿态就可以通过相应的部分迁移。为了实现这一目标,本文提出一种新的姿态迁移网络,可预测角色的蒙面权重和变形变换,以衔接目标角色,使其符合所需的姿态。所提出方法以半监督方式进行训练,吸收所有现有的具有配对/非配对姿态和造型的角色数据。对未见过的风格化人物和无生命物体具有良好的通用性。广泛的实验表明了所提出方法在新任务上的有效性。

We present the first method that automatically transfers poses between stylized 3D characters without skeletal rigging. In contrast to previous attempts to learn pose transformations on fixed or topology-equivalent skeleton templates, our method focuses on a novel scenario to handle skeleton-free characters with diverse shapes, topologies, and mesh connectivities. The key idea of our method is to represent the characters in a unified articulation model so that the pose can be transferred through the correspondent parts. To achieve this, we propose a novel pose transfer network that predicts the character skinning weights and deformation transformations jointly to articulate the target character to match the desired pose. Our method is trained in a semi-supervised manner absorbing all existing character data with paired/unpaired poses and stylized shapes. It generalizes well to unseen stylized characters and inanimate objects. We conduct extensive experiments and demonstrate the effectiveness of our method on this novel task.

https://arxiv.org/abs/2208.00790

 

2、[CV] VolTeMorph: Realtime, Controllable and Generalisable Animation of Volumetric Representations

S J. Garbin, M Kowalski, V Estellers, S Szymanowicz, S Rezaeifar, J Shen, M Johnson, J Valentin

[Microsoft]

VolTeMorph:实时可控可泛化体表示动画。最近,用于场景重建和新视图合成的体表示法越来越受欢迎,这使得人们重新关注以视觉高质量和实时方式为体内容制作动画。虽然基于学习函数的隐变形方法可以产生令人印象深刻的结果,但对于艺术家和内容创作者来说,它们是"黑盒子",需要大量的训练数据来进行有意义的泛化,且不能在训练数据之外产生逼真的推断。本文通过引入一种体变形方法来解决这些问题,该方法是实时的,易于用现成的软件进行编辑,能令人信服地进行推断。为证明方法的多功能性,将其应用于两个场景:基于物理的物体变形和使用混合形状控制化身的远程呈现。全面的实验表明,所提出方法与结合了隐变形的体测量方法和基于网格变形的方法相比,都更有优势。

The recent increase in popularity of volumetric representations for scene reconstruction and novel view synthesis has put renewed focus on animating volumetric content at high visual quality and in real-time. While implicit deformation methods based on learned functions can produce impressive results, they are `black boxes' to artists and content creators, they require large amounts of training data to generalise meaningfully, and they do not produce realistic extrapolations outside the training data. In this work we solve these issues by introducing a volume deformation method which is real-time, easy to edit with off-the-shelf software and can extrapolate convincingly. To demonstrate the versatility of our method, we apply it in two scenarios: physics-based object deformation and telepresence where avatars are controlled using blendshapes. We also perform thorough experiments showing that our method compares favourably to both volumetric approaches combined with implicit deformation and methods based on mesh deformation.

https://arxiv.org/abs/2208.00949

 

3、[CV] Learning to Solve Hard Minimal Problems

P Hruby, T Duff, A Leykin, T Pajdla

[ETH Zurich & University of Washington & Georgia Institute of Technology & Czech Technical University in Prague]

学习解决困难极小问题。本文提出一种在RANSAC框架内解决困难几何优化问题的方法。困难极小问题来自于将原始的几何优化问题放宽为具有许多假解的极小问题。所提出方法避免了计算大量的假解。设计了一种学习策略,用于选择一个可在数值上延伸问题和感兴趣解决方案的起始问题-方案对。通过开发一个RANSAC求解器来证明所提出方法,用于计算三个经过校准的相机的相对位置,通过每个视图中的四个点进行最小化松弛。平均可在70微秒内求解一个问题。对计算两台校准相机的相对姿态的问题进行了基准测试,并研究了所提出的工程选择,通过两视图五点的最小用例。

We present an approach to solving hard geometric optimization problems in the RANSAC framework. The hard minimal problems arise from relaxing the original geometric optimization problem into a minimal problem with many spurious solutions. Our approach avoids computing large numbers of spurious solutions. We design a learning strategy for selecting a starting problem-solution pair that can be numerically continued to the problem and the solution of interest. We demonstrate our approach by developing a RANSAC solver for the problem of computing the relative pose of three calibrated cameras, via a minimal relaxation using four points in each view. On average, we can solve a single problem in under 70 μs. We also benchmark and study our engineering choices on the very familiar problem of computing the relative pose of two calibrated cameras, via the minimal case of five points in two views.

https://arxiv.org/abs/2112.03424

 

4、[CV] AvatarGen: a 3D Generative Model for Animatable Human Avatars

J Zhang, Z Jiang, D Yang, H Xu, Y Shi, G Song, Z Xu, X Wang, J Feng

[National University of Singapore & ByteDance]

AvatarGen:面向可动画人物化身的3D生成模型。无监督生成具有不同外观和可动画化姿态的着装虚拟人,对创建3D人物化身和其他AR/VR应用非常重要。现有的方法要么局限于僵硬的物体建模,要么不是生成性的,因此无法合成高质量的虚拟人物并为其制作动画。本文提出AvatarGen,不仅能生成具有多样化外观的非刚性人体,且能完全控制姿态和视角的方法,只需要2D图像进行训练。具体来说,通过利用粗略的人体模型作为代理,将观察空间扭曲成典型空间下的标准化身,将最近的3D GANs扩展到着装人体的生成。为了对非刚性动态进行建模,引入一个变形网络来学习规范空间中与姿态相关的变形。为提高生成的人物化身的几何质量,利用有符号距离场作为几何表示,使身体模型对几何的学习有更直接的规范化。受益于这些设计,所提出方法可生成具有高质量外观和几何建模的可动画的人物化身,明显优于之前的3D GAN。此外,还能胜任许多应用,例如,单视图重建、重塑和文本引导合成。

Unsupervised generation of clothed virtual humans with various appearance and animatable poses is important for creating 3D human avatars and other AR/VR applications. Existing methods are either limited to rigid object modeling, or not generative and thus unable to synthesize high-quality virtual humans and animate them. In this work, we propose AvatarGen, the first method that enables not only non-rigid human generation with diverse appearance but also full control over poses and viewpoints, while only requiring 2D images for training. Specifically, it extends the recent 3D GANs to clothed human generation by utilizing a coarse human body model as a proxy to warp the observation space into a standard avatar under a canonical space. To model non-rigid dynamics, it introduces a deformation network to learn pose-dependent deformations in the canonical space. To improve geometry quality of the generated human avatars, it leverages signed distance field as geometric representation, which allows more direct regularization from the body model on the geometry learning. Benefiting from these designs, our method can generate animatable human avatars with high-quality appearance and geometry modeling, significantly outperforming previous 3D GANs. Furthermore, it is competent for many applications, e.g., single-view reconstruction, reanimation, and text-guided synthesis. Code and pre-trained model will be available.

https://arxiv.org/abs/2208.00561

 

5、[CL] Efficient Long-Text Understanding with Short-Text Models

M Ivgi, U Shaham, J Berant

[Tel-Aviv University]

基于短文本模型的高效长文本理解。基于Transformer的预训练语言模型(LM)在自然语言理解中无处不在,但由于其二次复杂性,无法应用于长序列,如故事、科研文章和长文档。虽然已经提出了无数高效的Transformer变体,但通常基于定制的实现,需要从头开始进行昂贵的预训练。本文提出SLED(SLiding-Encoder and Decoder),一种处理长序列的简单方法,重新使用并利用经过实战检验的短文预训练的LM。将输入划分为重叠的块,用短文本LM编码器对每个块进行编码,用预训练解码器来融合各块的信息(融合解码器)。通过控制性实验说明SLED为长文本理解提供了可行的策略,并在SCROLLS上评估了该方法,SCROLLS是一个有七个数据集的基准,涉及广泛的语言理解任务。实验表明,SLED与专门的模型相比具有竞争力,这些模型的规模高达50倍,而且需要专门的、昂贵的预训练步骤。

Transformer-based pretrained language models (LMs) are ubiquitous across natural language understanding, but cannot be applied to long sequences such as stories, scientific articles and long documents, due to their quadratic complexity. While a myriad of efficient transformer variants have been proposed, they are typically based on custom implementations that require expensive pretraining from scratch. In this work, we propose SLED: SLiding-Encoder and Decoder, a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. Specifically, we partition the input into overlapping chunks, encode each with a short-text LM encoder and use the pretrained decoder to fuse information across chunks (fusion-in-decoder). We illustrate through controlled experiments that SLED offers a viable strategy for long text understanding and evaluate our approach on SCROLLS, a benchmark with seven datasets across a wide range of language understanding tasks. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pretraining step.

https://arxiv.org/abs/2208.00748

 

另外几篇值得关注的论文:

 

[CV] Depth Field Networks for Generalizable Multi-view Scene Representation

面向可泛化多视场景表示的深度场网络

V Guizilini, I Vasiljevic, J Fang, R Ambrus, G Shakhnarovich...

[Toyota Research Institute & Toyota Technological Institute at Chicago]

https://arxiv.org/abs/2207.14287

 

[CV] 3D Cartoon Face Generation with Controllable Expressions from a Single GAN Image

基于单幅GAN图像可控表情的3D卡通人脸生成

H Wang, G Lin, S C. H. Hoi, C Miao

[Nanyang Technological University & Singapore Management University]

https://arxiv.org/abs/2207.14425

 

[CL] GTrans: Grouping and Fusing Transformer Layers for Neural Machine Translation

GTrans:面向神经机器翻译的分组融合Transformer层

J Yang, Y Yin, S Ma, H Huang, D Zhang, F Wei, Z Li

[Beihang University & Microsoft Research Asia]

https://arxiv.org/abs/2207.14467

 

[CL] An Interpretability Evaluation Benchmark for Pre-trained Language Models

预训练语言模型可解释性评估基准
Y Shen, L Wang, Y Chen, X Xiao, J Liu, H Wu

[Baidu Inc]

https://arxiv.org/abs/2207.13948