Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

Last update: Dec 28, 2022

Overview

NÜWA - Pytorch (wip)

Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch. This repository will be populated in the case that Microsoft does not open source the code by end of December. It may also contain an extension into video and audio, using a dual decoder approach.

DeepReader

Citations

@misc{wu2021nuwa,
    title   = {N\"UWA: Visual Synthesis Pre-training for Neural visUal World creAtion}, 
    author  = {Chenfei Wu and Jian Liang and Lei Ji and Fan Yang and Yuejian Fang and Daxin Jiang and Nan Duan},
    year    = {2021},
    eprint  = {2111.12417},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Comments

Question about generated videos?

There are a lot of negative numbers and very small decimals (like 5e-1). But the loss degrades normally when training. Is that a normal situation? How can I make the result visible?

opened by Fitzwong 0
Why the video does not pass through the encoder?

Hi! lucidrains. Thanks for providing a great repo which is convenient to understand the NUWA paper.
I have a question as follows: In the NUWA paper, we can see that the inputs of the Encoder are caption tokens (caption condition) and the video tokens (3DNA condition). So, in my eye, the video tokens sequence should fully self-attend in the Encoder, right? And then, the outputs condition the Decoder. The Decoder provided by you is as following. . It has causal self-attention and text-condition as we expected. But from the definition in paper, the condition contains the text-condition and 3DNA condition, and these two condition the Decoder. Is my opinion right? I am just curious about the condition in the NUWA paper. The Encoder in your repo is only the Text-Encoder, but the video does not pass through the encoder to condition the Encoder.

Looking forward to your reply! Thanks!

opened by Wang-Xiaodong1899 0
Questions about function forward() in NUWA please.
I'm confused me that, in function forward() of class NUWA, the ground-truth video is fed to transformer and calculate the output video, which is different from function generate().

frame_embeddings = self.video_transformer( frame_embeddings, # calculated from ground-truth video context = text_embeds, context_mask = text_mask )

So when training NUWA, the loss comes from logits. But the logits are not only from text, but ground-truth video (only one transformer layer, different from the auto-regressive model in generate function). Is that some kind of cheating when training? Or should I generate logits in the same way as in generate(), and then calculate loss to train?
opened by Fitzwong 1
Type of dataset for training VQ-GAN

Hi,

First, thanks a lot for the amazing work! I have one question regarding the training of the VQ-GAN, do you recommend training it on a dataset similar to the dataset the nuwa model will be trained? What I mean is, if I want to train nuwa to generate sport videos based on text, do I need to also train the VQ-GAN on a sport dataset?

Thanks a lot

opened by antonibigata 0
Pseudocode for 3DNA?

me no comprendai le complex einops 😢

Can someone give the 3DNA pseudocode to illustrate what's going on 🤗

(Also how did lucidrains bang out thousands of lines of code in a few weeks - is he confirmed to be human? 🤔)

opened by neel04 4

Releases(0.7.7a)

0.7.7a(Aug 14, 2022)

null
Source code(tar.gz)
Source code(zip)
0.7.7(Aug 14, 2022)

null
Source code(tar.gz)
Source code(zip)
0.7.6(Apr 28, 2022)

Source code(tar.gz)
Source code(zip)
0.7.5(Apr 28, 2022)

Source code(tar.gz)
Source code(zip)
0.7.4(Apr 27, 2022)

Source code(tar.gz)
Source code(zip)
0.7.3(Apr 22, 2022)

Source code(tar.gz)
Source code(zip)
0.7.2(Apr 7, 2022)

Source code(tar.gz)
Source code(zip)
0.7.1(Mar 24, 2022)

Source code(tar.gz)
Source code(zip)
0.7.0(Mar 24, 2022)

Source code(tar.gz)
Source code(zip)
0.6.4(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
0.6.3(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
0.6.2(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
0.6.1(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
0.6.0(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
0.5.15(Mar 12, 2022)

Source code(tar.gz)
Source code(zip)
0.5.14(Mar 12, 2022)

Source code(tar.gz)
Source code(zip)
0.5.12(Mar 12, 2022)

Source code(tar.gz)
Source code(zip)
0.5.11(Mar 12, 2022)

Source code(tar.gz)
Source code(zip)
0.5.10(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.9(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.8(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.7(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.6(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.5(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.4(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.3(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)
0.5.2(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)
0.5.1(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)
0.5.0(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)
0.4.33(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

To model the probability of a soccer coach leave his/her team during Campeonato Brasileiro for 10 chosen teams and considering years 2018, 2019 and 2020.

1 Jan 20, 2022

Streamlit app demonstrating an image browser for the Udacity self-driving-car dataset with realtime object detection using YOLO.

Streamlit Demo: The Udacity Self-driving Car Image Browser This project demonstrates the Udacity self-driving-car dataset and YOLO object detection in

992 Jan 04, 2023

PyTorch implementation of normalizing flow models

242 Jan 02, 2023

Code for "PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation" CVPR 2019 oral

Good news! We release a clean version of PVNet: clean-pvnet, including how to train the PVNet on the custom dataset. Use PVNet with a detector. The tr

722 Dec 27, 2022

Music Source Separation; Train & Eval & Inference piplines and pretrained models we used for 2021 ISMIR MDX Challenge.

Music Source Separation with Channel-wise Subband Phase Aware ResUnet (CWS-PResUNet) Introduction This repo contains the pretrained Music Source Separ

100 Dec 25, 2022

Face Alignment using python

Face Alignment Face Alignment using python Input Image Aligned Face Aligned Face Aligned Face Input Image Aligned Face Input Image Aligned Face Instal

28 Nov 23, 2022

Deep Learning & 3D Convolutional Neural Networks for Speaker Verification

TensorFlow implementation of 3D Convolutional Neural Networks for Speaker Verification - Official Project Page - Pytorch Implementation This repositor

753 Dec 17, 2022

Scikit-learn compatible estimation of general graphical models

skggm : Gaussian graphical models using the scikit-learn API In the last decade, learning networks that encode conditional independence relationships

213 Jan 02, 2023

Extract MNIST handwritten digits dataset binary file into bmp images

MNIST-dataset-extractor Extract MNIST handwritten digits dataset binary file into bmp images More info at http://yann.lecun.com/exdb/mnist/ Dependenci

6 May 24, 2021

face_recognization (FaceNet) + TFHE (HNP) + hand_face_detection (Mediapipe)

SuperControlSystem Face_Recognization (FaceNet) 面部识别 (FaceNet) Fully Homomorphic Encryption over the Torus (HNP) 环面全同态加密 (TFHE) Hand_Face_Detection (M

2 Dec 30, 2021

STRIVE: Scene Text Replacement In Videos

STRIVE: Scene Text Replacement In Videos Dataset Types: RoboText SynthText RealWorld videos RoboText : Videos of texts collected using navigation robo

15 Jul 11, 2022

A task-agnostic vision-language architecture as a step towards General Purpose Vision

Towards General Purpose Vision Systems By Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem Overview Welcome to the official code base f

79 Dec 23, 2022

DEMix Layers for Modular Language Modeling

DEMix This repository contains modeling utilities for "DEMix Layers: Disentangling Domains for Modular Language Modeling" (Gururangan et. al, 2021). T

43 Nov 11, 2022

Fast, general, and tested differentiable structured prediction in PyTorch

1.1k Dec 16, 2022

Source code of the paper PatchGraph: In-hand tactile tracking with learned surface normals.

PatchGraph This repository contains the source code of the paper PatchGraph: In-hand tactile tracking with learned surface normals. Installation Creat

11 Dec 15, 2022

Segmentation models with pretrained backbones. PyTorch.

Python library with Neural Networks for Image Segmentation based on PyTorch. The main features of this library are: High level API (just two lines to

6.6k Jan 06, 2023

A PyTorch implementation of the paper Mixup: Beyond Empirical Risk Minimization in PyTorch

Mixup: Beyond Empirical Risk Minimization in PyTorch This is an unofficial PyTorch implementation of mixup: Beyond Empirical Risk Minimization. The co

121 Dec 17, 2022

Machine learning library for fast and efficient Gaussian mixture models

This repository contains code which implements the Stochastic Gaussian Mixture Model (S-GMM) for event-based datasets Dependencies CMake Premake4 Blaz

1 Dec 19, 2022

code from "Tensor decomposition of higher-order correlations by nonlinear Hebbian plasticity"

Code associated with the paper "Tensor decomposition of higher-order correlations by nonlinear Hebbian learning," Ocker & Buice, Neurips 2021. "plot_f

4 Oct 16, 2022

Official repository accompanying a CVPR 2022 paper EMOCA: Emotion Driven Monocular Face Capture And Animation. EMOCA takes a single image of a face as input and produces a 3D reconstruction. EMOCA sets the new standard on reconstructing highly emotional images in-the-wild

EMOCA: Emotion Driven Monocular Face Capture and Animation Radek Daněček · Michael J. Black · Timo Bolkart CVPR 2022 This repository is the official i

339 Dec 30, 2022

Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

Related tags

Overview

NÜWA - Pytorch (wip)

Citations

Comments

Question about generated videos?

Why the video does not pass through the encoder?

Questions about function forward() in NUWA please.

Type of dataset for training VQ-GAN

Pseudocode for 3DNA?

Releases(0.7.7a)

0.7.7a(Aug 14, 2022)

0.7.7(Aug 14, 2022)

0.7.6(Apr 28, 2022)

0.7.5(Apr 28, 2022)

0.7.4(Apr 27, 2022)

0.7.3(Apr 22, 2022)

0.7.2(Apr 7, 2022)

0.7.1(Mar 24, 2022)

0.7.0(Mar 24, 2022)

0.6.4(Mar 15, 2022)

0.6.3(Mar 15, 2022)

0.6.2(Mar 15, 2022)

0.6.1(Mar 15, 2022)

0.6.0(Mar 15, 2022)

0.5.15(Mar 12, 2022)

0.5.14(Mar 12, 2022)

0.5.12(Mar 12, 2022)

0.5.11(Mar 12, 2022)

0.5.10(Mar 11, 2022)

0.5.9(Mar 11, 2022)

0.5.8(Mar 11, 2022)

0.5.7(Mar 11, 2022)

0.5.6(Mar 11, 2022)

0.5.5(Mar 11, 2022)

0.5.4(Mar 11, 2022)

0.5.3(Mar 10, 2022)

0.5.2(Mar 10, 2022)

0.5.1(Mar 10, 2022)

0.5.0(Mar 10, 2022)

0.4.33(Mar 10, 2022)