![]() |
AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation
|
![]() |
SF-V: Single Forward Video Generation Model
|
![]() |
BitsFusion: 1.99 bits Weight Quantization of Diffusion Model
|
![]() |
Efficient Training with Denoised Neural Weights
|
![]() |
E2GAN: Efficient Training of Efficient GANs for Image-to-Image Translation
|
![]() |
TextCraftor: Your Text Encoder Can be Image Quality Controller
|
![]() |
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis
|
![]() |
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
|
![]() |
SPAD: Spatially Aware Multiview Diffusers
|
![]() |
HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion
|
![]() |
Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors
|
![]() |
SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds
|
![]() |
LightSpeed: Light and Fast Neural Light Fields on Mobile Devices
|
![]() |
iNVS: Repurposing Diffusion Inpainters for Novel View Synthesis
|
![]() |
Rethinking Vision Transformers for
MobileNet Size and Speed
|
![]() |
COMCAT: Towards Efficient
Compression and Customization of
Attention-Based Vision Models
|
![]() |
Real-Time Neural Light Field on Mobile Devices
|
![]() |
SINE: SINgle Image Editing with Text-to-Image Diffusion Models
|
![]() |
Unsupervised
Volumetric
Animation
|
![]() |
Make-A-Story: Visual Memory
Conditioned Consistent Story Generation
|
![]() |
Invertible Neural Skinning
|
![]() |
3D Generation on ImageNet
|
![]() |
Discrete Contrastive Diffusion for
Cross-Modal and Conditional Generation
|
![]() |
EfficientFormer: Vision
Transformers at
MobileNet Speed
|
![]() |
Layer Freezing & Data Sieving: Missing
Pieces
of a
Generic Framework for Sparse Training
|
![]() |
R2L: Distilling Neural Radiance Field to
Neural Light Field for Efficient Novel View Synthesis |
![]() |
Cross-Modal 3D Shape Generation
and
Manipulation |
![]() |
Show Me What and Tell Me How: Video
Synthesis
via Multimodal Conditioning |
![]() |
Are Multimodal Transformers Robust to
Missing Modality?
|
![]() |
In&Out: Diverse Image Outpainting via
GAN
Inversion
|
![]() |
F8Net: Fixed-Point 8-bit Only
Multiplication for Network
Quantization |
![]() |
Lottery Ticket Implies Accuracy Degradation, Is
It
a Desirable
Phenomenon? |
![]() |
Flow Guided Transformable Bottleneck Networks
for
Motion Retargeting |
![]() |
Motion Representations
for
Articulated
Animation |
![]() |
Teachers Do More Than Teach: Compressing
Image-to-Image Models |
![]() |
A Good Image Generator Is What You Need
for
High-Resolution Video Synthesis |
![]() |
SMIL: Multimodal learning with severely missing
modality |