Diffusion Knows Transparency : Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation


Shaocong Xu1   Songlin Wei2   Qizhe Wei1   Zheng Geng1   Hong Li1,4   Licheng Shen3   Qianpu Sun3   Shu Han5  
Bin Ma3   Bohan Li6,7   Chongjie Ye8   Yuhang Zheng9   Nan Wang1   Saining Zhang1   Hao Zhao1,3*

1BAAI     2USC     3THU     4BUAA     5WHU     6SJTU     7EIT(Ningbo)     8FNii, CUHKSZ     9NUS

TL;DR


DKT is a foundation model for transparent-object 🫙, in-the-wild 🌎, arbitrary-length ⏳ video depth and normal estimation, facilitating downstream applications such as robot manipulation tasks, policy learning, and so forth.



In-the-wild performance




Dynamic Scenes




Transparent Objects in Robotic Scenes




Transparent Objects in Fabric Scenes




Small Transparent Objects




In-the-wild Video Normal Estimation




Comparisons for Video Depth Estimation




Comparisons for Video Normal Estimation




Robot Grasping Experiments




Abstract


Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences (1.32M frames) rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines (e.g., Depth-Anything-v2, DepthCrafter), and a normal variant (DKT-Normal) sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame (832×480). Integrated into a grasping stack, DKT’s depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: “Diffusion knows transparency.” Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.





Citation



  @article{dkt2025,
  title   = {Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation},
  author  = {Shaocong Xu and Songlin Wei and Qizhe Wei and Zheng Geng and Hong Li and Licheng Shen and Qianpu Sun and Shu Han and Bin Ma and Bohan Li and Chongjie Ye and Yuhang Zheng and Nan Wang and Saining Zhang and Hao Zhao},
  journal = {https://arxiv.org/abs/2512.23705},
  year    = {2025}
}