<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>CVPR on Yida Wang</title>
    <link>https://wangyida.github.io/categories/cvpr/</link>
    <description>Recent content in CVPR on Yida Wang</description>
    <image>
      <title>Yida Wang</title>
      <url>https://wangyida.github.io/logos/android-chrome-512x512.png</url>
      <link>https://wangyida.github.io/logos/android-chrome-512x512.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en</language>
    <lastBuildDate>Thu, 09 Apr 2026 10:15:01 +0200</lastBuildDate>
    <atom:link href="https://wangyida.github.io/categories/cvpr/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Arbitrary-Resolution and Fine-Grained Depth Estimation with Neural Implicit Fields (InfiniDepth)</title>
      <link>https://wangyida.github.io/posts/infinidepth/</link>
      <pubDate>Thu, 09 Apr 2026 10:15:01 +0200</pubDate>
      <guid>https://wangyida.github.io/posts/infinidepth/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Re-direct to the full &lt;a href=&#34;https://zju3dv.github.io/InfiniDepth/&#34;&gt;&lt;strong&gt;PAPER&lt;/strong&gt;&lt;/a&gt; and &lt;a href=&#34;https://github.com/RitianYu/InfiniDepth&#34;&gt;&lt;strong&gt;CODE&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;video controls loop muted playsinline style=&#34;width: 100%; height: auto; border-radius: 4px;&#34;&gt;
    &lt;source src=&#34;images/demo.mov&#34; type=&#34;video/mp4&#34;&gt;
&lt;/video&gt;
&lt;h1 id=&#34;abstract&#34;&gt;Abstract&lt;/h1&gt;
&lt;p&gt;Existing depth estimation methods are fundamentally limited to predicting depth on discrete image grids. Such representations restrict their scalability to arbitrary output resolutions and hinder the geometric detail recovery. This paper introduces &lt;strong&gt;InfiniDepth&lt;/strong&gt;, which represents depth as neural implicit fields. Through a simple yet effective local implicit decoder, we can query depth at continuous 2D coordinates, enabling arbitrary-resolution and fine-grained depth estimation. To better assess our method&amp;rsquo;s capabilities, we curate a high-quality 4K synthetic benchmark from five different games, spanning diverse scenes with rich geometric and appearance details. Experiments demonstrate that InfiniDepth achieves SOTA performance on both synthetic and real-world benchmarks across relative and metric depth estimation tasks, particularly excelling in fine-detail regions. It also benefits the task of novel view synthesis under large viewpoint shifts, producing high-quality results with fewer holes and artifacts.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Crafting World Models for Driving Scene Reconstruction via Online Restoration (ReconDreamer)</title>
      <link>https://wangyida.github.io/posts/recondreamer/</link>
      <pubDate>Sun, 11 May 2025 10:15:01 +0200</pubDate>
      <guid>https://wangyida.github.io/posts/recondreamer/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Re-direct to the full &lt;a href=&#34;https://arxiv.org/abs/2411.19548&#34;&gt;&lt;strong&gt;PAPER&lt;/strong&gt;&lt;/a&gt;, &lt;a href=&#34;https://recondreamer.github.io/&#34;&gt;&lt;strong&gt;PROJECT PAGE&lt;/strong&gt;&lt;/a&gt; and &lt;a href=&#34;https://github.com/GigaAI-research/ReconDreamer&#34;&gt;&lt;strong&gt;CODE&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Closed-loop simulation is crucial for end-to-end autonomous driving. Existing sensor simulation methods (e.g., NeRF and 3DGS) reconstruct driving scenes based on conditions that closely mirror training data distributions. However, these methods struggle with rendering novel trajectories, such as lane changes. Recent works have demonstrated that integrating world model knowledge alleviates these issues. Despite their efficiency, these approaches still encounter difficulties in the accurate representation of more complex maneuvers, with multi-lane shifts being a notable example. Therefore, we introduce ReconDreamer, which enhances driving scene reconstruction through incremental integration of world model knowledge. Specifically, DriveRestorer is proposed to mitigate artifacts via online restoration. This is complemented by a progressive data update strategy designed to ensure high-quality rendering for more complex maneuvers. To the best of our knowledge, ReconDreamer is the first method to effectively render in large maneuvers. Experimental results demonstrate that ReconDreamer outperforms Street Gaussians in the NTA-IoU, NTL-IoU, and FID, with relative improvements by 24.87%, 6.72%, and 29.97%. Furthermore,ReconDreamer surpasses DriveDreamer4D with PVG during large maneuver rendering, as verified by a relative improvement of 195.87% in the NTA-IoU metric and a comprehensive user study.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Street View Synthesis with Controllable Video Diffusion Models (StreetCrafter)</title>
      <link>https://wangyida.github.io/posts/streetcrafter/</link>
      <pubDate>Sun, 11 May 2025 10:15:01 +0200</pubDate>
      <guid>https://wangyida.github.io/posts/streetcrafter/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Re-direct to the full &lt;a href=&#34;https://arxiv.org/abs/2412.13188&#34;&gt;&lt;strong&gt;PAPER&lt;/strong&gt;&lt;/a&gt; and &lt;a href=&#34;https://github.com/zju3dv/street_crafter&#34;&gt;&lt;strong&gt;CODE&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;This paper aims to tackle the problem of photorealistic view synthesis from vehicle sensors data. Recent advancements in neural scene representation have achieved notable success in rendering high-quality autonomous driving scenes, but the performance significantly degrades as the viewpoint deviates from the training trajectory. To mitigate this problem, we introduce StreetCrafter, a novel controllable video diffusion model that utilizes LiDAR point cloud renderings as pixel-level conditions, which fully exploits the generative prior for novel view synthesis, while preserving precise camera control. Moreover, the utilization of pixel-level LiDAR condition allows us to make accurate pixel-level edits to target scenes. In addition, the generative prior of StreetCrafter can be effectively incorporated into dynamic scene representations to achieve real-time rendering. Experiments on Waymo Open and PandaSet datasets demonstrate that our model enables flexible control over viewpoint changes, enlarging the view synthesis regions for satisfying rendering, which outperforms existing methods.&lt;/p&gt;</description>
    </item>
    <item>
      <title>World Models Are Effective Data Machines for 4D Driving Scene Representation (DriveDreamer4D)</title>
      <link>https://wangyida.github.io/posts/drivedreamer4d/</link>
      <pubDate>Sun, 11 May 2025 10:15:01 +0200</pubDate>
      <guid>https://wangyida.github.io/posts/drivedreamer4d/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Re-direct to the full &lt;a href=&#34;https://arxiv.org/abs/2410.13571&#34;&gt;&lt;strong&gt;PAPER&lt;/strong&gt;&lt;/a&gt;, &lt;a href=&#34;https://drivedreamer4d.github.io/&#34;&gt;PROJECT PAGE&lt;/a&gt; and &lt;a href=&#34;https://github.com/GigaAI-research/DriveDreamer4D&#34;&gt;&lt;strong&gt;CODE&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Closed-loop simulation is essential for advancing end-to-end autonomous driving systems. Contemporary sensor simulation methods, such as NeRF and 3DGS, rely predominantly on conditions closely aligned with training data distributions, which are largely confined to forward-driving scenarios. Consequently, these methods face limitations when rendering complex maneuvers (e.g., lane change, acceleration, deceleration). Recent advancements in autonomous-driving world models have demonstrated the potential to generate diverse driving videos. However, these approaches remain constrained to 2D video generation, inherently lacking the spatiotemporal coherence required to capture intricacies of dynamic driving environments. In this paper, we introduce DriveDreamer4D, which enhances 4D driving scene representation leveraging world model priors. Specifically, we utilize the world model as a data machine to synthesize novel trajectory videos, where structured conditions are explicitly leveraged to control the spatial-temporal consistency of traffic elements. Besides, the cousin data training strategy is proposed to facilitate merging real and synthetic data for optimizing 4DGS. To our knowledge, DriveDreamer4D is the first to utilize video generation models for improving 4D reconstruction in driving scenarios. Experimental results reveal that DriveDreamer4D significantly enhances generation quality under novel trajectory views, achieving a relative improvement in FID by 32.1%, 46.4%, and 16.3% compared to PVG, S3Gaussian, and Deformable-GS. Moreover, DriveDreamer4D markedly enhances the spatiotemporal coherence of driving agents, which is verified by a comprehensive user study and the relative increases of 22.6%, 43.5%, and 15.6% in the NTA-IoU metric.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Learning Local Displacements for Point Cloud Completion</title>
      <link>https://wangyida.github.io/posts/disp3d/</link>
      <pubDate>Sat, 19 Feb 2022 10:15:01 +0200</pubDate>
      <guid>https://wangyida.github.io/posts/disp3d/</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Re-direct to the full &lt;a href=&#34;https://arxiv.org/pdf/2203.16600v1.pdf&#34;&gt;&lt;strong&gt;PAPER&lt;/strong&gt;&lt;/a&gt; and &lt;a href=&#34;https://github.com/wangyida/disp3d&#34;&gt;&lt;strong&gt;CODE&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div style=&#34;position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;&#34;&gt;
      &lt;iframe allow=&#34;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen&#34; loading=&#34;eager&#34; referrerpolicy=&#34;strict-origin-when-cross-origin&#34; src=&#34;https://www.youtube.com/embed/-rSLpHYO78M?autoplay=0&amp;amp;controls=1&amp;amp;end=0&amp;amp;loop=0&amp;amp;mute=0&amp;amp;start=0&#34; style=&#34;position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;&#34; title=&#34;YouTube video&#34;&gt;&lt;/iframe&gt;
    &lt;/div&gt;

&lt;h1 id=&#34;abstract&#34;&gt;Abstract&lt;/h1&gt;
&lt;div style=&#34;display: flex; flex-wrap: wrap; gap: 20px; align-items: flex-start; margin-bottom: 30px;&#34;&gt;
    &lt;div style=&#34;flex: 1; width: 50%; min-width: 300px;&#34;&gt;
        &lt;img src=&#34;images/CVPR_teaser.png&#34; style=&#34;width: 100%; height: auto; border-radius: 4px;&#34;&gt;
    &lt;/div&gt;
    &lt;div style=&#34;flex: 1; width: 50%; min-width: 300px;&#34;&gt;
        &lt;h3 style=&#34;margin-top: 0;&#34;&gt;Completing a car&lt;/h3&gt;
        &lt;p&gt;From the input partial scan to our object completion, we visualize the amount of detail in our reconstruction.&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
