Open Positions
We have a limited number of positions for research interns in Shanghai, the research topics are 3D Vision, AIGC and other related ones. This position is target for publications in top-tier CV conferences and only for graduate students who are still enrolled in an university. If you are interested in this position, please send your resume to my email.
|
Research Interests
My research interests lie in the field of Computer Vision, Machine Learning and Robotics. My current focus is 3D-AIGC. Before that, I worked on developing algorithms to crowd analysis problem, including counting, localization and motion estimation. I also worked on video understanding, action recognition, semantic segmentation, domain adaptation and learning with less supervision.
|
|
LAM3D: Large Image-Point-Cloud Alignment Model for 3D Reconstruction from Single Image
Ruikai Cui, Xibin Song, Weixuan Sun, Senbo Wang, Weizhe Liu, Shenzhou Chen, Taizhang Shang, Yang Li, Nick Barnes, Hongdong Li, Pan Ji
Neural Information Processing Systems (NeurIPS) , 2024
pdf
Large Reconstruction Models have made significant strides in the realm of automated 3D content generation from single or multiple input images. Despite their success, these models often produce 3D meshes with geometric inaccuracies, stemming from the inherent challenges of deducing 3D shapes solely from image data. In this work, we introduce a novel framework, the Large Image and Point Cloud Alignment Model (LAM3D), which utilizes 3D point cloud data to enhance the fidelity of generated 3D meshes.
|
|
Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane
Han Yan, Yang Li, Zhennan Wu, Shenzhou Chen, Weixuan Sun, Taizhang Shang, Weizhe Liu, Tian Chen, Xiaqiang Dai, Chao Ma, Hongdong Li, Pan Ji
SIGGRAPH Asia, 2024
video/
pdf
We present Frankenstein, a diffusion-based framework that can generate semantic-compositional 3D scenes in a single pass. Unlike existing methods that output a single, unified 3D shape, Frankenstein simultaneously generates multiple separated shapes, each corresponding to a semantically meaningful part. The 3D scene information is encoded in one single tri-plane tensor, from which multiple Singed Distance Function (SDF) fields can be decoded to represent the compositional shapes.
|
|
NeuSDFusion: A Spatial-Aware Generative Model for 3D Shape Completion, Reconstruction, and Generation
Ruikai Cui, Weizhe Liu †, Weixuan Sun, Senbo Wang, Taizhang Shang, Yang Li, Xibin Song, Han Yan, Zhennan Wu, Shenzhou Chen, Hongdong Li, Pan Ji
The European Conference on Computer Vision (ECCV), 2024
project page/
pdf
3D shape generation aims to produce innovative 3D content adhering to specific conditions and constraints. Existing methods often decompose 3D shapes into a sequence of localized components, treating each element in isolation without considering spatial consistency. As a result, these approaches exhibit limited versatility in 3D data representation and shape generation, hindering their ability to generate highly diverse 3D shapes that comply with the specified constraints. In this paper, we introduce a novel spatial-aware 3D shape generation framework that leverages 2D plane representations for enhanced 3D shape modeling.
|
|
BlockFusion: Expandable 3D Scene Generation using Latent Tri-plane Extrapolation
Zhennan Wu, Yang Li, Han Yan, Taizhang Shang, Weixuan Sun, Senbo Wang, Ruikai Cui, Weizhe Liu, Hiroyuki Sato, Hongdong Li, Pan Ji
ACM Transactions on Graphics (SIGGRAPH), 2024
project page/
pdf/
code
We present BlockFusion, a diffusion-based model that generates 3D scenes as unit blocks and seamlessly incorporates new blocks to extend the scene. BlockFusion is trained using datasets of 3D blocks that are randomly cropped from complete 3D scene meshes. Through per-block fitting, all training blocks are converted into the hybrid neural fields: with a tri-plane containing the geometry features, followed by a Multi-layer Perceptron (MLP) for decoding the signed distance values.
|
|
RGB-based Category-level Object Pose Estimation via Decoupled Metric Scale Recovery
Jiaxin Wei, Xibin Song, Weizhe Liu, Laurent Kneip, Hongdong Li, Pan Ji
IEEE International Conference on Robotics and Automation(ICRA), 2023
pdf/
code
While showing promising results, recent RGB-D camera-based category-level object pose estimation methods have restricted applications due to the heavy reliance on depth sensors. RGB-only methods provide an alternative to this problem yet suffer from inherent scale ambiguity stemming from monocular observations. In this paper, we propose a novel pipeline that decouples the 6D pose and size estimation to mitigate the influence of imperfect scales on rigid transformations.
|
|
Multi-view Tracking Using Weakly Supervised Human Motion Prediction
Martin Engilberge, Weizhe Liu, Pascal Fua
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023
pdf/
code
Multi-view approaches to people-tracking have the potential to better handle occlusions than single-view ones in crowded scenes. They often rely on the tracking-by-detection paradigm, which involves detecting people first and then connecting the detections. In this paper, we argue that an even more effective approach is to predict people motion over time and infer people's presence in individual frames from these. This enables to enforce consistency both over time and across views of a single temporal frame. We validate our approach on the PETS2009 and WILDTRACK datasets and demonstrate that it outperforms state-of-the-art methods.
|
|
Learning to Align Sequential Actions in the Wild
Weizhe Liu, Bugra Tekin, Huseyin Coskun, Vibhav Vineet, Pascal Fua, Marc Pollefeys
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022
pdf/
code
In this paper, we propose an approach to align sequential actions in the wild that involve diverse temporal variations. To this end, we propose an approach to enforce temporal priors on the optimal transport matrix, which leverages temporal consistency, while allowing for variations in the order of actions. Our model accounts for both monotonic and non-monotonic sequences and handles background frames that should not be aligned. We demonstrate that our approach consistently outperforms the stateof-the-art in self-supervised sequential action representation learning on four different benchmark datasets.
|
|
Leveraging Self-Supervision for Cross-Domain Crowd Counting
Weizhe Liu, Nikita Durasov, Pascal Fua
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022 (Oral)
pdf/
code
In this paper, we train with both synthetic images, along with their associated labels, and unlabeled real images. To this end, we force our network to learn perspective-aware features by training it to recognize upside-down real images from regular ones and incorporate into it the ability to predict its own uncertainty so that it can generate useful pseudo labels for fine-tuning purposes. This yields an algorithm that consistently outperforms state-of-the-art cross-domain crowd counting ones without any extra computation at inference time.
|
|
Counting People by Estimating People Flows
Weizhe Liu, Mathieu Salzmann, Pascal Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
pdf /
press /
video /
code
In this paper, we advocate estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing them. This enables us to impose much stronger constraints encoding the conservation of the number of people. As a result, it significantly boosts performance without requiring a more complex architecture. Furthermore, it allows us to exploit the correlation between people flow and optical flow to further improve the results. We also show that leveraging people conservation constraints in both a spatial and temporal manner makes it possible to train a deep crowd counting model in an active learning setting with much fewer annotations. This significantly reduces the annotation cost while still leading to similar performance to the full supervision case.
|
|
Estimating People Flows to Better Count Them in Crowded Scenes
Weizhe Liu, Mathieu Salzmann, Pascal Fua
The European Conference on Computer Vision (ECCV), 2020
pdf /
press /
video /
code
In this paper, we advocate estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing. This enables us to impose much stronger constraints encoding the conservation of the number of people. As a result, it significantly boosts performance without requiring a more complex architecture. Furthermore, it also enables us to exploit the correlation between people flow and optical flow to further improve the results.
|
|
Geometric and Physical Constraints for Drone-Based Head Plane Crowd Density Estimation
Weizhe Liu, Krzysztof Lis, Mathieu Salzmann, Pascal Fua
The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019
pdf /
press /
video
In this paper, we explicitly model the scale changes and reason in terms of people per square-meter. We show that feeding the perspective model to the network allows us to enforce global scale consistency and that this model can be obtained on the fly from the drone sensors. In addition, it also enables us to enforce physically-inspired temporal consistency constraints that do not have to be learned. This yields an algorithm that outperforms state-of-the-art methods in inferring crowd density from a moving drone camera especially when perspective effects are strong.
|
|
Context-Aware Crowd Counting
Weizhe Liu, Mathieu Salzmann, Pascal Fua
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019
pdf /
press /
video /
code
In this paper, we introduce an end-to-end trainable deep architecture that combines features obtained using multiple receptive field sizes and learns the importance of each such feature at each image location. In other words, our approach adaptively encodes the scale of the contextual information required to accurately predict crowd density. This yields an algorithm that outperforms state-of-the-art crowd counting methods, especially when perspective effects are strong.
|
Professional Services
Reviewer of major computer vision conferences (CVPR, ICCV, ECCV) and journals (TPAMI, IJCV, TIP).
|
|