Weizhe Liu

I am a Senior Research Scientist at Tencent, working on 3D reconstruction and AIGC.

Prior to that, I defended my Ph.D. thesis on Human-Centered Scene Understanding via Crowd Counting in Nov. 2021. From June 2017 - Jan. 2022, I've been working at CVLab, EPFL with Prof. Pascal Fua. I received the Master of Science degree from École polytechnique fédérale de Lausanne (EPFL) in 2017 and the Bachelor of Engineering degree from University of Electronic Science and Technology of China (UESTC) in 2014.

Email  /  CV  /  Google Scholar  /  LinkedIn  /  Twitter

profile photo
Open Positions

We have a limited number of positions for research interns in Shanghai, the research topics are 3D Vision, AIGC and other related ones. This position is target for publications in top-tier CV conferences and only for graduate students who are still enrolled in an university. If you are interested in this position, please send your resume to my email.

Research Interests

My research interests lie in the field of Computer Vision, Machine Learning and Robotics. My current focus is 3D reconstruction and AIGC. Before that, I worked on developing algorithms to crowd analysis problem, including counting, localization and motion estimation. I also worked on video understanding, action recognition, semantic segmentation, domain adaptation and learning with less supervision.

Preprints
Domain Adaptation for Semantic Segmentation via Patch-Wise Contrastive Learning
Weizhe Liu, David Ferstl, Samuel Schulter, Lukas Zebedin, Pascal Fua, Christian Leistner
arXiv
pdf

We introduce a novel approach to unsupervised and semi-supervised domain adaptation for semantic segmentation. Unlike many earlier methods that rely on adversarial learning for feature alignment, we leverage contrastive learning to bridge the domain gap by aligning the features of structurally similar label patches across domains. As a result, the networks are easier to train and deliver better performance. Our approach consistently outperforms state-of-the-art unsupervised and semi-supervised methods on two challenging domain adaptive segmentation tasks, particularly with a small number of target domain annotations. It can also be naturally extended to weakly-supervised domain adaptation, where only a minor drop in accuracy can save up to 75% of annotation cost.

Using Depth for Pixel-Wise Detection of Adversarial Attacks in Crowd Counting
Weizhe Liu, Mathieu Salzmann, Pascal Fua
arXiv
pdf

In this paper, we investigate the effectiveness of existing attack strategies on crowd-counting networks, and introduce a simple yet effective pixelwise detection mechanism. It builds on the intuition that, when attacking a multitask network, in our case estimating crowd density and scene depth, both outputs will be perturbed, and thus the second one can be used for detection purposes. We will demonstrate that this significantly outperforms heuristic and uncertainty-based strategies.

Publications
Multi-view Tracking Using Weakly Supervised Human Motion Prediction
Martin Engilberge, Weizhe Liu, Pascal Fua
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023
pdf/ code

Multi-view approaches to people-tracking have the potential to better handle occlusions than single-view ones in crowded scenes. They often rely on the tracking-by-detection paradigm, which involves detecting people first and then connecting the detections. In this paper, we argue that an even more effective approach is to predict people motion over time and infer people's presence in individual frames from these. This enables to enforce consistency both over time and across views of a single temporal frame. We validate our approach on the PETS2009 and WILDTRACK datasets and demonstrate that it outperforms state-of-the-art methods.

Learning to Align Sequential Actions in the Wild
Weizhe Liu, Bugra Tekin, Huseyin Coskun, Vibhav Vineet, Pascal Fua, Marc Pollefeys
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022
pdf/ code

In this paper, we propose an approach to align sequential actions in the wild that involve diverse temporal variations. To this end, we propose an approach to enforce temporal priors on the optimal transport matrix, which leverages temporal consistency, while allowing for variations in the order of actions. Our model accounts for both monotonic and non-monotonic sequences and handles background frames that should not be aligned. We demonstrate that our approach consistently outperforms the stateof-the-art in self-supervised sequential action representation learning on four different benchmark datasets.

Leveraging Self-Supervision for Cross-Domain Crowd Counting
Weizhe Liu, Nikita Durasov, Pascal Fua
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022 (Oral)
pdf/ code

In this paper, we train with both synthetic images, along with their associated labels, and unlabeled real images. To this end, we force our network to learn perspective-aware features by training it to recognize upside-down real images from regular ones and incorporate into it the ability to predict its own uncertainty so that it can generate useful pseudo labels for fine-tuning purposes. This yields an algorithm that consistently outperforms state-of-the-art cross-domain crowd counting ones without any extra computation at inference time.

Counting People by Estimating People Flows
Weizhe Liu, Mathieu Salzmann, Pascal Fua
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021
pdf / press / video / code

In this paper, we advocate estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing them. This enables us to impose much stronger constraints encoding the conservation of the number of people. As a result, it significantly boosts performance without requiring a more complex architecture. Furthermore, it allows us to exploit the correlation between people flow and optical flow to further improve the results. We also show that leveraging people conservation constraints in both a spatial and temporal manner makes it possible to train a deep crowd counting model in an active learning setting with much fewer annotations. This significantly reduces the annotation cost while still leading to similar performance to the full supervision case.

Estimating People Flows to Better Count Them in Crowded Scenes
Weizhe Liu, Mathieu Salzmann, Pascal Fua
The European Conference on Computer Vision (ECCV), 2020
pdf / press / video / code

In this paper, we advocate estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing. This enables us to impose much stronger constraints encoding the conservation of the number of people. As a result, it significantly boosts performance without requiring a more complex architecture. Furthermore, it also enables us to exploit the correlation between people flow and optical flow to further improve the results.

Geometric and Physical Constraints for Drone-Based Head Plane Crowd Density Estimation
Weizhe Liu, Krzysztof Lis, Mathieu Salzmann, Pascal Fua
The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019
pdf / press / video

In this paper, we explicitly model the scale changes and reason in terms of people per square-meter. We show that feeding the perspective model to the network allows us to enforce global scale consistency and that this model can be obtained on the fly from the drone sensors. In addition, it also enables us to enforce physically-inspired temporal consistency constraints that do not have to be learned. This yields an algorithm that outperforms state-of-the-art methods in inferring crowd density from a moving drone camera especially when perspective effects are strong.

Context-Aware Crowd Counting
Weizhe Liu, Mathieu Salzmann, Pascal Fua
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019
pdf / press / video / code

In this paper, we introduce an end-to-end trainable deep architecture that combines features obtained using multiple receptive field sizes and learns the importance of each such feature at each image location. In other words, our approach adaptively encodes the scale of the contextual information required to accurately predict crowd density. This yields an algorithm that outperforms state-of-the-art crowd counting methods, especially when perspective effects are strong.

Professional Services

Reviewer of major computer vision conferences (CVPR, ICCV, ECCV) and journals (TPAMI, IJCV, TIP).

Teaching


website template credit