Abstract

A key challenge in model-free category-level pose estimation is the extraction of contextual object features that generalize across varying instances within a specific category. Recent approaches leverage foundational features to capture semantic and geometry cues from data. However, these approaches fail under partial visibility. We overcome this with a first-complete-then-aggregate strategy for feature extraction utilizing class priors. In this paper, we present GCE-Pose, a method that enhances pose estimation for novel instances by integrating category-level global context prior. GCE-Pose performs semantic shape reconstruction with a proposed Semantic Shape Reconstruction (SSR) module. Given an unseen partial RGB-D object instance, our SSR module reconstructs the instance’s global geometry and semantics by deforming category-specific 3D semantic prototypes through a learned deep Linear Shape Model. We further introduce a Global Context Enhanced (GCE) feature fusion module that effectively fuses features from partial RGB-D observations and the reconstructed global context. Extensive experiments validate the impact of our global context prior and the effectiveness of the GCE fusion module, demonstrating that GCE-Pose significantly outperforms existing methods on challenging real-world datasets HouseCat6D and NOCS-REAL275. Our project page is available at https://colin-de.github.io/GCEPose/.

1. Introduction

Figure 1. Overview of the category-level Pose Estimation Pipeline: (A) Previous methods, i.g. AG-Pose and Second Pose, rely on partial features extracted by a neural network (NN) to regress object poses. (B) We introduce a novel approach that leverages a semantic shape reconstruction (SSR) module for global feature extraction. This global context enhances (GCE) the mapping from partial features to NOCS features.

现有的方法大多从输入中得到语义和几何信息,但当物体被遮挡导致物体只有一部分可见时,这些方法会失败,所以提出了GCE-Pose。

GCE-Pose通过“先补全后聚合”的策略来解决这个问题。具体来说,GCE-Pose使用提出的Semantic Shape Reconstruction (SSR)模块执行语义形状重建。给定一个未见过的部分可见RGB-D物体实例,SSR模块通过学习学习一个深度线性形状网络来重建实例的全局几何信息和语义信息。除此之外,进一步提出了一个Global Context Enhanced (GCE)模块,该模块有效的融合了部分可见RGB-D观测特征和重建上下文特征。

性能:

GCE-Pose$IoU_{25}$$IoU_{50}$$IoU_{75}$$5^\circ2\text{cm}$$5^\circ5\text{cm}$$10^\circ2\text{cm}$$10^\circ5\text{cm}$
REAL275-84.179.857.065.175.686.3
CAMERA25-------
HouseCat6D-79.260.624.825.755.458.4

Our main contributions are as follows:

  • We propose GCE-Pose, a Global Context Enhancement (GCE) approach that integrates global context with both geometric and semantic cues for category-level object pose estimation.
  • We introduce a Semantic Shape Reconstruction (SSR) strategy that addresses partially observed inputs by reconstructing both object geometry and semantics through learned categorical deformation prototypes.
  • Extensive experiments demonstrate that our method achieves robust pose estimation even under significant shape variations and occlusions improving the generalization to unseen instances.

2. Related Works

2.1. Object Reconstruction for Pose Estimation

2.2. Representation Learning for Pose Estimation

2.3. Generalizing Object Pose Estimators

3. Method

Figure 2. Illustration of GCE-Pose: (A) Semantic and geometric features are extracted from an RGB-D input. A keypoint feature detector identifies robust keypoints and extracts their corresponding features. (B) An instance-specific and category-level semantic global feature is reconstructed using our SSR module. (C) The global features are fused with the keypoint features to predict the keypoint NOCS coordinates. (D) The predicted keypoints, NOCS coordinates, and fused keypoint features are utilized for pose and size estimation.

给定RGB-D和对应的mask,可以得到$\mathbf{I}_\text{partial}$和$\mathbf{P}_\text{partial}$。

3.1. Robust Partial Feature Extraction

这一部分完全使用了AG-Pose中的方法。

对于点云$\mathbf{P}_\text{partial} \in \mathbb{R}^{N \times 3}$,使用PointNet++提取出特征$\mathbf{F}_\text{P} \in \mathbb{R}^{N \times C_1}$。对于RGB图像$\mathbf{I}_\text{partial}$,使用DINOv2提取出特征$\mathbf{F}_\text{I} \in \mathbb{R}^{N \times C_2}$。然后将$\mathbf{F}_\text{P}$和$\mathbf{F}_\text{I}$进行concat得到$\mathbf{F}_\text{partial} \in \mathbb{R}^{N \times C}$。

GCE-Pose使用AG-Pose中的方法提取关键点。

首先,使用一个可学习的嵌入$\mathbf{F}_\text{emb} \in \mathbb{R}^{M \times C}$提取出$M$个关键点特征,该嵌入与$\mathbf{F}_\text{partial}$执行交叉注意力以关注$\mathbf{P}_\text{partial}$中的关键区域。这个过程产生一个特征查询矩阵$\mathbf{F}_\text{q} = \text{CrossAttention}(\mathbf{F}_\text{emb}, \mathbf{F}_\text{partial})$。然后,通过余弦相似度计算对应关系,形成矩阵$\mathbf{A} \in \mathbb{R}^{M \times N}$,并从$\mathbf{P}_\text{partial}$中选择$M$个关键点,即$\mathbf{P}_\text{kpt} = \text{softmax}(\mathbf{A}) \mathbf{P}_\text{partial}$。

为了确保关键点位于实例表面并最大限度的减少外点,GCE-Pose使用了一个object-aware Chamfer distance loss $\mathcal{L}_\text{ocd}$。使用GT Pose $\mathbf{T}_\text{gt}$,GCE-Pose将每个点$x \in \mathbf{P}_\text{partial}$与实例模型$\mathbf{M}_\text{obj}$进行比较来过滤外点:

$$ \begin{equation}\tag{1} \min_{y \in \mathbf{M}_\text{obj}} \lVert \mathbf{T}_\text{gt}(x) - y \rVert_2 < \tau_1, \end{equation} $$

其中$\tau_1$是异常值阈值。object-aware Chamfer distance loss为:

$$ \begin{equation}\tag{2} \mathcal{L}_\text{ocd} = \frac{1}{\lvert\mathbf{P}_\text{kpt}\rvert} \sum_{x \in \mathbf{P}_\text{kpt}} \min_{y \in \mathbf{P}_\text{partial}^*} \lVert x - y \rVert_2. \end{equation} $$

为了防止关键点聚集,添加了diversity regularization loss:

$$ \begin{equation}\tag{3} \mathcal{L}_\text{div} = \sum_{x \ne y \in \mathbf{P}_\text{kpt}} \max\{0, \tau_2 - \lVert x - y \rVert_2\}, \end{equation} $$

其中$\tau_2$控制关键点分布。为了增强具有几何上下文的特征,GCE-Pose直接使用了AG-Pose中的Geometric-Aware Feature Aggregation (GAFA)模块。GAFA通过(1)来自K最近邻的局部几何细节和(2)来自所有关键点的全局信息来增强每个关键点,从而提高对应估计的特征辨别能力。

3.2. Semantic Shape Reconstruction

Figure 3. Illustration of Deep Linear Semantic Shape Model. A Deep Linear Semantic Shape model is composed of a prototype shape $c$, a scale network $\mathcal{S}$, a deformation network $\mathcal{D}$, a Deformation field $\mathcal{V}$ and a category-level semantic features $c_\text{sem}^k$. At stage 1, we build a Deep Linear Shape (DLS) model using sampled point clouds from all ground truth instances within each category, training a linear parameterization network to represent each instance. At stage 2, we retrain the DLS model to regress the corresponding DLS parameters from partial point cloud inputs using a deformation and scale network. During testing, the network predicts DLS parameters for unseen objects and reconstructs their point clouds based on the learned deformation field to get semantic reconstruction.

类内差异是类别级物体6D位姿估计任务的关键挑战之一。为应对这一问题,形状先验被广泛应用到该任务中。具体而言,通过采用“平均形状+形变”的方式对目标形状进行建模,或学习用于几何重建的隐式神经表示,模型能够更有效地学习NOCS空间中的对应关系,从而充分利用形状先验。尽管几何形状重建能够提供重要的先验信息,但其仍难以捕捉到物体部件中蕴含的丰富语义信息。

基于此,GCE-Pose提出了Semantic Shape Reconstruction (SSR)模块来学习逐类别线性形状模型,该模型使用特定于实例的几何形状和类别级语义特征来描述对象。

Deep Linear Semantic Shape Model

为了克服深度传感器的部分观测带来的挑战,例如遮挡和不完整的几何形状,GCE-Pose采用了Deep Linear Semantic Shape Model的变体。即使用形状参数对物体形状进行参数化,并生成完整的3D形状表示,即使在面对有限的输入数据时也是如此。GCE-Pose将模型中的每个点表示为元组$(x, f)$,其中$x \in \mathbb{R}^3$表示空间坐标,$f \in \mathbb{R}^C$表示其语义特征向量。对于类别$k$内的对象实例的$I$个点,GCE-Pose学习Linear Semantic Shape Model。类别$k$的模型由以下三部分组成:(i)几何原型$c^k \in \mathbb{R}^{I \times 3}$,其对应语义特征为$c_\text{sem}^k \in \mathbb{R}^{I \times C}$的;(ii)一组几何变形基本向量$v^k = \{v_1^k, \cdots, v_D^k\}$,其中$v_i^k \in \mathbb{R}^3$;(iii)尺度参数向量$s^k \in \mathbb{R}^3$。GCE-Pose的关键点在于:在几何形变过程中,语义特征始终与其对应点保持耦合关系。由此,模型的语义形状$\mathbf{U}_\text{k}$可定义为:

$$ \begin{equation}\tag{4} \mathbf{U}_\text{k} = (\mathbf{X}_\text{k}, \mathbf{F}_\text{k}) = \left(\mathbf{s}^k \odot \left(c^k + \sum_{i = 1}^D a_i^kv_i^k\right), c_\text{sem}^k\right), \end{equation} $$

其中,$\mathbf{X}^\text{k} \in \mathbb{R}^{I \times 3}$表示形状先验$k$中的$I$个点,$\mathbf{F}^\text{k} \in \mathbb{R}^{I \times C}$表示与这些点对应的语义特征。形状参数向量记为$a^k = (a_1^k, \cdots, a_D^k) \in \mathbb{R}^D$,尺度参数向量记为$s^k \in \mathbb{R}^3$,$\odot$表示逐元素的Hadamard积。对于每个类别$k$,GCE-Pose训练两个神经网络:其中,网络$\mathcal{D}^k$用于预测形状参数$a^k$,网络$\mathcal{S}^k$用于预测尺度参数$s^k$。

为了优化模型,GCE-Pose最小化Chamfer distance loss $\mathcal{L}_\text{CD}$,以保证形状重建的准确性,其定义如下:

$$ \begin{equation}\tag{5} \mathcal{L}_\text{CD} = \sum_{x \in \mathbf{P}}\min_k d(x, \mathbf{U}_\text{k}), \end{equation} $$

其中,$d$表示Chamfer distance,$\mathbf{P}$表示类别$k$的真实点云,$\mathbf{U}^\text{k}$表示公式4定义的重建形状。

利用GT进行训练可以得到最优参数$\bar{a}^k$、$\bar{s}^k$、$c^k$和$v^k$。在此基础上,GCE-Pose固定$c^k$和$v^k$,并结合部分观测点云$P_\text{partial}$设计额外的损失函数,以进一步细化形状重建:

$$ \begin{equation}\tag{6} \mathcal{L}_\text{para} = \sum_{x^\prime \in \mathbf{P}_\text{partial}}\lambda_1\lvert\mathcal{D}^k(x^\prime) - \bar{a}^k\rvert + \lambda_2\lvert\mathcal{S}^k(x^\prime) - \bar{s}^k\rvert. \end{equation} $$

最后,GCE-Pose结合重建损失和参数损失来得到整体损失:

$$ \begin{equation}\tag{7} \mathcal{L}_\text{rec} = \lambda_\text{CD} \cdot \mathcal{L}_\text{CD} + \lambda_\text{para} \cdot \mathcal{L}_\text{para}. \end{equation} $$

其中$\lambda_\text{CD}$和$\lambda_\text{para}$分别是对$\mathcal{L}_\text{CD}$和$\mathcal{L}_\text{para}$的贡献进行加权的超参数。

Semantic Prototype Construction

为了将丰富的语义信息有效融入三维形状重建过程,GCE-Pose首先利用DINOv2从每个对象实例的多视角RGB图像中提取稠密语义特征。具体而言,对于每个无纹理对象实例,GCE-Pose在其周围布置多个虚拟相机,从不同视角采集RGB图像和深度图。该设置能够尽可能完整地覆盖物体表面,并有效缓解遮挡带来的影响。随后,将采集到的RGB图像输入DINOv2模型,以提取稠密的二维语义特征图。结合对应的深度图以及已知的相机内参和外参,GCE-Pose将二维语义特征进一步投影到三维空间中。对于图像中的每个像素$(u, v)$,利用其深度值$z$计算对应的三维位置$P$,并将关联的语义特征$\mathbf{f}_\text{2D}(u, v)$投影到该三维点上,其中$P = zK^{-1}[u, v, 1]^T$, 这里$K$表示相机内参矩阵。最终,GCE-Pose获得稠密的语义点云$\mathbf{F}_\text{sem}$。为保证计算效率并建立逐点对应关系,GCE-Pose进一步将该稠密语义点云下采样为与几何重建结果对齐的$I$个点。对于深度线性形状重建中的每个点$P_i$,GCE-Pose从稠密点云中聚合其$\text{k}$个最近邻的语义特征:

$$ \begin{equation}\tag{8} \mathbf{F}_\text{instance}(P_i) = \frac{1}{k} \sum_{P_j \in N_k(P_i)} \mathbf{F}_\text{sem}(P_j). \end{equation} $$

随后,通过对类别$k$中$N$个实例的特征取平均,GCE-Pose构建类别级语义原型$c_\text{sem}^k$,同时保持其与几何原型$c^k$之间的逐点对应关系:

$$ \begin{equation}\tag{9} c_\text{sem}^k = \frac{1}{N} \sum_{i = 1}^N \mathbf{F}_\text{instance}(P_i). \end{equation} $$

Semantic Reconstruction

GCE-Pose的主要优势在于,一旦语义原型建立完成,语义重建过程便可显著简化。对于给定的部分点云$x^\prime$,GCE-Pose首先重建其几何形状,随后直接从原型中继承对应的语义特征。

3.3. Global Context Enhanced Feature Fusion

传统姿态估计方法主要依赖于部分观测,因此在遮挡和视角变化等复杂情况下往往表现受限。为克服这些局限性,GCE-Pose提出了Global Context Enhancement (GCE)特征融合模块,用于有效融合完整的语义形状重建结果与部分观测信息,从而建立更为鲁棒的特征对应关系。

首先,GCE-Pose的目标是从语义重建结果中提取全局特征。给定部分观测点云$\mathbf{P}_\text{partial}$,我们可以根据公式4重建得到完整形状$\mathbf{P}_\text{global}$。随后,利用PointNet++从$\mathbf{P}_\text{global}$中提取几何特征,并将其与类别级语义特征$c_\text{sem}$进行拼接,从而得到全局特征$\mathbf{F}_\text{global} \in \mathbb{R}^{I \times C}$。

给定关键点特征$\mathbf{F}_\text{kpt} \in \mathbb{R}^{M \times C}$和全局特征$\mathbf{F}_\text{global} \in \mathbb{R}^{I \times C}$,GCE-Pose的目标是利用全局语义上下文对关键点特征进行增强。其主要挑战在于,如何弥合两种特征表示之间的域差异:一方面,部分观测是在含噪声的相机坐标空间中获取的;另一方面,全局重建则是在包含完整形状信息的归一化对象坐标空间中表示的。

为了融合部分特征$\mathbf{F}_\text{kpt}$与全局特征$\mathbf{F}_\text{global}$,GCE-Pose分别将关键点位置$\mathbf{P}_\text{kpt}$和全局点位置$\mathbf{P}_\text{global}$输入可学习的位置嵌入网络,并将得到的位置编码与对应特征进行拼接,从而将部分特征和全局特征映射为高维位置标记:

$$ \begin{equation}\tag{10} \begin{aligned} \mathbf{F}_\text{kpt}^\prime &= \text{concat}(\mathbf{F}_\text{kpt}, \text{PE}_\text{kpt}), \\ \mathbf{F}_\text{global}^\prime &= \text{concat}(\mathbf{F}_\text{global}, \text{PE}_\text{global}), \end{aligned} \end{equation} $$

其中,$\text{PE}$表示位置编码网络。随后,我们引入注意力机制对两种信息源进行融合,其中全局特征为关键点特征的细化提供语义上下文。具体而言,我们首先将关键点特征和全局特征投影到共享的嵌入空间中:

$$ \begin{equation}\tag{11} \mathbf{F}_\text{kpt}^{\prime\prime} = \text{LayerNorm}(\text{MLP}_\text{proj}(\mathbf{F}_\text{kpt}^\prime)), \end{equation} $$

$$ \begin{equation}\tag{12} \mathbf{F}_\text{global}^{\prime\prime} = \text{LayerNorm}(\text{MLP}_\text{proj}(\mathbf{F}_\text{global}^\prime)), \end{equation} $$

然后,通过交叉注意力和残差连接来聚合全局上下文增强:

$$ \begin{equation}\tag{13} \mathbf{F}_\text{context} = \text{CrossAttn}(\mathbf{F}_\text{kpt}^{\prime\prime}, \mathbf{F}_\text{global}^{\prime\prime}), \end{equation} $$

$$ \begin{equation}\tag{14} \mathbf{F}_\text{gce} = \mathbf{F}_\text{kpt} + \mathbf{F}_\text{context}. \end{equation} $$

在将关键点特征$\mathbf{F}_\text{kpt}$与全局特征$\mathbf{F}_\text{global}$融合后,GCE-Pose得到全局上下文增强的关键点特征$\mathbf{F}_\text{gce}$。随后,按照DPDN的设置,将$\mathbf{F}_\text{gce}$输入自注意力模块和多层感知机(MLP),以预测对应的关键点NOCS坐标$\mathbf{P}_\text{kpt}^\text{nocs}$。

为了确保所提取的关键点及其相关特征能够有效表征部分观测点云$\mathbf{P}_\text{partial}$,GCE-Pose进一步引入了一个重建模块来恢复其三维几何形状。该模块以关键点的位置和特征作为输入,首先对关键点位置进行位置编码,并通过多层感知机(MLP)对其特征进行细化。在聚合位置编码和细化后特征的基础上,形状解码器预测重建增量,以恢复目标的几何形状。重建损失定义为部分观测点云$\mathbf{P}_\text{partial}$与重建点云$\mathbf{P}_\text{recon}$之间的object-aware Chamfer distance (CD),其表达式如公式1所示:

$$ \begin{equation}\tag{15} \mathcal{L}_\text{recon} = \frac{1}{\lvert\mathbf{P}_\text{recon}\rvert} \sum_{x \in \mathbf{P}_\text{recon}} \min_{y \in \mathbf{P}_\text{partial}^*} \lVert x - y \rVert_2. \end{equation} $$

3.4. Pose Size Estimator

给定关键点的NOCS坐标$\mathbf{P}_\text{kpt}^\text{nocs} \in \mathbb{R}^{M \times 3}$、增强后的关键点特征$\mathbf{F}_\text{gce}$以及关键点位置$\mathbf{P}_\text{kpt}$,GCE-Pose可以建立关键点级别的对应关系,并在此基础上回归最终的姿态与尺寸参数$\mathbf{R}$、$\mathbf{t}$和$\mathbf{s}$。该过程可表述为:

$$ \begin{equation}\tag{16} \mathbf{f}_\text{pose} = \text{concat}[\mathbf{P}_\text{kpt}, \mathbf{F}_\text{gce}, \mathbf{P}_\text{kpt}^\text{nocs}], \end{equation} $$

$$ \begin{equation}\tag{17} (\mathbf{R}, \mathbf{t}, \mathbf{s}) = \text{MLP}_R(\mathbf{f}_\text{pose}), \text{MLP}_t(\mathbf{f}_\text{pose}), \text{MLP}_s(\mathbf{f}_\text{pose}). \end{equation} $$

对于旋转表示$\mathbf{R}$,GCE-Pose采用6D continuous representation。对于平移参数$\mathbf{t}$,GCE-Pose遵循HS-Pose的策略,通过预测GT与点云中心位置之间的残差来估计平移量。

3.5. Overall Loss Function

总损失函数如下:

$$ \begin{equation}\tag{18} \mathcal{L}_\text{all} = \lambda_1\mathcal{L}_\text{ocd} + \lambda_2\mathcal{L}_\text{div} + \lambda_3\mathcal{L}_\text{rec} + \lambda_4\mathcal{L}_\text{nocs} + \lambda_5\mathcal{L}_\text{pose}, \end{equation} $$

其中$\lambda_1$、$\lambda_2$、$\lambda_3$、$\lambda_4$和$\lambda_5$是平衡每一项的超参数。对于$\mathcal{L}_\text{pose}$,我们使用:

$$ \begin{equation}\tag{19} \mathcal{L}_\text{pose} = \lVert \mathbf{R}_\text{gt} - \mathbf{R} \rVert_2 + \lVert \mathbf{t}_\text{gt} - \mathbf{t} \rVert_2 + \lVert \mathbf{s}_\text{gt} - \mathbf{s} \rVert_2. \end{equation} $$

GCE-Pose利用GT位姿$\mathbf{T}_\text{gt} = (\mathbf{R}_\text{gt}, \mathbf{t}_\text{gt}, \mathbf{s}_\text{gt})$,将相机空间中的关键点坐标$\mathbf{P}_\text{kpt}$投影到NOCS空间,从而生成关键点的GT NOCS坐标$\mathbf{P}_\text{kpt}^\text{gt}$。对于损失项$\mathcal{L}_\text{nocs}$,GCE-Pose采用Smooth $L_1$损失进行监督:

$$ \begin{equation}\tag{20} \mathcal{L}_\text{nocs} = \lVert\mathbf{T}_\text{gt}(\mathbf{P}_\text{kpt}^\text{gt}) - \mathbf{P}_\text{kpt}^\text{nocs}\rVert_\text{SL1}. \end{equation} $$

4. Experiment

4.1. Implementation Details

4.2. Evaluation Benchmarks

4.3. Comparison with the State-of-the-Art

Figure 4. Visualization of category-level object pose estimation results on HouseCat6D dataset. Predicted 3D bounding boxes are shown in red, with ground truth in green. Challenging cases are highlighted in pink side squares. Leveraging our global context-enhanced pose prediction pipeline, GCE-Pose outperforms the SOTA AG-Pose (DINO), demonstrating robustness to occlusions and strong generalization to novel instances.

4.4. Ablation Studies

Figure 5. (A) shows the visualization of input partial points and the output semantic shape reconstructions; (B) visualizes the semantic prototypes of different categories and the aggregated instance semantics.

5. Conclusions and Limitations

Supplementary Material

Figure 6. Visualization of feature point cloud using PCA. Upper row: Key feature. Bottom row: Value feature. The zoom-in visualizations indicate embedding changes in the key feature around symmetric areas, which are negligible for the value feature.

Figure 7. Semantics consistency for the "camera" class despite geometry changes in the lens area.

Figure 8. NOCS dataset bounding box visualization. Green indicates GT, and red indicates prediction results.

Figure 9. HouseCat6D bounding box visualization. Green indicates GT, and red indicates prediction results.

Figure 10. Visualization of HouseCat6D Keypoint NOCS Error. Red indicates a high error; green indicates a low error.

Figure 11. Visualization of Semantic prototypes and in-class semantic transfer results in HouseCat6D dataset.

最后修改:2026 年 03 月 11 日
如果觉得我的文章对你有用,请随意赞赏。