6-DOF Grasp Pose Evaluation and Optimization via Transfer Learning from NeRFs

We address the problem of robotic grasping of known and unknown objects using implicit behavior cloning. We train a grasp evaluation model from a small number of demonstrations that outputs higher values for grasp candidates that are more likely to succeed in grasping. This evaluation model serves as an objective function, that we maximize to identify successful grasps. Key to our approach is the utilization of learned implicit representations of visual and geometric features derived from a pre-trained NeRF. Though trained exclusively in a simulated environment with simplified objects and 4-DoF top-down grasps, our evaluation model and optimization procedure demonstrate generalization to 6-DoF grasps and novel objects both in simulation and in real-world settings, without the need for additional data.

Supplementary Material

Network Architectures

We implemented our models in Tensorflow 2.9.1. Blocks with rounded corners represent utilities, activation functions or batch normalization and are not trainable. For these we either use available implementations from tensorflow, keras or tensorflow-addons or we implement them ourselves. Blocks with sharp corners represent inputs or trainable variables. Ocher blocks are layers available in keras or tensorflow.
We use two main pipelines in our work. The first one is the rendering pipeline. Its functional purpose is to render novel views of the scene given some input observations with known camera poses. In our work, we mainly use it to pre-train the NeRF model and to utilize it as a feature extractor for the grasp evaluation in the grasp evaluation pipeline. The following figures show the basic architecture of both pipelines.

Rendering pipeline

Grasp evaluation pipeline

When training using the rendering pipeline, all trainable variables are trained. When training using the grasp evaluation pipeline, only the trainable variables of the GraspReadout are trained.
Both pipelines employ VisualFeatures, a module that extracts visual features from the input images. The following figure shows its architecture.


Its implementation is analogous to the feature extraction module in VisionNeRF from the paper "Vision Transformer for NeRF-Based View Synthesis from a Single Input Image" by Lin et al. (https://cseweb.ucsd.edu/~viscomp/projects/VisionNeRF). The model relies on a VisionTransformer (ViT), which is based on the implementation in Ross Wightman's timm library (https://github.com/rwightman/pytorch-image-models) The following figures show the components used in the VisualFeatures module.












For novel view synthesis using the rendering pipeline, we need to process single 5-DoF poses that are sampled along a camera-ray before we can apply volumetric rendering. Just like in VisionNeRF, we use a two-stage approach consisting of a coarse and a fine stage. The following figure shows the processing of a single 5-DoF pose.

Process single 5-DoF pose - render

The MVResNetMLP fuses the information from multiple views into a single feature vector, or into the scene activations. For rendering, the model uses two MVResNetMLPs: one for the coarse stage and one for the fine stage. For grasp evaluation we use a single MVResNetMLP that is initialized with the weights of the fine stage. The following figure shows the architecture of the MVResNetMLP and its main building block, the ResNetMLPBlock.



The MVResNetMLP extracts the features from the input images and the poses. These are then further processed by the readouts. The RenderReadout uses the output embeddings to render the scene, and the GraspReadout uses the output scene activations to evaluate the grasp candidates. The following figure shows the architecture of both readout modules and the GraspReadout's main building blocks.





Additional content coming soon ...


Gergely Sóti
Institute for Robotics and Autonomous Systems, Karlsruhe University of Applied Sciences, 76133 Karlsruhe, Germany