We implemented our models in Tensorflow 2.9.1. Blocks with rounded corners represent utilities,
activation functions or batch normalization and are not trainable. For these we either use available
implementations from tensorflow, keras or tensorflow-addons or we implement them ourselves.
Blocks with sharp corners represent inputs or trainable variables.
Ocher blocks are layers available in keras or tensorflow.
We use two main pipelines in our work. The first one is the rendering pipeline. Its functional
purpose is to render novel views of the scene given some input observations with known camera poses. In
our work, we mainly use it to pre-train the NeRF model and to utilize it as a feature extractor for the
grasp evaluation in the grasp evaluation pipeline. The following figures show the basic
architecture of both pipelines.
Grasp evaluation pipeline
When training using the rendering pipeline, all trainable variables are trained. When training
using the grasp evaluation pipeline, only the trainable variables of the GraspReadout
are trained.
Both pipelines employ VisualFeatures, a module that extracts visual features from the input
images. The following figure shows its architecture.
For novel view synthesis using the rendering pipeline, we need to process single 5-DoF poses
that are sampled along a camera-ray before we can apply volumetric rendering. Just like in VisionNeRF,
we use a two-stage approach consisting of a coarse and a fine stage. The following figure shows the
processing of a single 5-DoF pose.
Process single 5-DoF pose - render
The MVResNetMLP fuses the information from multiple views into a single feature vector, or into
the scene activations. For rendering, the model uses two MVResNetMLPs: one for the coarse stage
and one for the fine stage. For grasp evaluation we use a single MVResNetMLP that is initialized
with the weights of the fine stage. The following figure shows the architecture of the
MVResNetMLP and its main building block, the ResNetMLPBlock.
The MVResNetMLP extracts the features from the input images and the poses. These are then further
processed by the readouts. The RenderReadout uses the output embeddings to render the scene,
and the GraspReadout uses the output scene activations to evaluate the grasp candidates. The
following figure shows the architecture of both readout modules and the GraspReadout's main
building blocks.