Fifth Update

Mateo de Mayo

17 April 2021

Summary

This update shows an overview of the architecture and data structures that compose the Kimera-VIO project. The solution seems very complete but it has a perceived lack of performance for real-time applications, although it was not explored thoroughly enough to assert that just yet.

Building Kimera

Kimera provides a dockerfile for preview purposes in a Docker container. In my case that worked well, but I preferred to install everything on my system for development instead of using Docker. The repository provides an installation guide that is not specific enough about which snapshots of the dependencies the project should use, and as such, it breaks in many of the tested dependency setups even though they were compatible with the proposed guide. After reading many Github issues related to building problems I figured it would be better to just stick to doing the same the dockerfile was doing and indeed, that finally worked.

For future reference here is a list of the dependencies used:

C++ Refresher

While I quite like C++ and I’ve read some books about the language, I’ve never had the luck to work on a project that uses it for a decent amount of time. This lack of practice shines when reading new code and makes the process a bit slower than one would prefer.

For Kimera in particular, it seems to be mainly implemented in C++11. I had to take some time to remember value categories and learn about how type-deduction of lvalues and rvalues for move semantics work with templates to understand why std::move and std::forward exist (they are basically static_cast<T&&> with extra steps). There is also std::bind which works the same as many bind functions found in other languages; it partially applies arguments to functions. However, it can be used to bind class methods to instances of those classes. To this end, it uses the concept of member functions, which had gone under my radar along the .* and ->* operators; though one should not use them.

Kimera Architecture

Kimera VIO is implemented as a pipeline (with some caveats outlined in its TODOs) which can run in parallel. It defines PipelineModules which are configured via YAML parameter files and compose the Pipeline. Between those modules are thread-safe queues which one can put and get different amount of Packets (single input multiple output SIMO queues, MIMO, MISO, and so on). Here is a diagram of this pipeline from the Kimera repo:

kimera-diagram

Each pipeline module is a wrapper to a class that implements its main algorithm. The wrapped classes are:

1. EurocDataProvider: Reads the EuRoC dataset from disk; it is currently the only supported dataset provider; it could be interesting to try to extend it with TUM VI. Its output is similar to:

class StereoImuSyncPacket {
  StereoFrame stereo_frame; // Sanitized left and right images
  ImuStampS imu_stamps; // IMU timestamps
  ImuAccGyrS imu_accgyrs; // IMU accelerometer and gyroscope samples
}

2. StereoVisionFrontEnd Takes the left and right camera images together with the IMU samples detect and tracks the frames landmarks, and calculates the view pose relative to them. Its output is along the lines of:

class FrontendOutput {
  StereoMeasurements stereo_meas; // Tracker status, body pose and 2D landmarks
  gtsam::Pose3 relative_pose_body_stereo; // Pose difference with previous frame
  StereoFrame stereo_frame_lkf; // Similar to this module's input
  ImuFrontEnd::PimPtr pim; // IMU fusion (preintegration) data
  ImuAccGyrS imu_acc_gyrs; // Accelerometer-gyroscope samples
  cv::Mat feature_tracks; // Display image with the landmark path for this frame
}

3. VioBackEnd: This class uses gtsam thoroughly and is responsible for “adding up” the visual-inertial states (the landmarks and viewer relative poses) on top of each other to build the graph of absolute poses predictions. It can also estimate the initial pose and velocity state based on IMU data. It produces an output similar to:

class BackendOutput {
  // Estimation of  pose, velocity, and IMU bias at a certain timestamp
  VioNavStateTimestamped W_State_Blkf;

  // Probabilistic state to be reused in next "spins" of the pipeline.
  // This portion of kimera uses a lot of gtsam algorithms and structures.
  gtsam::Values state;
  gtsam::Matrix state_covariance_lkf; // lkf = last key frame
  gtsam::NonlinearFactorGraph factor_graph;

  // Landmarks data
  int landmark_count;
  PointsWithIdMap landmarks_with_id_map; // List of (id, vector3)
  LmkIdToLmkTypeMap lmk_id_to_lmk_type_map; // List of (id, landmark type)
}

4. Mesher: It builds both a per-frame and a multi-frame (slower) mesh of the environment. This one will not be necessary at first for what I want, but it is nice to have it.

class MesherOutput {
  Mesh2D mesh_2d_; // mesh of screen space 2d landmarks
  Mesh3D mesh_3d_; // mesh of world space 3d landmarks
}

5. Visualizer: Finally a visualization is produced of the EuRoC footage with the detected landmarks and a 3D view of the reconstructed trajectory and meshes.

Tools and Performance

There are other tools in the Kimera ecosystem like the Kimera-VIO-Evaluation repository which uses Jupyter notebooks with a modified version of evo, a Python package for evaluation of SLAM algorithms. There is Kimera-Semantics which builds on top of Voxblox (paper) to offer another kind of mesh reconstruction of the environment which is separated from the Mesher but supposedly slower. Finally, the Kimera project also developed the uHumans and uHumans2 datasets which are built on top of a Unity simulation. I think that the idea is quite nice considering how expensive mocap systems can be and the additional benefit of being able to try out any camera parameters you could want. However, I do wonder if the not-so-photorealistic lighting or the too-perfect cameras could be a problem when translating a system that works well on these kinds of datasets to a real-world application.

By default, I’m not getting very good performance from Kimera-VIO (about five frames per second) and while that may very well be me compiling the project in a suboptimal manner, I do think there might be a problem with Kimera itself regarding performance. “Kimera is heavily parallelized and uses four threads” is a phrase from their paper but I don’t think that heavily parallelized and four threads are a good match. In their paper they describe these threads to be used for: frontend, backend & mesher, RPGO (loop closure), and Kimera-Semantics. Meaning each one of these algorithms is sequential and runs on the CPU. While I’m no expert in computer vision, not using a GPU for image-related problems looks odd to me. With ORB-SLAM3 I only found mentions of OpenMP but nothing related to GPU parallelism so they may have the same problem. I like the idea of maybe in the future porting these algorithms to work in GPUs, but on the other hand, that effort would surely not be trivial.

What’s Next

As I’ve dedicated some time to understanding the Kimera codebase and my time for this is limited I will probably continue with Kimera and forget about ORB-SLAM3 for the time being. With this better understanding of the project architecture and to be able to run a SLAM solution on some custom-made footage I will need to dive more into both frontend and backend modules to understand their interactions and some of their theoretical bases.