Fifth Update
Mateo de Mayo
17 April 2021
Summary
This update shows an overview of the architecture and data structures that compose the Kimera-VIO project. The solution seems very complete but it has a perceived lack of performance for real-time applications, although it was not explored thoroughly enough to assert that just yet.
Building Kimera
Kimera provides a dockerfile for preview purposes in a Docker container. In my case that worked well, but I preferred to install everything on my system for development instead of using Docker. The repository provides an installation guide that is not specific enough about which snapshots of the dependencies the project should use, and as such, it breaks in many of the tested dependency setups even though they were compatible with the proposed guide. After reading many Github issues related to building problems I figured it would be better to just stick to doing the same the dockerfile was doing and indeed, that finally worked.
For future reference here is a list of the dependencies used:
C++ Refresher
While I quite like C++ and I’ve read some books about the language, I’ve never had the luck to work on a project that uses it for a decent amount of time. This lack of practice shines when reading new code and makes the process a bit slower than one would prefer.
For Kimera in particular, it seems to be mainly implemented in C++11. I had to
take some time to remember value categories and learn about how type-deduction
of lvalues and rvalues for move semantics work with templates to understand
why std::move
and std::forward
exist (they are basically static_cast<T&&>
with extra steps). There is also std::bind
which works the same as many
bind
functions found in other languages; it partially applies arguments to
functions. However, it can be used to bind class methods to instances of those
classes. To this end, it uses the concept of member functions, which had gone
under my radar along the .*
and ->*
operators; though one should not use
them.
Kimera Architecture
Kimera VIO is implemented as a pipeline (with some caveats outlined in its
TODO
s) which can run in parallel. It defines PipelineModule
s which are
configured via YAML parameter files and compose the Pipeline
. Between those
modules are thread-safe queues which one can put
and get
different amount of
Packet
s (single input multiple output SIMO
queues, MIMO
, MISO
, and so
on). Here is a diagram of this pipeline from the Kimera repo:
Each pipeline module is a wrapper to a class that implements its main algorithm. The wrapped classes are:
1. EurocDataProvider
: Reads the EuRoC dataset from disk; it is currently the
only supported dataset provider; it could be interesting to try to extend it
with TUM VI. Its output is similar to:
class StereoImuSyncPacket {
StereoFrame stereo_frame; // Sanitized left and right images
ImuStampS imu_stamps; // IMU timestamps
ImuAccGyrS imu_accgyrs; // IMU accelerometer and gyroscope samples
}
2. StereoVisionFrontEnd
Takes the left and right camera images together with
the IMU samples detect and tracks the frames landmarks, and
calculates the view pose relative to them. Its output is along the lines of:
class FrontendOutput {
StereoMeasurements stereo_meas; // Tracker status, body pose and 2D landmarks
gtsam::Pose3 relative_pose_body_stereo; // Pose difference with previous frame
StereoFrame stereo_frame_lkf; // Similar to this module's input
ImuFrontEnd::PimPtr pim; // IMU fusion (preintegration) data
ImuAccGyrS imu_acc_gyrs; // Accelerometer-gyroscope samples
cv::Mat feature_tracks; // Display image with the landmark path for this frame
}
3. VioBackEnd
: This class uses gtsam
thoroughly and is responsible
for “adding up” the visual-inertial states (the landmarks and viewer
relative poses) on top of each other to build the graph of absolute poses
predictions. It can also estimate the initial pose and velocity state based
on IMU data. It produces an output similar to:
class BackendOutput {
// Estimation of pose, velocity, and IMU bias at a certain timestamp
VioNavStateTimestamped W_State_Blkf;
// Probabilistic state to be reused in next "spins" of the pipeline.
// This portion of kimera uses a lot of gtsam algorithms and structures.
gtsam::Values state;
gtsam::Matrix state_covariance_lkf; // lkf = last key frame
gtsam::NonlinearFactorGraph factor_graph;
// Landmarks data
int landmark_count;
PointsWithIdMap landmarks_with_id_map; // List of (id, vector3)
LmkIdToLmkTypeMap lmk_id_to_lmk_type_map; // List of (id, landmark type)
}
4. Mesher
: It builds both a per-frame and a multi-frame (slower) mesh of the
environment. This one will not be necessary at first for what I want, but it
is nice to have it.
class MesherOutput {
Mesh2D mesh_2d_; // mesh of screen space 2d landmarks
Mesh3D mesh_3d_; // mesh of world space 3d landmarks
}
5. Visualizer
: Finally a visualization is produced of the EuRoC footage with
the detected landmarks and a 3D view of the reconstructed trajectory and meshes.
Tools and Performance
There are other tools in the Kimera ecosystem like the Kimera-VIO-Evaluation
repository which uses Jupyter notebooks with a modified version of
evo, a Python package for evaluation of SLAM algorithms. There is
Kimera-Semantics which builds on top of Voxblox (paper)
to offer another kind of mesh reconstruction of the environment which is
separated from the Mesher
but supposedly slower. Finally, the Kimera project
also developed the uHumans and uHumans2 datasets which are built on top of a
Unity simulation. I think that the idea is quite nice considering how expensive
mocap systems can be and the additional benefit of being able to try out any
camera parameters you could want. However, I do wonder if the
not-so-photorealistic lighting or the too-perfect cameras could be a problem
when translating a system that works well on these kinds of datasets to a
real-world application.
By default, I’m not getting very good performance from Kimera-VIO (about five frames per second) and while that may very well be me compiling the project in a suboptimal manner, I do think there might be a problem with Kimera itself regarding performance. “Kimera is heavily parallelized and uses four threads” is a phrase from their paper but I don’t think that heavily parallelized and four threads are a good match. In their paper they describe these threads to be used for: frontend, backend & mesher, RPGO (loop closure), and Kimera-Semantics. Meaning each one of these algorithms is sequential and runs on the CPU. While I’m no expert in computer vision, not using a GPU for image-related problems looks odd to me. With ORB-SLAM3 I only found mentions of OpenMP but nothing related to GPU parallelism so they may have the same problem. I like the idea of maybe in the future porting these algorithms to work in GPUs, but on the other hand, that effort would surely not be trivial.
What’s Next
As I’ve dedicated some time to understanding the Kimera codebase and my time for this is limited I will probably continue with Kimera and forget about ORB-SLAM3 for the time being. With this better understanding of the project architecture and to be able to run a SLAM solution on some custom-made footage I will need to dive more into both frontend and backend modules to understand their interactions and some of their theoretical bases.