Fourth Update
Mateo de Mayo
29 March 2021
Summary
More steps were taken towards getting a better idea of what visual-inertial SLAM open-source implementations are available and what would it take to use one for tracking in Monado. ORB-SLAM3 and Kimera-VIO were the main considered projects. Filtering information from different surveys and continuing learning the SLAM jargon were a big part of this update as well.
Introduction
The motivation for this post, vaguely stated, was to record a video from my phone, give it to some glue code I would’ve set up that in turn would run some black-box open-source implementation of SLAM and return the estimated poses during the video frames; all of these with the idea of eventually integrating the SLAM tracking into Monado. Even though that sounded like a perfect three part plan in my head, not even the phone video was recorded. Most of the time was spent getting a better understanding of the problems of SLAM implementations, which turn out to be a lot more than the purely theoretical problems discussed in these introductory papers.
Datasets
In reality, there needs to be some preprocessing of the video input and parameter calibration for the used camera. This is when having standardized datasets comes in handy. There are many datasets available, EuRoC MAV seems to be one of the most standards though I would argue that the TUM VI dataset seems to be a better fit as it is newer, has more and more diverse sequences, larger resolution, and photometric calibration. Another thing that makes TUM VI a specially good fit for VR compared to EuRoC MAV is that it was captured with a handheld camera compared to the Micro Aerial Vehicle (MAV) used for the EuRoC dataset. A comparison between many datasets from the TUM VI paper, is presented in the following table:
Reading the TUM VI paper gives a good overview of what it takes to build a usable standardized image sequence to feed a SLAM algorithm. The calibration and tuning of the camera and IMU sensors is an entire problem of its own dealing with exposure times, timestamps synchronization, IMU temperature models, vignette calibration, among others. Having already calibrated datasets takes a weight off of the problem.
In the future, when testing different algorithms, TUM VI will probably be a good baseline for comparing their performance as it also provides ground truth data taken from MoCaps and evaluation metrics. It will also be interesting to see how well those results compare against the usage of real cameras and sensors.
Landscape
Before talking about the SLAM landscape, I would like to reiterate (if this was not clear yet), that most of these conclusions are drawn from a newbie’s perspective and much of it could be wrong.
There is this talk from last year titled Science as Amateur Software Development and among other things, it goes about how sometimes, academic research does not apply the same level of methodical processes that are often taken for granted when developing software. It suggests that more often than not, research projects are thought of as a one-shot burst of work that will not necessarily be used or peer-reviewed by other people in the far future. And while the talk goes further about how this is quite problematic in the trust we can put into the science that is being produced, it might be a problem to have in mind when considering SLAM implementations that come from an academic background.
There are many options to consider when looking for an open-source SLAM implementation. OpenSLAM.org is a page that lists many solutions, although its last commit is from 2018. The TUM VI paper (2018) referenced before compares OKVIS, ROVIO, VINS, and BASALT. This survey (2018) compares around 50 different implementations from 2007-2018. The reality is that I ended up searching in the Monado Discord server for alternatives that have been referenced before. As such I ended up considering Maplab, OpenVSLAM, Kimera, and ORB-SLAM3. I discarded Maplab as its last significative update is from 2018, OpenVSLAM seems to have been terminated this last month due to conflicts with ORB-SLAM2. ORB-SLAM3 has the problems of having a GPL license and not being updated since 2020 but I will still consider it as its paper shows better results than Kimera and I’ve not found mentions of Kimera surpassing it with the newest updates. Kimera seems quite nice as it is an umbrella of projects that includes not only VI SLAM but semantic scene understanding and other areas that could be nice to have in an OpenXR runtime; it doesn’t seem to support TUM VI though.
The Oculus Quest (and probably the Quest 2) seems to be using a custom-built system called Insight. As Facebook hired many people behind open source implementations like LSD-SLAM and ORB-SLAM authors, they’ve probably helped on building Insight and its performance might be difficult to match with freely available solutions. There is a blog post that came out today that has some interesting quotes in the “Oculus Insight” section of the post. Furthermore, there are some hints in their posts about using machine learning models that have been trained over many play hours to smooth out their SLAM implementations. They also use AI on other features like hand tracking and DeepFocus which are very cool and show how machine learning can aid the VR ecosystem.
What’s Next
I was able to compile and run both ORB-SLAM3 and Kimera-VIO on TUM_VI and EuRoC samples respectively (by the way for ORB-SLAM3 to compile you will need to copy this one line pull request that has not been merged since, you guessed it, 2018). The Kimera and the ORB-SLAM3 papers will be good references to use when reading their source codes. I read some of the entry points of ORB-SLAM3 and it is a relatively big project with around 25kloc without dependencies, I imagine Kimera to be no different. I’m also interested in the work already made by ILLIXR as they already seem to have integrated Monado in some way with both OpenVINS and Kimera.
While there are still many options I didn’t consider, I think that right now it will be worth just start doing something and throw it away later if needed. I will most likely choose one of ORB-SLAM3 or Kimera and go on with my plan, and hopefully in the way I will gain better insights into the different aspects I glanced over in this post, though it may take more than one blog post to have something working, so stay tuned.