Visual SLAM – Sensor Net

SURF-LSTM

We propose SURF-LSTM, a low complexity deep architecture to learn image absolute pose (position and orientation) in indoor environments using SURF descriptors and recurrent neural networks. Given the strongest SURF features descriptors of an input image, we use 2 layers of bidirectional long short term memory (LSTM) to model the sequential relation between them to learn the 6 degrees of freedom absolute pose in an arbitrary reference frame. In addition to achieving competitive performance compared to existing image localization methods, our system can be trained in less than 10 minutes instead of hours by the state of the art. It requires as small as 0.0128 MB to save the image frame rather than 0.08 MB compared to other methods that use the cropped images and the weights file needs 1.5 MB of storage compared to 100 MB of other methods which leads to significant time and space efficiency.

Linear-PoseNet

Neural networks-based camera pose estimation systems rely on fine tuning very large networks to regress the camera position and orientation with very complex training procedure. In this paper, we explore the following question: do we need to fine tune and train such complex networks to reach the desired accuracy? We show that we can reach comparable or better accuracy for the single image indoor localization systems with using only one layer of ridge regression and pretrained features of ResNet-50 architecture with training time less than a second on CPU instead of hours of GPU training needed by the state of the art. For outdoor scenes, we show that using only 3 fully connected layers on top of pretrained ResNet50 features without fine-tuning can perform well compared to the state of the art with only minutes of training. For more complexity reduction, we show that downsampling the pretrained ResNet-50 features by more than 10 times using principal component analysis (PCA) has a little effect on the performance but can save both training time and storage space.

SurfCNN

Convolutional neural network (CNN) is a powerful tool for many data applications. However, its high dimension nature, large network size and computational complexity, and the need of large amount of training data make it challenging to be used in edge computing applications, which are becoming increasingly popular, relevant and important. In this paper, we propose a descriptor based approach to accelerate convolutional neural network training, reduce input dimension and network size, which greatly facilitates the use of CNN for edge computating and even cloud computing. By using image descriptors to extract features from original images, we report a simpler convolutional neural network with fast training speed, low memory usage and outstanding accuracy without the need for a pre-trained network as opposed to the state of art. In indoor localization, our SURF descriptors accelerated CNN (SurfCNN) can reach an average position error of 0.28 m and orientation error of 9.2°. Compared to the conventional CNN that uses original images as input, our algorithm reduces the dimension of the input features by a factor of 48 without impairing the accuracy. Further, at an extreme feature reduction of 14,440 times, our model still retains an average position error retained at 0.41 m and orientation error at 14°.

Generalizable Sequential Camera Pose Learning Using Surf Enhanced 3D CNN

Image based localization is a key block of visual simultaneous localization and mapping (SLAM) system where image data is used to localize the camera relative to an arbitrary reference frame. Although finding the location from one image or between two images is well studied in the literature, few works study the problem of finding the pose of multiple images in videos of different frame lengths. Here, we propose two different architectures to address this problem, one using a combination of 2D convolutional neural network (CNN) and recurrent neural networks (RNN) and the other using 3D CNN. We demonstrate that 3D CNN is better for pose estimation problem than CNN-RNN by visualizing the learned features per layer of both architectures and the accuracy performance. Further, instead of using RGB images as input to the networks, we use SURF descriptors to reduce the image dimension of 480×640×3 by more than 48 folds, making the training time much faster and the learning model less complex. Both architectures show competitive performance in comparison to the state of the art on indoor localization dataset with the ability to generalize to test scenes that are completely different from the training scenes.

SURF-LSTM

Linear-PoseNet

SurfCNN

Generalizable Sequential Camera Pose Learning Using Surf Enhanced 3D CNN

Published Papers