Recently the FAIR team at Facebook have perfected their series of image classification and segmentation algorithms (Faster CNN, R- CNN) with a new and exciting addition – Mask RCNN. The results are extraordinary – you are able to extract from a plain image the position of each object in the image and also its contour -see below :
And as finally the official implementation (Detectron) is released, we decided to spend some time to evaluate it.
As its name implies, Mask RCNN is extending the previous algorithm Faster RCNN by adding a new branch which predicts the object mask in parallel with the existing branch which predicts the bounding box. “MaskRCNN outperforms all existing algorithms on segmentation, bounding-box object detection and person keypoint detection on the well-established COCO image dataset”.
There are quite a few implementations available on Git because, as soon as the paper describing the model appeared on arxiv, the research community (me included 🙂 ) tried to reproduce its results.
In this post, I will briefly present the existing implementation and compare their strengths and weaknesses.
1 . Facebook Detector official implementation finally appeared lastly in February 2018. It’s available freely at https://github.com/facebookresearch/Detectron and it contains Mask RCNN and implementations for RetinaNet, Faster R-CNN and RPN.
The code is written in Python2 using the Caffe2 deep learning framework. The backbone architectures are ResNet50/101 and VGG16.
The implementation is extremely clean, scholarly written (I was able to read the code as easy as the paper), and you can easily see that engineers at FAIR took care to make this project very easy to extend and maintain.
2. Matterport implementation appeared somewhere around the end of 2017(oct-dec). It’s freely available at https://github.com/matterport/Mask_RCNN and it’s surprisingly complete and easy to use. It is also written in Python(version 3 this time) but on Keras and TensofFlow. Its installation instructions are clear and the usage is straightforward. This implementation seems to be mainly maintained by employees at Matterport(a 3D imaging company) and PhilFerriere, an ex Microsoft AI.
3. MX MaskR-CNN , available at https://github.com/TuSimple/mx-maskrcnn is a Python2 implementation of Mask R-CNN on Apache MXNet framework. Made available also at the end of Oct 2017, it was one of the first implementations publicly available .
Implementing the same model, the accuracies of the 3 implementations are rather similar (for an outsider- I imagine for those involved in developing them every 0.5% it’s a game changer). So I’ll judge them here just based on the easiness of installation and on the execution time.
I tested them on full HD images (1920*1080px) which is the standard TV format right now and also on a set of random images from the COCO database (roughly 640*480 format).
Bellow, you see the results of the different implementations(FB Detector on the left, MP on the right). While they are not identical, the main features are reproduced correctly by all implementations and from my point of view(as an user) they are equivalent.
Here is a performance benchmark results on Linux (Windows where available) :
|Implementation||Python Version||DN Framework||Operating System||Backbone||Time per image (1920×1080)||Time per image (640*480)|
|Facebook Detector||2.7||Caffe2||Linux (Ubuntu 16.04)||Resnet101/Coco||340ms||116ms|
|Matterport||3.5||Tensorflow 1.4||Linux (Ubuntu16.04)||Resnet101/Coco||290ms||190ms|
|Matterport||3.5||Tensorflow 1.4||Windows 10||Resnet101/Ccoco||620ms||490ms|
|MxNet MaskRCNN||2.7||MXNet||Linux||Resnet 50/Coco||Not yet there||N/A|
All tests were run on a system with 2 GTX 1080Ti GPUs (as most of the computation is on GPU, the CPU was mostly irrelevant, with same results for a system with a 6 core i7 and a 16 core Threadripper).
It’s somewhat interesting to note that Matterport implementation is faster out of the box than the Facebook implementation on higher resolution images (kudos to the implementing team). This can be due to a framework which is better optimized (apparently Tensorflow tends to outperforms Caffe2 lately, although in may 2017 they were competing neck to neck ) . It might also be the fact that the MP implementation scales all images to 1024×1024 (not sure about that …. ). Interesting, the windows implementation of Tensorflow is slower (I saw this a few times before on my own models).
From the point of view of an engineer who just wants to use these implementations in a real-life product, the question is mostly about how easy will be to get these solutions to run.
If your development is in Linux (let’s say a web – app with a Linux back-end), both FB and MP¨implementations are great. I would give extra points to MP because I found Tensorflow really easy to install (pip3 install tensorflow-gpu), while Caffe2 needed to be compiled from source. FB gets extra points for speed at lower resolutions, while MP for high-res images.
Finally, if you are developing a Windows application you don’t really have much choice. At this point Caffe2 Windows installation for GPU fails admirably, Python2 is less and less supported, so your best bet is to use the MP Tensorflow implementation (binaries available on windows), and use one of TF front-ends in your favorite programming language (Python, C/C++, C#, Java ports available).
In the end, I would like to congratulate the authors of this excellent segmentation algorithm(Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick), but also those who made available these excellent implementations. I am really thrilled about how far we managed to get in the last few years in Computer Vision, and I can’t wait to see what new surprises the future brings.