Mask R-CNN – review and benchmark of available implementations

Recently the FAIR team at Facebook have perfected their series of image classification and segmentation algorithms (Faster CNN, R- CNN) with a new and exciting addition – Mask RCNN. The results are extraordinary – you are able to extract from a plain image the position of each object in the image and also its contour -see below :

Instance Segmentation Sample

And as finally the official implementation (Detectron) is released, we decided to spend some time to evaluate it.

As its name implies, Mask RCNN is extending the previous algorithm Faster RCNN by adding a new branch which predicts the object mask in parallel with the existing branch which predicts the bounding box. “MaskRCNN outperforms all existing algorithms on segmentation, bounding-box object detection and person keypoint detection on the well-established COCO image dataset”.

There are quite  a few implementations available on Git because, as soon as the paper describing the model appeared on arxiv,  the research community (me included 🙂 ) tried to reproduce its results.

In this post, I will briefly present the existing implementation and compare their strengths and weaknesses.

1 . Facebook Detector official implementation finally appeared lastly in February 2018. It’s available freely at and it contains Mask RCNN and implementations for RetinaNet, Faster R-CNN and RPN.

The code is written in Python2 using the Caffe2 deep learning framework. The backbone architectures are ResNet50/101 and VGG16.

The implementation is extremely clean, scholarly written (I was able to read the code as easy as the paper), and you can easily see that engineers at FAIR took care to make this project very easy to extend and maintain.

2.  Matterport implementation appeared somewhere around the end of 2017(oct-dec). It’s freely available at and it’s surprisingly complete and easy to use. It is also written in Python(version 3 this time) but on Keras and TensofFlow. Its installation instructions are clear and the usage is straightforward. This implementation seems to be mainly maintained by employees at Matterport(a 3D imaging company) and PhilFerriere, an ex Microsoft AI.

3. MX MaskR-CNN , available at is a Python2  implementation of Mask R-CNN on Apache MXNet framework. Made available also at the end of Oct 2017, it was one of the first implementations publicly available .

Implementing the same model, the accuracies of the 3 implementations are  rather similar (for an outsider- I imagine for those involved in developing them every 0.5% it’s a game changer). So I’ll judge them here just based on the easiness of installation and on the execution time.

I tested them on full HD images (1920*1080px) which is the standard TV format right now and also on a set of random images from the COCO database (roughly 640*480 format).

Bellow, you see the results of the different implementations(FB Detector on the left, MP on the right). While they are not identical, the main features are reproduced correctly by all implementations and from my  point of view(as an user) they are equivalent.

Here is a performance benchmark results on Linux (Windows where available) :

 Implementation Python Version DN Framework  Operating System  Backbone Time per image (1920×1080) Time per image (640*480)
 Facebook Detector 2.7 Caffe2  Linux (Ubuntu 16.04)  Resnet101/Coco 340ms  116ms
 Matterport  3.5 Tensorflow 1.4  Linux (Ubuntu16.04) Resnet101/Coco 290ms  190ms
 Matterport  3.5  Tensorflow 1.4  Windows 10  Resnet101/Ccoco  620ms 490ms
 MxNet MaskRCNN  2.7 MXNet  Linux  Resnet 50/Coco  Not yet there  N/A

All tests were run on a system with 2 GTX 1080Ti GPUs (as most of the computation is on GPU, the CPU was mostly irrelevant, with same results for a system with a  6 core i7 and a 16 core Threadripper).

It’s somewhat interesting to note that Matterport implementation is faster out of the box than the Facebook implementation on higher resolution images (kudos to the implementing team). This can be due to a framework which is better optimized (apparently Tensorflow tends to outperforms Caffe2 lately, although in may 2017 they were competing neck to neck ) . It might also be the fact that the MP implementation scales all images to 1024×1024 (not sure about that …. ). Interesting, the windows  implementation of Tensorflow is slower (I saw this a few times before on my own models).

From the point of view of an engineer who just wants to use these implementations in a real-life product, the question is mostly about how easy will be to get these solutions to run.

If your development is in Linux (let’s say a web – app with a Linux back-end), both FB and MP¨implementations are great. I would give extra points to MP because I found Tensorflow really easy to install (pip3 install tensorflow-gpu), while Caffe2 needed to be compiled from source.  FB gets extra points for speed at lower resolutions, while MP for high-res images.

Finally, if you are developing a Windows application you don’t really have much choice. At this point Caffe2 Windows installation for GPU fails admirably, Python2 is less and less supported, so your best bet is to use the MP Tensorflow implementation (binaries available on windows), and use one of TF front-ends in your favorite programming language (Python, C/C++, C#, Java ports available).

In the end, I  would like to congratulate the authors of this excellent segmentation algorithm(Kaiming HeGeorgia GkioxariPiotr DollárRoss Girshick), but also  those who made available these excellent implementations. I am really thrilled about how far we managed to get in the last few years in Computer Vision, and I can’t wait to see what new surprises the future brings.







One thought on “Mask R-CNN – review and benchmark of available implementations

Leave a Reply

Your email address will not be published. Required fields are marked *