Outline


We have built an exciting app that people can use for their instance segmentation tasks. Users can upload their own images and obtain the annotated image. For each object detected in the image, the annotation includes a bounding box that surrounds the object, a class label (the class that the object belongs to), and a binary mask that segments the object from the background. We offer multiple modes to allow users to specify the type of objects they are interested in. Furthermore, we have done research on a new methodology known as Grabcut that allows users to iteratively refine the segmented results, and are actively working on adding this extra functionality into the web app.

Motivation


The task of instance segmentation is extremely important for critical computer vision tasks such as those involved with autonomous vehicles. For instance, if we solely use object detection in the autonomous vehicles, there is a chance that the bounding boxes of multiple cars may overlap and this will confuse the self-driving vehicle. Instance segmentation helps overcome this flaw. The ability to detect the spatial boundaries of objects down to pixel-level details instead of broadly sensing their location could mean the difference between the vehicle safely navigating its way through and the vehicle striking other cars or objects. In a world of high velocity traffic and unpredictability, the smallest details can have some of the most serious consequences. Instance segmentation will continue to play a crucial role in various computer vision tasks long into the future. Autonomous vehicles, medical imaging, facial recognition, robotic procedures; all of these fields rely on being able to accurately differentiate object instances, and we are fascinated by its long term potential.

Background


Instance Segmentation

Instance segmentation is a combination of object detection (classify individual objects and localize them using a bounding box) and semantic segmentation (classify each pixel into the given classes). Instance segmentation involves detection and segmentation of the individual instances of objects. In the present study, we used both a deep-learning-based algorithm (Mask-RCNN) and a graph-based algorithm that allow users' inputs (GrabCut) to perform the instance segmentation task.

Object Detection Kinds

Figure 1. A Brief Explanation of Instance Segmentation. (Image Source: Halbe, 2020)

Methods


Mask R-CNN

Mask R-CNN (He et al., 2017) is a state-of-the-art instance segmentation deep learning model. It extends Faster R-CNN, a fully convolutional neural network, to perform pixel-level image segmentation. An extra branch is added on top of the Faster R-CNN architecture to predict the binary mask for each of the detected instance. This mask branch functions in parallel with the existing branches for classification (object class prediction) and localization (bounding box prediction). The mask branch is a small fully-connected network that is applied to each region of interest (ROI) to predict a segmentation mask on a per pixel basis. This extra mask branch decouples the classification task from the pixel-level mask prediction task, resulting in a better overall performance.

The "backbone" of the underlying CNN model can be adjusted. To optimize the app's interactivity, we chose to use the CNN backbone model with the shortest inference time (i.e., ResNet with 50 layers followed by a Feature Pyramid Network, or, R50-FPN).

Mask R-CNN

Figure 2. The architecture of Mask R-CNN. Mask R-CNN extends R-CNN model to perform instance segmentation. (Image source: He et al., 2017)

Performance Metrics

Average Precision (AP)

  • AP                % AP averaged across 10 IoU thresholds = .50, .55, ..., .95
  • APIOU=0.50    % AP at IoU threshold = .50
  • APIOU=0.75    % AP at IoU threshold = .75

AP for Objects with Different Sizes

  • APsmall        % AP for small objects - object area < 322 pixels
  • APmedium    % AP for medium objects - 32 2 pixels < object area < 962 pixels
  • APlarge        % AP for large objects - area > 962 pixels

Note

  • AP is averaged across all categories within a dataset.
  • The definition of intersection over union (IoU): For the segmentation mask, IoU is defined as the overlap between the predicted mask and the ground-truth mask over the union of the the predicted mask and the ground-truth mask (both measures in terms of the number of pixels). For the bounding box, IoU is defined as the overlap between the predicted box and the ground-truth box over the union of the the predicted box and the ground-truth box (both measured in terms of the area).
  • The evaluation metrics for the prediction of the bounding box and the segmentation mask are identical in all respects except the IoU computation.

For more details about the performance metrics, please see the description (here).

GrabCut

To allow the users to interatively refine the binary mask for a segmented instance, we used the GrabCut algorithm. GrabCut is a graph-based algorithm that incorporates the hints that users give to iteratively refine the instance segmentation result. GrabCut works on one instance at a time, and will consider the instance of interest as foreground and the rest as background.

The figure below shows an example for how GrabCut works. First, the user shall provide a bounding box to surround the instance of interest. The region outside the box will be considered as "sure background" and will not switch to foreground throughout the estimation process. The region within the box will be considered as "probable foreground", which can be estimated as either background or foreground during the process. In addition, the user will also give "hints" by using a brush to highlight the region that is "sure background" and region that is "sure foreground". After the user submits the hints, GrabCut will then estimate the foreground pixels and background pixels. This give-hint-and-update-result process completes one iteration of GrabCut. Usually, the result the user receives after one round of iteration is not ideal. In this case, the user can give more hints and start another iteration of GrabCut. The mask will be refined iteratively until the user is satisfied with the result.

GrabCut

Figure 3. An example of how the GrabCut algorithm works (Image Source: Rother et al., 2004)

Under the hood (see Figure 4), GrabCut uses a graph to represent an image. Each node represents a pixel and the edge represents the similarity between two pixels (e.g., the similarity in color). A Gaussian Mixture Model (GMM) is then applied to separately estimate the foreground pixels (the instance of interest) and the background pixels. The foreground pixels and the background pixels are identified by two distinct connected subgraphs. After a user gives hints for the foreground and the background, the GMM parameters will be updated accordingly and thus update the segmentation result. The can take several rounds (give hints to update the result) until the user is satisfied with the result.

GrabCut

Figure 4. How the GrabCut algorithm works under the hood

Implementation Details


Packages and the Envoronment

We implemented the Mask-RCNN model using the python package Detectron2 (Wu et al., 2019), developed by Facebook AI Research (FAIR). As for GrubCut, we based our app on the python package opencv-python (Bradski, 2000).

Although we wanted to host this app publicly on Amazon Web Services, we could not afford the cost of hosting. So instead, we have included instructions on how to run the app locally in the Github Repo (https://github.com/VarunSreenivasan16/CS766-Project), and we will be looking for more cost-effective ways to host the app publicly.

Mask-RCNN: Domain-General and Domain-specific Modes

To allow the users to annotate the images in different domains, we have included six instance segmentation modes. There is one domain-general mode and five domain-specific modes. The details of the each of the mode are described below. For each of the six modes, we either used a pre-trained model or fine-tuned the model. For larger datasets (i.e., the generic, marine, and city modes), we decided to use pre-trained models, and we only fine-tuned the models when the dataset is small (i.e., the nature, balloon, and micro-controller modes). Fine-tuning was done with Google Colaboratory (please see the links for the notebooks).

Overview of the six modes
  • Generic - This mode provides the largest detection coverage for common objects (1203 common objects).

  • Marine - Specifically for detecting marine animals such as fish, crab, starfish and marine waste such as bottles, nets, and wreckage.

  • City - Specifically for detecting objects common in urban street scenes.

  • Nature - Specifically for detecting squirrels and butterflies.

  • Balloon - Specifically for detecting balloons.

  • Micro-Controller - Specifically for detecting common micro-controllers (e.g., Arduinos).

Details of the Six Modes
  1. Generic Mode (pre-trained)

    Dataset: Large Vocabulary Instance Segmentation (Gupta et. al, 2019)

    A large dataset that contains more than 2 million high quality instance segmentation masks for over 1000 entry-level object categories in 164k images.

    The pre-trained LVIS model provided by Detectron2 is used for this mode.

    Metrics for Bounding Box

    AP APIOU=0.50 APIOU=0.75 APsmall APmedium APlarge
    23.632 39.284 24.502 18.723 28.728 37.372

    Metrics for Segmentation Mask

    AP APIOU=0.50 APIOU=0.75 APsmall APmedium APlarge
    24.383 37.699 26.004 16.690 31.206 41.163
    generic

    Figure 5a. Inference with pre-trained LVIS model on custom image

  2. Marine Mode (pre-trained)

    Dataset: TrashCan (Hong et. al, 2020)

    A dataset comprising of 7212 images with observations of marine animals, trash, ROVs and multiple other forms of underwater flora and fauna. We used the instance version of the dataset that contains 22 different classes of objects.

    The pre-trained model available alongside the dataset is used for this mode.

    Metrics for Bounding Box

    AP APIOU=0.50 APIOU=0.75 APsmall APmedium APlarge
    34.5 55.4 38.1 27.6 36.2 51.4

    Metrics for Segmentation Mask

    AP APIOU=0.50 APIOU=0.75 APsmall APmedium APlarge
    30.0 55.3 29.4 23.2 31.7 48.6
    balloon

    Figure 5b. Inference with fine-tuned model on TrashCan test set

  3. City Mode (pre-trained)

    Dataset: Cityscapes (Cordts et. al, 2016)

    A large-scale dataset comprising of a diverse set of image frames of street scenes from 50 different cities. There are 5000 images with fine annotations and 20000 with coarse annotations.

    The pre-trained CityScapes model provided by Detectron2 is used. (Note: The metrics for the bounding box prediction of the pretrained model aren't provided. Only the metrics for the segmentation mask are provided. )

    Metrics for Segmentation Mask

    AP APIOU=0.50
    36.481 62.219
    city

    Figure 5c. Inference with pre-trained Cityscape model on custom city image

  4. Nature Mode (fine-tuned)

    Dataset: Nature Dataset (Olafenwa, 2020)

    This is a dataset containing around 800 images of squirrels and butterfiles.

    The Mask R-CNN model is obtained through fine-tuning (our fine-tuning code) the pre-trained COCO model using the train set. We chose the training parameters based on the demo, which is alongside the dataset. The model is then evaluated on the test set.

    Training Parameters

    • images per batch = 2
    • initial learning rate = 0.00025
    • number of iterations = 1000

    Metrics for Bounding Box

    AP APIOU=0.50 APIOU=0.75 APsmall APmedium APlarge
    63.750 96.952 75.514 nan nan 64.126

    Metrics for Segmentation Mask

    AP APIOU=0.50 APIOU=0.75 APsmall APmedium APlarge
    73.770 96.442 84.619 nan nan 76.358
    butterfly

    Figure 5d. Inference with fine-tuned model on Nature test set

  5. Balloon Mode (fine-tuned)

    Dataset: Matterport's Balloon Dataset (Waleed, 2018)

    This is a small dataset provided by Matterport containing balloon instances.

    The Mask R-CNN model is obtained through fine-tuning (our fine-tuning code) the pre-trained COCO model using the train set. We chose the training parameters based on this demo. The model is then evaluated on the test set.

    Training Parameters

    • images per batch = 2
    • initial learning rate = 0.00025
    • number of iterations = 300

    Metrics for Bounding Box

    AP APIOU=0.50 APIOU=0.75 APsmall APmedium APlarge
    65.014 84.353 79.896 9.398 54.380 77.943

    Metrics for Segmentation Mask

    AP APIOU=0.50 APIOU=0.75 APsmall APmedium APlarge
    75.856 83.622 83.365 7.249 57.591 92.700
    balloon

    Figure 5e. Inference with fine-tuned model on Balloon test set

  6. Micro-Controller Mode (fine-tuned)

    Dataset: Micro-Controller Segmentation Data (Tanner)

    A small dataset containing annotations for instances of Arduino, Raspberry Pi, Lora, and ESP8266.

    The Mask R-CNN model is obtained through fine-tuning (our fine-tuning code) the pre-trained COCO model using the train set. We chose the training parameters based on the demo that is alongside the dataset. The model is then evaluated on the test set.

    Training Parameters

    • images per batch = 2
    • initial learning rate = 0.00025
    • number of iterations = 1000

    Metrics for Bounding Box

    AP APIOU=0.50 APIOU=0.75 APsmall APmedium APlarge
    76.320 91.749 91.749 nan 90.000 80.033

    Metrics for Segmentation Mask

    AP APIOU=0.50 APIOU=0.75 APsmall APmedium APlarge
    88.020 91.749 91.749 nan 90.000 96.894
    Micro-con

    Figure 5f. Inference with fine-tuned model on Micro-Controller test set

Web-App & Demo


Please access the repo containing the code and instructions for use at our Github Repo.

One of the main deliverables for our project is an interactive web application that can be used to perform instance segmentation on an user's personal images. The github repository containing this code is linked above. The app runs inside of a docker container to make dependency management easier, and is built using a python-flash backend with a DHTML frontend including HTML, CSS, JS, and libraries such as jQuery and Bootstrap. We are searching for cost-effective ways to host this app publicly.
At the moment, the user can use the app to perform instance segmentation on their personal images using one of our six available modes. The user is able to save their image after processing and will receive visual feedback. If the user is more interested in processing a large quantity of images versus one at a time, we also offer Batch processing functionlity so that they can upload a directory of images, have them all processed, and have a zip file downloaded with their processed images once complete.
There are a few features currently in progress that were not 100% completed. Grabcut functionality in the app is one such feature, where if the user is not satisfied with the identification of an object of importance, they can use Grabcut to give hints on the objects they are interested in and the model will focus on that particular object. The controls for user interaction are implemented in the app; however, the backend processing is still incomplete and so processing is currently disabled. Another piece of functionality we are working on to offer is to allow the user to upload their own custom dataset and train a model on specific object types for nuanced inference. This should allow infinite customizeability for the purpose of single image instance segmentation.

Instance Segmentation Demo:

GrabCut: Refine the segmentation mask

As shown in Figure 6, Mask R-CNN first infers the segmentation mask for the bus. This predicted mask is not perfect, e.g., it misses the mirror and falsely include the parts of the car in front of it. The user then provides some hints - the white strokes highlight the parts that are surely the instance, and the black the parts surely the background. After GrabCut, the segmentation mask is improved.

GrabCut demo

Figure 6. Demo of the app's GrabCut function.

Challenges We Faced and Solved


  • Redirecting our project

    We had to rapidly change the direction of our project after receiving feedback for our initial proposal. Our original plan involved implementing the Mask R-CNN as in the original paper and replicating the results on the COCO Dataset and then implementing the model on autonomous vehicle datasets such as the Cityscapes Dataset and the Indian Driving Dataset. We also intended to spend time on enhancing model performance by making modifications such as tweaking the model architecture (e.g., the R-CNN backbones), and exploring different training techniques (e.g., multi-scale train/test, horizontal flip test). We realized that this plan was impractical given the time and computer resources available. We decided to change the direction and emphasis of our project to deliver an end result that is both tangible and achievable.

  • Platform Independence of App

    We initially planned to develop an app that users can install and run locally. However, the core engine package of instance segmentation, Detectron2, is natively supported only on Linux and Mac OS. We spent a lot of time exploring workarounds to have it work on Windows OS but in vein. We then explored hosting the app publicly with a service such as Amazon Web Services; however, we cannot financially afford the cost of the resources needed to continually host the app. The solution we landed on is to have the user host the app locally, for the time being; however, everything is bundled into a docker container so that the platform independence is addressed and dependency management is kept simple. All the user needs is to have docker installed on their machine, and they will be able to build the image from the Dockerfile provided and installation instructions in the README of the Web-App directory.

  • Computational Resources

    Initially, we decided to train all the models ourselves with large dataset. However, that turned out to be impractical due to the computational resources we had (detailed below). Therefore, for larger datasets (i.e., the generic, marine, and city modes), we decided to use pre-trained models, and we only fine-tuned the models on smaller dataset (i.e., the nature, balloon, and micro-controller modes).

    1. To use Google Coloboratory, the dataset must be uploaded to Google drive in order for Colab to read it. This was a significant bottleneck because uploading large datasets onto Google drive is slow - with the bandwidth of 100Mbps download speed and 40Mbps upload speed, uploading 1GB of files to Google drive took us 5 hours. Although we were able to find an existing COCO dataset on a publicly accessible Google drive, other datasets we were considering (e.g., Open Images Dataset, Indian Driving Dataset) were not publicly available on Google drive.
    2. The second problem was related to the limitation of the Colab environment. Even when we were able to mount the public Google drive that contains the COCO dataset, Colab crashed upon loading the large-scale dataset (2017 COCO training set is of size 19GB). After days of research and trial and error, we eventually managed a workaround using symbolic links, which ensures that Colab wouldn’t crash due to being inundated by many images. This, however, requires the dataset to be present on the Google drive ina specific format. we were only able to configure the 2017 COCO dataset into such formats, but not for other datasets.
    3. Colab was under-powered to train the Mask R-CNN model from scratch with such a large dataset. The free version of Colab only provides 12.8 GB of RAM, one single GPU (with 11.4MB of memory), and each session only allows 12 hours of training. More annoyingly, if the browser gets idle, the Colab session would be terminated automatically, which means the the client computer can’t be turned off or used for other purposes while training. The authors of [1] said that it took them 32 hours to train the Mask R-CNN model from scratch with COCO dataset using a 8-GPU machine. With an single-GPU machine, the 12-hour session limit, and the termination mechanism, we deemed Colab not appropriate for such kind of heavy training task.
  • Availability of instance segmentation datasets

    There were not a lot of instance segmentation datasets to choose from while fine-tuning because we had to constrain the size of the dataset to account for the limited computational resources.

Future Directions


Adding Video Stream Support

Our app currently only works for static images. We aim to extend our app to perform instance segmentation over video frames. Ideally, we would ideally like to perform live instance segmentation, which is essential for many real-world applications (e.g., autonomous vehicles). To achieve this, the run-time needs to be optimized.

Expanding the Coverage of the Supported Object Classes

Our app currently offers six modes of instance segmentation. We aim to further expand our detection range to cover more objects. Ideally, the user does not need to manually specify the mode to use. Rather, there should be one single model that can handle all types of objects. This allows us to broaden the reach of our app by effectively covering the wide ranging interests of our app users. This would require training the model using a very large dataset that contains thousands of classes or even mor.

Documents

References

[1] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 2961–2969, 2017.

[2] Yuxin Wu and Alexander Kirillov and Francisco Massa and Wan-Yen Lo and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2. 2019

[3] Hong, Jungseok; Fulton, Michael S; Sattar, Junaed. TrashCan 1.0 An Instance-Segmentation Labeled Dataset of Trash Observations. Retrieved from the Data Repository for the University of Minnesota, https://doi.org/10.13020/g1gx-y834. 2020

[4] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benen-son, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding, 2016.

[5] Agrim Gupta, Piotr Dollar, and Ross Girshick. A Dataset for Large Vocabulary Instance Segmentation. Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition. 2019

[6] Waleed Abdulla. Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. https://github.com/matterport/Mask_RCNN. 2017.

[7] G. Varma, A. Subramanian, A. Namboodiri, M. Chandraker, and C. V. Jawahar. IDD: A Dataset for Exploring Problems of Autonomous Navigation in Unconstrained Environments. In 2019 IEEE WinterConference on Applications of Computer Vision (WACV), pages 1743–1751, 7

[8] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ́ar,and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference onComputer Vision, pages 740–755. Springer, 2014.