Region Based Convolutional Neural Networks Explained

Region-based Convolutional Neural Networks (R-CNN) are a family of machine learning models for computer vision and specifically object detection.

History

The original goal of R-CNN was to take an input image and produce a set of bounding boxes as output, where each bounding box contains an object and also the category (e.g. car or pedestrian) of the object. More recently, R-CNN has been extended to perform other computer vision tasks. The following covers some of the versions of R-CNN that have been developed.

November 2013: R-CNN. Given an input image, R-CNN begins by applying a mechanism called Selective Search to extract regions of interest (ROI), where each ROI is a rectangle that may represent the boundary of an object in image. Depending on the scenario, there may be as many as ROIs. After that, each ROI is fed through a neural network to produce output features. For each ROI's output features, a collection of support-vector machine classifiers is used to determine what type of object (if any) is contained within the ROI.
April 2015: Fast R-CNN. While the original R-CNN independently computed the neural network features on each of as many as two thousand regions of interest, Fast R-CNN runs the neural network once on the whole image. At the end of the network is a novel method called ROIPooling, which slices out each ROI from the network's output tensor, reshapes it, and classifies it. As in the original R-CNN, the Fast R-CNN uses Selective Search to generate its region proposals.^[1]
June 2015: Faster R-CNN. While Fast R-CNN used Selective Search to generate ROIs, Faster R-CNN integrates the ROI generation into the neural network itself.
March 2017: Mask R-CNN. While previous versions of R-CNN focused on object detection, Mask R-CNN adds instance segmentation. Mask R-CNN also replaced ROIPooling with a new method called ROIAlign, which can represent fractions of a pixel.^[2] ^[3]
June 2019: Mesh R-CNN adds the ability to generate a 3D mesh from a 2D image.^[4]

Applications

Region-based convolutional neural networks have been used for tracking objects from a drone-mounted camera,^[5] locating text in an image,^[6] and enabling object detection in Google Lens.^[7] Mask R-CNN serves as one of seven tasks in the MLPerf Training Benchmark, which is a competition to speed up the training of neural networks.^[8]

References

News: Bhatia. Richa. What is region of interest pooling?. September 10, 2018. Analytics India. March 12, 2020.
News: Farooq. Umer. From R-CNN to Mask R-CNN. February 15, 2018. Medium. March 12, 2020.
News: Weng. Lilian. Object Detection for Dummies Part 3: R-CNN Family. December 31, 2017. Lil'Log. March 12, 2020.
News: Wiggers. Kyle. Facebook highlights AI that converts 2D objects into 3D shapes. October 29, 2019. VentureBeat. March 12, 2020.
News: Nene. Vidi. Deep Learning-Based Real-Time Multiple-Object Detection and Tracking via Drone. Aug 2, 2019. Drone Below. Mar 28, 2020.
News: Ray. Tiernan. Facebook pumps up character recognition to mine memes. Sep 11, 2018 . . Mar 28, 2020.
News: Sagar. Ram. These machine learning methods make google lens a success. Sep 9, 2019. Analytics India. Mar 28, 2020.
1910.01500v3. math.LG. Peter. Mattson. MLPerf Training Benchmark. 2019. etal.