CNN R-mask is an extension of the faster R-CNN. The fastest R-CNN predicts the bounding box and the R-CNN Mask basically adds one more branch to predict object masks in parallel.
How the R-CNN mask works:
Backbone network model: a standard convolutional neural network that functions as a feature extractor. For example, it will change the image to 1024x1024x3 to a 32x32x2048 map function that serves as an input to the next layer.
Regional proposal network (RPN): when using an area defined by a maximum of 200 K anchor boxes, RPN scans each region and predicts if there is an object. One of the advantages of RPN is that it does not scan the real image, the network scans the characteristics of the map, which makes it faster.
Classification of areas of interest and bounding boxes: in this step, the algorithm takes the area of interest proposed by the RPN as the entry and exit classification (softmax) and the bounding box (regressor).
Mask segmentation: in the last step, the positive ROI area algorithm is taken as input and the 28×28 pixel mask with the floating value is generated as output for the object. During the inference, this mask is improved.