top of page

Mask Wearer Recognition Through Neural Networks - What a MvP Can Look Like

Many thanks to NeuroForge for the exciting internship!

Through the project, I was able to further develop my skills in Tensorflow & Software Engineering, among others, and got a great insight into your daily operations.

Jakob Fleischmann

From the idea to the MvP

Jakob Fleischmann is an intern for research in the field of artificial intelligence at NeuroForge and has gained an insight into the daily working environment at NeuroForge during his internship. In the course of this, he was able to exemplarily implement the concept of a Minimal viable Product (MvP) on the daily Corona event and develop a system to comply with the mask requirement. We simply grabbed Jakob and asked him about his project.

Jakob, what motivated you to use neural networks as part of your internship to check compliance with the mask obligation with the help of the software?

Due to the current Corona pandemic, the wearing of mouth-nose protection has become established or compulsory in many public places such as shops or public transport. Due to the government's decrees, businesses are faced with the challenge of ensuring that employees and customers adhere to the safety regulations. With this project, I would like to show that AI is perfectly suited to support this task, avoid penalties and protect the health of everyone.

What approach did you take for the project and why did you choose it?

I focused on quick results and thus also a quick evaluation. After all, speed counts in times of crisis. It was also important for me to work in a time- and resource-efficient way. Of course, the innovative approach was exciting anyway. During the first steps, we noticed that there are already very large training data sets for recognising people. So the idea was to just quickly add a few more mask images.

Of course, my chosen network architecture, Faster R-CNN, already provides quite a bit. If we use a Faster R-CNN trained in this way, all that remains is to decide for a single person whether a mask is worn or not. A classic CNN is predestined for this. The combination then makes the good result

Convolutional neural networks fall into the field of artificial intelligence, namely deep learning. You also use region recognition here. How did you do that?

It was not only important to me to recognise a mask in the picture, but also,

but also to get a real advantage on pictures where there are several people. Therefore, of course, an RPN (Region Proposal Network) was the first choice. I used a concept from "Faster R-CNN" as a template for this and followed FurkanOM in the implementation.

So you used 2 different approaches for one solution. Why did you choose them?

For the MvP, it was especially important for me to demonstrate the feasibility of the project and thus the potential of artificial intelligence. I wanted to find out how I could cleverly prototype the implementation before the time-consuming data acquisition and the so-called "labelling". We then quickly came to the idea in the team to use existing data sets for this as much as possible. So the project was "split up". To do this, we used the PascalVOC 2007 and 2012 datasets and first trained our network only for general person recognition. So we only extracted the images with associated meta information from the data sets that contain a person.

def filter_by_label_id(data_set, label_id): 
    """Filter for dataset. Passes just entries that involve persons
            tensorflow bool vector
    labels = data_set['labels']
    right_label_or_not = (labels == label_id)
    right_label_or_not = tf.dtypes.cast(right_label_or_not, tf.int32)
    right_label_or_not = tf.reduce_sum(right_label_or_not)
    right_label_or_not = tf.dtypes.cast(right_label_or_not, tf.bool)
    return right_label_or_not
def main():
    train_data, train_info = data_utils.get_dataset("voc/2007", "train+validation")
    val_data, _ = data_utils.get_dataset("voc/2007", "test")

    label_id = data_utils.get_label_id_for_label(label_name, train_info)

    if with_voc_2012:
        voc_2012_data, _ = data_utils.get_dataset("voc/2012", "train+validation")
        train_data = train_data.concatenate(voc_2012_data)

    train_data = train_data.filter(lambda data_set: data_utils.filter_by_label_id(data_set, label_id)) 
    train_total_items = data_utils.get_total_item_size(train_data)

    val_data = val_data.filter(lambda data_set: data_utils.filter_by_label_id(data_set, label_id))
    val_total_items = data_utils.get_total_item_size(val_data)

All unnecessary information was discarded and thus not included in the artificial intelligence training. During the subsequent preprocessing, we removed all information concerning the 19 classes that were irrelevant for us.

def preprocessing(image_data, final_height, final_width, label_id, apply_augmentation=False, evaluate=False):     
    """Image resizing operation handled before batch operations        
       and discard information on all labels except label with label_id           
        image_data = tensorflow dataset image_data         
        final_height = final image height after resizing         
        final_width = final image width after resizing     
        img = (final_height, final_width, channels)         
        gt_boxes = (gt_box_size, [y1, x1, y2, x2])         
        gt_labels = (gt_box_size)     
    img = image_data["image"]     
    gt_boxes = image_data["objects"]["bbox"]     
    gt_labels = tf.cast(image_data["objects"]["label"] + 1, tf.int32) #     
    add 1 for background      

    # delete gt_boxe and gt_label entrys that do not belong to label_id         
    person_or_not = gt_labels == (label_id + 1) # + 1 since the lable 
    background is added     
    gt_boxes = gt_boxes[person_or_not]     
    gt_labels = gt_labels[person_or_not]     
    gt_labels = gt_labels - label_id # since just one lable is used it is 
    identified with 1      

    if evaluate:         
        not_diff = tf.logical_not(image_data["objects"]["is_difficult"])         
        gt_boxes = gt_boxes[not_diff]         
        gt_labels = gt_labels[not_diff]     
    img = tf.image.convert_image_dtype(img, tf.float32)     
    img = tf.image.resize(img, (final_height, final_width))     
    if apply_augmentation:         
        img, gt_boxes = randomly_apply_operation(flip_horizontally, img,     
    return img, gt_boxes, gt_labels

For my Faster-RCNN backbone, I used the MobileNetV2 as a ready-made architecture. In comparison with e.g. VGG-16, training and evaluation showed that we could achieve faster and more reliable training results.

As a last task, I had to classify the cut-out persons with the help of my own Convolutional Neuronal Network. Using stacked convolutional, pooling and dense layers as well as batch normalisation, we end up with a statement about whether the person on the input image is wearing a mask or not. I have illustrated the architecture here.

Why did you decide to use an additional CNN for this MvP and not completely rely on Faster R-CNN?

In the course of this MVP, it quickly became clear that we not only wanted to determine whether or not a mask was being worn in the picture, but also to be able to perform our mask test for several people in the same picture. To teach Faster-RCNN this additional information would have actually required time-consuming manual tagging of people wearing masks. For assessing the feasibility of the scenario, the split approach described above proved to be faster and easier.

For further development and a more stable application in productive operation, I would of course recommend the approach with marking and labelling on a specially prepared data set. Especially after the MvP has already achieved very good results.

What potential do you see in this project? How would you proceed now?

From my point of view, the project has demonstrated the potential of the iterative development process. In addition to the transfer of knowledge, the development also trained my awareness of the problem and possible questions that might arise. Of course, we reproduced the result of the artificial intelligence at the end. A visualisation of the process is attached here.

Thank you very much for your work, Jakob!

It's always nice to see how innovative approaches deliver initial results quickly and easily.

Jonas Szalanczi


Recent Posts

See All


bottom of page