Machine vision: how neural networks understand images better and better

In our webinar 'Machine vision: learning increasingly complex real-world scenarios with limited annotated data', computer vision expert John Beuving explained how, based on deep learning, neural networks are increasingly able to understand images and videos. In this summary of the webinar, we set out the key facts and lessons.

John Beuving: visionary in computer vision development

John knows what he is talking about: he has been in the computer vision world for just under two decades. In 2003, he started at Siemens Mobile. Via Dacolian and DySI Analytics, he ended up at SmarterVision, where he develops socially relevant computer vision solutions as a computer vision expert and CTO. He obtained his PhD in Delft on model-free tracking; recently, as a volunteer at the NGO Sensing Clues, he has been building machine vision solutions for wildlife conservation.

Understanding images and videos with SmarterVision

In the webinar, Beuving first explains how SmarterVision deploys computer vision to, for example, monitor bridges, detect smuggling to jail based on camera images and detect epileptic seizures in young children in a non-invasive way.

The software stack used by SmarterVision for this purpose includes:

  • Image classification: classifying an entire image.
  • Object detection: identifying objects and their position.
  • Sementic segmentation: which pixels belong to an object?
  • Instance segmentation: identifying different instances of objects.
  • Pose estimation: analysing the different poses of objects.
  • Tracking: tracing objects within a series of images.

Video challenges

Video poses additional challenges compared to single images, Beuving explains. "This is because you have to determine not only what happens in a video, but also when exactly it starts and ends." In particular, the tricky part is the issue of how to create software or detection for events that don't happen often? Contraband thrown over a prison wall is a good example: it rarely happens, so there is hardly any footage of it.

Deep learning revolution

Deep learning, or a form of machine learning based on artificial neural networks, offers a solution. Since 2014, deep learning has improved enormously, in particular thanks to better graphics processors, larger available datasets (e.g. ImageNet contains millions of images for visual object recognition software that have been manually annotated) and advanced learning methods for deep learning including dropout and batch normalisation.

Beuving: "Deep learning models are data-inefficient. Want to improve performance? Then you need a lot of annotated data, but there is only a limited amount available. There may be huge amounts of data coming in, more than 80 years of video a day on YouTube alone, but that is mostly unannotated. We want to learn from data, but we cannot label everything. Fortunately, the boundaries are being stretched further and further thanks to research. New techniques and datasets are allowing us to have a better understanding of video with the same amount of data."

The data paradox

Only: still a lot of data is needed. Ideally, we would like to understand even more with even less - or even no - data. That's where the data paradox comes in: the more we understand about an image or video, the harder it is to get more data. Complicating this is that many images are unique: for rare or difficult situations, it is difficult to get more data.

Solving the data problem: more data

The first step is actually always: make sure you have more annotated data. The go-to way for this is currently supervised learning, where a human labels all the data points. A classifier is then trained based on these data points.

Generating data via GANs and game engines

Supervised learning can be limited by generating data. This can be done, for example, using game engines and generative adversarial networks (GANs). Game engines, or the software development environments used to build video games, can be used to generate synthetic data that can be used to train on foreground detection and tracking, among other things. Especially with the game engine Unity, you get great results.

A generative adversarial network learns to generate images that have the same characteristics as the images in the training set, allowing you to create high-quality images. This is also the underlying technique for creating deepfakes.

Semiautomatic supervised learning

Semiautomatic methods for supervised learning such as pseudo-labelling, active learning and point annotations do not require the user to analyse all the data points themselves. Pseudo-labelling means that you limit or simplify supervised learning by finding the most informative data. You train the model with a batch of labelled data, the trained model predicts labels for the unlabelled data and then the model is trained with the pseudo-labelled and labelled datasets together.

Active learning occurs when human experts label all difficult data points. The classifier is then re-trained with the new data points. The point annotations method for video is based on object detection, where you can locate actions with limited supervision. In a video, all kinds of objects are tracked over time, with one click you can annotate an object over that whole series of single frames. The big disadvantage of all these semi-automatic methods: unfortunately, they do not work well for unbalanced datasets.

Meta learning

Meta learning

Meta learning is a relatively new concept where learning can be done quickly with less data. The idea behind it is that people and animals learn so quickly by observing contexts, including other senses and physical properties of objects. This can involve few shot learning and zero shot learning.

With few shot learning, you need a few samples per class in your data for the machine to learn about the class. Among other things, the model looks for similarities between classes. This is a promising technique, not least because you can also use the neural networks for other support sets. Zero shot learning means having no samples at all in your training set. The model classifies categories that have not yet been seen; the data is classified based on unlabelled samples.

Self-supervised learning

The hype within computer vision, according to Beuving, is currently self-supervised learning. This involves automatically annotating unannotated data so that a non-supervised dataset is trained in a supervised manner. The core is self-labelling: data annotates itself and learns from itself.

This can be done in various ways, such as:

  • The neural network has to predict what the missing part of the image looks like.
  • With the technique exemplar, multiple examples are generated by making adjustments such as scaling, rotating and changing contrast or colours.
  • Jigsaw: the image is divided into puzzle pieces, by placing the puzzle pieces correctly, the network learns visual concepts.
  • For videos, you can predict the future by learning from the past.

Anomaly detection

Detecting anything that deviates from certain values is at the heart of anomaly detection, which is an alternative way of solving the data problem. The generator learns concepts from the real world, while the discriminator detects anomalies based on the input.

In practice

"What we currently use a lot is online active learning with a human in the loop," Beuving concludes the webinar. "The foundation is laid by self-supervised pre-training, a human expert then does the active learning piece."

The computer vision veteran's standard advice for companies is: always get more data first. "If you have enough data available, try self-supervised learning because it is so promising. Then fine-tune this with meta learning methods or active learning. Anomaly detection you can use as a fallback option."

For organisations

In need of IT Development & Testing professionals? Spilberg gets you to the next step with our extensive network of experts.  Read more about our staffing services for organisations

For Professionals

Want to boost your career? Spilberg is the partner that helps you to your next assignment or employer. Read more about the possibilities