In our webinar 'Machine vision: learning increasingly complex real-world scenarios with limited annotated data', computer vision expert John Beuving explained how, based on deep learning, neural networks are increasingly able to understand images and videos. In this summary of the webinar, we set out the key facts and lessons.
John knows what he is talking about: he has been in the computer vision world for just under two decades. In 2003, he started at Siemens Mobile. Via Dacolian and DySI Analytics, he ended up at SmarterVision, where he develops socially relevant computer vision solutions as a computer vision expert and CTO. He obtained his PhD in Delft on model-free tracking; recently, as a volunteer at the NGO Sensing Clues, he has been building machine vision solutions for wildlife conservation.
In the webinar, Beuving first explains how SmarterVision deploys computer vision to, for example, monitor bridges, detect smuggling to jail based on camera images and detect epileptic seizures in young children in a non-invasive way.
The software stack used by SmarterVision for this purpose includes:
Video poses additional challenges compared to single images, Beuving explains. "This is because you have to determine not only what happens in a video, but also when exactly it starts and ends." In particular, the tricky part is the issue of how to create software or detection for events that don't happen often? Contraband thrown over a prison wall is a good example: it rarely happens, so there is hardly any footage of it.
Deep learning, or a form of machine learning based on artificial neural networks, offers a solution. Since 2014, deep learning has improved enormously, in particular thanks to better graphics processors, larger available datasets (e.g. ImageNet contains millions of images for visual object recognition software that have been manually annotated) and advanced learning methods for deep learning including dropout and batch normalisation.
Beuving: "Deep learning models are data-inefficient. Want to improve performance? Then you need a lot of annotated data, but there is only a limited amount available. There may be huge amounts of data coming in, more than 80 years of video a day on YouTube alone, but that is mostly unannotated. We want to learn from data, but we cannot label everything. Fortunately, the boundaries are being stretched further and further thanks to research. New techniques and datasets are allowing us to have a better understanding of video with the same amount of data."
Only: still a lot of data is needed. Ideally, we would like to understand even more with even less - or even no - data. That's where the data paradox comes in: the more we understand about an image or video, the harder it is to get more data. Complicating this is that many images are unique: for rare or difficult situations, it is difficult to get more data.
The first step is actually always: make sure you have more annotated data. The go-to way for this is currently supervised learning, where a human labels all the data points. A classifier is then trained based on these data points.
Supervised learning can be limited by generating data. This can be done, for example, using game engines and generative adversarial networks (GANs). Game engines, or the software development environments used to build video games, can be used to generate synthetic data that can be used to train on foreground detection and tracking, among other things. Especially with the game engine Unity, you get great results.
A generative adversarial network learns to generate images that have the same characteristics as the images in the training set, allowing you to create high-quality images. This is also the underlying technique for creating deepfakes.
Semiautomatic methods for supervised learning such as pseudo-labelling, active learning and point annotations do not require the user to analyse all the data points themselves. Pseudo-labelling means that you limit or simplify supervised learning by finding the most informative data. You train the model with a batch of labelled data, the trained model predicts labels for the unlabelled data and then the model is trained with the pseudo-labelled and labelled datasets together.
Active learning occurs when human experts label all difficult data points. The classifier is then re-trained with the new data points. The point annotations method for video is based on object detection, where you can locate actions with limited supervision. In a video, all kinds of objects are tracked over time, with one click you can annotate an object over that whole series of single frames. The big disadvantage of all these semi-automatic methods: unfortunately, they do not work well for unbalanced datasets.
Meta learning is a relatively new concept where learning can be done quickly with less data. The idea behind it is that people and animals learn so quickly by observing contexts, including other senses and physical properties of objects. This can involve few shot learning and zero shot learning.
With few shot learning, you need a few samples per class in your data for the machine to learn about the class. Among other things, the model looks for similarities between classes. This is a promising technique, not least because you can also use the neural networks for other support sets. Zero shot learning means having no samples at all in your training set. The model classifies categories that have not yet been seen; the data is classified based on unlabelled samples.
The hype within computer vision, according to Beuving, is currently self-supervised learning. This involves automatically annotating unannotated data so that a non-supervised dataset is trained in a supervised manner. The core is self-labelling: data annotates itself and learns from itself.
This can be done in various ways, such as:
Detecting anything that deviates from certain values is at the heart of anomaly detection, which is an alternative way of solving the data problem. The generator learns concepts from the real world, while the discriminator detects anomalies based on the input.
"What we currently use a lot is online active learning with a human in the loop," Beuving concludes the webinar. "The foundation is laid by self-supervised pre-training, a human expert then does the active learning piece."
The computer vision veteran's standard advice for companies is: always get more data first. "If you have enough data available, try self-supervised learning because it is so promising. Then fine-tune this with meta learning methods or active learning. Anomaly detection you can use as a fallback option."
In need of IT Development & Testing professionals? Spilberg gets you to the next step with our extensive network of experts. Read more about our staffing services for organisations
Want to boost your career? Spilberg is the partner that helps you to your next assignment or employer. Read more about the possibilities