In this article, we will be taking a deeper look into how Computer Vision works and why it serves as a great stepping stone for more technological advancements. Thanks to advances in artificial intelligence, deep learning and neural networks, the field has been able to take great leaps in recent years and has been able to surpass humans in some tasks related to detecting and labelling objects.
One of the strongest driving factors behind the rapid growth of Computer Vision is the amount of data people generate every day. This data is what enables us to train Computer Vision and other systems to excel at specific tasks. It is estimated that there are more than three billion images shared online every day. This would require enormous computing power to analyse all of these images. But through artificial intelligence, we can automate a great percentage of this analysis process. As the field of Computer Vision has grown with advancing hardware and new algorithms, the accuracy rates for image classification has improved sharply.
The idea of Computer Vision started in the 1950s and since then it has grown exponentially to grow to commercial applications such as autonomous driving, facial recognition, improved surveillance, and so on.
This article will serve as a second part to the previous one where we gave a brief introduction into what Computer Vision is. If you haven’t checked it out, we highly recommend you do so in order to understand this article better. Or, if you’re already sufficiently competent on this topic, continue reading!
How does Computer Vision work?
Neural Networks are constructed behind the idea that they mimic the way the human brains works. However, nobody is actually certain if that’s true because of our limited knowledge of neuroscience. How does our brain work, and how are we able to process so much information in real-time? Because of these uncertainties, it’s difficult to determine how well our computer vision algorithms actually perform against our own biological vision system.
Figure 1: The human vision system vs computer vision system
Computer vision is all about pattern recognition. So one way to train a computer how to understand visual data is to feed it images. The more labelled images a computer is fed, the more the computer is able to accurately recognise patterns, people, objects, etc.
For example, if you feed a computer a hundreds of thousands of images of cars, it will apply its algorithms that will enable the analysis of colours, patterns, shapes, measurements, etc. After this the computer will be able to understand what a ‘car’ is, and will then theoretically be able to identify cars from unlabelled images on its own without supervision.
The primary objective of Computer Vision is to replicate human vision using digital images through three main processing components, carried out in consecutive order:
Image analysis and understanding
As our human visual understanding of world is reflected in our ability to make decisions through what we see, providing such a visual understanding to computers would allow them the same power.
Figure 2: The process of computer vision from image acquisition of the real world to how it makes a decision.
Image acquisition Image acquisition can be described as the process of translating the world around us into binary data composed of zeros and ones, which are then interpreted as digital images.
The following tools have been created to help create datasets that enable the digitalisation of images:
Webcams and embedded cameras
Digital compact cameras and DSLR cameras
Consumer 3D cameras and laser range finders
Image Processing The second component of Computer Vision is the low-level processing of images. Algorithms are applied to the binary data acquired in the first step to infer low-level information on parts of the image. This type of information is characterised by image edges, point features or segments, for example. They are all the basic geometric elements that build objects in images. This second step usually involves advanced applied mathematics algorithms and techniques. Low-level image processing algorithms include:
1. Edge Detection Edge detection is the process of using a variety of algorithms that aim at identifying points in a digital image at which the image brightness changes drastically. The points in an image at which brightness changes sharply are typically organised into a set of curved line segments that are commonly known as edges.
The same problem of finding discontinuities in one-dimensional signals is known as step detection and the problem of finding signal discontinuities over time is known as change detection. Edge detection is a fundamental tool in image processing, machine vision and computer vision, particularly in the areas of feature detection and feature extraction.
2. Segmentation In computer vision, image segmentation is the process of separating a digital image into multiple segments or parts. Segmentation is primarily used for simplifying or changing the representation of an image into something that can produce more meaning or provide easier analysis. Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images.
In other words, segmentation is the process of assigning a label to every pixel in an image so that pixels with the same label share pre-defined characteristics. A common technique of image segmentation is to look for abrupt discontinuities in pixel values, which typically indicate edges that define a region.
The reason we divide the image into segments is so that we can make use of the important segments for processing the image. Have a look at this example below:
Figure 3: Object Detection and Instance Segmentation (Source: stanford.edu)
In object detection, the algorithm constructs a box that corresponds to each class in an image. The limitation of object detection is that it does not provide us with any valuable information on the shape of an object as the algorithm uses boxes. If we want to gather more information, this type of segmentation is too ambiguous.
In the right image, instance segmentation creates a pixel-wise mask that covers each object in the image. This method provides us with significantly more information on the object than object detection.
Figure 4: Semantic Segmentation and Instance Segmentation (source)
We can broadly divide image segmentation techniques into two types. Have a look at the image above. We can see that both images utilise image segmentation to separate groups of pixels. However, both methods are used in different scenarios.
Semantic Segmentation, which is used in the first image, we can see that each pixel belongs to a certain segment (either a person or the background). Semantic Segmentation assigns the same colour to pixels in the same class. In this case, the people are represented by pink pixels and the background by black pixels.
Instance Segmentation differs from semantic segmentation because in this method, every object of the same class have different colours. Person 1 as red, person 2 as green, background as black, and so on.
In summary, if there are multiple objects in an image, semantic segmentation will focus on classifying all the objects as a single instance. Whereas in instance segmentation, it will identify each of the objects individually.
Figure 5: When designing computer vision for self-driving cars, semantic segmentation is popularly used to help the system identify and locate vehicles and other important objects on the road.
3. Classification Image classification refers to the process in computer vision that is able to classify an image based on its visual content. By feeding an artificial neural network with tons of labelled images, the algorithm should then be able to tell if an image contains a car or not, for example.
The most widely used architecture for image classification is Convolutional Neural Networks (CNNs). A typical use case for CNNs is where you feed the network images and the network classifies the data.