Computer Vision: How machines learn to see
As part of the blog series on artificial intelligence (AI), we are now devoting ourselves to the topic of computer vision (also machine vision). This is a sub-area of AI that deals with analyzing and manipulating visual data, especially images and videos, and extracting information.
Similar to NLP, the range of applications is seemingly endless, because processing visual data is also one of the primary sensory tasks for us humans. Again, a simplified comparison of how humans actually process visual data is worthwhile in order to gain an understanding of computer vision and the parallels to human processing.
How does machine image processing work?
To test the mental development of young children, they are often given the task of recognizing and abstracting shapes such as squares, triangles and circles, e.g. B. which block of wood fits into which shape. What has to happen in order to master this task? The toddler has two objects: the block of wood and the shape. The objects have specific color patterns that presumably differ from the background (eg the floor). The child receives visual information that tells him that there are objects that differ from the floor in color and structure, as well as in distance and shape. This allows the child to clearly see the boundaries of the object. If the child now interacts more and more often with this or similar objects, it gradually understands that properties such as the number of edges, corners and aspect ratios are always similar – it learns to recognize patterns. The better the child masters these patterns, the better able he is to abstract that the block of wood has the same pattern as the square shape and the object fits into the shape accordingly.
Of course, the data situation in computer vision looks different at first. While in the child complex sensory and neurological connections lead to perception, visual data in a machine is primarily mathematical in nature. Each pixel is given its own numerical value, representing a color from a wide spectrum. A high-resolution photo is therefore stored in millions of numerical values. The basic features of the machine’s approach to analysis and pattern recognition are not so different from those of humans.
Through various manipulation techniques, the machine z. B. Perceive the outlines of objects in an image. In the wooden block example, the machine would clearly see the dividing line between the red wooden block and the white floor, because the wooden block’s pixels are in significantly different color spectra than those of the floor. With the help of machine learning (ML) and large amounts of training data, the machine can now be taught to recognize patterns.
This is of course a simplified form of visual data processing, but it provides an understanding of how the machine can recognize patterns from visual data and where the parallels to us humans lie. With the help of huge amounts of data and deep learning (DL) algorithms, a wide variety of patterns and objects in image and video data can be recognized and classified.
application areas
One of the main tasks in computer vision is object recognition and classification. This can be used to solve a wide variety of tasks. A few of them are shown here as examples.
Optical Character Recognition (OCR)
OCR is a widespread technology that is mainly used for the evaluation of documents. It allows writing and characters that are not yet available in machine text format to be converted into such. So e.g. B. Scanned invoices can be read out automatically and the individual data points such as invoice number, product item, open amount or biller can be extracted in order to then process them further.
OCR is often used in combination with other processes because text data is often easier to process than visual data. If you z. B. converts existing documents such as scanned invoices, correspondence, contracts, etc. into text format using OCR, a wide variety of Natural Language Processing (NLP) algorithms can then be applied to them (see previous blog post ).
Object and Face Recognition
Anyone who has ever used services/apps such as Google Photos or Google Lens is probably already familiar with such technologies. For example, Google Photos allows B. to sort images in the cloud by people or animals. A single click and I only see photos that show my dog. With Google Lens, objects can be analyzed and classified live via camera. What is the name of the tree species in your garden? Which insect species does this beetle belong to? Google Lens will tell you within seconds and throw out related Wikipedia articles. The service now includes more than a billion objects.
Of course, these technologies are not just limited to the private photo sector, but can be extended to a wide variety of applications, e.g. B. on forensic analysis. What used to be considered futuristic and more likely to appear in TV series like “Criminal Minds” is now quite possible and real: algorithms that compare the faces of people in camera recordings with databases of criminals.
healthcare
Computer vision is also finding its way into healthcare, because the algorithms are suitable for evaluating MRT or other imaging methods and e.g. B. Identify tumors or other abnormalities. The algorithms achieve a high level of accuracy of up to 98% and can thus support doctors in anamnesis and diagnosis.
Such technologies can now also be used easily by private individuals. For example, with the “SkinScreener” app, you can easily scan your skin from home and have moles examined for irregularities and skin cancer. Nevertheless, one should not rely solely on the results of algorithms, but always consult a doctor.
augmented reality
The future will play out in virtual reality and augmented reality. In particular, augmented reality benefits from computer vision, as this is the only way to project objects into the real world. For example, wearables like Google Glass can B. Recognizing machines and projecting operating instructions and documents into the glasses.
Only a few use cases have been outlined in this article, but the spectrum is huge. From autonomously driving cars, to reading the address on envelopes, to evaluating emotions in customer conversations, computer vision is used everywhere in business.
As diverse as the advantages and added values of the technology are, there are also watchouts and downsides, especially with regard to ethical and data protection aspects.
Ethical and legal aspects
In particular, image material of people requires special handling, because it contains so-called “personal data of special categories”. This means characteristics such as ethnic origin, religious beliefs or health-related data that can be clearly assigned to a person. The General Data Protection Regulation (GDPR) prohibits the processing of such data without the specific consent of the data subject.
The reasons for this are varied and often have an ethical background, such as the potential discrimination of people because of their ethnicity or world view. Especially when processing image material that depicts people, it is important to precisely define the legal framework and the purpose of the processing.
Computer vision also harbors risks from a social perspective, for example in government surveillance. Through the precise identification of people and the corresponding hardware such as cameras in public places, it is possible to monitor almost every step and consequently invade the private sphere of the person.
Conclusion
Visual computing is becoming more and more important in our world, the advantages are obvious. It is important to be aware of the legal framework when using it and to use the technology in an ethical way. As always with AI, the devil is in the details and you should get a clear picture of the data situation and the use case before applying visual computing. if e.g. For example, if the training data for the algorithm does not have the required quality and breadth, the machine will not perform precisely either.
- Vom 12. May 2022