Image recognition

Image recognition is the process of detecting and identifying an object or a feature present in images or videos, captured by a digital camera. Though seemingly easy task for humans, it has been astronomically difficult for computers to understand images. This concept is building block of computer vision and artificial intelligence. And used in wide varieties of real world systems, which include factory automation, autonomous cars, security surveillance, medical diagnosis, visual search, toll booth monitoring, face recognition, gender-identification, biometrics etc.

What was used before for image-recognition

At fundamental level, most of the algorithms for image recognition use explicit/implicit combination of feature extraction and classification algorithms. Starting from primitive features from images like colors, edges, gradients, several intelligent methods have been proposed for the process of feature extraction by researchers. Several researchers have adopted algorithms from various fields such as linear-algebra, statistics, natural language processing and more importantly machine learning to solve the problem of image-recognition. To track the progress of the field, several datasets have been introduced and currently IMAGENET is the largest and most challenging dataset that is available. IMAGENET has more than 1 million images grouped to approximately 1000 classes. In the recent few years, algorithms known as convolutional neural networks(CNN’s) helped researchers achieve near human accuracies in understanding various kinds of images. In the next few sections, we briefly introduce CNN’s, their advantages compared to previous methods, a brief overview of the ground-breaking results on IMAGENET dataset and other possible applications and we conclude with future directions.

CNN’s, its history, recent breakthroughs and results

Convolutional neural networks are variants of generic algorithms called artificial neural networks(ANN’s). We will talk briefly about ANN’s, recent developments and come back to CNN’s again. The field of ANN’s has a long history starting from “Perceptron algorithm” in 1950’s. Researchers discovered that Perceptron algorithm cannot approximate many nonlinear decision functions ( eg: XOR function ) and by 1980’s, they found a solution to that problem by stacking multiple layers of linear classifiers. Essentially they achieved solutions by using functions of functions of functions… on input data; ANN’s took off for a while due to constraints on computational power and availability of labelled datasets.

Since late 2000’s, ANN’s have recovered and become more successful, mainly due to availability of inexpensive, parallel hardware( GPUs, computer clusters ) and a massive amounts of labelled data. Researches found that by making the ANN’s deeper ie., by stacking more layers ( Hence the term “Deep learning”), they were able to achieve groundbreaking results in several hard problems of AI, mainly in the fields of computer vision, speech recognition, and language modeling. Perhaps most important reason for huge success of ANN’s is that they have millions of parameters and can approximate very nonlinear functions, when modeled right complicated problems like image-recognition can be solved to greater extent. From 2006-2012, few other breakthroughs include the use of rectified linear units, dropout mechanisms, improvements in training and prediction quality.

Image depicting, several processing layers of CNN. Image courtesy by Stanford course on CNN’s (

CNN’s are specifically designed for digital images, with many processing layers, an example CNN is depicted in the above image. Images have very high dimensionality, for eg., a color image of size 250×250 is represented by 187500 positive numbers. It becomes extremely difficult to train ANN’s on such high-dimensional data with many layers. Hence CNN’s are introduced, CNN’s are connected locally and they share weights among nearby nodes/pixels, which are very similar to neurons in the brain. Basic design principle of CNN’s comes from signal processing concept known as convolution. In convolution we apply a same set of weights to process the input signal. In practice, CNN’s also comes with another layer known as “max-pooling layer”, which selects the maximum value locally, and is invariant to shifts in the input.

CNN’s when  compared to other image recognition methods, use relatively little pre-processing. This means that they are not hand-engineered and are responsible to learn filters themselves, which is major advantage of CNN’s. An analysis of CNN’s after learning reveals that CNN’s similar to humans, they learn level-by-level with multiple levels of abstraction. That is in the first few layers, they learn basic shapes( lines, dots, etc ), in the few layers they tend to learn little complicated shapes ( Eg: patterns, rectangles, circles,  etc., ) and slowly they tend to learn the building blocks of natural and man-made objects. Hence, they are flexible and easily adoptable to solve multiple problems similar to image recognition.

Since 2012, CNN’s have undergone rapid developments, many novel architectures have been proposed which resulted in improvements in image recognition. Notably, CNN’s on MNIST database, have achieved an error rate of 0.23 in 2012, and dropped down to an surprisingly low error rate of 0.06656 by 2015 ( GoogleNet, the foundation of Deep dream ). This is by far the best and comparable to what humans could recognize. The main difference between CNN’s and humans is that CNN’s find it difficult to recognize very small objects ( such as ants on a stem ) and images that are heavily distorted by filters. These kind of images rarely trouble humans. However, humans find it difficult to recognize fine-grained categories of objects such as particular species of bird or dog, where as CNN’s handle this very easily.

What’s next?

CNN’s have also achieved fascinating results in other image related problems such as face recognition, video-action recognition etc., Surprisingly due to their flexibility, they have also achieved competitive and sometimes best results in the sub-problems of Natural language processing such as semantic parsing, search query retrieval, sentence modeling, classification, prediction etc.,  CNN’s have been successfully adopted by drug-discovery-community too. CNN’s were leveraged in understanding the interaction between molecules and biological proteins to identify potential treatments that are more likely to be effective and safe. This gave birth to ATOMNET, that was used to predict novel candidate biomolecules for several disease targets, most notably treatments for the Ebola virus and multiple sclerosis. The future of CNN’s is very exciting, we may be seeing breakthroughs in time-series data, mainly in using the combination of recurrent-neural-network (RNN’s) and long-short-term memory networks (LSTM’s) to better solve the problems of video understanding, hand-writing recognition/simulation, speaker recognition in clutter, predicting diseases and predicting stock markets and the list goes on.