One of the most common problems in Video Processing is image recognition algorithm.

Simply put, given an image , to be able to say if the image contains a cat, a person, a car or just some random background.

This problem have be

en widely studied, and it’s at the base of, and progresses in paralel with, the Large Scale Visual Recognition Challenge (ILSVRC).

In this short tutorial we will learn how to make a binary classifier able to say if in the given image we have a person or not.

**Logistic regression it’s probably the simplest binary classifier available.**

The logical steps of this application would be :

- Prepare the input data features : images usually come as arrays of (nx, ny,3) , where nx and ny are the horizontal and vertical sizes of the image, and 3 is the number of color channels (R,G,B) . We convert these to a column vector v = np.reshape(1,nx*ny*3) then add them to the X input matrix X.append(v)
- Prepare the Y truth labels as a (1,no_examples) vector, containing 1 for “person” and “0” for “not a person
- Our model would then be y_pred = W*X +b
- The cost function for logistic regression is cost = -(1/m)*np.sum(Y*np.log(A)+(1-Y)*np.log(1-A))
- We apply gradient descent to adjust W,b such as to minimize the cost function

Finally , the results are :

For 200 train images and 50 test images of size (60,60) we obtain after 2400 gradient descent iterations a training accuracy of 100% and a test accuracy of 70% – not bad for such a simple classifier.

For 20k train images and 1k test images of size (100,60) we obtain after 30000 gradient descent iteration a training accuracy and a test accuracy of 80% – similar to a result obtained by a 5 layer Deep Neural Network.

Why the results are so good ?First – because we train a binary clasifier – and logistic regression excels at this, second – because we have a very large number of features.

Here you can find the python code for this tutorial.

**A second, better option would be a Neural Network**

We model and train using Tensorflow a simple NN with 3 layers – 25 , 12 and 2 neurons

We have :24000 input features(60 px*100 py *4 channels) and 2 output classes

The weights matrixes will thus have the shapes

W1 : [25, 24000] , b1 : [25, 1]

W2 : [12, 25] , b2 : [12, 1]

W3 : [2, 12] , b3 : [2, 1]

Running the training for this network for 150 epochs with an ADAM optimiser and a learning rate of 1E-4 yields promising results :

Train Accuracy: 0.951913

Test Accuracy: 0.945743