Antonius [Denny] Harijanto Blog: Convolutional Neural Net (CNN) vs. Feed-foward Neural Net vs. Human Visual-Processing System.

Recently, I was working with on a digit-classification task. I trained a variant of LeNet CNN network on a modified MNIST dataset. The modification is the addition of non-digits samples, which are augmented as the following:
1. For each of the images in the dataset, create four clones, each shifted by 50% upward, downward, to the left, and to the right, respectively.
2. Each of the four images generated from (1) is translated again. An image that from step (1) was translated upward/downward is now shifted randomly (0-25%) to the right/left. And similarly, the one that was translated to the left/right from (1) is now shifted upward/downward.

The network itself is modified to output one more digit, which is used to classify a non-digit input. All the generated non-digits samples explained above are labeled with this digit.

[Post the images that are falsely classified as digit here]

When used on an image of properly aligned digit, the trained classifier worked pretty well. When it was used in a sliding-window fashion, though, the result was disappointing. There are still so many false positives. It incorrectly classified what obviously seemed like non-digits, as digits. I suspected CNN architecture to be the culprit here. My thought is that since a convolution layer "combines" neighboring pixels together, multiple convolution layers followed by a fully-connected layer would "combine" more than just neighboring pixels, resulting in a potentially unexpected connectivity between pixels, and hence an unexpected activation.

After that, I was wondering if feed-forward neural net would produce better result when used in a sliding window fashion. My thinking is that since its connectivity is more straightforward, its activation units would require more "exact" matches to trigger, just as how it is less robust to translations when compared to CNN, and hence would be more robust to negative samples. When I get a chance, I'd really like to experiment with this. I'm thinking of deep visualization technique to analyse the robustness of CNN vs. feed-forward neural net against negative samples.

Also, it's worth to mention that I came across End-to-End Text Recognition with Convolutional Neural Networks [Wang, et al], which talks about text recognition using CNN. What's interesting is that they actually train a dedicated classifier, which is used in a sliding-window fashion, to detect whether a region contains a centered text or not. Only after the region is determined, they run another character-classifier that actually predicts the text there.

Another interesting finding is about how a deep neural network can be fooled. On their paper, Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images, Nguyen, et al found that one can fool the state-of-art of pattern-recognition DNN with an augmented image (http://www.evolvingai.org/fooling). Their paper shows how an image can be augmented in a way that a DNN-based classifier would classify it to be of a particular pattern with a very high accuracy. The image can be assembled such that human would still be able to make out the classifier-claimed pattern or one that totally makes no sense (false-positive).

The need of explicitly training NN networks against negative samples got me thinking about how our (human) visual processing system works. It seems that we can make out a particular shape out of very complex background very easily. I'm interested to understand how our brain is actually achieving this. With this understanding, we'll be able to come up with a better architecture to tackle recognition problem, as well as to simplify the tedious process of gathering the training data. While gathering enough positive samples - in order to train a classifier that works reasonably well - is already a tedious work, gathering negative samples is even more challenging. If the size of the positive sample in the universe of all samples is X, the size of the negative counterpart is the size of the entire universe subtracted by X. Imagine how large it is!

Antonius [Denny] Harijanto Blog

Pages

Saturday, October 10, 2015

Convolutional Neural Net (CNN) vs. Feed-foward Neural Net vs. Human Visual-Processing System.

No comments:

Post a Comment