Hi, this is the second post in the Siamese networks mini series. You are more than welcome to read the first one, It explains the basics needed for this post.
last time we talked a bit about Siamese Networks and watched them fail miserably and we understood that the failure was not their fault. Why? Because, we asked them to do things they are not built to do.
This post will show us a bit more realistic demand for Siamese networks, which is the deployment of spatial representation. We will use the MNIST data set, and try to use the Siamese network for a semi-supervised learning task, creating a new N-dimensional space where the dataset is well represented – similar to an auto-encoder and unsupervised learning, but using the data tags to create a more efficient spatial projection, hence the semi-supervised.
as explained here
we can map the Siamese loss into a problem of binary cross entropy loss, where y is 1 if the examples are of the same class and 0 otherwise, thus creating an incentive for the network to create a close spatial representation for images of the same class and far representation otherwise.
To measure the success of the network, we will compare two setups:
- Calculate the absolute values of dimension wise subtraction between two images and using it as an input for SVM classifier.
- Calculate the absolute values of dimension wise subtraction between the images after passing them a forward propagation in the siamese network and extracting their new spatial representation, then using the vector as an input to train an SVM classifier
Both classifiers will try to use the input (two images, represented as vectors) to determine whether two images are of the same class or not.
With the final setup we get better performance for the semi supervised task using the Siamese network (80% vs 74%), of course this is not the optimal setup, hyper-parameter tuning is needed, but it’s good enough to show we are going somewhere. Please see the code in this GitHub repo. I tried to provide it as a colab link, but sadly, the GPU and TPU kernels do not live to end the run (short run). You can download it, and run in using jupyter notebook.
Notice, when I ran this network I ran into multiple obstacles
- Forgot to put the absolute value in the last layer and got inferior results
- Put the dropout on the first layer and got inferior results
- Tried a too high learning rate and got bad results
- Larger batch size made it run faster but hurt the classification results (interestingly, without hurting the loss as much).
A quick recap for what we are gaining here:
- A better representation for our data in our domain is always better in terms of learning resources (time and hopefully data)
- If we are able to get a spatial representation that is able, just by placing two of the data samples in space, determine whether they are of the same class or not, we are one step closer to one shot learning that will allow us for instance to determine if two people are the same person, without learning to classify them as a specific person in advance.
And here is some code for all of the devoted readers reaching the end
Many thanks to Ortal Dayan for the help editing my posts!