We all love deep learning, it provides us with strong classification capabilities, stronger than ever. Combined with the ability to run a strong classification algorithm, without extracting highly specialized domain features that require strong domain knowledge, sounds like an absolute dream.
However, when working in the data science industry, we see that a big part of the market is still dominated by “classic” machine learning algorithms, which are considered weaker than deep learning. Everyone first learning about deep learning might ask themselves the good old “BUT WHY?!”
Why should anyone willingly use a weaker algorithm with inferior generalization capabilities, that also requires having domain knowledge and extracting features?
The unspoken downside is the need for a big amount of training data. How big? BIG.
When working with a deep learning model, we see that the more complex it gets, we need more data for it to converge well without overfitting. When working with supervised learning models, the data should also be labeled, which tends to make problems even more problematic. And even when we do have a great dataset, let’s say for classification, and we can create a great model, what happens when the real world introduces us with a new class, that we know nothing about?
A good use case where this happens, is facial recognition: we can train a great model in the lab, with a huge dataset. That model is expected to perform well for a client that is absent from the original dataset. Classic models based on end to end deep learning may struggle with that use case, given they do not have enough data to learn how to deal with this new “class”
One approach can be transfer learning. Training a model to classify face images, then using it’s internal layers in order to extract good features from the new images, hence lowering the image dimensions, allowing us to train the model with fewer data samples (assuming the model learned how to extract meaningful features from the data). We will get into transfer learning in a future post, since we are here to talk about the other option.
Siamese networks (see 1 – 2), are a cool – relatively new deep learning architecture, designed to allow us to tackle this exact issue, classification of new classes that with a small data set. How it’s done you ask?
Siamese networks are constructed out of 2 or more networks (for simplicity, we will talk about 2), sharing weights. Their goal is to translate the data samples into a latent space, where separating the classes is easy. Siamese networks at their most simple implementation will be consisted of multiple layers, and have a binary classification node in the end. Their input will be pairs of data samples, and their expected output will be 1 if the pair is of two data samples from the same class, and 0 if not. We expect the layer to learn how to extract features that are relevant to that question, hence allowing an easy separation between classes.
A good result for a well trained siamese network will be a network that is trained to take a pair of data samples, and predict if they are of the same class or not, even if we did not have a good chance to train for separating those classes – this is called – “one-shot learning’ see example .
Here is an example of a Keras (tf.keras) implementation at its simplest, hint – it does not work well yet, a bit about what is going on here – this is a simple FN network, taking two mnist data samples as input
input1, input2 = keras.layers.Input((28 * 28,)), keras.layers.Input((28 * 28,))
using the same weight set to feed forward both of them, then trying to provide us with a binary classification, are the two of the same class?
emb1 = keras.layers.Dense(units=1000, activation="relu") emb2 = keras.layers.Dense(units=100, activation="sigmoid") output1, output2 = emb2(emb1(input1)), emb2(emb1(input1)) output = keras.layers.Dense(units=1, activation="sigmoid")(keras.layers.Concatenate(axis=1)([output1, output2]))
This does not work very well, and why should it? Neural networks very rarely behave the way we expect them at first attempt. A lot of work has been put into this issue.
Hint – a promising direction is improving the network with euclidean distance function in the last hidden layer – instead of just “leaving” the data there. Makes sense, the Siamese network is meant to give us a good spatial representation. Another option is using something called “contractive loss” over the last layer instead of connecting it into a binary classification layer.
Stay tuned – In the next post I will show the performance gap between the two methods in terms of loss improvement.
check out this repo if you like to see the code for implementing that basic network
I love cats. Deal with it. And also this is a siamese cat, which is a bonus.
For the Hebrew version, please visit the AI blog