Hi @user2020 thanks for trying out stellargraph!
My understanding is that I do not need steps 11, 12, 13 and 14 in order to generate node embeddings. Am I correct?
I’m guessing you might be referring to the Jupyter notebook execution numbers to the left of each cell containing code? Below is the code I’m seeing in those cells in the demo notebook:
prediction = link_classification(
output_dim=1, output_act="sigmoid", edge_embedding_method="ip"
model = keras.Model(inputs=x_inp, outputs=prediction)
history = model.fit(
This step is necessary to train the unsupervised model - it essentially frames the unsupervised learning task as a form of “link prediction” task, where the model is learning to predict whether a pair of nodes are close or far away in the graph.
When you don’t have any labels on nodes to train on, unsupervised learning with GraphSAGE (and many other graph methods) take the approach of defining a task that allows the model to learn to generate node embeddings that reflect how close each node is from one another. So generally speaking, nodes that are closer together in the graph should have similar embeddings, and those that are far away should not. And this approach fits into the framework of a link prediction task, since a link prediction model also learns to recognise pairs of nodes that are close (i.e. should be connected) vs nodes that are far (i.e. should not be connected), so that’s why we’re seeing the
link_classification function being used here.
You can have a look at Section 3.2 Equation 1 from the GraphSAGE paper - the loss function for unsupervised learning is defined via an inner product of two node embeddings with sigmoid activation. The negative sampling component comes from running the
UnsupervisedSampler in our code.
I am a bit confused with the sentence “Note that this model’s weights are the same as those of the corresponding node encoder in the previously trained node pair classifier.” Which previously trained node-pair classifier are you referring to?
The trained classifier is the
model variable being used in the cells I’ve quoted above. Once we’ve trained the model with the above approach, since we don’t actually want to do link prediction, but want to just generate node embeddings we need to:
- use the weights that have been learned in the model
- but build a model with a slightly different structure (otherwise the model will be expecting pairs of nodes as input, and returning scores between 0 and 1 as output!)
which is what the code below is doing, using the inputs/outputs of the model from before to build a model with slightly different inputs/outputs:
x_inp_src = x_inp[0::2]
x_out_src = x_out
embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)
When you build multiple keras models that refer to a common set of input/output tensors (
x_out variables being used here) they end up sharing the various components that are connecting the input and output tensors in between as well.
I’d suggest trying to replicate all the steps in the notebook at least up to the cell that generates node embeddings
node_embeddings = embedding_model.predict(node_gen, workers=4, verbose=1)
And the section after that to visualise the node embeddings will also be useful for you to determine whether the workflow is working correctly.
Hope that helps, happy to explain further if anything is still unclear!