StellarGraph

Unsupervised graphSage example

I am trying to use GraphSAGE for generating node embeddings that will further be used in the model I am building, along with other features that are not related to graphs. I am following https://github.com/stellargraph/stellargraph/blob/master/demos/embeddings/embeddings-unsupervised-graphsage-cora.ipynb

example. My understanding is that I do not need steps 11, 12, 13 and 14 in order to generate node embeddings. Am I correct?

I am a bit confused with the sentence “Note that this model’s weights are the same as those of the corresponding node encoder in the previously trained node pair classifier.” Which previously trained node-pair classifier are you referring to?

Hi @user2020 thanks for trying out stellargraph!

My understanding is that I do not need steps 11, 12, 13 and 14 in order to generate node embeddings. Am I correct?

I’m guessing you might be referring to the Jupyter notebook execution numbers to the left of each cell containing code? Below is the code I’m seeing in those cells in the demo notebook:

prediction = link_classification(
    output_dim=1, output_act="sigmoid", edge_embedding_method="ip"
)(x_out)

model = keras.Model(inputs=x_inp, outputs=prediction)

model.compile(
    optimizer=keras.optimizers.Adam(lr=1e-3),
    loss=keras.losses.binary_crossentropy,
    metrics=[keras.metrics.binary_accuracy],
)

history = model.fit(
    train_gen,
    epochs=epochs,
    verbose=1,
    use_multiprocessing=False,
    workers=4,
    shuffle=True,
)

This step is necessary to train the unsupervised model - it essentially frames the unsupervised learning task as a form of “link prediction” task, where the model is learning to predict whether a pair of nodes are close or far away in the graph.

When you don’t have any labels on nodes to train on, unsupervised learning with GraphSAGE (and many other graph methods) take the approach of defining a task that allows the model to learn to generate node embeddings that reflect how close each node is from one another. So generally speaking, nodes that are closer together in the graph should have similar embeddings, and those that are far away should not. And this approach fits into the framework of a link prediction task, since a link prediction model also learns to recognise pairs of nodes that are close (i.e. should be connected) vs nodes that are far (i.e. should not be connected), so that’s why we’re seeing the link_classification function being used here.

You can have a look at Section 3.2 Equation 1 from the GraphSAGE paper - the loss function for unsupervised learning is defined via an inner product of two node embeddings with sigmoid activation. The negative sampling component comes from running the UnsupervisedSampler in our code.


I am a bit confused with the sentence “Note that this model’s weights are the same as those of the corresponding node encoder in the previously trained node pair classifier.” Which previously trained node-pair classifier are you referring to?

The trained classifier is the model variable being used in the cells I’ve quoted above. Once we’ve trained the model with the above approach, since we don’t actually want to do link prediction, but want to just generate node embeddings we need to:

  • use the weights that have been learned in the model
  • but build a model with a slightly different structure (otherwise the model will be expecting pairs of nodes as input, and returning scores between 0 and 1 as output!)

which is what the code below is doing, using the inputs/outputs of the model from before to build a model with slightly different inputs/outputs:

x_inp_src = x_inp[0::2]
x_out_src = x_out[0]
embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)

When you build multiple keras models that refer to a common set of input/output tensors (x_inp / x_out variables being used here) they end up sharing the various components that are connecting the input and output tensors in between as well.


I’d suggest trying to replicate all the steps in the notebook at least up to the cell that generates node embeddings

node_embeddings = embedding_model.predict(node_gen, workers=4, verbose=1)

And the section after that to visualise the node embeddings will also be useful for you to determine whether the workflow is working correctly.

Hope that helps, happy to explain further if anything is still unclear!

Thanks for the clarification! I was wondering what is the recommended size of data used for training? I find it impossible to perform model.fit on input data of close to 1 million nodes.

WARNING:tensorflow:sample_weight modes were coerced from
  ...
    to  
  ['...']
Train for 911662 steps
Epoch 1/20

is where the code gets stuck and there is no movement. CPU usage is very low, so I doubt that there is anything going on in the background.

Also, is there an explanation of why I am getting embeddings (in no way random since I am getting a significant performance lift) when running everything until:

# Build the model and expose input and output sockets of graphsage, for node pair inputs:
x_inp, x_out = graphsage.in_out_tensors()

(including the step above)
followed by

x_inp_src = x_inp[0::2]
x_out_src = x_out[0]
embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)
node_gen = GraphSAGENodeGenerator(G, batch_size, num_samples).flow(connected_nodes)
node_embeddings = embedding_model.predict(node_gen, workers=4, verbose=1)

Basically, I am trying to generate embeddings for the whole graph and compare automatic feature generation with handcrafted feature generation as step 1. In the step 2, I am trying to train the model and generate embeddings for unseen nodes.