I have version 0.11.1
Graph has 300K nodes and 2.1M edges. Feature length is 20 and the graph has 3 types of edges.
The rest of the code is the same as in the Unsupervised Cora example notebook. Unfortunately, I cannot share the data. I added the parameters I used in the text below.
I realized that the number_of_walks and length parameters greatly impact the performance. For example, on a graph with 301K nodes and 2.1M edges, I get 32min per epoch training time when using number_of_walks=3 and length=2 (181K training steps) vs ~4h of training time when using number_of_walks=10 and length=3 (I am using unsupervised Cora notebook on a much larger data set). Time per step is 11ms. Is there any recommended set of parameters to start with?
I am using num_samples=[10,5]. My understanding is that num_samples represents the size of neighborhood that is sampled at each layer. We first sample the graph using random walks and then we use num_samples parameters to take neighborhood samples at each layer. Am I correct?
Also, I am using layer_sizes=[30,30]. Is there any rule that I should use for layer_sizes values? Those are Keras hidden nodes? Let’s say I have nodes where each node has 15 features and I am using num_samples=[10,5]. Is there any relationship between how I choose layer_sizes and how I choose num_samples?
My ultimate goal is to predict node embeddings for unseen nodes and use the embeddings with other manually generated features in model training. I am using 1/4 of the data set for training.
nodes = list(G.nodes())
number_of_walks = 10
length = 3
unsupervised_samples = UnsupervisedSampler(
G, nodes=nodes, length=length, number_of_walks=number_of_walks, seed = 1
batch_size = 10
# Here I create link generator using StellarGraph object and specify sampling at each layer
generator = GraphSAGELinkGenerator(G, batch_size, num_samples=[10,5], seed=1)
train_gen = generator.flow(unsupervised_samples, seed=1)
graphsage = GraphSAGE(
layer_sizes=layer_sizes, generator=generator, bias=True, dropout=0.0, normalize="l2"
x_inp, x_out = graphsage.in_out_tensors()
prediction = link_classification(
output_dim=1, output_act="sigmoid", edge_embedding_method="ip"
model = keras.Model(inputs=x_inp, outputs=prediction)
history = model.fit(
Why do I need both Link and NodeGenerator in the Unsupervised example?