StellarGraph

Model.fit not finishing

I am trying to train a model using Graph Sage and my training never completes. I reduced the size of graph to ~330K nodes and 4.7M edges. Running

history = model.fit(
    train_gen,
    epochs=epochs,
    verbose=2,
    use_multiprocessing=False,
    workers=50,
    shuffle=True,
)

is stuck at

Train for 106508 steps
Epoch 1/5

for hours (during the last attempt it was stuck for 10h). Is there a limit on the size of graph? I don’t understand why training is so slow.

Edit: I significantly reduced the input data set size and the edge types. The graph now has ~70K nodes and 250K edges. using num_samples=[5,2] and [10,10] for layers ends up with training time of 3min per epoch. I am also not noticing any improvements when changing workers parameter. Using workers 2 or 100 results in the same performance. Are there any performance benchmarks for Graph Sage or am I doing something wrong that results in bad performance?

Hi, thanks for investigating and trying some things.

It’d help to have a bit more information about the various inputs to the training:

  1. Which version of StellarGraph? (print(stellargraph.__version__) or pip freeze can confirm this)
  2. A summary of the graph node features (graph.info() shows the dimension and type, which should be enough for this)
  3. How the GraphSAGE...Generator generator is constructed?
  4. How the train_gen input is constructed (specifically, the size of the training data)?
  5. How the GraphSAGE model is constructed?

If you’re able and comfortable to share, we’d be happy to look at the whole code and even the data.

Also, we’ve recently made the feature sampling required for GraphSAGE much faster on the development branch (#1225), which makes some configurations of GraphSAGE noticeably faster. You can try this by cloning the repository and installing from there, or by installing from the repository with pip directly:

pip install git+https://github.com/stellargraph/stellargraph.git

Let us know what you find and we can work from there.

In terms of comparative benchmarks, the “hateful twitter” demo runs on a graph of a similar size (100K nodes, 2.2M edges):

>>> print(G.info())
StellarGraph: Undirected multigraph
 Nodes: 100386, Edges: 2194979

 Node types:
  default: [100386]
    Features: float32 vector, length 204
    Edge types: default-default->default

 Edge types:
    default-default->default: [2194979]

It takes about 10s/epoch for training and validating on my machine (on CPU, not GPU), using StellarGraph 0.11.1:

Train for 15 steps, validate for 85 steps
Epoch 1/30
15/15 [==============================] - 11s 702ms/step - loss: 0.3407 - acc: 0.8819 - val_loss: 0.4325 - val_acc: 0.8798
Epoch 2/30
15/15 [==============================] - 9s 633ms/step - loss: 0.3360 - acc: 0.8913 - val_loss: 0.4111 - val_acc: 0.8516
Epoch 3/30
15/15 [==============================] - 10s 687ms/step - loss: 0.3377 - acc: 0.8738 - val_loss: 0.4106 - val_acc: 0.8734
...

However, the training input is quite a bit smaller than the full graph in this case: 745 (train) + 4226 (validation) = 4971 nodes per epoch.

I have version 0.11.1

Graph has 300K nodes and 2.1M edges. Feature length is 20 and the graph has 3 types of edges.
The rest of the code is the same as in the Unsupervised Cora example notebook. Unfortunately, I cannot share the data. I added the parameters I used in the text below.

I realized that the number_of_walks and length parameters greatly impact the performance. For example, on a graph with 301K nodes and 2.1M edges, I get 32min per epoch training time when using number_of_walks=3 and length=2 (181K training steps) vs ~4h of training time when using number_of_walks=10 and length=3 (I am using unsupervised Cora notebook on a much larger data set). Time per step is 11ms. Is there any recommended set of parameters to start with?

I am using num_samples=[10,5]. My understanding is that num_samples represents the size of neighborhood that is sampled at each layer. We first sample the graph using random walks and then we use num_samples parameters to take neighborhood samples at each layer. Am I correct?

Also, I am using layer_sizes=[30,30]. Is there any rule that I should use for layer_sizes values? Those are Keras hidden nodes? Let’s say I have nodes where each node has 15 features and I am using num_samples=[10,5]. Is there any relationship between how I choose layer_sizes and how I choose num_samples?

My ultimate goal is to predict node embeddings for unseen nodes and use the embeddings with other manually generated features in model training. I am using 1/4 of the data set for training.

nodes = list(G.nodes())
number_of_walks = 10
length = 3

unsupervised_samples = UnsupervisedSampler(
    G, nodes=nodes, length=length, number_of_walks=number_of_walks, seed = 1
)

batch_size = 10
epochs=5
num_samples=[10,5]

# Here I create link generator using StellarGraph object and specify sampling at each layer
generator = GraphSAGELinkGenerator(G, batch_size, num_samples=[10,5], seed=1)
train_gen = generator.flow(unsupervised_samples, seed=1)

graphsage = GraphSAGE(
    layer_sizes=layer_sizes, generator=generator, bias=True, dropout=0.0, normalize="l2"
)
x_inp, x_out = graphsage.in_out_tensors()
prediction = link_classification(
    output_dim=1, output_act="sigmoid", edge_embedding_method="ip"
)(x_out)

model = keras.Model(inputs=x_inp, outputs=prediction)

model.compile(
    optimizer=keras.optimizers.Adam(lr=1e-3),
    loss=keras.losses.binary_crossentropy,
    metrics=[keras.metrics.binary_accuracy],
)

history = model.fit(
    train_gen,
    epochs=epochs,
    verbose=1,
    use_multiprocessing=False,
    workers=2,
    shuffle=False,
)

Why do I need both Link and NodeGenerator in the Unsupervised example?

No worries; it’s enough to have an idea of the shape of the graph :+1:

Yeah, you’re right. You may understand this, but, just to be clear, doing unsupervised training with UnsupervisedSampler has three steps (following section 3.2 of the GraphSAGE paper):

  1. sample some edges: positive instances come use random walks, where a walk like start node -> a -> b -> c is turned into edges like start node -- a, start node -- b, start node -- c, and negative ones are sampled randomly
  2. train the GraphSAGE or other model for link prediction on all these positive and negative instances, which computes an embedding vector for the source and target nodes of each edge, and combines them for the final link prediction task
  3. use the trained GraphSAGE model to compute the embedding vectors for nodes

The parameters like number_of_walks used in step 1 thus directly translate into the number of positive and negative examples: with number_of_walks=10 and length=3, each starting node ends up with 2 edges per walk, so, with 10 walks and the equal-number of negative samples, there’s approximately 40 edges created per node in the graph (12 million). This ends up being a lot of data to feed through the training process.

I think these can be fairly independent; the layer size represents the final size that the samples are aggregated into. For instance for layer_size = [30, 20] (different here just for explanation clarity) and num_samples = [10, 5], for each input node, there’s 10 first-neighbours sampled, and then, for each of those 10, 5 second-neighbours sampled. The 5 second-neighbours are aggregated into 10 vectors of size 20 (the corresponding layer size element), and these 10 vectors are aggregated into 1 vector of size 30.

Additionally, I think these generally should be chosen via a conventional hyperparameter search. It seems the paper used num_samples = [25, 10], and layer sizes of 128, 256, 512 or 1024 (it’s slightly unclear between Appendix C and their code).

Is the second model you’re training doing supervised node classification (predicting a property about the node) using the embedding vectors and these manual features as input? If so, have you considering adding the manual features into the data in the StellarGraph object and training a node classification model end-to-end, without generating intermediate embeddings?

It’s suboptimal, but one way to make this faster would be to reduce to a smaller amount of data for training.

The link generator does GraphSAGE sampling for the nodes at both ends of the edge, and it is used in step 2 for training the model via link prediction. The node generator does GraphSAGE sampling for just a single node, and is used in step 3.

1 Like