HinSage nodes without features

Background: I’m using HinSage because my objective is to perform node classification on a heterogenous graph based on graph structure and node features. I’m trying to reduce the need for manual feature engineering from graph patterns.

In my initial testing with 3 different node types, I setup a single dummy feature on each node (all values set to 1) to get me going where the intent was to add true features once I’d proven HinSage was making some good predictions based on the structure of a simple graph. For my ‘simple test’, I produced a test graph where the classification of one node type A was directly correlated to its attachment of a node of Type C (via node B). HinSage didn’t ‘find’ that pattern. Rather than delve into the specifics of the HinSage implementation I hypothesised that maybe HinSage didn’t have an awareness of the node types in its training - the loss function was pretty flat over the epochs. As a quick hack to test my hypothesis, I generated an artificial random feature on each node, where each node type’s random value was a gaussian distribution around a different mean (because I am performing feature normalisation) - the idea being this would help the algorithm learn the type of nodes by the differing means. This seemed to work and saw the loss function output drop and give good results.

My hack doesn’t feel sustainable as features are added, I’m curious to hear thoughts on what I’m observing and what I should be expecting from HinSage when there are limited features on nodes. I know there are other algorithms what look at graph structure without node features, I was hoping HinSage could do this and roll in features to boot where available.


HinSAGE does have awareness of the node types, but it does indeed require node features to be maximally useful. The core of the algorithm when computing the hidden output for a node is to sample neighbours following edge and node types, with each of those samples aggregated separately. In both HinSAGE and the original GraphSAGE, the graph structure (and the node/edge types within that) is thus conveyed by the node features fitting into this sampling. In particular, this means that the model cannot distinguish between nodes with identical features, like your original all-ones experiment.

If you’re interested in more details, Theorem 1 in the GraphSAGE paper formalises this for GraphSAGE, and HinSAGE is likely similar (the \|\mathbf{x}_v - \mathbf{x}_{v'}\|_2 > C supposition is saying that node features have to be different):

For your particular dataset, are the “real” features from the original dataset likely to be different for each node?

Some other options:

  • Instead of having the node type be handled at the sampling stage in HinSAGE, instead encode it as a one-hot node feature vector (A = 1, B = 0, C = 0, for a node of type A), and then any algorithms for homogeneous graphs can be used (e.g. GraphSAGE or GAT). These feature vectors can be concatenated alongside any other features too. (This may or may not work so well if there’s significant class imbalance: HinSAGE effectively does stratified sampling, that GraphSAGE doesn’t.)
  • One-hot encode the node identity, for node types without features. For example, if node type B has no features, compute features like [1, 0, ..., 0] for the first node of type B, [0, 1, 0, ..., 0] for the second (etc.). This is similar to the trick of randomly generating features, but “skipping” a step, where the model first learns node identity based on the features. (This may require too much memory to be feasible, though.)

Does that vaguely help? Please ask if that doesn’t make sense or if I’m missing something :smile:

Thank you for your reply. I think I’m starting to better understand the pros/cons of HinSAGE as a result of reviewing your comments. My classes will be imbalanced (fraud prediction application) so that’s keeping my attention with HinSAGE for now. If I understand your second option correctly, if say m in the number of nodes of a type then that is going to be at least an m-dimensional feature vector for that node type which I agree is probably not going to be practical from a memory standpoint. My ‘hack’ feels a lot less precise, I wonder compared to your second idea what the tradeoff of precision / efficiency would be?

I’ve started to look at the source for the HinSAGENodeGenerator class to get a better feel for what’s happening with the features as they are prepared to be used by GraphSAGE. It hasn’t quite clicked yet, but I’ll persevere with a fresh mind later. I’m trying to understand how the features for multiple node types end up being encoded for what I expect is a single node type for GraphSAGE. If you can provide any cues that’d be helpful.

Re your question

For your particular dataset, are the “real” features from the original dataset likely to be different for each node?

By node type, yes there are likely to be different features on all of them. Looking at the theorem quoted, if I understand correctly the issue will arise if lots of the nodes don’t have feature values that aren’t discerning enough and are similar?

Thanks for your careful explanation, I hope my questions as articulated make sense!

Yeah, you’re right. For one hot encoding the node IDs, you end up with a m \times m square matrix of features, with m ones and all the other elements zero.

I think it’s a reasonable thing to try.

One thing to consider is that both one-hot encoding or random features will probably stop the model from generalising to new nodes, meaning it’s suitable for transduction tasks but not inductive ones.

Have you seen , that provides a textual description of the algorithm?

Yeah, the theorem does suggest that having distinct features is important for GraphSAGE: it’s not critical, but it will help the model.

Another option to consider beyond random features (but moving more towards manual feature engineering) would be to use feature vectors that capture the graph structure. For example, they could be computed using any of the algorithms in (the ones that don’t use node features at all might be the most relevant). These feature vectors could be appended to the real features for types that have them, or just be used to fill in features for types any already.