StellarGraph

Loading String Data into StellerGraph

Hello Everyone !

I have been experimenting with StellerGraph framework for my university research and I am working on Drug-target interaction network from BioSNAP called ChG-Miner

Snippet of data:
|#Drug|Gene|
|DB00357|P05108|
|DB02721|P00325|
|DB00773|P23219|

As First step towards my project is to ‘Load this Network in StellerGraph’ as all nodes from dataset are of String format, I am not able to achieve it.
As converting each node into a one-hot vector does not seems to be efficient way, is there anything I can do to load such dataset?

Appreciate the help in this regard.

Thanks!

Hi, thanks for asking a question. I’m not sure I entirely understand, so I’ll give some help, but be sure to let me know if it’s not on the right track!

I think that dataset comes in the form of an edge list, where each row represents a connection between a Drug and a Gene. This means that’s it’s purely edge data. The edge data can be loaded by creating a Pandas dataframe with pd.read_csv, and passing it into the StellarGraph constructor. For example, following the “heterogeneous” section of the Loading Pandas demo:

import pandas as pd
import stellargraph as sg

edges = pd.read_csv("ChG-Miner_miner-chem-gene.tsv", sep="\t")

drugs = pd.DataFrame(index=pd.unique(edges["#Drug"]))
genes = pd.DataFrame(index=pd.unique(edges["Gene"]))

graph = sg.StellarGraph(
    {"drug": drugs, "gene": genes}, 
    edges,
    source_column="#Drug", 
    target_column="Gene",
)

print(graph.info())
StellarGraph: Undirected multigraph
 Nodes: 7343, Edges: 15139

 Node types:
  drug: [5018]
    Features: none
    Edge types: drug-default->gene
  gene: [2325]
    Features: none
    Edge types: gene-default->drug

 Edge types:
    drug-default->gene: [15139]
        Weights: all 1 (default)

This graph will contain only the edge info, as well as the distinction in the types of the nodes. This is suitable for random walk methods like Node2Vec or Metapath2Vec.

As you hint, you can also add one hot-encoded IDs to use for (transitive) HinSAGE and so on. Creating a two square frames of 25M and 5.4M elements or a single one of 54M is obviously inefficient, but it’s probably not too bad.

If you want “real” node data, it looks like that is available the “Entities and feature tables” on https://snap.stanford.edu/biodata/index.html. For instance, each of the three Genes listed in your snippet is included in the G-SynMiner dataset, under the uniprot_ids column. This data can be loaded separately and used in place of the genes dataframe (you’ll need to do appropriate preprocessing and also potentially filter it down to just the genes that appear in the dataset). I don’t know if there’s data for drugs.

Also, an alternative approach to having two types like I do above would be to have a single type with node features that encode the type. For instance:

drugs = pd.DataFrame(index=pd.unique(edges["#Drug"])).assign(drug=1, gene=0)
genes = pd.DataFrame(index=pd.unique(edges["Gene"])).assign(drug=0, gene=1)

nodes = pd.concat([drugs, genes])
graph = StellarGraph(
    nodes, 
    edges,
    source_column="#Drug", 
    target_column="Gene",
)
print(nodes.sample(3))
print(graph.info())
         drug  gene
Q15418      0     1
Q969I3      0     1
DB02084     1     0

StellarGraph: Undirected multigraph
 Nodes: 7343, Edges: 15139

 Node types:
  default: [7343]
    Features: float32 vector, length 2
    Edge types: default-default->default

 Edge types:
    default-default->default: [15139]
        Weights: all 1 (default)

This allows using “homogeneous” algorithms like GCN and GAT. This can be combined with the real node data too, by filling in default values for the gene features for drug nodes (like all zeros).

Does this help get you started?

1 Like

Dear huon,
Appreciate your quick reply and guidance. :slight_smile:

Hello huon,

I was wondering about the one-hot vector encoding code you have written in solution. As I cannot see the StellerGraph constructor called over there, how could we see the graph.info ?
Did you miss anything there?

I tried creating graph object using below code line but getting an error.

graph = sg.StellarGraph({“nodes”: nodes, “edges”: edges})

Thanks already!

Oops, sorry, I missed that. I’ve edited it in now. Here’s a copy for ease of reference:

graph = StellarGraph(
    nodes, 
    edges,
    source_column="#Drug", 
    target_column="Gene",
)

For future reference, there’s a few other places you can look to answer a question like that. For example:

1 Like