当前位置：网站首页>DGL Chapter 1 (official tutorial) personal notes

DGL Chapter 1 (official tutorial) personal notes

2022-07-28 17:11:00 【Name filling】

DGL The library is very friendly Chinese tutorial address It's right here , Here is basically pasted from there , Count as personal notes .

Chapter one chart

DGL The core data structure of DGLGraph Provides a graph centric programming abstraction . DGLGraph Interfaces are provided to handle the structure of the diagram 、 node / edge Characteristics of , And the calculations that can be performed using these components .

1.1 The basic concept of graph

Understand basic concepts , chart 、 The representation of the figure 、 Weighted graph and unweighted graph 、 Isomorphic and heterogeneous graphs 、 Multiple pictures

1.2 chart 、 Nodes and edges

DGL Use a unique integer to represent the node , namely spot ID; The corresponding two endpoints ID To represent an edge . Each edge has edge ID.DGL The middle side has direction , That is the edge $（ u, v ）$ Representation node $u$ Point to the node $v$ .

For multiple nodes ,DGL Use a one-dimensional shaping tensor （ Such as ,PyTorch Of Tensor class ） Keep the points of the graph ID,DGL be called ” Node tensor ”. For many edges ,DGL Use a include 2 Tuples of node tensors $(U, V)$ , among , use $(U [i], V [i])$ Refers to a $U [i]$ To $V [i]$ The edge of .

Create a DGLGraph One way to object is to use dgl.graph() function . It takes a set of edges as input .DGL It also supports creating graph objects from other data sources .

Outside the chain picture

The following code snippet uses dgl.graph() Function to build a DGLGraph object , Corresponding to the inclusion shown in the figure below 4 Graph of nodes . Some of the code demonstrates parts of the query graph structure API How to use .

import dgl
import torch as th
#  edge  0->1, 0->2, 0->3, 1->3
u,v = th.tensor([0,0,0,1]), th.tensor([1,2,3,3])
g = dgl.graph((u,v))
print(g)

Using backend: pytorch

Graph(num_nodes=4, num_edges=4,
      ndata_schemes={}
      edata_schemes={})

#  Access to the node 
print(g.nodes())

tensor([0, 1, 2, 3])

#  Get the point corresponding to the edge 
print(g.edges())

(tensor([0, 0, 0, 1]), tensor([1, 2, 3, 3]))

#  Get the corresponding endpoint and edge of the edge ID
print(g.edges(form='all'))

(tensor([0, 0, 0, 1]), tensor([1, 2, 3, 3]), tensor([0, 1, 2, 3]))

#  If there is a maximum ID The node has no edges , When creating a diagram , The user needs to clearly indicate the number of nodes .
g = dgl.graph((u, v), num_nodes=8)

For undirected graphs , You need to create edges in both directions for each edge . have access to dgl.to_bidirected() Function to achieve this . As shown in the following code snippet , This function can convert the original graph into a graph with reverse edges .

bg = dgl.to_bidirected(g)
bg.edges()

(tensor([0, 0, 0, 1, 1, 2, 3, 3]), tensor([1, 2, 3, 0, 3, 0, 0, 1]))

DGL It can be used 32 or 64 Bit integer as ID But the type should be consistent . Here are two conversion methods

edges = th.tensor([2, 5, 3]), th.tensor([3, 5, 0])  #  edge ：2->3, 5->5, 3->0
g64 = dgl.graph(edges)  # DGL By default int64
print(g64.idtype)

torch.int64

g32 = dgl.graph(edges, idtype=th.int32)  #  Use int32 Build a diagram 
g32.idtype

torch.int32

g64_2 = g32.long()  #  convert to int64
g64_2.idtype

torch.int64

g32_2 = g64.int()  #  convert to int32
g32_2.idtype

torch.int32

1.3 Characteristics of nodes and edges

DGLGraph The nodes and edges of an object can have Multiple user-defined 、 Nameable features , To store the properties of the nodes and edges of the graph .
adopt ndata and edata Interfaces can access these features .

for example , The following code creates 2 A node feature （ In the first 8、15 The row is named 'x' 、 'y' ） and 1 Edge features （ In the 9 The row is named 'x' ）.

import dgl
import torch as th
g = dgl.graph((th.tensor([0,0,1,5]), th.tensor([1,2,2,0]))) # 6 Nodes , Four sides 
# g = dgl.graph(([0, 0, 1, 5], [1, 2, 2, 0]))
g

Graph(num_nodes=6, num_edges=4,
      ndata_schemes={}
      edata_schemes={})

g.ndata['x'] = th.ones(g.num_nodes(), 3)    #  The length is 3 The node characteristics of 
g.edata['x'] = th.ones(g.num_edges(), dtype=th.int32)    # #  Scalar integer feature 
g

Graph(num_nodes=6, num_edges=4,
      ndata_schemes={'x': Scheme(shape=(3,), dtype=torch.float32)}
      edata_schemes={'x': Scheme(shape=(), dtype=torch.int32)})

#  Features with different names can have different shapes 
g.ndata['y'] = th.randn(g.num_nodes(), 5)    # x, y Two characteristics 
g.ndata['x'][1]     #  Access to the node 1 Characteristics of

tensor([1., 1., 1.])

g.edata['x'][th.tensor([0, 3])]    #  Get edge 0 and 3 Characteristics of

tensor([1, 1], dtype=torch.int32)

About ndata and edata Important description of the interface ：

Only numeric types are allowed （ Such as single precision floating point 、 Double precision floating point and integer ） Characteristics of . These features can be scalars 、 Vector or multidimensional tensor .
Each node feature has a unique name , Each edge feature also has a unique name . Features of nodes and edges can have the same name （ As in the above example code 'x' ）
When creating features through tensor assignment ,DGL Will assign features to Every node and every edge . The first dimension of the tensor must be consistent with the number of nodes or edges in the graph . You cannot assign features to a subset of nodes or edges in a graph .
Features with the same name must have the same dimension and data type .
The characteristic tensor uses ” Line first ” Principles , That is, each row of slices is stored 1 A node or 1 The characteristics of the edge （ Refer to section 16 and 18 That's ok ）.

For weighted graphs , You can store weights as an edge feature , as follows .

#  edge  0->1, 0->2, 0->3, 1->3
edges = th.tensor([0, 0, 0, 1]), th.tensor([1, 2, 3, 3])
weights = th.tensor([0.1, 0.6, 0.9, 0.7])  #  The weight of each edge 
g = dgl.graph(edges)
g.edata['w'] = weights  #  I'm going to call it  'w'
g

Graph(num_nodes=4, num_edges=4,
      ndata_schemes={}
      edata_schemes={'w': Scheme(shape=(), dtype=torch.float32)})

edges

(tensor([0, 0, 0, 1]), tensor([1, 2, 3, 3]))

1.4 Create a diagram from an external source

You can construct a from an external source DGLGraph object , Include ：

From the outside for graphs and sparse matrices Python library （NetworkX and SciPy） Created from .
Load graph data from disk .
This section does not cover functions that generate graphs by converting other graphs , Please read the relevant overview API Reference manual .

Create a diagram from an external library

import dgl
import torch as th
import scipy.sparse as sp
spmat = sp.rand(100, 100, density=0.05)    # 5% Nonzero term  100*100 0.05 Nonzero term 
dgl.from_scipy(spmat)    #  come from SciPy

Graph(num_nodes=100, num_edges=500,
      ndata_schemes={}
      edata_schemes={})

import networkx as nx
nx_g = nx.path_graph(5)    #  One link 0-1-2-3-4
dgl.from_networkx(nx_g)    #  come from NetworkX

Graph(num_nodes=5, num_edges=8,
      ndata_schemes={}
      edata_schemes={})

Be careful , When using nx.path_graph(5) When creating , DGLGraph Objects have 8 side , Instead of 4 strip . This is because nx.path_graph(5) Constructed an undirected NetworkX chart networkx.Graph , and DGLGraph The edges of are always directed . So when the undirected NetworkX Turn the picture into DGLGraph Object time ,DGL Will internally 1 An undirected edge is converted to 2 The strip has a directed edge . Use directed NetworkX chart networkx.DiGraph This behavior can be avoided .

nxg = nx.DiGraph([(2, 1), (1, 2), (2, 3), (0, 0)])
dgl.from_networkx(nxg)

Graph(num_nodes=4, num_edges=4,
      ndata_schemes={}
      edata_schemes={})

Load from disk

Comma separated values （CSV）

CSV Is a common format , Store nodes in tabular format 、 Edges and their characteristics ：

nodes.csv

age, title
43, 1
23, 3
…

edges.csv

src, dst, weight
0, 1, 0.4
0, 3, 0.9
…

Many famous Python library ( Such as Pandas) You can load this type of data into python object ( Such as numpy.ndarray) in , Then use these objects to build DGLGraph object . If the back-end framework also provides tools to save or load tensors from disk ( Such as torch.save(), torch.load()), You can follow the same principle to build diagrams .

JSON/GML Format

If you don't pay much attention to speed , Readers can use NetworkX Provide tools to parse Various data formats , DGL You can create diagrams indirectly from these sources .

DGL Binary format

DGL Provides API To load from disk or save binary format to disk . In addition to the diagram structure ,API It can also process feature data and graph level label data . DGL Also support direct from S3/HDFS Load or add S3/HDFS Save map . The reference manual provides more details of this usage .

1.5 Heterogeneous graph

Compared with isomorphic graphs , Heterogeneous graphs can have different types of nodes and edges . These different types of nodes and edges have independent properties ID Space and features . For example, in the figure below ,” user ” and ” game ” Node ID from 0 At the beginning , And the two nodes have different characteristics .

A heterogeneous graph example . The graph has two types of nodes (“ user ” and ” game ”) And two types of edges (“ Focus on ” and ” play ”).

stay DGL in , A heterogeneous graph consists of A series of subgraphs constitute , A subgraph corresponds to a relationship . Each relationship consists of a string triplet Definition ( Source node type , Edge type , Target node type ) . Because the relationship definition here eliminates the ambiguity of edge type ,DGL Call them canonical edge types .

The following code is a DGL An example of creating a heterogeneous diagram in .

import dgl
import torch as th

#  Create a heterogeneous graph with three node types and three edge types 
graph_data = {
    
    ('drug', 'interacts', 'drug'): (th.tensor([0, 1]), th.tensor([1, 2])),
    ('drug', 'interacts', 'gene'): (th.tensor([0, 1]), th.tensor([2, 3])),
    ('drug', 'treats', 'disease'): (th.tensor([1]), th.tensor([2]))
}
g = dgl.heterograph(graph_data)
print("g.ntypes:", g.ntypes)
print("g.etypes:", g.etypes)
print("g.canonical_etypes:", g.canonical_etypes)

g.ntypes: ['disease', 'drug', 'gene']
g.etypes: ['interacts', 'interacts', 'treats']
g.canonical_etypes: [('drug', 'interacts', 'drug'), ('drug', 'interacts', 'gene'), ('drug', 'treats', 'disease')]

Be careful , Isomorphic graph and bipartite graph are just a special kind of heterogeneous graph , They include only one relationship .

#  An isomorphic graph 
dgl.heterograph({
    ('node_type', 'edge_type', 'node_type'): (u, v)})
#  A bipartite graph 
dgl.heterograph({
    ('source_type', 'edge_type', 'destination_type'): (u, v)})

Graph(num_nodes={'destination_type': 4, 'source_type': 2},
      num_edges={('source_type', 'edge_type', 'destination_type'): 4},
      metagraph=[('source_type', 'destination_type', 'edge_type')])

Associated with heterogeneous graphs metagraph Is the pattern of the graph . It specifies Node set and Type constraints on edges between nodes . metagraph One of the nodes in $u$ Corresponds to a node type in the related heterogeneous graph . metagraph In the middle (u,v) Indicates that there are from... In the related heterogeneous graph $u$ Type node to $v$ Edge of type node .

Graph(num_nodes={'disease': 3, 'drug': 3, 'gene': 4},
      num_edges={('drug', 'interacts', 'drug'): 2, ('drug', 'interacts', 'gene'): 2, ('drug', 'treats', 'disease'): 1},
      metagraph=[('drug', 'drug', 'interacts'), ('drug', 'gene', 'interacts'), ('drug', 'disease', 'treats')])

g.metagraph().edges()

OutMultiEdgeDataView([('drug', 'drug'), ('drug', 'gene'), ('drug', 'disease')])

g.metagraph().nodes()

NodeView(('drug', 'gene', 'disease'))

Use multiple types

When multiple node and edge types are introduced , The user is calling DGLGraph API To get a specific type of information , You need to specify specific node and edge types . Besides , Different types of nodes and edges have separate properties ID.

#  Get the number of all nodes in the graph 
print("g.num_nodes():", g.num_nodes())
#  obtain drug Number of nodes 
print("g.num_nodes('drug'):", g.num_nodes('drug'))
#  Different types of nodes have separate ID. therefore , Without specifying the node type, there is no explicit return value .
# g.nodes()
# DGLError: Node type name must be specified if there are more than one node types.
print("g.nodes('drug'):", g.nodes('drug'))

g.num_nodes(): 10
g.num_nodes('drug'): 3
g.nodes('drug'): tensor([0, 1, 2])

To set up / Gets the characteristics of a specific node and edge type ,DGL Provides two new types of Syntax ： g.nodes[‘node_type’].data[‘feat_name’] and g.edges[‘edge_type’].data[‘feat_name’] .

#  Set up / obtain "drug" Type of node "hv" features 
g.nodes['drug'].data['hv'] = th.ones(3, 1)
g.nodes['drug'].data['hv']

tensor([[1.],
        [1.],
        [1.]])

#  Set up / obtain "treats" Type of edge "he" features 
g.edges['treats'].data['he'] = th.zeros(1, 1)
g.edges['treats'].data['he']

tensor([[0.]])

If there is only one node or edge type in the graph , You do not need to specify the type of node or edge .

g = dgl.heterograph({
    
   ('drug', 'interacts', 'drug'): (th.tensor([0, 1]), th.tensor([1, 2])),
   ('drug', 'is similar', 'drug'): (th.tensor([0, 1]), th.tensor([2, 3]))
})
g.nodes()

tensor([0, 1, 2, 3])

#  Set up / Gets a single type of node or edge feature , You don't have to use new syntax 
g.ndata['hv'] = th.ones(4, 1)
g.ndata['hv']

tensor([[1.],
        [1.],
        [1.],
        [1.]])

Load heterogeneous graph from disk

Comma separated values （CSV）

A common way to store heterogeneous graphs is in different ways CSV Different types of nodes and edges are stored in the file . Here is an example .

Data folder

data/
|-- drug.csv # drug node
|-- gene.csv # gene node
|-- disease.csv # disease node
|-- drug-interact-drug.csv # drug-drug Interaction edge
|-- drug-interact-gene.csv # drug-gene Interaction edge
|-- drug-treat-disease.csv # drug-disease Treatment side
Similar to the case of isomorphic graphs , Users can use things like Pandas Such a bag will first CSV File resolved to numpy Array or frame tensor , Then build a relational Dictionary , And use it to construct a heterogeneous graph . This method is also applicable to other popular file formats , such as GML or JSON.

DGL Binary format

DGL Provides dgl.save_graphs() and dgl.load_graphs() function , It is used to save heterogeneous diagrams in binary format and load them .

Edge type subgraph

Users can create subgraphs of heterogeneous graphs by specifying the relationships to be preserved , Relevant features will also be copied .

g = dgl.heterograph({
    
   ('drug', 'interacts', 'drug'): (th.tensor([0, 1]), th.tensor([1, 2])),
   ('drug', 'interacts', 'gene'): (th.tensor([0, 1]), th.tensor([2, 3])),
   ('drug', 'treats', 'disease'): (th.tensor([1]), th.tensor([2]))
})
g.nodes['drug'].data['hv'] = th.ones(3, 1)

#  Keep the relationship  ('drug', 'interacts', 'drug')  and  ('drug', 'treats', 'disease') .
# 'drug'  and  'disease'  Nodes of type will also be preserved 
eg = dgl.edge_type_subgraph(g,[('drug', 'interacts', 'drug'),
                               ('drug', 'treats', 'disease')])
eg

Graph(num_nodes={'disease': 3, 'drug': 3},
      num_edges={('drug', 'interacts', 'drug'): 2, ('drug', 'treats', 'disease'): 1},
      metagraph=[('drug', 'drug', 'interacts'), ('drug', 'disease', 'treats')])

eg.nodes['drug'].data['hv']

tensor([[1.],
        [1.],
        [1.]])

dgl.edge_type_subgraph(g, [('drug', 'interacts', 'gene')])

Graph(num_nodes={'drug': 3, 'gene': 4},
      num_edges={('drug', 'interacts', 'gene'): 2},
      metagraph=[('drug', 'gene', 'interacts')])

Transform heterogeneous graphs into isomorphic graphs

Heterogeneous graph provides a clear interface for managing different types of nodes and edges and their related features . This is especially useful when :

The characteristics of different types of nodes and edges have different data types or sizes .
Users want to apply different operations to different types of nodes and edges .

If the above does not apply , And users do not want to distinguish the types of nodes and edges in modeling , be DGL Allow to use dgl.DGLGraph.to_homogeneous() API Transform heterogeneous graphs into isomorphic graphs . The specific actions are as follows :

g = dgl.heterograph({
    
   ('drug', 'interacts', 'drug'): (th.tensor([0, 1]), th.tensor([1, 2])),
   ('drug', 'treats', 'disease'): (th.tensor([1]), th.tensor([2]))})
g.nodes['drug'].data['hv'] = th.zeros(3, 1)
g.nodes['disease'].data['hv'] = th.ones(3, 1)
g.edges['interacts'].data['he'] = th.zeros(2, 1)
g.edges['treats'].data['he'] = th.zeros(1, 2)

#  Feature merging is not performed by default 
hg = dgl.to_homogeneous(g)
'hv' in hg.ndata

False

hg

Graph(num_nodes=6, num_edges=3,
      ndata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64), '_TYPE': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64), '_TYPE': Scheme(shape=(), dtype=torch.int64)})

#  Characteristics of copy edge 
#  For features to be copied ,DGL It is assumed that the features to be merged of different types of nodes or edges have the same size and data type 
g = dgl.to_homogeneous(g, edata=['he'])    #  There is no same feature shape 
# DGLError: Cannot concatenate column ‘he’ with shape Scheme(shape=(2,), dtype=torch.float32) and shape Scheme(shape=(1,), dtype=torch.float32)

#  Copy node characteristics 
hg = dgl.to_homogeneous(g, ndata=['hv'])
hg.ndata['hv']

tensor([[1.],
        [1.],
        [1.],
        [0.],
        [0.],
        [0.]])

The type of the original node or edge and the corresponding ID Stored in ndata and edata in .

#  The order of node types in a heterogeneous graph 
g.ntypes

['disease', 'drug']

#  Original node type 
hg.ndata[dgl.NTYPE]

tensor([0, 0, 0, 1, 1, 1])

#  The original node of a specific type ID
hg.ndata[dgl.NID]

tensor([0, 1, 2, 0, 1, 2])

#  The order of edge types in heterogeneous graphs 
g.etypes

['interacts', 'treats']

#  Original edge type 
hg.edata[dgl.ETYPE]

tensor([0, 0, 1])

#  The original specific type of edge ID
hg.edata[dgl.EID]

tensor([0, 1, 0])

For modeling purposes , Users may need to merge some relationships , And apply the same operation to them . To achieve this , You can extract edge type subgraphs of heterogeneous graphs first , Then the subgraph is transformed into isomorphic graph

g = dgl.heterograph({
    
    ('drug', 'interacts', 'drug'): (th.tensor([0, 1]), th.tensor([1, 2])),
    ('drug', 'interacts', 'gene'): (th.tensor([0, 1]), th.tensor([2, 3])),
    ('drug', 'treats', 'disease'): (th.tensor([1]), th.tensor([2]))
})
sub_g = dgl.edge_type_subgraph(g, [('drug', 'interacts', 'drug'),
                                   ('drug', 'interacts', 'gene')])

h_sub_g = dgl.to_homogeneous(sub_g)
h_sub_g

Graph(num_nodes=7, num_edges=4,
      ndata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64), '_TYPE': Scheme(shape=(), dtype=torch.int64)}
      edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64), '_TYPE': Scheme(shape=(), dtype=torch.int64)})

1.6 stay GPU Upper use DGLGraph

Users can pass in two GPU Tensor to create GPU Upper DGLGraph . Another way is to use to() API take DGLGraph Copied to the GPU, This will copy the graph structure and feature data to the specified device .

import dgl
import torch as th
u, v = th.tensor([0, 1, 2]), th.tensor([2, 3, 4])
g = dgl.graph((u, v))
g.ndata['x'] = th.rand(5, 3)
g.device

device(type='cpu')

cuda_g = g.to('cuda:0')    #  Accept any device object from the back-end framework 
cuda_g.device

device(type='cuda', index=0)

cuda_g.ndata['x'].device    #  Feature data is also copied to GPU On

device(type='cuda', index=0)

#  from GPU The graph of tensor construction is also in GPU On 
u, v = u.to('cuda:0'), v.to('cuda:0')
g = dgl.graph((u, v))
g.device

device(type='cuda', index=0)

Any involvement GPU The operations of the graph are all in GPU Running on . therefore , This requires that all tensor parameters have been placed in GPU On , As a result, ( Graph or tensor ) Will also be GPU On . Besides ,GPU Only accept GPU Feature data on .

cuda_g.in_degrees()    #  The degree of

tensor([0, 0, 1, 1, 1], device='cuda:0')

cuda_g.in_edges([2, 3, 4])    #  Parameters of non tensor type can be accepted

(tensor([0, 1, 2], device='cuda:0'), tensor([2, 3, 4], device='cuda:0'))

cuda_g.in_edges(th.tensor([2, 3, 4]).to('cuda:0'))    #  Parameters of tensor type must be in GPU On

(tensor([0, 1, 2], device='cuda:0'), tensor([2, 3, 4], device='cuda:0'))

cuda_g.ndata['h'] = th.randn(5, 4)
# Cannot assign node feature "h" on device cpu to a graph on device cuda:0. Call DGLGraph.to() to copy the graph to the same device.