当前位置:网站首页>Figure introduction to neural network core dataset
Figure introduction to neural network core dataset
2022-07-26 14:32:00 【yihanyifan】
Data set summary
Cora The data set consists of machine learning papers . These papers are divided into one of the following seven categories :
- Based on the case
- Genetic algorithm (ga)
- neural network
- Probability method
- Reinforcement learning
- Rule learning
- theory
The choice of these papers is , In the final corpus , Each paper is cited or cited by at least one other paper . There are... In the whole corpus 2708 piece The paper .
After stemming and removing the suffix , only 1433 individual The only word . Document frequency is less than 10 All the words of are deleted .
Dataset file description
The data set is composed of cora.cites And cora.content Two documents .
cora.content
.content The document contains a description of the paper in the following format :<paper_id> <word_attributes>+ <class_label>
Each row ( It is actually a node of the graph ) The first field of is the unique string identifier of the paper , Heel 1433 A field ( The value is binary ), Express 1433 Every word in the vocabulary exists in the article ( from 1 Express ) There is still no ( from 0 Express ). Last , The last field of this line represents the category label of the paper (7 individual ). Therefore, the characteristics of this data should have 1433 Dimensions , Add the first field idx, The last field label, Altogether 1433 + 2 Dimensions .
cora.cites
.cites The document contains the reference relationship of the corpus ‘ chart ’.
Each row ( It's actually an edge of the graph ) Describe a reference relationship in the following format :< Cited paper number > < Citation number >
Each line contains two paper id. The first field is the identification of the cited paper , The second field represents the cited paper . The direction of reference relationship is from right to left . If a line consists of “ The paper 1 The paper 2” Express , be “ The paper 2 quote The paper 1”, That is, the link is “ The paper 2 - > The paper 1”. You can use the links between papers ( quote ) Relationship building adjacency matrix adj.
ind.cora.x : Training set node eigenvector , Save to :scipy.sparse.csr.csr_matrix, The actual expanded size is : (140, 1433)
ind.cora.tx : Test set node eigenvector , Save to :scipy.sparse.csr.csr_matrix, The actual expanded size is : (1000, 1433)
ind.cora.allx : The feature vectors of training nodes with labels and without labels , Save to :scipy.sparse.csr.csr_matrix, The actual expanded size is :(1708, 1433), It can be understood as a collection of node features other than the test set , The training set is a subset of it
ind.cora.y : one-hot Represents the label of the training node , Save to :numpy.ndarray
ind.cora.ty : one-hot Represents the label of the test node , Save to :numpy.ndarray
ind.cora.ally : one-hot It means ind.cora.allx Corresponding label , Save to :numpy.ndarray
ind.cora.graph : Save information about the edges between nodes , Save in :{ index : [ index_of_neighbor_nodes ] }
ind.cora.test.index : Save the index of the test set node , Save to :List, Used for the following inductive learning settings .
import numpy as np
import pickle as pkl
import networkx as nx
import scipy.sparse as sp
# from scipy.sparse.linalg.eigen.arpack import eigsh I don't know why this error is reported
from scipy.sparse.linalg.eigen import arpack
import sys
def parse_index_file(filename):
"""Parse index file."""
index = []
for line in open(filename):
index.append(int(line.strip()))
return index
def sample_mask(idx, l):
"""Create mask."""
mask = np.zeros(l)
mask[idx] = 1
return np.array(mask, dtype=np.bool)
def load_data(dataset_str):
"""
Loads input data from gcn/data directory
ind.dataset_str.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object;
ind.dataset_str.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object;
ind.dataset_str.allx => the feature vectors of both labeled and unlabeled training instances
(a superset of ind.dataset_str.x) as scipy.sparse.csr.csr_matrix object;
ind.dataset_str.y => the one-hot labels of the labeled training instances as numpy.ndarray object;
ind.dataset_str.ty => the one-hot labels of the test instances as numpy.ndarray object;
ind.dataset_str.ally => the labels for instances in ind.dataset_str.allx as numpy.ndarray object;
ind.dataset_str.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict
object;
ind.dataset_str.test.index => the indices of test instances in graph, for the inductive setting as list object.
All objects above must be saved using python pickle module.
:param dataset_str: Dataset name
:return: All data input files loaded (as well the training/test data).
"""
names = ['x', 'y', 'tx', 'ty', 'allx', 'ally', 'graph']
objects = []
for i in range(len(names)): # Read the files separately
with open("data/ind.{}.{}".format(dataset_str, names[i]), 'rb') as f:
if sys.version_info > (3, 0): # python The version is greater than 3.0
data = pkl.load(f, encoding='latin1')
if (names[i].find('graph') == -1): # If not .graph file
print(f)
"""
x:(140, 1433) 140 Nodes participate in training , The vector of each node is 1433 dimension
y:(140,7) 140 Training objectives of nodes participating in training ,7 Unique hot coding of dimension
tx:(1000, 1433) 1000 Nodes participating in the test
ty:(1000, 7)
allx: (1708, 1433)
ally: (1708, 7)
"""
print(data.shape)
print(type(data))
# >>> <class 'scipy.sparse._csr.csr_matrix'>
print(type(data[0]))
# >>> <class 'scipy.sparse._csr.csr_matrix'>
for j in range(data.shape[0]): # The number of rows in a matrix
"""
#x: data[j] The first j Vector representation of nodes
#y: data[j] The first j Labels of nodes y j (7,)
"""
print('********', names[i], j, data[j].shape, '**********')
print(data[j])
print('\n')
else:
print(f)
print(type(data))
# >>> <class 'collections.defaultdict'>
print(data)
objects.append(data)
else:
objects.append(pkl.load(f))
x, y, tx, ty, allx, ally, graph = tuple(objects)
# Training data set
print(x[0][0], x.shape, type(x)) ##x It's a sparse matrix , remember 1 The location of ,140 An example , The eigenvector dimension of each instance is 1433 (140,1433)
print(y[0], y.shape) ##y Is the label vector ,7 classification ,140 An example (140,7)
## Test data set
print(tx[0][0], tx.shape, type(tx)) ##tx It's a sparse matrix ,1000 An example , The eigenvector dimension of each instance is 1433 (1000,1433)
print(ty[0], ty.shape) ##y Is the label vector ,7 classification ,1000 An example (1000,7)
##allx,ally Consistent with the above form
print(allx[0][0], allx.shape, type(allx)) ##tx It's a sparse matrix ,1708 An example , The eigenvector dimension of each instance is 1433 (1708,1433)
print(ally[0], ally.shape) ##y Is the label vector ,7 classification ,1708 An example (1708,7)
##graph It's a dictionary , Big picture total 2708 Nodes
for i in graph:
print(i, graph[i])
#dataset_str:core Data sets ,format: Functions are used in conjunction with parameters ,{} The contents will be format() Replace the parameters inside
# obtain text.index All the data in it
test_idx_reorder = parse_index_file("data/ind.{}.test.index".format(dataset_str))
test_idx_range = np.sort(test_idx_reorder)
print(test_idx_range.size)
print(type(test_idx_range))
print(test_idx_range)
if dataset_str == 'citeseer':
# Fix citeseer dataset (there are some isolated nodes in the graph)
# Find isolated nodes, add them as zero-vecs into the right position
test_idx_range_full = range(min(test_idx_reorder), max(test_idx_reorder) + 1)
tx_extended = sp.lil_matrix((len(test_idx_range_full), x.shape[1]))
tx_extended[test_idx_range - min(test_idx_range), :] = tx
tx = tx_extended
ty_extended = np.zeros((len(test_idx_range_full), y.shape[1]))
ty_extended[test_idx_range - min(test_idx_range), :] = ty
ty = ty_extended
features = sp.vstack((allx, tx)).tolil()
features[test_idx_reorder, :] = features[test_idx_range, :]
adj = nx.adjacency_matrix(nx.from_dict_of_lists(graph))
# print(adj,adj.shape)
labels = np.vstack((ally, ty))
labels[test_idx_reorder, :] = labels[test_idx_range, :]
idx_test = test_idx_range.tolist()
idx_train = range(len(y))
idx_val = range(len(y), len(y) + 500)
train_mask = sample_mask(idx_train, labels.shape[0])
val_mask = sample_mask(idx_val, labels.shape[0])
test_mask = sample_mask(idx_test, labels.shape[0])
y_train = np.zeros(labels.shape)
y_val = np.zeros(labels.shape)
y_test = np.zeros(labels.shape)
y_train[train_mask, :] = labels[train_mask, :]
y_val[val_mask, :] = labels[val_mask, :]
y_test[test_mask, :] = labels[test_mask, :]
return adj, features, y_train, y_val, y_test, train_mask, val_mask, test_mask
def sparse_to_tuple(sparse_mx):
"""Convert sparse matrix to tuple representation."""
def to_tuple(mx):
if not sp.isspmatrix_coo(mx):
mx = mx.tocoo()
coords = np.vstack((mx.row, mx.col)).transpose()
values = mx.data
shape = mx.shape
return coords, values, shape
if isinstance(sparse_mx, list):
for i in range(len(sparse_mx)):
sparse_mx[i] = to_tuple(sparse_mx[i])
else:
sparse_mx = to_tuple(sparse_mx)
return sparse_mx
def preprocess_features(features):
"""Row-normalize feature matrix and convert to tuple representation"""
rowsum = np.array(features.sum(1))
r_inv = np.power(rowsum, -1).flatten()
r_inv[np.isinf(r_inv)] = 0.
r_mat_inv = sp.diags(r_inv)
features = r_mat_inv.dot(features)
return sparse_to_tuple(features)
def normalize_adj(adj):
"""Symmetrically normalize adjacency matrix."""
adj = sp.coo_matrix(adj)
rowsum = np.array(adj.sum(1))
d_inv_sqrt = np.power(rowsum, -0.5).flatten()
d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0.
d_mat_inv_sqrt = sp.diags(d_inv_sqrt)
return adj.dot(d_mat_inv_sqrt).transpose().dot(d_mat_inv_sqrt).tocoo()
def preprocess_adj(adj):
"""Preprocessing of adjacency matrix for simple GCN model and conversion to tuple representation."""
adj_normalized = normalize_adj(adj + sp.eye(adj.shape[0]))
return sparse_to_tuple(adj_normalized)
def construct_feed_dict(features, support, labels, labels_mask, placeholders):
"""Construct feed dictionary."""
feed_dict = dict()
feed_dict.update({placeholders['labels']: labels})
feed_dict.update({placeholders['labels_mask']: labels_mask})
feed_dict.update({placeholders['features']: features})
feed_dict.update({placeholders['support'][i]: support[i] for i in range(len(support))})
feed_dict.update({placeholders['num_features_nonzero']: features[1].shape})
return feed_dict
def chebyshev_polynomials(adj, k):
"""Calculate Chebyshev polynomials up to order k. Return a list of sparse matrices (tuple representation)."""
print("Calculating Chebyshev polynomials up to order {}...".format(k))
adj_normalized = normalize_adj(adj)
laplacian = sp.eye(adj.shape[0]) - adj_normalized
largest_eigval, _ = eigsh(laplacian, 1, which='LM')
scaled_laplacian = (2. / largest_eigval[0]) * laplacian - sp.eye(adj.shape[0])
t_k = list()
t_k.append(sp.eye(adj.shape[0]))
t_k.append(scaled_laplacian)
def chebyshev_recurrence(t_k_minus_one, t_k_minus_two, scaled_lap):
s_lap = sp.csr_matrix(scaled_lap, copy=True)
return 2 * s_lap.dot(t_k_minus_one) - t_k_minus_two
for i in range(2, k + 1):
t_k.append(chebyshev_recurrence(t_k[-1], t_k[-2], scaled_laplacian))
return sparse_to_tuple(t_k)
load_data('cora')
边栏推荐
- Jzoffer51- reverse pairs in the array (merge sort solution)
- Plato farm is expected to further expand its ecosystem through elephant swap
- Mysql5.7 is installed through file zip - Ninth Five Year Plan xiaopang
- How to do app upgrade test?
- Mysql-04 storage engine and data type
- Redis data operation
- Unity学习笔记–无限地图
- PHP uses sqlserver
- [paper reading] raw+:a two view graph propagation method with word coupling for readability assessment
- @A thousand lines of work, ride the cloud together!
猜你喜欢
![[ostep] 02 virtualized CPU - process](/img/0b/3f151ccf002eb6c0469bf74072a3c5.png)
[ostep] 02 virtualized CPU - process
![[GYCTF2020]FlaskApp](/img/ee/dcb42617af4a0e41657f6cf7095feb.png)
[GYCTF2020]FlaskApp

TransC知识表示模型

基于标签嵌入注意力机制的多任务文本分类模型

maya将模型导入到unity

嵌入式开发:调试嵌入式软件的技巧

Leetcode215 the kth largest element (derivation of quick sort partition function)

Research on prediction of user churn in online health community based on user portrait

Disease knowledge discovery based on spo semantic triples

Plato Farm有望通过Elephant Swap,进一步向外拓展生态
随机推荐
Book download | introduction to lifelong supervised learning in 2022, CO authored by meta AI, CMU and other scholars, 171 Pages pdf
Difference between base addressing and index addressing
智能家居行业发展,密切关注边缘计算和小程序容器技术
Inspiration from brain: introduction to synaptic integration principle in deep neural network optimization
【方差分析】之matlab求解
Some lightweight network models in detection and segmentation (share your own learning notes)
Annotation and reflection
注解和反射
My creation Anniversary - from the heart
@A thousand lines of work, ride the cloud together!
VP视频结构化框架
【愚公系列】2022年7月 Go教学课程 017-分支结构之IF
Unity学习笔记–无限地图
Construction practice of pipeline engine of engineering efficiency ci/cd
c# 用移位 >> 和运算与 &判断两个 二进制数 是否发生过改变
【论文阅读】GRAW+:A Two-View Graph Propagation Method With Word Coupling for Readability Assessment
Would you please tell me if there is a way for Flink SQL not to output update_ before?
Use of URL download resources
C语言_结构体指针来访问结构体数组
UDP multithreaded online chat