当前位置：网站首页>Figure introduction to neural network core dataset

Figure introduction to neural network core dataset

2022-07-26 14:32:00 【yihanyifan】

Data set summary

Cora The data set consists of machine learning papers . These papers are divided into one of the following seven categories ：

Based on the case
Genetic algorithm (ga)
neural network
Probability method
Reinforcement learning
Rule learning
theory

The choice of these papers is , In the final corpus , Each paper is cited or cited by at least one other paper . There are... In the whole corpus 2708 piece The paper .

After stemming and removing the suffix , only 1433 individual The only word . Document frequency is less than 10 All the words of are deleted .

Dataset file description

The data set is composed of cora.cites And cora.content Two documents .

cora.content

.content The document contains a description of the paper in the following format ：<paper_id> <word_attributes>+ <class_label>

Each row （ It is actually a node of the graph ） The first field of is the unique string identifier of the paper , Heel 1433 A field （ The value is binary ）, Express 1433 Every word in the vocabulary exists in the article ( from 1 Express ) There is still no ( from 0 Express ). Last , The last field of this line represents the category label of the paper （7 individual ）. Therefore, the characteristics of this data should have 1433 Dimensions , Add the first field idx, The last field label, Altogether 1433 + 2 Dimensions .

cora.cites

.cites The document contains the reference relationship of the corpus ‘ chart ’.
Each row （ It's actually an edge of the graph ） Describe a reference relationship in the following format ：< Cited paper number > < Citation number >

Each line contains two paper id. The first field is the identification of the cited paper , The second field represents the cited paper . The direction of reference relationship is from right to left . If a line consists of “ The paper 1 The paper 2” Express , be “ The paper 2 quote The paper 1”, That is, the link is “ The paper 2 - > The paper 1”. You can use the links between papers （ quote ） Relationship building adjacency matrix adj.

ind.cora.x : Training set node eigenvector , Save to ：scipy.sparse.csr.csr_matrix, The actual expanded size is ： (140, 1433)

ind.cora.tx : Test set node eigenvector , Save to ：scipy.sparse.csr.csr_matrix, The actual expanded size is ： (1000, 1433)

ind.cora.allx : The feature vectors of training nodes with labels and without labels , Save to ：scipy.sparse.csr.csr_matrix, The actual expanded size is ：(1708, 1433), It can be understood as a collection of node features other than the test set , The training set is a subset of it

ind.cora.y : one-hot Represents the label of the training node , Save to ：numpy.ndarray

ind.cora.ty : one-hot Represents the label of the test node , Save to ：numpy.ndarray

ind.cora.ally : one-hot It means ind.cora.allx Corresponding label , Save to ：numpy.ndarray

ind.cora.graph : Save information about the edges between nodes , Save in ：{ index : [ index_of_neighbor_nodes ] }

ind.cora.test.index : Save the index of the test set node , Save to ：List, Used for the following inductive learning settings .

import numpy as np
import pickle as pkl
import networkx as nx
import scipy.sparse as sp
# from scipy.sparse.linalg.eigen.arpack import eigsh  I don't know why this error is reported 
from scipy.sparse.linalg.eigen import arpack
import sys


def parse_index_file(filename):
    """Parse index file."""
    index = []
    for line in open(filename):
        index.append(int(line.strip()))
    return index


def sample_mask(idx, l):
    """Create mask."""
    mask = np.zeros(l)
    mask[idx] = 1
    return np.array(mask, dtype=np.bool)


def load_data(dataset_str):
    """
    Loads input data from gcn/data directory

    ind.dataset_str.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object;
    ind.dataset_str.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object;
    ind.dataset_str.allx => the feature vectors of both labeled and unlabeled training instances
        (a superset of ind.dataset_str.x) as scipy.sparse.csr.csr_matrix object;
    ind.dataset_str.y => the one-hot labels of the labeled training instances as numpy.ndarray object;
    ind.dataset_str.ty => the one-hot labels of the test instances as numpy.ndarray object;
    ind.dataset_str.ally => the labels for instances in ind.dataset_str.allx as numpy.ndarray object;
    ind.dataset_str.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict
        object;
    ind.dataset_str.test.index => the indices of test instances in graph, for the inductive setting as list object.

    All objects above must be saved using python pickle module.

    :param dataset_str: Dataset name
    :return: All data input files loaded (as well the training/test data).
    """
    names = ['x', 'y', 'tx', 'ty', 'allx', 'ally', 'graph']
    objects = []
    for i in range(len(names)):  #  Read the files separately 
        with open("data/ind.{}.{}".format(dataset_str, names[i]), 'rb') as f:
            if sys.version_info > (3, 0):  # python The version is greater than 3.0
                data = pkl.load(f, encoding='latin1')
                if (names[i].find('graph') == -1):  #  If not .graph file 
                    print(f)
                    """
                    x:(140, 1433) 140 Nodes participate in training , The vector of each node is 1433 dimension 
                    y:(140,7) 140 Training objectives of nodes participating in training ,7 Unique hot coding of dimension 
                    tx:(1000, 1433) 1000 Nodes participating in the test 
                    ty:(1000, 7)
                    allx: (1708, 1433)
                    ally: (1708, 7)
                    """

                    print(data.shape)
                    print(type(data))
                    # >>> <class 'scipy.sparse._csr.csr_matrix'>
                    print(type(data[0]))
                    # >>> <class 'scipy.sparse._csr.csr_matrix'>

                    for j in range(data.shape[0]):  #  The number of rows in a matrix 
                        """
                        #x: data[j] The first j Vector representation of nodes 
                        #y: data[j] The first j Labels of nodes  y j (7,)
                        """
                        print('********', names[i], j, data[j].shape, '**********')
                        print(data[j])
                        print('\n')
                else:
                    print(f)
                    print(type(data))
                    # >>> <class 'collections.defaultdict'>
                    print(data)

                objects.append(data)

            else:
                objects.append(pkl.load(f))

    x, y, tx, ty, allx, ally, graph = tuple(objects)

    #  Training data set 
    print(x[0][0], x.shape, type(x))  ##x It's a sparse matrix , remember 1 The location of ,140 An example , The eigenvector dimension of each instance is 1433  (140,1433)
    print(y[0], y.shape)  ##y Is the label vector ,7 classification ,140 An example  (140,7)

    ## Test data set 
    print(tx[0][0], tx.shape, type(tx))  ##tx It's a sparse matrix ,1000 An example , The eigenvector dimension of each instance is 1433  (1000,1433)
    print(ty[0], ty.shape)  ##y Is the label vector ,7 classification ,1000 An example  (1000,7)

    ##allx,ally Consistent with the above form 
    print(allx[0][0], allx.shape, type(allx))  ##tx It's a sparse matrix ,1708 An example , The eigenvector dimension of each instance is 1433  (1708,1433)
    print(ally[0], ally.shape)  ##y Is the label vector ,7 classification ,1708 An example  (1708,7)

    ##graph It's a dictionary , Big picture total 2708 Nodes 
    for i in graph:
        print(i, graph[i])
    #dataset_str:core Data sets ,format:  Functions are used in conjunction with parameters ,{} The contents will be format() Replace the parameters inside 
    # obtain text.index All the data in it 
    test_idx_reorder = parse_index_file("data/ind.{}.test.index".format(dataset_str))
    test_idx_range = np.sort(test_idx_reorder)
    print(test_idx_range.size)
    print(type(test_idx_range))


    print(test_idx_range)

    if dataset_str == 'citeseer':
        # Fix citeseer dataset (there are some isolated nodes in the graph)
        # Find isolated nodes, add them as zero-vecs into the right position
        test_idx_range_full = range(min(test_idx_reorder), max(test_idx_reorder) + 1)
        tx_extended = sp.lil_matrix((len(test_idx_range_full), x.shape[1]))
        tx_extended[test_idx_range - min(test_idx_range), :] = tx
        tx = tx_extended
        ty_extended = np.zeros((len(test_idx_range_full), y.shape[1]))
        ty_extended[test_idx_range - min(test_idx_range), :] = ty
        ty = ty_extended

    features = sp.vstack((allx, tx)).tolil()
    features[test_idx_reorder, :] = features[test_idx_range, :]
    adj = nx.adjacency_matrix(nx.from_dict_of_lists(graph))
    # print(adj,adj.shape)

    labels = np.vstack((ally, ty))
    labels[test_idx_reorder, :] = labels[test_idx_range, :]

    idx_test = test_idx_range.tolist()
    idx_train = range(len(y))
    idx_val = range(len(y), len(y) + 500)

    train_mask = sample_mask(idx_train, labels.shape[0])
    val_mask = sample_mask(idx_val, labels.shape[0])
    test_mask = sample_mask(idx_test, labels.shape[0])

    y_train = np.zeros(labels.shape)
    y_val = np.zeros(labels.shape)
    y_test = np.zeros(labels.shape)
    y_train[train_mask, :] = labels[train_mask, :]
    y_val[val_mask, :] = labels[val_mask, :]
    y_test[test_mask, :] = labels[test_mask, :]

    return adj, features, y_train, y_val, y_test, train_mask, val_mask, test_mask


def sparse_to_tuple(sparse_mx):
    """Convert sparse matrix to tuple representation."""

    def to_tuple(mx):
        if not sp.isspmatrix_coo(mx):
            mx = mx.tocoo()
        coords = np.vstack((mx.row, mx.col)).transpose()
        values = mx.data
        shape = mx.shape
        return coords, values, shape

    if isinstance(sparse_mx, list):
        for i in range(len(sparse_mx)):
            sparse_mx[i] = to_tuple(sparse_mx[i])
    else:
        sparse_mx = to_tuple(sparse_mx)

    return sparse_mx


def preprocess_features(features):
    """Row-normalize feature matrix and convert to tuple representation"""
    rowsum = np.array(features.sum(1))
    r_inv = np.power(rowsum, -1).flatten()
    r_inv[np.isinf(r_inv)] = 0.
    r_mat_inv = sp.diags(r_inv)
    features = r_mat_inv.dot(features)
    return sparse_to_tuple(features)


def normalize_adj(adj):
    """Symmetrically normalize adjacency matrix."""
    adj = sp.coo_matrix(adj)
    rowsum = np.array(adj.sum(1))
    d_inv_sqrt = np.power(rowsum, -0.5).flatten()
    d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0.
    d_mat_inv_sqrt = sp.diags(d_inv_sqrt)
    return adj.dot(d_mat_inv_sqrt).transpose().dot(d_mat_inv_sqrt).tocoo()


def preprocess_adj(adj):
    """Preprocessing of adjacency matrix for simple GCN model and conversion to tuple representation."""
    adj_normalized = normalize_adj(adj + sp.eye(adj.shape[0]))
    return sparse_to_tuple(adj_normalized)


def construct_feed_dict(features, support, labels, labels_mask, placeholders):
    """Construct feed dictionary."""
    feed_dict = dict()
    feed_dict.update({placeholders['labels']: labels})
    feed_dict.update({placeholders['labels_mask']: labels_mask})
    feed_dict.update({placeholders['features']: features})
    feed_dict.update({placeholders['support'][i]: support[i] for i in range(len(support))})
    feed_dict.update({placeholders['num_features_nonzero']: features[1].shape})
    return feed_dict


def chebyshev_polynomials(adj, k):
    """Calculate Chebyshev polynomials up to order k. Return a list of sparse matrices (tuple representation)."""
    print("Calculating Chebyshev polynomials up to order {}...".format(k))

    adj_normalized = normalize_adj(adj)
    laplacian = sp.eye(adj.shape[0]) - adj_normalized
    largest_eigval, _ = eigsh(laplacian, 1, which='LM')
    scaled_laplacian = (2. / largest_eigval[0]) * laplacian - sp.eye(adj.shape[0])

    t_k = list()
    t_k.append(sp.eye(adj.shape[0]))
    t_k.append(scaled_laplacian)

    def chebyshev_recurrence(t_k_minus_one, t_k_minus_two, scaled_lap):
        s_lap = sp.csr_matrix(scaled_lap, copy=True)
        return 2 * s_lap.dot(t_k_minus_one) - t_k_minus_two

    for i in range(2, k + 1):
        t_k.append(chebyshev_recurrence(t_k[-1], t_k[-2], scaled_laplacian))

    return sparse_to_tuple(t_k)


load_data('cora')

原网站

版权声明
本文为[yihanyifan]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/207/202207261411541600.html

当前位置：网站首页>Figure introduction to neural network core dataset

Figure introduction to neural network core dataset

Data set summary

Dataset file description

边栏推荐

猜你喜欢

随机推荐