当前位置：网站首页>Attention attention mechanism flow chart

Attention attention mechanism flow chart

2022-07-28 10:40:00 【Upper back left love】

With the help of Attention Mechanism to improve the information processing ability of neural network
Attention The mechanism is mainly divided into two parts The first part ：self-attention ( Improve the way of attention ), The second part ：transformer( Definition encode-decode The process )
As shown below ：
Attention The mechanism essentially addresses the post process Given a query vector Q, Through calculation and Key And attach to value, obtain Attention Value
advantage ： There is no need to calculate all N Input information into neural network for calculation , from X Select some task related information to input into the neural network
1. Information input
2. Calculate the distribution of attention A,A = Softmax(s(key,q))
3. According to the distribution of attention A To calculate the weighted average of information
  Attention Mechanism stay RNN,CNN Solve the problem of long-distance dependence of input sequence
In the traditional neural network , Want to establish long-distance dependence between input sequences Use two methods ： The first method is to increase the number of network layers , Use deep neural network for long-distance information interaction ; The second way ： Access to the network （ Unable to process variable length sequence , Different input lengths , The connection weight is the same ）
use Self-Attetntion Dynamic generation without connection weights , So as to process the variable length information sequence .
Attention Mechanism function calculation principle ：

Q Express Query,k(Key), V(Value) ,
key == value A common model , Key！= Value Model of key value pairs
4. Q And K dot Point multiplication , Prevent the result from being too large It's divided by a scale $\sqrt{d_k}$ , $d_k$ by query and key Vector dimension
5. utilize Softmax Normalize the result to a probability distribution
6. Multiply by matrix V Get the weight sum

Transfomer (Attention Is All You Need) Detailed explanation

transformer Of The core mechanism Attention
Transformer The essence is a Seq2Seq Model , One on the left encoder Read the input into , Right will decoder Get the output
** Transformer encode **:
sub-layer-1: multi-head self attention mechanism, Used for self-attention
sub-layer-2: Position-wise Feed-foeward Network , Simple fully connected network , For each position The same operation is performed on the vectors of （ linear transformation + RELU Activation layer （ Input out layer 512, Middle layer 2048））
Every sub-layer Experimental residual network LayerNorm（X+ sublayer(X)）
* * Transformer Decoder **:
sub-layer-1 : Use self-attention Need to do mask
sub-layer-2: Positionn-wise Feed -forward Network Same as Encode
sub-layer-3: Encoder-Decoder attention Calculation

Encoder-Decoder attention VS self-attention
Use multi-head however Encoder-Decoder attention Use traditional attention Mechanism , Among them Query yes self-attention mechanism The last time that has been calculated i Encoded value at ,Key and Value All are Encoder Output , This is related to self-attention mechanism Different

multi-head attention

basic attention Use scaled dot-product attention, The following example

example:

use the RNN with Attention
Insert picture description here

""" Toy example of attention layer use Train RNN (GRU) on IMDB dataset (binary classification) Learning and hyper-parameters were not tuned; script serves as an example """
from __future__ import print_function,division

import numpy as np
import tensorflow as tf

from tensorflow.contrib.rnn import GRUCell
from tensorflow.python.ops.rnn import bidirectional_dynamic_rnn as bi_rnn
from tqdm import tqdm
from Attention.attention import attention
import Attention.imdb as imdb
from  Attention.utils import  get_vocabulary_size, fit_in_vocabulary, zero_pad, batch_generator
NUM_WORDS = 10000
INDEX_FROM = 3
SEQUENCE_LENGTH = 250
EMBEDDING_DIM = 100
HIDDEN_SIZE = 150
ATTENTION_SIZE = 50
KEEP_PROB = 0.8
BATCH_SIZE = 256
NUM_EPOCHS = 3  # Model easily overfits without pre-trained words embeddings, that's why train for a few epochs
DELTA = 0.5
MODEL_PATH = './model'
NUM_WORDS = 10000
INDEX_FROM = 3
SEQUENCE_LENGTH = 250
EMBEDDING_DIM = 100
HIDDEN_SIZE = 150
ATTENTION_SIZE = 50
KEEP_PROB = 0.8
BATCH_SIZE = 256
NUM_EPOCHS = 3  # Model easily overfits without pre-trained words embeddings, that's why train for a few epochs
DELTA = 0.5


# Load the data set
(X_train, y_train), (X_test, y_test) = imdb.load_data(path = './imdb/imdb.npz',num_words=NUM_WORDS, index_from=INDEX_FROM)

# Sequences pre-processing
vocabulary_size = get_vocabulary_size(X_train)
X_test = fit_in_vocabulary(X_test, vocabulary_size)
X_train = zero_pad(X_train, SEQUENCE_LENGTH)
X_test = zero_pad(X_test, SEQUENCE_LENGTH)

# Different placeholders
with tf.name_scope('Inputs'):
    batch_ph = tf.placeholder(tf.int32, [None, SEQUENCE_LENGTH], name='batch_ph')
    target_ph = tf.placeholder(tf.float32, [None], name='target_ph')
    seq_len_ph = tf.placeholder(tf.int32, [None], name='seq_len_ph')
    keep_prob_ph = tf.placeholder(tf.float32, name='keep_prob_ph')

# Embedding layer
with tf.name_scope('Embedding_layer'):
    embeddings_var = tf.Variable(tf.random_uniform([vocabulary_size, EMBEDDING_DIM], -1.0, 1.0), trainable=True)
    tf.summary.histogram('embeddings_var', embeddings_var)
    batch_embedded = tf.nn.embedding_lookup(embeddings_var, batch_ph)

# (Bi-)RNN layer(-s)
rnn_outputs, _ = bi_rnn(GRUCell(HIDDEN_SIZE), GRUCell(HIDDEN_SIZE),
                        inputs=batch_embedded, sequence_length=seq_len_ph, dtype=tf.float32)
tf.summary.histogram('RNN_outputs', rnn_outputs)

# Attention layer
with tf.name_scope('Attention_layer'):
    attention_output, alphas = attention(rnn_outputs, ATTENTION_SIZE, return_alphas=True)
    tf.summary.histogram('alphas', alphas)

# Dropout
drop = tf.nn.dropout(attention_output, keep_prob_ph)

# Fully connected layer
with tf.name_scope('Fully_connected_layer'):
    W = tf.Variable(tf.truncated_normal([HIDDEN_SIZE * 2, 1], stddev=0.1))  # Hidden size is multiplied by 2 for Bi-RNN
    b = tf.Variable(tf.constant(0., shape=[1]))
    y_hat = tf.nn.xw_plus_b(drop, W, b)
    y_hat = tf.squeeze(y_hat)
    tf.summary.histogram('W', W)

with tf.name_scope('Metrics'):
    # Cross-entropy loss and optimizer initialization
    loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y_hat, labels=target_ph))
    tf.summary.scalar('loss', loss)
    optimizer = tf.train.AdamOptimizer(learning_rate=1e-3).minimize(loss)

    # Accuracy metric
    accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.round(tf.sigmoid(y_hat)), target_ph), tf.float32))
    tf.summary.scalar('accuracy', accuracy)

merged = tf.summary.merge_all()

# Batch generators
train_batch_generator = batch_generator(X_train, y_train, BATCH_SIZE)
test_batch_generator = batch_generator(X_test, y_test, BATCH_SIZE)

train_writer = tf.summary.FileWriter('./logdir/train', accuracy.graph)
test_writer = tf.summary.FileWriter('./logdir/test', accuracy.graph)

session_conf = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))

saver = tf.train.Saver()

if __name__ == "__main__":
    with tf.Session(config=session_conf) as sess:
        sess.run(tf.global_variables_initializer())
        print("Start learning...")
        for epoch in range(NUM_EPOCHS):
            loss_train = 0
            loss_test = 0
            accuracy_train = 0
            accuracy_test = 0

            print("epoch: {}\t".format(epoch), end="")

            # Training
            num_batches = X_train.shape[0] // BATCH_SIZE
            for b in tqdm(range(num_batches)):
                x_batch, y_batch = next(train_batch_generator)
                seq_len = np.array([list(x).index(0) + 1 for x in x_batch])  # actual lengths of sequences
                loss_tr, acc, _, summary = sess.run([loss, accuracy, optimizer, merged],
                                                    feed_dict={
    batch_ph: x_batch,
                                                               target_ph: y_batch,
                                                               seq_len_ph: seq_len,
                                                               keep_prob_ph: KEEP_PROB})
                accuracy_train += acc
                loss_train = loss_tr * DELTA + loss_train * (1 - DELTA)
                train_writer.add_summary(summary, b + num_batches * epoch)
            accuracy_train /= num_batches

            # Testing
            num_batches = X_test.shape[0] // BATCH_SIZE
            for b in tqdm(range(num_batches)):
                x_batch, y_batch = next(test_batch_generator)
                seq_len = np.array([list(x).index(0) + 1 for x in x_batch])  # actual lengths of sequences
                loss_test_batch, acc, summary = sess.run([loss, accuracy, merged],
                                                         feed_dict={
    batch_ph: x_batch,
                                                                    target_ph: y_batch,
                                                                    seq_len_ph: seq_len,
                                                                    keep_prob_ph: 1.0})
                accuracy_test += acc
                loss_test += loss_test_batch
                test_writer.add_summary(summary, b + num_batches * epoch)
            accuracy_test /= num_batches
            loss_test /= num_batches

            print("loss: {:.3f}, val_loss: {:.3f}, acc: {:.3f}, val_acc: {:.3f}".format(
                loss_train, loss_test, accuracy_train, accuracy_test
            ))
        train_writer.close()
        test_writer.close()
        saver.save(sess, MODEL_PATH)
        print("Run 'tensorboard --logdir=./logdir' to checkout tensorboard logs.")

import tensorflow as tf


def attention(inputs, attention_size, time_major=False, return_alphas=False):
    """ Attention mechanism layer which reduces RNN/Bi-RNN outputs with Attention vector. The idea was proposed in the article by Z. Yang et al., "Hierarchical Attention Networks for Document Classification", 2016: http://www.aclweb.org/anthology/N16-1174. Variables notation is also inherited from the article Args: inputs: The Attention inputs. Matches outputs of RNN/Bi-RNN layer (not final state): In case of RNN, this must be RNN outputs `Tensor`: If time_major == False (default), this must be a tensor of shape: `[batch_size, max_time, cell.output_size]`. If time_major == True, this must be a tensor of shape: `[max_time, batch_size, cell.output_size]`. In case of Bidirectional RNN, this must be a tuple (outputs_fw, outputs_bw) containing the forward and the backward RNN outputs `Tensor`. If time_major == False (default), outputs_fw is a `Tensor` shaped: `[batch_size, max_time, cell_fw.output_size]` and outputs_bw is a `Tensor` shaped: `[batch_size, max_time, cell_bw.output_size]`. If time_major == True, outputs_fw is a `Tensor` shaped: `[max_time, batch_size, cell_fw.output_size]` and outputs_bw is a `Tensor` shaped: `[max_time, batch_size, cell_bw.output_size]`. attention_size: Linear size of the Attention weights. time_major: The shape format of the `inputs` Tensors. If true, these `Tensors` must be shaped `[max_time, batch_size, depth]`. If false, these `Tensors` must be shaped `[batch_size, max_time, depth]`. Using `time_major = True` is a bit more efficient because it avoids transposes at the beginning and end of the RNN calculation. However, most TensorFlow data is batch-major, so by default this function accepts input and emits output in batch-major form. return_alphas: Whether to return attention coefficients variable along with layer's output. Used for visualization purpose. Returns: The Attention output `Tensor`. In case of RNN, this will be a `Tensor` shaped: `[batch_size, cell.output_size]`. In case of Bidirectional RNN, this will be a `Tensor` shaped: `[batch_size, cell_fw.output_size + cell_bw.output_size]`. """

    if isinstance(inputs, tuple):
        # In case of Bi-RNN, concatenate the forward and the backward RNN outputs.
        inputs = tf.concat(inputs, 2)

    if time_major:
        # (T,B,D) => (B,T,D)
        inputs = tf.array_ops.transpose(inputs, [1, 0, 2])

    hidden_size = inputs.shape[2].value  # D value - hidden size of the RNN layer

    initializer = tf.random_normal_initializer(stddev=0.1)

    # Trainable parameters
    w_omega = tf.get_variable(name="w_omega", shape=[hidden_size, attention_size], initializer=initializer)
    b_omega = tf.get_variable(name="b_omega", shape=[attention_size], initializer=initializer)
    u_omega = tf.get_variable(name="u_omega", shape=[attention_size], initializer=initializer)

    with tf.name_scope('v'):
        # Applying fully connected layer with non-linear activation to each of the B*T timestamps;
        # the shape of `v` is (B,T,D)*(D,A)=(B,T,A), where A=attention_size
        v = tf.tanh(tf.tensordot(inputs, w_omega, axes=1) + b_omega)

    # For each of the timestamps its vector of size A from `v` is reduced with `u` vector
    vu = tf.tensordot(v, u_omega, axes=1, name='vu')  # (B,T) shape
    alphas = tf.nn.softmax(vu, name='alphas')  # (B,T) shape

    # Output of (Bi-)RNN is reduced with attention vector; the result has (B,D) shape
    output = tf.reduce_sum(inputs * tf.expand_dims(alphas, -1), 1)

    if not return_alphas:
        return output
    else:
        return output, alphas