当前位置:网站首页>Attention attention mechanism flow chart
Attention attention mechanism flow chart
2022-07-28 10:40:00 【Upper back left love】
With the help of Attention Mechanism to improve the information processing ability of neural network
Attention The mechanism is mainly divided into two parts The first part :self-attention ( Improve the way of attention ), The second part :transformer( Definition encode-decode The process )
As shown below :
Attention The mechanism essentially addresses the post process Given a query vector Q, Through calculation and Key And attach to value, obtainAttention Value
advantage : There is no need to calculate all N Input information into neural network for calculation , from X Select some task related information to input into the neural network- Information input
- Calculate the distribution of attention A,A = Softmax(s(key,q))
- According to the distribution of attention A To calculate the weighted average of information
Attention Mechanism stay RNN,CNN Solve the problem of long-distance dependence of input sequence
In the traditional neural network , Want to establish long-distance dependence between input sequences Use two methods : The first method is to increase the number of network layers , Use deep neural network for long-distance information interaction ; The second way : Access to the network ( Unable to process variable length sequence , Different input lengths , The connection weight is the same )
use Self-Attetntion Dynamic generation without connection weights , So as to process the variable length information sequence .Attention Mechanism function calculation principle :

Q Express Query,k(Key), V(Value) ,
key == value A common model , Key!= Value Model of key value pairs
4. Q And K dot Point multiplication , Prevent the result from being too large It's divided by a scale d k \sqrt{d_k} dk , d k d_k dk by query and key Vector dimension
5. utilizeSoftmaxNormalize the result to a probability distribution
6. Multiply by matrix V Get the weight sum
Transfomer (Attention Is All You Need) Detailed explanation
transformer Of The core mechanism Attention
Transformer The essence is a Seq2Seq Model , One on the left encoder Read the input into , Right will decoder Get the output
** Transformer encode **:
sub-layer-1: multi-head self attention mechanism, Used for self-attention
sub-layer-2: Position-wise Feed-foeward Network , Simple fully connected network , For each position The same operation is performed on the vectors of ( linear transformation + RELU Activation layer ( Input out layer 512, Middle layer 2048))
Every sub-layer Experimental residual network LayerNorm(X+ sublayer(X))
* * Transformer Decoder **:
sub-layer-1 : Use self-attention Need to do mask
sub-layer-2: Positionn-wise Feed -forward Network Same as Encode
sub-layer-3: Encoder-Decoder attention Calculation
- Encoder-Decoder attention VS self-attention
Use multi-head however Encoder-Decoder attention Use traditional attention Mechanism , Among them Query yes self-attention mechanism The last time that has been calculated i Encoded value at ,Key and Value All are Encoder Output , This is related to self-attention mechanism Different
multi-head attention
basic attention Use scaled dot-product attention, The following example
example:
use the RNN with Attention
""" Toy example of attention layer use Train RNN (GRU) on IMDB dataset (binary classification) Learning and hyper-parameters were not tuned; script serves as an example """
from __future__ import print_function,division
import numpy as np
import tensorflow as tf
from tensorflow.contrib.rnn import GRUCell
from tensorflow.python.ops.rnn import bidirectional_dynamic_rnn as bi_rnn
from tqdm import tqdm
from Attention.attention import attention
import Attention.imdb as imdb
from Attention.utils import get_vocabulary_size, fit_in_vocabulary, zero_pad, batch_generator
NUM_WORDS = 10000
INDEX_FROM = 3
SEQUENCE_LENGTH = 250
EMBEDDING_DIM = 100
HIDDEN_SIZE = 150
ATTENTION_SIZE = 50
KEEP_PROB = 0.8
BATCH_SIZE = 256
NUM_EPOCHS = 3 # Model easily overfits without pre-trained words embeddings, that's why train for a few epochs
DELTA = 0.5
MODEL_PATH = './model'
NUM_WORDS = 10000
INDEX_FROM = 3
SEQUENCE_LENGTH = 250
EMBEDDING_DIM = 100
HIDDEN_SIZE = 150
ATTENTION_SIZE = 50
KEEP_PROB = 0.8
BATCH_SIZE = 256
NUM_EPOCHS = 3 # Model easily overfits without pre-trained words embeddings, that's why train for a few epochs
DELTA = 0.5
# Load the data set
(X_train, y_train), (X_test, y_test) = imdb.load_data(path = './imdb/imdb.npz',num_words=NUM_WORDS, index_from=INDEX_FROM)
# Sequences pre-processing
vocabulary_size = get_vocabulary_size(X_train)
X_test = fit_in_vocabulary(X_test, vocabulary_size)
X_train = zero_pad(X_train, SEQUENCE_LENGTH)
X_test = zero_pad(X_test, SEQUENCE_LENGTH)
# Different placeholders
with tf.name_scope('Inputs'):
batch_ph = tf.placeholder(tf.int32, [None, SEQUENCE_LENGTH], name='batch_ph')
target_ph = tf.placeholder(tf.float32, [None], name='target_ph')
seq_len_ph = tf.placeholder(tf.int32, [None], name='seq_len_ph')
keep_prob_ph = tf.placeholder(tf.float32, name='keep_prob_ph')
# Embedding layer
with tf.name_scope('Embedding_layer'):
embeddings_var = tf.Variable(tf.random_uniform([vocabulary_size, EMBEDDING_DIM], -1.0, 1.0), trainable=True)
tf.summary.histogram('embeddings_var', embeddings_var)
batch_embedded = tf.nn.embedding_lookup(embeddings_var, batch_ph)
# (Bi-)RNN layer(-s)
rnn_outputs, _ = bi_rnn(GRUCell(HIDDEN_SIZE), GRUCell(HIDDEN_SIZE),
inputs=batch_embedded, sequence_length=seq_len_ph, dtype=tf.float32)
tf.summary.histogram('RNN_outputs', rnn_outputs)
# Attention layer
with tf.name_scope('Attention_layer'):
attention_output, alphas = attention(rnn_outputs, ATTENTION_SIZE, return_alphas=True)
tf.summary.histogram('alphas', alphas)
# Dropout
drop = tf.nn.dropout(attention_output, keep_prob_ph)
# Fully connected layer
with tf.name_scope('Fully_connected_layer'):
W = tf.Variable(tf.truncated_normal([HIDDEN_SIZE * 2, 1], stddev=0.1)) # Hidden size is multiplied by 2 for Bi-RNN
b = tf.Variable(tf.constant(0., shape=[1]))
y_hat = tf.nn.xw_plus_b(drop, W, b)
y_hat = tf.squeeze(y_hat)
tf.summary.histogram('W', W)
with tf.name_scope('Metrics'):
# Cross-entropy loss and optimizer initialization
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=y_hat, labels=target_ph))
tf.summary.scalar('loss', loss)
optimizer = tf.train.AdamOptimizer(learning_rate=1e-3).minimize(loss)
# Accuracy metric
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.round(tf.sigmoid(y_hat)), target_ph), tf.float32))
tf.summary.scalar('accuracy', accuracy)
merged = tf.summary.merge_all()
# Batch generators
train_batch_generator = batch_generator(X_train, y_train, BATCH_SIZE)
test_batch_generator = batch_generator(X_test, y_test, BATCH_SIZE)
train_writer = tf.summary.FileWriter('./logdir/train', accuracy.graph)
test_writer = tf.summary.FileWriter('./logdir/test', accuracy.graph)
session_conf = tf.ConfigProto(gpu_options=tf.GPUOptions(allow_growth=True))
saver = tf.train.Saver()
if __name__ == "__main__":
with tf.Session(config=session_conf) as sess:
sess.run(tf.global_variables_initializer())
print("Start learning...")
for epoch in range(NUM_EPOCHS):
loss_train = 0
loss_test = 0
accuracy_train = 0
accuracy_test = 0
print("epoch: {}\t".format(epoch), end="")
# Training
num_batches = X_train.shape[0] // BATCH_SIZE
for b in tqdm(range(num_batches)):
x_batch, y_batch = next(train_batch_generator)
seq_len = np.array([list(x).index(0) + 1 for x in x_batch]) # actual lengths of sequences
loss_tr, acc, _, summary = sess.run([loss, accuracy, optimizer, merged],
feed_dict={
batch_ph: x_batch,
target_ph: y_batch,
seq_len_ph: seq_len,
keep_prob_ph: KEEP_PROB})
accuracy_train += acc
loss_train = loss_tr * DELTA + loss_train * (1 - DELTA)
train_writer.add_summary(summary, b + num_batches * epoch)
accuracy_train /= num_batches
# Testing
num_batches = X_test.shape[0] // BATCH_SIZE
for b in tqdm(range(num_batches)):
x_batch, y_batch = next(test_batch_generator)
seq_len = np.array([list(x).index(0) + 1 for x in x_batch]) # actual lengths of sequences
loss_test_batch, acc, summary = sess.run([loss, accuracy, merged],
feed_dict={
batch_ph: x_batch,
target_ph: y_batch,
seq_len_ph: seq_len,
keep_prob_ph: 1.0})
accuracy_test += acc
loss_test += loss_test_batch
test_writer.add_summary(summary, b + num_batches * epoch)
accuracy_test /= num_batches
loss_test /= num_batches
print("loss: {:.3f}, val_loss: {:.3f}, acc: {:.3f}, val_acc: {:.3f}".format(
loss_train, loss_test, accuracy_train, accuracy_test
))
train_writer.close()
test_writer.close()
saver.save(sess, MODEL_PATH)
print("Run 'tensorboard --logdir=./logdir' to checkout tensorboard logs.")
import tensorflow as tf
def attention(inputs, attention_size, time_major=False, return_alphas=False):
""" Attention mechanism layer which reduces RNN/Bi-RNN outputs with Attention vector. The idea was proposed in the article by Z. Yang et al., "Hierarchical Attention Networks for Document Classification", 2016: http://www.aclweb.org/anthology/N16-1174. Variables notation is also inherited from the article Args: inputs: The Attention inputs. Matches outputs of RNN/Bi-RNN layer (not final state): In case of RNN, this must be RNN outputs `Tensor`: If time_major == False (default), this must be a tensor of shape: `[batch_size, max_time, cell.output_size]`. If time_major == True, this must be a tensor of shape: `[max_time, batch_size, cell.output_size]`. In case of Bidirectional RNN, this must be a tuple (outputs_fw, outputs_bw) containing the forward and the backward RNN outputs `Tensor`. If time_major == False (default), outputs_fw is a `Tensor` shaped: `[batch_size, max_time, cell_fw.output_size]` and outputs_bw is a `Tensor` shaped: `[batch_size, max_time, cell_bw.output_size]`. If time_major == True, outputs_fw is a `Tensor` shaped: `[max_time, batch_size, cell_fw.output_size]` and outputs_bw is a `Tensor` shaped: `[max_time, batch_size, cell_bw.output_size]`. attention_size: Linear size of the Attention weights. time_major: The shape format of the `inputs` Tensors. If true, these `Tensors` must be shaped `[max_time, batch_size, depth]`. If false, these `Tensors` must be shaped `[batch_size, max_time, depth]`. Using `time_major = True` is a bit more efficient because it avoids transposes at the beginning and end of the RNN calculation. However, most TensorFlow data is batch-major, so by default this function accepts input and emits output in batch-major form. return_alphas: Whether to return attention coefficients variable along with layer's output. Used for visualization purpose. Returns: The Attention output `Tensor`. In case of RNN, this will be a `Tensor` shaped: `[batch_size, cell.output_size]`. In case of Bidirectional RNN, this will be a `Tensor` shaped: `[batch_size, cell_fw.output_size + cell_bw.output_size]`. """
if isinstance(inputs, tuple):
# In case of Bi-RNN, concatenate the forward and the backward RNN outputs.
inputs = tf.concat(inputs, 2)
if time_major:
# (T,B,D) => (B,T,D)
inputs = tf.array_ops.transpose(inputs, [1, 0, 2])
hidden_size = inputs.shape[2].value # D value - hidden size of the RNN layer
initializer = tf.random_normal_initializer(stddev=0.1)
# Trainable parameters
w_omega = tf.get_variable(name="w_omega", shape=[hidden_size, attention_size], initializer=initializer)
b_omega = tf.get_variable(name="b_omega", shape=[attention_size], initializer=initializer)
u_omega = tf.get_variable(name="u_omega", shape=[attention_size], initializer=initializer)
with tf.name_scope('v'):
# Applying fully connected layer with non-linear activation to each of the B*T timestamps;
# the shape of `v` is (B,T,D)*(D,A)=(B,T,A), where A=attention_size
v = tf.tanh(tf.tensordot(inputs, w_omega, axes=1) + b_omega)
# For each of the timestamps its vector of size A from `v` is reduced with `u` vector
vu = tf.tensordot(v, u_omega, axes=1, name='vu') # (B,T) shape
alphas = tf.nn.softmax(vu, name='alphas') # (B,T) shape
# Output of (Bi-)RNN is reduced with attention vector; the result has (B,D) shape
output = tf.reduce_sum(inputs * tf.expand_dims(alphas, -1), 1)
if not return_alphas:
return output
else:
return output, alphas
边栏推荐
猜你喜欢

机器人技术(RoboCup 2D)如何进行一场球赛
![Implement a queue with two stacks [C language]](/img/8a/679575bb0a562eff7e4317e64b4790.png)
Implement a queue with two stacks [C language]
![[wechat applet] project practice - lottery application](/img/7b/a0545f077259b3dc971dc246813177.png)
[wechat applet] project practice - lottery application

ICML 2022 | 图表示学习的结构感知Transformer模型

CentOS7下安装mysql5.7

SQL Server 2016 learning records - connection query

3、MapReduce详解与源码分析

PHP generates QR code (learning)
![Database security - create login user + configure permissions [notes]](/img/02/0c3eb542593e8e0a3a62db75c52850.png)
Database security - create login user + configure permissions [notes]

AP AUTOSAR platform design 3 architecture
随机推荐
Codeforces Round #614 (Div. 2) B. JOE is on TV!
(1) Summary of machine learning concepts
Uni app project directory, file function introduction and development specification
SQL Server 2016 学习记录 --- 数据更新
GKPolygonObstacle
GKLinearCongruentialRandomSource
Machine learning -- handwritten English alphabet 3 -- engineering features
Excel word simple skills sorting (continuous update ~)
Machine learning -- handwritten English alphabet 1 -- classification process
SQL Server 2016 learning record - nested query
ICML 2022 | 图表示学习的结构感知Transformer模型
Implement a queue with two stacks [C language]
ogg里用多个filter语法应该怎么写?
SQL Server 2016 learning records - View
a different object with the same identifier value was already associated with the session
ACM寒假集训#6
markdown转成word或者pdf
第一篇:UniAPP的小程序跨端开发-----创建uniapp项目
6、MapReduce自定义分区实现
SDUT 2446 最终排名