tensorflow的legacy_seq2seq

tensorflow要重新给出一套seq2seq的接口,把之前的seq2seq搬到了legacy_seq2seq下,今天读的就是来自这里的代码。目前很多代码还是使用了老的seq2seq接口,因此仍有熟悉的必要。

_extract_argmax_and_embed

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def _extract_argmax_and_embed(embedding,
output_projection=None,
update_embedding=True):
"""Get a loop_function that extracts the previous symbol and embeds it.
Args:
embedding: embedding tensor for symbols.
output_projection: None or a pair (W, B). If provided, each fed previous
output will first be multiplied by W and added B.
update_embedding: Boolean; if False, the gradients will not propagate
through the embeddings.
Returns:
A loop function.
"""
def loop_function(prev, _):
if output_projection is not None:
prev = nn_ops.xw_plus_b(prev, output_projection[0], output_projection[1])
prev_symbol = math_ops.argmax(prev, 1)
# Note that gradients will not propagate through the second parameter of
# embedding_lookup.
emb_prev = embedding_ops.embedding_lookup(embedding, prev_symbol)
if not update_embedding:
emb_prev = array_ops.stop_gradient(emb_prev)
return emb_prev
return loop_function

rnn_decoder

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def rnn_decoder(decoder_inputs,
initial_state,
cell,
loop_function=None,
scope=None):
"""RNN decoder for the sequence-to-sequence model.
Args:
decoder_inputs: A list of 2D Tensors [batch_size x input_size].
initial_state: 2D Tensor with shape [batch_size x cell.state_size].
cell: core_rnn_cell.RNNCell defining the cell function and size.
loop_function: If not None, this function will be applied to the i-th output
in order to generate the i+1-st input, and decoder_inputs will be ignored,
except for the first element ("GO" symbol). This can be used for decoding,
but also for training to emulate http://arxiv.org/abs/1506.03099.
Signature -- loop_function(prev, i) = next
* prev is a 2D Tensor of shape [batch_size x output_size],
* i is an integer, the step number (when advanced control is needed),
* next is a 2D Tensor of shape [batch_size x input_size].
scope: VariableScope for the created subgraph; defaults to "rnn_decoder".
Returns:
A tuple of the form (outputs, state), where:
outputs: A list of the same length as decoder_inputs of 2D Tensors with
shape [batch_size x output_size] containing generated outputs.
state: The state of each cell at the final time-step.
It is a 2D Tensor of shape [batch_size x cell.state_size].
(Note that in some cases, like basic RNN cell or GRU cell, outputs and
states can be the same. They are different for LSTM cells though.)
"""
with variable_scope.variable_scope(scope or "rnn_decoder"):
state = initial_state
outputs = []
prev = None
for i, inp in enumerate(decoder_inputs):
if loop_function is not None and prev is not None:
with variable_scope.variable_scope("loop_function", reuse=True):
inp = loop_function(prev, i)
if i > 0:
variable_scope.get_variable_scope().reuse_variables()
output, state = cell(inp, state)
outputs.append(output)
if loop_function is not None:
prev = output
return outputs, state

decoder_inputs:是a list,其中的每一个元素表示的是t_i时刻的输入,每一时刻的输入又会有batch_size个,每一个输入(通差是表示一个word或token)又是input_size维度的。
loop_function: 如果loop_function有设置的话,decoder input中第一个”GO”会输入,但之后时刻的input就会被忽略,取代的是input_ti+1 = loop_function(output_ti)
这里定义的loop_function,有2个参数,(prev,i),输出为next

输出:
outputs:既然是每一时刻的input都会对应得到一个output,自然outputs的shape和decoder_inputs是一样,是a list,每个元素的shape=[batch_size, input_size](但是这里为了区别,认为是output_size)
state:最后一个时刻t的cell state,shape=[batch_size, cell.state_size]

basic_rnn_seq2seq

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def basic_rnn_seq2seq(encoder_inputs,
decoder_inputs,
cell,
dtype=dtypes.float32,
scope=None):
"""Basic RNN sequence-to-sequence model.
This model first runs an RNN to encode encoder_inputs into a state vector,
then runs decoder, initialized with the last encoder state, on decoder_inputs.
Encoder and decoder use the same RNN cell type, but don't share parameters.
Args:
encoder_inputs: A list of 2D Tensors [batch_size x input_size].
decoder_inputs: A list of 2D Tensors [batch_size x input_size].
cell: core_rnn_cell.RNNCell defining the cell function and size.
dtype: The dtype of the initial state of the RNN cell (default: tf.float32).
scope: VariableScope for the created subgraph; default: "basic_rnn_seq2seq".
Returns:
A tuple of the form (outputs, state), where:
outputs: A list of the same length as decoder_inputs of 2D Tensors with
shape [batch_size x output_size] containing the generated outputs.
state: The state of each decoder cell in the final time-step.
It is a 2D Tensor of shape [batch_size x cell.state_size].
"""
with variable_scope.variable_scope(scope or "basic_rnn_seq2seq"):
enc_cell = copy.deepcopy(cell)
_, enc_state = core_rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype)
return rnn_decoder(decoder_inputs, enc_state, cell)

encoder_inputs:a list,每个元素是时刻t的输入,每一时刻又存在batch_size个输入(word or token),并且每个token用input_size来表示(embedding)。因此,是a list of [batch_size, input_size]
decoder_inputs:同上,但是这两个list的长度可能不同,前者根据encoder_max_length指定,decoder根据decoder_max_length指定。
输出:
outputs:shape和decoder_inputs相同,差别在于这里用output_size和input_size区别【why
state:还是最后一个时刻的cell state,[batch_size, cell.state_size]

注意到这里用到深拷贝:

深拷贝是在另一块地址中创建一个新的变量或容器,同时容器内的元素的地址也是新开辟的,仅仅是值相同而已,是完全的副本。也就是说( 新瓶装新酒 )。

encode阶段使用的是core_rnn.static_rnn()不知道这个函数和别的rnn有什么不同?

decode阶段,很基本,直接使用了上面提到的rnn_decoder来生成最后的outputs和state,返回。

static_rnn

代码在这,比较繁琐,就不详细解读了。

embedding_rnn_decoder

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def embedding_rnn_decoder(decoder_inputs,
initial_state,
cell,
num_symbols,
embedding_size,
output_projection=None,
feed_previous=False,
update_embedding_for_previous=True,
scope=None):
"""RNN decoder with embedding and a pure-decoding option.
Args:
decoder_inputs: A list of 1D batch-sized int32 Tensors (decoder inputs).
initial_state: 2D Tensor [batch_size x cell.state_size].
cell: core_rnn_cell.RNNCell defining the cell function.
num_symbols: Integer, how many symbols come into the embedding.
embedding_size: Integer, the length of the embedding vector for each symbol.
output_projection: None or a pair (W, B) of output projection weights and
biases; W has shape [output_size x num_symbols] and B has
shape [num_symbols]; if provided and feed_previous=True, each fed
previous output will first be multiplied by W and added B.
feed_previous: Boolean; if True, only the first of decoder_inputs will be
used (the "GO" symbol), and all other decoder inputs will be generated by:
next = embedding_lookup(embedding, argmax(previous_output)),
In effect, this implements a greedy decoder. It can also be used
during training to emulate http://arxiv.org/abs/1506.03099.
If False, decoder_inputs are used as given (the standard decoder case).
update_embedding_for_previous: Boolean; if False and feed_previous=True,
only the embedding for the first symbol of decoder_inputs (the "GO"
symbol) will be updated by back propagation. Embeddings for the symbols
generated from the decoder itself remain unchanged. This parameter has
no effect if feed_previous=False.
scope: VariableScope for the created subgraph; defaults to
"embedding_rnn_decoder".
Returns:
A tuple of the form (outputs, state), where:
outputs: A list of the same length as decoder_inputs of 2D Tensors. The
output is of shape [batch_size x cell.output_size] when
output_projection is not None (and represents the dense representation
of predicted tokens). It is of shape [batch_size x num_decoder_symbols]
when output_projection is None.
state: The state of each decoder cell in each time-step. This is a list
with length len(decoder_inputs) -- one item for each time-step.
It is a 2D Tensor of shape [batch_size x cell.state_size].
Raises:
ValueError: When output_projection has the wrong shape.
"""
with variable_scope.variable_scope(scope or "embedding_rnn_decoder") as scope:
if output_projection is not None:
dtype = scope.dtype
proj_weights = ops.convert_to_tensor(output_projection[0], dtype=dtype)
proj_weights.get_shape().assert_is_compatible_with([None, num_symbols])
proj_biases = ops.convert_to_tensor(output_projection[1], dtype=dtype)
proj_biases.get_shape().assert_is_compatible_with([num_symbols])
embedding = variable_scope.get_variable("embedding",
[num_symbols, embedding_size])
loop_function = _extract_argmax_and_embed(
embedding, output_projection,
update_embedding_for_previous) if feed_previous else None
emb_inp = (embedding_ops.embedding_lookup(embedding, i)
for i in decoder_inputs)
return rnn_decoder(
emb_inp, initial_state, cell, loop_function=loop_function)

刚才讲了一个basic的decoder叫rnn_decoder:rnn_decoder(decoder_inputs,initial_state,cell,loop_function=None,scope=None),现在来一个稍微高级一点的。
对比一下发现这个decoder没有loop_function,多出来了num_symbolsembedding_sizeoutput_projection=Nonefeed_previous=Falseupdate_embedding_for_previous=True。这些都是什么呢?

参数:
decoder_inputs:既然这个标榜了embedding,那么input肯定和rnn_decoder有些不同。这里input变为1维,[batch_size, ]也就是说,输入不需要自己做embedding了,直接输入tokens在vocab中对应的idx(即ids)即可,内部会自动帮我们进行id到embedding的转化。
num_symbols:就是vocab_size
embedding_size:每个token需要embedding成的维数,比如100
output_projection:(W, b)就是将输出做一个映射。为什么要映射,因为此时input相当于a list of [batch_size, 1],内部帮我们做一个embedding,得到embedded_input=[batch_size, embedding_size ],经过cell之后,得到[batch_size, output_size](这个过程就是之前的rnn_decoder做的事情)。这样之后,如果我们设置了feed_previous=True,也就是需要将前一时刻的output作为下一时刻的input,那么前一时刻的output中要从vocab_size中选出一个分数最高的token来,即argmax(previous_output)。过程如下图描述的那样:

但是,现在的output维度是output_size,并不能知道每个vocab的得分情况。因此要从output_size映射到vocab_size(这里的num_symbols)。
我们知道,x(某一时刻的output)的shape=[batch_size, output_size],映射的公式是xw+b,那么w的shape=[output_zize, num_symbols]

update_embedding_for_previous:如果前一时刻的output不作为当前的input的话(feed_previous=False),这个参数没影响();否则,该参数默认是True,但如果设置成false,则表示不对前一个embedding进行更新,那么bp的时候只会更新”GO”的embedding,其他token(decoder生成的)embedding不变。

输出:
outputs:如果output_projection=None的话,也就是不进行映射(直接输出的是num_symbols的个数),那么a list of [batch_size, num_symbols];如果不为None,说明outputs要进行映射,则outputs是a list of [batch_size, num_symbols]
state同上

embedding_rnn_seq2seq

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def embedding_rnn_seq2seq(encoder_inputs,
decoder_inputs,
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_size,
output_projection=None,
feed_previous=False,
dtype=None,
scope=None):
"""Embedding RNN sequence-to-sequence model.
This model first embeds encoder_inputs by a newly created embedding (of shape
[num_encoder_symbols x input_size]). Then it runs an RNN to encode
embedded encoder_inputs into a state vector. Next, it embeds decoder_inputs
by another newly created embedding (of shape [num_decoder_symbols x
input_size]). Then it runs RNN decoder, initialized with the last
encoder state, on embedded decoder_inputs.
Args:
encoder_inputs: A list of 1D int32 Tensors of shape [batch_size].
decoder_inputs: A list of 1D int32 Tensors of shape [batch_size].
cell: core_rnn_cell.RNNCell defining the cell function and size.
num_encoder_symbols: Integer; number of symbols on the encoder side.
num_decoder_symbols: Integer; number of symbols on the decoder side.
embedding_size: Integer, the length of the embedding vector for each symbol.
output_projection: None or a pair (W, B) of output projection weights and
biases; W has shape [output_size x num_decoder_symbols] and B has
shape [num_decoder_symbols]; if provided and feed_previous=True, each
fed previous output will first be multiplied by W and added B.
feed_previous: Boolean or scalar Boolean Tensor; if True, only the first
of decoder_inputs will be used (the "GO" symbol), and all other decoder
inputs will be taken from previous outputs (as in embedding_rnn_decoder).
If False, decoder_inputs are used as given (the standard decoder case).
dtype: The dtype of the initial state for both the encoder and encoder
rnn cells (default: tf.float32).
scope: VariableScope for the created subgraph; defaults to
"embedding_rnn_seq2seq"
Returns:
A tuple of the form (outputs, state), where:
outputs: A list of the same length as decoder_inputs of 2D Tensors. The
output is of shape [batch_size x cell.output_size] when
output_projection is not None (and represents the dense representation
of predicted tokens). It is of shape [batch_size x num_decoder_symbols]
when output_projection is None.
state: The state of each decoder cell in each time-step. This is a list
with length len(decoder_inputs) -- one item for each time-step.
It is a 2D Tensor of shape [batch_size x cell.state_size].
"""

既然有了embedding_rnn_decoder,那么对应的就有embedding_rnn_seq2seq。之前讲过basic_rnn_seq2seq(encoder_inputs, decoder_inputs, cell, dtype=dtypes.float32, scope=None)
inputs:还是像之前说的,既然embedding是内部帮我们完成,则inputs shape= a list of [batch_size],每个位置都只是一个token id。内部使用一个embedding wrapper,做lookup,生成a list of [batch_size, embedding_size]
对比之下,多了几个参数:
num_encoder_symbols:通俗的说其实就是encoder端的vocab_size。enc和dec两端词汇量不同主要在于不同语言的translate task中,如果单纯是中文到中文的生成,不存在两端词汇量的不同。
num_decoder_symbols:同上
embedding_size:每个vocab需要用多少维的vector表示
output_projection=None
feed_previous=False:如果feed_previous只是简单的一个True or False,则直接返回embedding_rnn_decoder的结果。重点是feed_previous还能传入一个boolean tensor(暂时无此需求)

attention_decoder

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def attention_decoder(decoder_inputs,
initial_state,
attention_states,
cell,
output_size=None,
num_heads=1,
loop_function=None,
dtype=None,
scope=None,
initial_state_attention=False):
"""RNN decoder with attention for the sequence-to-sequence model.
In this context "attention" means that, during decoding, the RNN can look up
information in the additional tensor attention_states, and it does this by
focusing on a few entries from the tensor. This model has proven to yield
especially good results in a number of sequence-to-sequence tasks. This
implementation is based on http://arxiv.org/abs/1412.7449 (see below for
details). It is recommended for complex sequence-to-sequence tasks.
Args:
decoder_inputs: A list of 2D Tensors [batch_size x input_size].
initial_state: 2D Tensor [batch_size x cell.state_size].
attention_states: 3D Tensor [batch_size x attn_length x attn_size].
cell: core_rnn_cell.RNNCell defining the cell function and size.
output_size: Size of the output vectors; if None, we use cell.output_size.
num_heads: Number of attention heads that read from attention_states.
loop_function: If not None, this function will be applied to i-th output
in order to generate i+1-th input, and decoder_inputs will be ignored,
except for the first element ("GO" symbol). This can be used for decoding,
but also for training to emulate http://arxiv.org/abs/1506.03099.
Signature -- loop_function(prev, i) = next
* prev is a 2D Tensor of shape [batch_size x output_size],
* i is an integer, the step number (when advanced control is needed),
* next is a 2D Tensor of shape [batch_size x input_size].
dtype: The dtype to use for the RNN initial state (default: tf.float32).
scope: VariableScope for the created subgraph; default: "attention_decoder".
initial_state_attention: If False (default), initial attentions are zero.
If True, initialize the attentions from the initial state and attention
states -- useful when we wish to resume decoding from a previously
stored decoder state and attention states.
Returns:
A tuple of the form (outputs, state), where:
outputs: A list of the same length as decoder_inputs of 2D Tensors of
shape [batch_size x output_size]. These represent the generated outputs.
Output i is computed from input i (which is either the i-th element
of decoder_inputs or loop_function(output {i-1}, i)) as follows.
First, we run the cell on a combination of the input and previous
attention masks:
cell_output, new_state = cell(linear(input, prev_attn), prev_state).
Then, we calculate new attention masks:
new_attn = softmax(V^T * tanh(W * attention_states + U * new_state))
and then we calculate the output:
output = linear(cell_output, new_attn).
state: The state of each decoder cell the final time-step.
It is a 2D Tensor of shape [batch_size x cell.state_size].
Raises:
ValueError: when num_heads is not positive, there are no inputs, shapes
of attention_states are not set, or input size cannot be inferred
from the input.
"""

刚才讲完了embedding_rnn_decoder,则再来看看attention_decoder。
和基本的rnn_decoder相比(rnn_decoder(decoder_inputs, initial_state, cell, loop_function=None, scope=None))
多了几个参数:
attention_states:attention_states作为addition info出现,
output_size=None:如果是None的话默认为cell.output_size
num_heads=1 :应该pay attention的点的个数,比如要focus到attention_states的几个点,默认为只关注1个点
initial_state_attention=False:如果是True的话,attention由state和attention_states进行初始化,如果False,则attention初始化为0

embedding_attention_decoder

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
def embedding_attention_decoder(decoder_inputs,
initial_state,
attention_states,
cell,
num_symbols,
embedding_size,
num_heads=1,
output_size=None,
output_projection=None,
feed_previous=False,
update_embedding_for_previous=True,
dtype=None,
scope=None,
initial_state_attention=False):
"""RNN decoder with embedding and attention and a pure-decoding option.
Args:
decoder_inputs: A list of 1D batch-sized int32 Tensors (decoder inputs).
initial_state: 2D Tensor [batch_size x cell.state_size].
attention_states: 3D Tensor [batch_size x attn_length x attn_size].
cell: core_rnn_cell.RNNCell defining the cell function.
num_symbols: Integer, how many symbols come into the embedding.
embedding_size: Integer, the length of the embedding vector for each symbol.
num_heads: Number of attention heads that read from attention_states.
output_size: Size of the output vectors; if None, use output_size.
output_projection: None or a pair (W, B) of output projection weights and
biases; W has shape [output_size x num_symbols] and B has shape
[num_symbols]; if provided and feed_previous=True, each fed previous
output will first be multiplied by W and added B.
feed_previous: Boolean; if True, only the first of decoder_inputs will be
used (the "GO" symbol), and all other decoder inputs will be generated by:
next = embedding_lookup(embedding, argmax(previous_output)),
In effect, this implements a greedy decoder. It can also be used
during training to emulate http://arxiv.org/abs/1506.03099.
If False, decoder_inputs are used as given (the standard decoder case).
update_embedding_for_previous: Boolean; if False and feed_previous=True,
only the embedding for the first symbol of decoder_inputs (the "GO"
symbol) will be updated by back propagation. Embeddings for the symbols
generated from the decoder itself remain unchanged. This parameter has
no effect if feed_previous=False.
dtype: The dtype to use for the RNN initial states (default: tf.float32).
scope: VariableScope for the created subgraph; defaults to
"embedding_attention_decoder".
initial_state_attention: If False (default), initial attentions are zero.
If True, initialize the attentions from the initial state and attention
states -- useful when we wish to resume decoding from a previously
stored decoder state and attention states.
Returns:
A tuple of the form (outputs, state), where:
outputs: A list of the same length as decoder_inputs of 2D Tensors with
shape [batch_size x output_size] containing the generated outputs.
state: The state of each decoder cell at the final time-step.
It is a 2D Tensor of shape [batch_size x cell.state_size].
Raises:
ValueError: When output_projection has the wrong shape.
"""

其实是前面讲的embedding_decoder和attention_decoder的结合版。

embedding_attention_seq2seq

1
2
3
4
5
6
7
8
9
10
11
12
def embedding_attention_seq2seq(encoder_inputs,
decoder_inputs,
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_size,
num_heads=1,
output_projection=None,
feed_previous=False,
dtype=None,
scope=None,
initial_state_attention=False)

与embedding_attention_decoder相对应的seq2seq模型

sequence_loss_by_example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def sequence_loss_by_example(logits,
targets,
weights,
average_across_timesteps=True,
softmax_loss_function=None,
name=None):
"""Weighted cross-entropy loss for a sequence of logits (per example).
Args:
logits: List of 2D Tensors of shape [batch_size x num_decoder_symbols].
targets: List of 1D batch-sized int32 Tensors of the same length as logits.
weights: List of 1D batch-sized float-Tensors of the same length as logits.
average_across_timesteps: If set, divide the returned cost by the total
label weight.
softmax_loss_function: Function (labels-batch, inputs-batch) -> loss-batch
to be used instead of the standard softmax (the default if this is None).
name: Optional name for this operation, default: "sequence_loss_by_example".
Returns:
1D batch-sized float Tensor: The log-perplexity for each sequence.
Raises:
ValueError: If len(logits) is different from len(targets) or len(weights).
"""
if len(targets) != len(logits) or len(weights) != len(logits):
raise ValueError("Lengths of logits, weights, and targets must be the same "
"%d, %d, %d." % (len(logits), len(weights), len(targets)))
with ops.name_scope(name, "sequence_loss_by_example",
logits + targets + weights):
log_perp_list = []
for logit, target, weight in zip(logits, targets, weights):
if softmax_loss_function is None:
# TODO(irving,ebrevdo): This reshape is needed because
# sequence_loss_by_example is called with scalars sometimes, which
# violates our general scalar strictness policy.
target = array_ops.reshape(target, [-1])
crossent = nn_ops.sparse_softmax_cross_entropy_with_logits(
labels=target, logits=logit)
else:
crossent = softmax_loss_function(target, logit)
log_perp_list.append(crossent * weight)
log_perps = math_ops.add_n(log_perp_list)
if average_across_timesteps:
total_size = math_ops.add_n(weights)
total_size += 1e-12 # Just to avoid division by 0 for all-0 weights.
log_perps /= total_size
return log_perps

返回值:
1D batch-sized float Tensor:为每一个序列(一个batch中有batch_size个sequence)计算其log perplexity,也是名称中by_example的含义

输入:
(注意:一个batch上的所有数据都被pad成相同长度?因此它们的time_length是一样的?)
logits:a list依次存储一系列时刻上的输出,每一时刻的输出都是batch_size为单位的,其中的每一个输入对应的输出是整个vocab上的得分,因此是num_decoder_symbols。因此,logits应该是a list of [batch_size, num_decoder_symbols]
targets:a list表示依次的所有时刻的target,每一时刻又有batch_size个输入,因此对应batch_size个target,因此shape=a list of [batch_size, ]
weights:每个example,在每一时刻都有对自身当前token的权重。因此shape=a list of [batch_size,]
疑问:weights是做什么用的?为什么要对每个token设置权重?

解读代码:
首先会生成一个crossent,shape=[batch_size, ],再和weights相乘,还是得到[batch_size, ],表示每个example在当前时刻t位置的得分(batch_size个),append到log_perp_list中(最终shape是a list of [batch_size, ])
所有的time length循环完毕之后,累加这些time length,得到一个shape=[batch_size,]的变量,叫做log_perps。

sequence_loss

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def sequence_loss(logits,
targets,
weights,
average_across_timesteps=True,
average_across_batch=True,
softmax_loss_function=None,
name=None):
"""Weighted cross-entropy loss for a sequence of logits, batch-collapsed.
Args:
logits: List of 2D Tensors of shape [batch_size x num_decoder_symbols].
targets: List of 1D batch-sized int32 Tensors of the same length as logits.
weights: List of 1D batch-sized float-Tensors of the same length as logits.
average_across_timesteps: If set, divide the returned cost by the total
label weight.
average_across_batch: If set, divide the returned cost by the batch size.
softmax_loss_function: Function (labels-batch, inputs-batch) -> loss-batch
to be used instead of the standard softmax (the default if this is None).
name: Optional name for this operation, defaults to "sequence_loss".
Returns:
A scalar float Tensor: The average log-perplexity per symbol (weighted).
Raises:
ValueError: If len(logits) is different from len(targets) or len(weights).
"""
with ops.name_scope(name, "sequence_loss", logits + targets + weights):
cost = math_ops.reduce_sum(
sequence_loss_by_example(
logits,
targets,
weights,
average_across_timesteps=average_across_timesteps,
softmax_loss_function=softmax_loss_function))
if average_across_batch:
batch_size = array_ops.shape(targets[0])[0]
return cost / math_ops.cast(batch_size, cost.dtype)
else:
return cost

其实主体还是上面讲的sequence_loss_by_example,只不过对上面的[batch_size,]的结果进行sum,如果默认average_across_batch的话,就sum/batch_size,平均每一个sequence的log perplexity;要是设置了不平均,则返回的是整个batch上的sum of log perplexity

model_with_buckets

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def model_with_buckets(encoder_inputs,
decoder_inputs,
targets,
weights,
buckets,
seq2seq,
softmax_loss_function=None,
per_example_loss=False,
name=None):
"""Create a sequence-to-sequence model with support for bucketing.
The seq2seq argument is a function that defines a sequence-to-sequence model,
e.g., seq2seq = lambda x, y: basic_rnn_seq2seq(
x, y, core_rnn_cell.GRUCell(24))
Args:
encoder_inputs: A list of Tensors to feed the encoder; first seq2seq input.
decoder_inputs: A list of Tensors to feed the decoder; second seq2seq input.
targets: A list of 1D batch-sized int32 Tensors (desired output sequence).
weights: List of 1D batch-sized float-Tensors to weight the targets.
buckets: A list of pairs of (input size, output size) for each bucket.
seq2seq: A sequence-to-sequence model function; it takes 2 input that
agree with encoder_inputs and decoder_inputs, and returns a pair
consisting of outputs and states (as, e.g., basic_rnn_seq2seq).
softmax_loss_function: Function (labels-batch, inputs-batch) -> loss-batch
to be used instead of the standard softmax (the default if this is None).
per_example_loss: Boolean. If set, the returned loss will be a batch-sized
tensor of losses for each sequence in the batch. If unset, it will be
a scalar with the averaged loss from all examples.
name: Optional name for this operation, defaults to "model_with_buckets".
Returns:
A tuple of the form (outputs, losses), where:
outputs: The outputs for each bucket. Its j'th element consists of a list
of 2D Tensors. The shape of output tensors can be either
[batch_size x output_size] or [batch_size x num_decoder_symbols]
depending on the seq2seq model used.
losses: List of scalar Tensors, representing losses for each bucket, or,
if per_example_loss is set, a list of 1D batch-sized float Tensors.
Raises:
ValueError: If length of encoder_inputs, targets, or weights is smaller
than the largest (last) bucket.
"""

参数:
encoder_inputs:一开始我有个疑问,这里的inputs是ids的形式还是传入input_size的形式,仔细想想实际是这样的。这个inputs具体的shape形式要根据后面seq2seq定义的那个函数决定,一般就只传入两个参数x, y分别对应encoder_inputs和decoder_inputs(另外特定seq2seq需要的参数需要在自定义的这个seq2seq函数内部传入)。这个时候,如果我们使用的是embedding_seq2seq,那么实际的inputs就应该是ids的样子;否则,就是input_size的样子。
targets:a list因为每一时刻都会有target,并且每一时刻输入的是batch_size个,因此每一时刻的target是[batch_size,]的形式,最终导致targets是a list of [batch_size, ]
buckets:a list of (input_size, output_size)
per_example_loss:默认是False,表示losses是[batch_size, ]。比如刚才讲到的sequence_loss_by_example的结果是[batch_size,],再者sequence_loss的结果是一个scalar。

实现:

1
2
3
4
5
6
for j, bucket in enumerate(buckets):
with variable_scope.variable_scope(
variable_scope.get_variable_scope(), reuse=True if j > 0 else None):
bucket_outputs, _ = seq2seq(encoder_inputs[:bucket[0]],
decoder_inputs[:bucket[1]])
outputs.append(bucket_outputs)

根据实现可以看到,比如设置了3个buckets=[(2, 4), (5, 7), (8, 10)],第1个bucket是(2,4),那么先截取encoder_inputs中每个(batch_size个)sequences的前2个tokens,和同理截取decoder_inputs中前4个tokens(encoder_inputs的第一维度就是time)。
然后把截取部分进行seq2seq,得到输出是a list of [batch_size, output_size](这个list的长度为4,output是按decoder的长度算),然后将这个输出加入到outputs中。
最终得到的outputs就是一个bucket_size长度(这里为3)的列表,列表中每个元素是长度不等的list(之所以长度不等是因为每个bucket所定义的max_decoder_length不等,依次增大)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
if per_example_loss:
losses.append(
sequence_loss_by_example(
outputs[-1],
targets[:bucket[1]],
weights[:bucket[1]],
softmax_loss_function=softmax_loss_function))
else:
losses.append(
sequence_loss(
outputs[-1],
targets[:bucket[1]],
weights[:bucket[1]],
softmax_loss_function=softmax_loss_function))

计算完当前bucket的outputs后,就应该计算当前bucket的loss。由于当前bucket的output刚刚append,因此outputs[-1]就是当前bucket的output。又因为我们截取了decoder_inputs,因此targets和weights都要截取成相同的长度。这样的话就得到当前bucket的loss,append到losses中。

因此,最后的outputs和losses,我们只要索引bucket的idx,就可以得到该bucket上的output和loss。