2017-03-10

tensorflow的legacy_seq2seq

tensorflow要重新给出一套seq2seq的接口，把之前的seq2seq搬到了legacy_seq2seq下，今天读的就是来自这里的代码。目前很多代码还是使用了老的seq2seq接口，因此仍有熟悉的必要。

_extract_argmax_and_embed

def _extract_argmax_and_embed(embedding,
                              output_projection=None,
                              update_embedding=True):
  """Get a loop_function that extracts the previous symbol and embeds it.
  Args:
    embedding: embedding tensor for symbols.
    output_projection: None or a pair (W, B). If provided, each fed previous
      output will first be multiplied by W and added B.
    update_embedding: Boolean; if False, the gradients will not propagate
      through the embeddings.
  Returns:
    A loop function.
  """
  def loop_function(prev, _):
    if output_projection is not None:
      prev = nn_ops.xw_plus_b(prev, output_projection[0], output_projection[1])
    prev_symbol = math_ops.argmax(prev, 1)
    # Note that gradients will not propagate through the second parameter of
    # embedding_lookup.
    emb_prev = embedding_ops.embedding_lookup(embedding, prev_symbol)
    if not update_embedding:
      emb_prev = array_ops.stop_gradient(emb_prev)
    return emb_prev
  return loop_function

rnn_decoder

def rnn_decoder(decoder_inputs,
                initial_state,
                cell,
                loop_function=None,
                scope=None):
  """RNN decoder for the sequence-to-sequence model.
  Args:
    decoder_inputs: A list of 2D Tensors [batch_size x input_size].
    initial_state: 2D Tensor with shape [batch_size x cell.state_size].
    cell: core_rnn_cell.RNNCell defining the cell function and size.
    loop_function: If not None, this function will be applied to the i-th output
      in order to generate the i+1-st input, and decoder_inputs will be ignored,
      except for the first element ("GO" symbol). This can be used for decoding,
      but also for training to emulate http://arxiv.org/abs/1506.03099.
      Signature -- loop_function(prev, i) = next
        * prev is a 2D Tensor of shape [batch_size x output_size],
        * i is an integer, the step number (when advanced control is needed),
        * next is a 2D Tensor of shape [batch_size x input_size].
    scope: VariableScope for the created subgraph; defaults to "rnn_decoder".
  Returns:
    A tuple of the form (outputs, state), where:
      outputs: A list of the same length as decoder_inputs of 2D Tensors with
        shape [batch_size x output_size] containing generated outputs.
      state: The state of each cell at the final time-step.
        It is a 2D Tensor of shape [batch_size x cell.state_size].
        (Note that in some cases, like basic RNN cell or GRU cell, outputs and
         states can be the same. They are different for LSTM cells though.)
  """
  with variable_scope.variable_scope(scope or "rnn_decoder"):
    state = initial_state
    outputs = []
    prev = None
    for i, inp in enumerate(decoder_inputs):
      if loop_function is not None and prev is not None:
        with variable_scope.variable_scope("loop_function", reuse=True):
          inp = loop_function(prev, i)
      if i > 0:
        variable_scope.get_variable_scope().reuse_variables()
      output, state = cell(inp, state)
      outputs.append(output)
      if loop_function is not None:
        prev = output
  return outputs, state

decoder_inputs：是a list，其中的每一个元素表示的是t_i时刻的输入，每一时刻的输入又会有batch_size个，每一个输入（通差是表示一个word或token）又是input_size维度的。
loop_function: 如果loop_function有设置的话，decoder input中第一个”GO”会输入，但之后时刻的input就会被忽略，取代的是input_ti+1 = loop_function(output_ti)
这里定义的loop_function，有2个参数，（prev,i），输出为next

输出：
outputs：既然是每一时刻的input都会对应得到一个output，自然outputs的shape和decoder_inputs是一样，是a list，每个元素的shape=[batch_size, input_size]（但是这里为了区别，认为是output_size）
state：最后一个时刻t的cell state，shape=[batch_size, cell.state_size]

basic_rnn_seq2seq

def basic_rnn_seq2seq(encoder_inputs,
                      decoder_inputs,
                      cell,
                      dtype=dtypes.float32,
                      scope=None):
  """Basic RNN sequence-to-sequence model.
  This model first runs an RNN to encode encoder_inputs into a state vector,
  then runs decoder, initialized with the last encoder state, on decoder_inputs.
  Encoder and decoder use the same RNN cell type, but don't share parameters.
  Args:
    encoder_inputs: A list of 2D Tensors [batch_size x input_size].
    decoder_inputs: A list of 2D Tensors [batch_size x input_size].
    cell: core_rnn_cell.RNNCell defining the cell function and size.
    dtype: The dtype of the initial state of the RNN cell (default: tf.float32).
    scope: VariableScope for the created subgraph; default: "basic_rnn_seq2seq".
  Returns:
    A tuple of the form (outputs, state), where:
      outputs: A list of the same length as decoder_inputs of 2D Tensors with
        shape [batch_size x output_size] containing the generated outputs.
      state: The state of each decoder cell in the final time-step.
        It is a 2D Tensor of shape [batch_size x cell.state_size].
  """
  with variable_scope.variable_scope(scope or "basic_rnn_seq2seq"):
    enc_cell = copy.deepcopy(cell)
    _, enc_state = core_rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype)
    return rnn_decoder(decoder_inputs, enc_state, cell)

encoder_inputs：a list，每个元素是时刻t的输入，每一时刻又存在batch_size个输入（word or token），并且每个token用input_size来表示（embedding）。因此，是a list of [batch_size, input_size]
decoder_inputs：同上，但是这两个list的长度可能不同，前者根据encoder_max_length指定，decoder根据decoder_max_length指定。
输出：
outputs：shape和decoder_inputs相同，差别在于这里用output_size和input_size区别【why
state：还是最后一个时刻的cell state，[batch_size, cell.state_size]

注意到这里用到深拷贝:

深拷贝是在另一块地址中创建一个新的变量或容器，同时容器内的元素的地址也是新开辟的，仅仅是值相同而已，是完全的副本。也就是说（新瓶装新酒）。

encode阶段使用的是core_rnn.static_rnn()不知道这个函数和别的rnn有什么不同？

decode阶段，很基本，直接使用了上面提到的rnn_decoder来生成最后的outputs和state，返回。

static_rnn

代码在这，比较繁琐，就不详细解读了。

embedding_rnn_decoder

def embedding_rnn_decoder(decoder_inputs,
                          initial_state,
                          cell,
                          num_symbols,
                          embedding_size,
                          output_projection=None,
                          feed_previous=False,
                          update_embedding_for_previous=True,
                          scope=None):
  """RNN decoder with embedding and a pure-decoding option.
  Args:
    decoder_inputs: A list of 1D batch-sized int32 Tensors (decoder inputs).
    initial_state: 2D Tensor [batch_size x cell.state_size].
    cell: core_rnn_cell.RNNCell defining the cell function.
    num_symbols: Integer, how many symbols come into the embedding.
    embedding_size: Integer, the length of the embedding vector for each symbol.
    output_projection: None or a pair (W, B) of output projection weights and
      biases; W has shape [output_size x num_symbols] and B has
      shape [num_symbols]; if provided and feed_previous=True, each fed
      previous output will first be multiplied by W and added B.
    feed_previous: Boolean; if True, only the first of decoder_inputs will be
      used (the "GO" symbol), and all other decoder inputs will be generated by:
        next = embedding_lookup(embedding, argmax(previous_output)),
      In effect, this implements a greedy decoder. It can also be used
      during training to emulate http://arxiv.org/abs/1506.03099.
      If False, decoder_inputs are used as given (the standard decoder case).
    update_embedding_for_previous: Boolean; if False and feed_previous=True,
      only the embedding for the first symbol of decoder_inputs (the "GO"
      symbol) will be updated by back propagation. Embeddings for the symbols
      generated from the decoder itself remain unchanged. This parameter has
      no effect if feed_previous=False.
    scope: VariableScope for the created subgraph; defaults to
      "embedding_rnn_decoder".
  Returns:
    A tuple of the form (outputs, state), where:
      outputs: A list of the same length as decoder_inputs of 2D Tensors. The
        output is of shape [batch_size x cell.output_size] when
        output_projection is not None (and represents the dense representation
        of predicted tokens). It is of shape [batch_size x num_decoder_symbols]
        when output_projection is None.
      state: The state of each decoder cell in each time-step. This is a list
        with length len(decoder_inputs) -- one item for each time-step.
        It is a 2D Tensor of shape [batch_size x cell.state_size].
  Raises:
    ValueError: When output_projection has the wrong shape.
  """
  with variable_scope.variable_scope(scope or "embedding_rnn_decoder") as scope:
    if output_projection is not None:
      dtype = scope.dtype
      proj_weights = ops.convert_to_tensor(output_projection[0], dtype=dtype)
      proj_weights.get_shape().assert_is_compatible_with([None, num_symbols])
      proj_biases = ops.convert_to_tensor(output_projection[1], dtype=dtype)
      proj_biases.get_shape().assert_is_compatible_with([num_symbols])
    embedding = variable_scope.get_variable("embedding",
                                            [num_symbols, embedding_size])
    loop_function = _extract_argmax_and_embed(
        embedding, output_projection,
        update_embedding_for_previous) if feed_previous else None
    emb_inp = (embedding_ops.embedding_lookup(embedding, i)
               for i in decoder_inputs)
    return rnn_decoder(
        emb_inp, initial_state, cell, loop_function=loop_function)

刚才讲了一个basic的decoder叫rnn_decoder：rnn_decoder(decoder_inputs,initial_state,cell,loop_function=None,scope=None)，现在来一个稍微高级一点的。
对比一下发现这个decoder没有loop_function，多出来了num_symbols，embedding_size，output_projection=None，feed_previous=False，update_embedding_for_previous=True。这些都是什么呢？

参数：
decoder_inputs：既然这个标榜了embedding，那么input肯定和rnn_decoder有些不同。这里input变为1维，[batch_size, ]也就是说，输入不需要自己做embedding了，直接输入tokens在vocab中对应的idx（即ids）即可，内部会自动帮我们进行id到embedding的转化。
num_symbols：就是vocab_size
embedding_size：每个token需要embedding成的维数，比如100
output_projection：(W, b)就是将输出做一个映射。为什么要映射，因为此时input相当于a list of [batch_size, 1]，内部帮我们做一个embedding，得到embedded_input=[batch_size, embedding_size ]，经过cell之后，得到[batch_size, output_size]（这个过程就是之前的rnn_decoder做的事情）。这样之后，如果我们设置了feed_previous=True，也就是需要将前一时刻的output作为下一时刻的input，那么前一时刻的output中要从vocab_size中选出一个分数最高的token来，即argmax(previous_output)。过程如下图描述的那样：

但是，现在的output维度是output_size，并不能知道每个vocab的得分情况。因此要从output_size映射到vocab_size（这里的num_symbols）。
我们知道，x(某一时刻的output)的shape=[batch_size, output_size]，映射的公式是xw+b，那么w的shape=[output_zize, num_symbols]

update_embedding_for_previous：如果前一时刻的output不作为当前的input的话(feed_previous=False)，这个参数没影响（）；否则，该参数默认是True，但如果设置成false，则表示不对前一个embedding进行更新，那么bp的时候只会更新”GO”的embedding，其他token（decoder生成的）embedding不变。

输出：
outputs：如果output_projection=None的话，也就是不进行映射(直接输出的是num_symbols的个数)，那么a list of [batch_size, num_symbols]；如果不为None，说明outputs要进行映射，则outputs是a list of [batch_size, num_symbols]
state同上

embedding_rnn_seq2seq

def embedding_rnn_seq2seq(encoder_inputs,
                          decoder_inputs,
                          cell,
                          num_encoder_symbols,
                          num_decoder_symbols,
                          embedding_size,
                          output_projection=None,
                          feed_previous=False,
                          dtype=None,
                          scope=None):
  """Embedding RNN sequence-to-sequence model.
  This model first embeds encoder_inputs by a newly created embedding (of shape
  [num_encoder_symbols x input_size]). Then it runs an RNN to encode
  embedded encoder_inputs into a state vector. Next, it embeds decoder_inputs
  by another newly created embedding (of shape [num_decoder_symbols x
  input_size]). Then it runs RNN decoder, initialized with the last
  encoder state, on embedded decoder_inputs.
  Args:
    encoder_inputs: A list of 1D int32 Tensors of shape [batch_size].
    decoder_inputs: A list of 1D int32 Tensors of shape [batch_size].
    cell: core_rnn_cell.RNNCell defining the cell function and size.
    num_encoder_symbols: Integer; number of symbols on the encoder side.
    num_decoder_symbols: Integer; number of symbols on the decoder side.
    embedding_size: Integer, the length of the embedding vector for each symbol.
    output_projection: None or a pair (W, B) of output projection weights and
      biases; W has shape [output_size x num_decoder_symbols] and B has
      shape [num_decoder_symbols]; if provided and feed_previous=True, each
      fed previous output will first be multiplied by W and added B.
    feed_previous: Boolean or scalar Boolean Tensor; if True, only the first
      of decoder_inputs will be used (the "GO" symbol), and all other decoder
      inputs will be taken from previous outputs (as in embedding_rnn_decoder).
      If False, decoder_inputs are used as given (the standard decoder case).
    dtype: The dtype of the initial state for both the encoder and encoder
      rnn cells (default: tf.float32).
    scope: VariableScope for the created subgraph; defaults to
      "embedding_rnn_seq2seq"
  Returns:
    A tuple of the form (outputs, state), where:
      outputs: A list of the same length as decoder_inputs of 2D Tensors. The
        output is of shape [batch_size x cell.output_size] when
        output_projection is not None (and represents the dense representation
        of predicted tokens). It is of shape [batch_size x num_decoder_symbols]
        when output_projection is None.
      state: The state of each decoder cell in each time-step. This is a list
        with length len(decoder_inputs) -- one item for each time-step.
        It is a 2D Tensor of shape [batch_size x cell.state_size].
  """

既然有了embedding_rnn_decoder，那么对应的就有embedding_rnn_seq2seq。之前讲过basic_rnn_seq2seq(encoder_inputs, decoder_inputs, cell, dtype=dtypes.float32, scope=None)
inputs：还是像之前说的，既然embedding是内部帮我们完成，则inputs shape= a list of [batch_size]，每个位置都只是一个token id。内部使用一个embedding wrapper，做lookup，生成a list of [batch_size, embedding_size]
对比之下，多了几个参数：
num_encoder_symbols：通俗的说其实就是encoder端的vocab_size。enc和dec两端词汇量不同主要在于不同语言的translate task中，如果单纯是中文到中文的生成，不存在两端词汇量的不同。
num_decoder_symbols：同上
embedding_size：每个vocab需要用多少维的vector表示
output_projection=None：
feed_previous=False：如果feed_previous只是简单的一个True or False，则直接返回embedding_rnn_decoder的结果。重点是feed_previous还能传入一个boolean tensor（暂时无此需求）

attention_decoder

def attention_decoder(decoder_inputs,
                      initial_state,
                      attention_states,
                      cell,
                      output_size=None,
                      num_heads=1,
                      loop_function=None,
                      dtype=None,
                      scope=None,
                      initial_state_attention=False):
  """RNN decoder with attention for the sequence-to-sequence model.
  In this context "attention" means that, during decoding, the RNN can look up
  information in the additional tensor attention_states, and it does this by
  focusing on a few entries from the tensor. This model has proven to yield
  especially good results in a number of sequence-to-sequence tasks. This
  implementation is based on http://arxiv.org/abs/1412.7449 (see below for
  details). It is recommended for complex sequence-to-sequence tasks.
  Args:
    decoder_inputs: A list of 2D Tensors [batch_size x input_size].
    initial_state: 2D Tensor [batch_size x cell.state_size].
    attention_states: 3D Tensor [batch_size x attn_length x attn_size].
    cell: core_rnn_cell.RNNCell defining the cell function and size.
    output_size: Size of the output vectors; if None, we use cell.output_size.
    num_heads: Number of attention heads that read from attention_states.
    loop_function: If not None, this function will be applied to i-th output
      in order to generate i+1-th input, and decoder_inputs will be ignored,
      except for the first element ("GO" symbol). This can be used for decoding,
      but also for training to emulate http://arxiv.org/abs/1506.03099.
      Signature -- loop_function(prev, i) = next
        * prev is a 2D Tensor of shape [batch_size x output_size],
        * i is an integer, the step number (when advanced control is needed),
        * next is a 2D Tensor of shape [batch_size x input_size].
    dtype: The dtype to use for the RNN initial state (default: tf.float32).
    scope: VariableScope for the created subgraph; default: "attention_decoder".
    initial_state_attention: If False (default), initial attentions are zero.
      If True, initialize the attentions from the initial state and attention
      states -- useful when we wish to resume decoding from a previously
      stored decoder state and attention states.
  Returns:
    A tuple of the form (outputs, state), where:
      outputs: A list of the same length as decoder_inputs of 2D Tensors of
        shape [batch_size x output_size]. These represent the generated outputs.
        Output i is computed from input i (which is either the i-th element
        of decoder_inputs or loop_function(output {i-1}, i)) as follows.
        First, we run the cell on a combination of the input and previous
        attention masks:
          cell_output, new_state = cell(linear(input, prev_attn), prev_state).
        Then, we calculate new attention masks:
          new_attn = softmax(V^T * tanh(W * attention_states + U * new_state))
        and then we calculate the output:
          output = linear(cell_output, new_attn).
      state: The state of each decoder cell the final time-step.
        It is a 2D Tensor of shape [batch_size x cell.state_size].
  Raises:
    ValueError: when num_heads is not positive, there are no inputs, shapes
      of attention_states are not set, or input size cannot be inferred
      from the input.
  """

刚才讲完了embedding_rnn_decoder，则再来看看attention_decoder。
和基本的rnn_decoder相比（rnn_decoder(decoder_inputs, initial_state, cell, loop_function=None, scope=None)）
多了几个参数：
attention_states：attention_states作为addition info出现，
output_size=None：如果是None的话默认为cell.output_size
num_heads=1 :应该pay attention的点的个数，比如要focus到attention_states的几个点，默认为只关注1个点
initial_state_attention=False：如果是True的话，attention由state和attention_states进行初始化，如果False，则attention初始化为0

embedding_attention_decoder

def embedding_attention_decoder(decoder_inputs,
                                initial_state,
                                attention_states,
                                cell,
                                num_symbols,
                                embedding_size,
                                num_heads=1,
                                output_size=None,
                                output_projection=None,
                                feed_previous=False,
                                update_embedding_for_previous=True,
                                dtype=None,
                                scope=None,
                                initial_state_attention=False):
  """RNN decoder with embedding and attention and a pure-decoding option.
  Args:
    decoder_inputs: A list of 1D batch-sized int32 Tensors (decoder inputs).
    initial_state: 2D Tensor [batch_size x cell.state_size].
    attention_states: 3D Tensor [batch_size x attn_length x attn_size].
    cell: core_rnn_cell.RNNCell defining the cell function.
    num_symbols: Integer, how many symbols come into the embedding.
    embedding_size: Integer, the length of the embedding vector for each symbol.
    num_heads: Number of attention heads that read from attention_states.
    output_size: Size of the output vectors; if None, use output_size.
    output_projection: None or a pair (W, B) of output projection weights and
      biases; W has shape [output_size x num_symbols] and B has shape
      [num_symbols]; if provided and feed_previous=True, each fed previous
      output will first be multiplied by W and added B.
    feed_previous: Boolean; if True, only the first of decoder_inputs will be
      used (the "GO" symbol), and all other decoder inputs will be generated by:
        next = embedding_lookup(embedding, argmax(previous_output)),
      In effect, this implements a greedy decoder. It can also be used
      during training to emulate http://arxiv.org/abs/1506.03099.
      If False, decoder_inputs are used as given (the standard decoder case).
    update_embedding_for_previous: Boolean; if False and feed_previous=True,
      only the embedding for the first symbol of decoder_inputs (the "GO"
      symbol) will be updated by back propagation. Embeddings for the symbols
      generated from the decoder itself remain unchanged. This parameter has
      no effect if feed_previous=False.
    dtype: The dtype to use for the RNN initial states (default: tf.float32).
    scope: VariableScope for the created subgraph; defaults to
      "embedding_attention_decoder".
    initial_state_attention: If False (default), initial attentions are zero.
      If True, initialize the attentions from the initial state and attention
      states -- useful when we wish to resume decoding from a previously
      stored decoder state and attention states.
  Returns:
    A tuple of the form (outputs, state), where:
      outputs: A list of the same length as decoder_inputs of 2D Tensors with
        shape [batch_size x output_size] containing the generated outputs.
      state: The state of each decoder cell at the final time-step.
        It is a 2D Tensor of shape [batch_size x cell.state_size].
  Raises:
    ValueError: When output_projection has the wrong shape.
  """

其实是前面讲的embedding_decoder和attention_decoder的结合版。

embedding_attention_seq2seq

def embedding_attention_seq2seq(encoder_inputs,
                                decoder_inputs,
                                cell,
                                num_encoder_symbols,
                                num_decoder_symbols,
                                embedding_size,
                                num_heads=1,
                                output_projection=None,
                                feed_previous=False,
                                dtype=None,
                                scope=None,
                                initial_state_attention=False)

与embedding_attention_decoder相对应的seq2seq模型

sequence_loss_by_example

def sequence_loss_by_example(logits,
                             targets,
                             weights,
                             average_across_timesteps=True,
                             softmax_loss_function=None,
                             name=None):
  """Weighted cross-entropy loss for a sequence of logits (per example).
  Args:
    logits: List of 2D Tensors of shape [batch_size x num_decoder_symbols].
    targets: List of 1D batch-sized int32 Tensors of the same length as logits.
    weights: List of 1D batch-sized float-Tensors of the same length as logits.
    average_across_timesteps: If set, divide the returned cost by the total
      label weight.
    softmax_loss_function: Function (labels-batch, inputs-batch) -> loss-batch
      to be used instead of the standard softmax (the default if this is None).
    name: Optional name for this operation, default: "sequence_loss_by_example".
  Returns:
    1D batch-sized float Tensor: The log-perplexity for each sequence.
  Raises:
    ValueError: If len(logits) is different from len(targets) or len(weights).
  """
  if len(targets) != len(logits) or len(weights) != len(logits):
    raise ValueError("Lengths of logits, weights, and targets must be the same "
                     "%d, %d, %d." % (len(logits), len(weights), len(targets)))
  with ops.name_scope(name, "sequence_loss_by_example",
                      logits + targets + weights):
    log_perp_list = []
    for logit, target, weight in zip(logits, targets, weights):
      if softmax_loss_function is None:
        # TODO(irving,ebrevdo): This reshape is needed because
        # sequence_loss_by_example is called with scalars sometimes, which
        # violates our general scalar strictness policy.
        target = array_ops.reshape(target, [-1])
        crossent = nn_ops.sparse_softmax_cross_entropy_with_logits(
            labels=target, logits=logit)
      else:
        crossent = softmax_loss_function(target, logit)
      log_perp_list.append(crossent * weight)
    log_perps = math_ops.add_n(log_perp_list)
    if average_across_timesteps:
      total_size = math_ops.add_n(weights)
      total_size += 1e-12  # Just to avoid division by 0 for all-0 weights.
      log_perps /= total_size
  return log_perps

返回值：
1D batch-sized float Tensor：为每一个序列（一个batch中有batch_size个sequence）计算其log perplexity，也是名称中by_example的含义

输入：
(注意：一个batch上的所有数据都被pad成相同长度？因此它们的time_length是一样的？)
logits：a list依次存储一系列时刻上的输出，每一时刻的输出都是batch_size为单位的，其中的每一个输入对应的输出是整个vocab上的得分，因此是num_decoder_symbols。因此，logits应该是a list of [batch_size, num_decoder_symbols]
targets：a list表示依次的所有时刻的target，每一时刻又有batch_size个输入，因此对应batch_size个target，因此shape=a list of [batch_size, ]
weights：每个example，在每一时刻都有对自身当前token的权重。因此shape=a list of [batch_size,]
疑问：weights是做什么用的？为什么要对每个token设置权重？

解读代码：
首先会生成一个crossent，shape=[batch_size, ]，再和weights相乘，还是得到[batch_size, ]，表示每个example在当前时刻t位置的得分(batch_size个)，append到log_perp_list中（最终shape是a list of [batch_size, ]）
所有的time length循环完毕之后，累加这些time length，得到一个shape=[batch_size,]的变量，叫做log_perps。

sequence_loss

def sequence_loss(logits,
                  targets,
                  weights,
                  average_across_timesteps=True,
                  average_across_batch=True,
                  softmax_loss_function=None,
                  name=None):
  """Weighted cross-entropy loss for a sequence of logits, batch-collapsed.
  Args:
    logits: List of 2D Tensors of shape [batch_size x num_decoder_symbols].
    targets: List of 1D batch-sized int32 Tensors of the same length as logits.
    weights: List of 1D batch-sized float-Tensors of the same length as logits.
    average_across_timesteps: If set, divide the returned cost by the total
      label weight.
    average_across_batch: If set, divide the returned cost by the batch size.
    softmax_loss_function: Function (labels-batch, inputs-batch) -> loss-batch
      to be used instead of the standard softmax (the default if this is None).
    name: Optional name for this operation, defaults to "sequence_loss".
  Returns:
    A scalar float Tensor: The average log-perplexity per symbol (weighted).
  Raises:
    ValueError: If len(logits) is different from len(targets) or len(weights).
  """
  with ops.name_scope(name, "sequence_loss", logits + targets + weights):
    cost = math_ops.reduce_sum(
        sequence_loss_by_example(
            logits,
            targets,
            weights,
            average_across_timesteps=average_across_timesteps,
            softmax_loss_function=softmax_loss_function))
    if average_across_batch:
      batch_size = array_ops.shape(targets[0])[0]
      return cost / math_ops.cast(batch_size, cost.dtype)
    else:
      return cost

其实主体还是上面讲的sequence_loss_by_example，只不过对上面的[batch_size,]的结果进行sum，如果默认average_across_batch的话，就sum/batch_size，平均每一个sequence的log perplexity；要是设置了不平均，则返回的是整个batch上的sum of log perplexity

model_with_buckets


def model_with_buckets(encoder_inputs,
                       decoder_inputs,
                       targets,
                       weights,
                       buckets,
                       seq2seq,
                       softmax_loss_function=None,
                       per_example_loss=False,
                       name=None):
  """Create a sequence-to-sequence model with support for bucketing.
  The seq2seq argument is a function that defines a sequence-to-sequence model,
  e.g., seq2seq = lambda x, y: basic_rnn_seq2seq(
      x, y, core_rnn_cell.GRUCell(24))
  Args:
    encoder_inputs: A list of Tensors to feed the encoder; first seq2seq input.
    decoder_inputs: A list of Tensors to feed the decoder; second seq2seq input.
    targets: A list of 1D batch-sized int32 Tensors (desired output sequence).
    weights: List of 1D batch-sized float-Tensors to weight the targets.
    buckets: A list of pairs of (input size, output size) for each bucket.
    seq2seq: A sequence-to-sequence model function; it takes 2 input that
      agree with encoder_inputs and decoder_inputs, and returns a pair
      consisting of outputs and states (as, e.g., basic_rnn_seq2seq).
    softmax_loss_function: Function (labels-batch, inputs-batch) -> loss-batch
      to be used instead of the standard softmax (the default if this is None).
    per_example_loss: Boolean. If set, the returned loss will be a batch-sized
      tensor of losses for each sequence in the batch. If unset, it will be
      a scalar with the averaged loss from all examples.
    name: Optional name for this operation, defaults to "model_with_buckets".
  Returns:
    A tuple of the form (outputs, losses), where:
      outputs: The outputs for each bucket. Its j'th element consists of a list
        of 2D Tensors. The shape of output tensors can be either
        [batch_size x output_size] or [batch_size x num_decoder_symbols]
        depending on the seq2seq model used.
      losses: List of scalar Tensors, representing losses for each bucket, or,
        if per_example_loss is set, a list of 1D batch-sized float Tensors.
  Raises:
    ValueError: If length of encoder_inputs, targets, or weights is smaller
      than the largest (last) bucket.
  """

参数：
encoder_inputs：一开始我有个疑问，这里的inputs是ids的形式还是传入input_size的形式，仔细想想实际是这样的。这个inputs具体的shape形式要根据后面seq2seq定义的那个函数决定，一般就只传入两个参数x, y分别对应encoder_inputs和decoder_inputs（另外特定seq2seq需要的参数需要在自定义的这个seq2seq函数内部传入）。这个时候，如果我们使用的是embedding_seq2seq，那么实际的inputs就应该是ids的样子；否则，就是input_size的样子。
targets：a list因为每一时刻都会有target，并且每一时刻输入的是batch_size个，因此每一时刻的target是[batch_size,]的形式，最终导致targets是a list of [batch_size, ]
buckets：a list of (input_size, output_size)
per_example_loss：默认是False，表示losses是[batch_size, ]。比如刚才讲到的sequence_loss_by_example的结果是[batch_size,]，再者sequence_loss的结果是一个scalar。

实现：

for j, bucket in enumerate(buckets):
      with variable_scope.variable_scope(
          variable_scope.get_variable_scope(), reuse=True if j > 0 else None):
        bucket_outputs, _ = seq2seq(encoder_inputs[:bucket[0]],
                                    decoder_inputs[:bucket[1]])
        outputs.append(bucket_outputs)

根据实现可以看到，比如设置了3个buckets=[(2, 4), (5, 7), (8, 10)]，第1个bucket是(2,4)，那么先截取encoder_inputs中每个（batch_size个）sequences的前2个tokens，和同理截取decoder_inputs中前4个tokens（encoder_inputs的第一维度就是time）。
然后把截取部分进行seq2seq，得到输出是a list of [batch_size, output_size]（这个list的长度为4，output是按decoder的长度算），然后将这个输出加入到outputs中。
最终得到的outputs就是一个bucket_size长度（这里为3）的列表，列表中每个元素是长度不等的list（之所以长度不等是因为每个bucket所定义的max_decoder_length不等，依次增大）

if per_example_loss:
  losses.append(
      sequence_loss_by_example(
          outputs[-1],
          targets[:bucket[1]],
          weights[:bucket[1]],
          softmax_loss_function=softmax_loss_function))
else:
  losses.append(
      sequence_loss(
          outputs[-1],
          targets[:bucket[1]],
          weights[:bucket[1]],
          softmax_loss_function=softmax_loss_function))

计算完当前bucket的outputs后，就应该计算当前bucket的loss。由于当前bucket的output刚刚append，因此outputs[-1]就是当前bucket的output。又因为我们截取了decoder_inputs，因此targets和weights都要截取成相同的长度。这样的话就得到当前bucket的loss，append到losses中。

因此，最后的outputs和losses，我们只要索引bucket的idx，就可以得到该bucket上的output和loss。

Lan's Blog

the stack of it nerds