Jeremy

Beam Search 笔记

Beam Search 是一种启发式的搜索算法,是宽度(广度)优先搜索算法的改进。

因此,在介绍Beam Search之前,我们先回顾以下宽度(广度)优先搜索算法。

所谓的宽度优先搜索算法(BFS),就是一种先生成的节点先扩展的搜索策略,其具体的搜索过程:从初始节点S开始逐层向下扩展,在第n层节点还没有完全搜索完之前,不会进入第n+1层节点进行搜索。

关于该搜索策略的一个示例如下图所示:

宽度(广度)优先搜索算法的框架如下图所示:

那么所谓的宽度优先搜索算法有什么问题呢?

从上面的算法原理可以看出,宽度优先搜索算法是一种完备策略,即只要问题有解,那么它就一定可以找到解,而且其搜索到的结果也一定是最优解,
可是它也有一个比较致命的缺点,那就是它需要保存所有扩展出的状态,对空间有大量的需求,并不适合解非常大的问题。


OK, 上面我们已经谈到了BFS的优缺点,而BS主要是对BFS的缺点进行改进,当然,BS其实也没有完全解决缺点,它只是通过beam来降低算法对空间的需求。

具体来说,就是根据一个heuristic cost function,来估计当前节点到目标节点的代价大小,然后用beam width来限制当前层遍历存储的节点个数(即按照一定的排序方法,存储前B(beam width)个节点)。

在BFS搜索借助一个Openlist,在BS中则借助BEAM来存储将继续展开的节点。此外,还有一个hash table来存储已经访问过的节点,与BFS中的closed list作用类似。
初始化时,把起始节点加入BEAM 和 hash table。 然后,在算法的每个主循环中,将在BEAM中的节点的所有后继节点加入SET中,并在SET中选取最优的B个节点加入BEAM和hash table中。
注意,已经在hash table中的节点是不会再加入BEAM中的,因为到达改点的更近的路径已经找到了。
直至目标节点加入到SET,或者hash table满了(表明存储耗尽),或者BEAM为空(表明搜索无解),则算法终止。

这里面有一点需要,强调一下,我之前就理解错了,那就是beam width 是对着整个level的,而不是针对某个节点的,如下图所示,在depth2 中,我们就只会保存B和E。

下面是一份Beam Search 的伪码:

以下这个是基于tensorflow的beam search 代码,主要用于decodeing。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
from __future__ import division
import tensorflow as tf
with tf.Graph().as_default():
beam_size = 3 # Number of hypotheses in beam.
num_symbols = 5 # Output vocabulary size.
embedding_size = 10
num_steps = 3
embedding = tf.zeros([num_symbols, embedding_size])
output_projection = None
# log_beam_probs: list of [beam_size, 1] Tensors
# Ordered log probabilities of the `beam_size` best hypotheses
# found in each beam step (highest probability first).
# beam_symbols: list of [beam_size] Tensors
# The ordered `beam_size` words / symbols extracted by the beam
# step, which will be appended to their corresponding hypotheses
# (corresponding hypotheses found in `beam_path`).
# beam_path: list of [beam_size] Tensor
# The ordered `beam_size` parent indices. Their values range
# from [0, `beam_size`), and they denote which previous
# hypothesis each word should be appended to.
log_beam_probs, beam_symbols, beam_path = [], [], []
def beam_search(prev, i):
if output_projection is not None:
prev = tf.nn.xw_plus_b(
prev, output_projection[0], output_projection[1])
# Compute
# log P(next_word, hypothesis) =
# log P(next_word | hypothesis)*P(hypothesis) =
# log P(next_word | hypothesis) + log P(hypothesis)
# for each hypothesis separately, then join them together
# on the same tensor dimension to form the example's
# beam probability distribution:
# [P(word1, hypothesis1), P(word2, hypothesis1), ...,
# P(word1, hypothesis2), P(word2, hypothesis2), ...]
# If TF had a log_sum_exp operator, then it would be
# more numerically stable to use:
# probs = prev - tf.log_sum_exp(prev, reduction_dims=[1])
probs = tf.log(tf.nn.softmax(prev))
# i == 1 corresponds to the input being "<GO>", with
# uniform prior probability and only the empty hypothesis
# (each row is a separate example).
if i > 1:
probs = tf.reshape(probs + log_beam_probs[-1],
[-1, beam_size * num_symbols])
# Get the top `beam_size` candidates and reshape them such
# that the number of rows = batch_size * beam_size, which
# allows us to process each hypothesis independently.
best_probs, indices = tf.nn.top_k(probs, beam_size)
indices = tf.stop_gradient(tf.squeeze(tf.reshape(indices, [-1, 1])))
best_probs = tf.stop_gradient(tf.reshape(best_probs, [-1, 1]))
symbols = indices % num_symbols # Which word in vocabulary.
beam_parent = indices // num_symbols # Which hypothesis it came from.
beam_symbols.append(symbols)
beam_path.append(beam_parent)
log_beam_probs.append(best_probs)
return tf.nn.embedding_lookup(embedding, symbols)
# Setting up graph.
inputs = [tf.placeholder(tf.float32, shape=[None, num_symbols])
for i in range(num_steps)]
for i in range(num_steps):
beam_search(inputs[i], i + 1)
# Running the graph.
input_vals = [0, 0, 0]
l = np.log
eps = -10 # exp(-10) ~= 0
# These values mimic the distribution of vocabulary words
# from each hypothesis independently (in log scale since
# they will be put through exp() in softmax).
input_vals[0] = np.array([[0, eps, l(2), eps, l(3)]])
# Step 1 beam hypotheses =
# (1) Path: [4], prob = log(1 / 2)
# (2) Path: [2], prob = log(1 / 3)
# (3) Path: [0], prob = log(1 / 6)
input_vals[1] = np.array([[l(1.2), 0, 0, l(1.1), 0], # Path [4]
[0, eps, eps, eps, eps], # Path [2]
[0, 0, 0, 0, 0]]) # Path [0]
# Step 2 beam hypotheses =
# (1) Path: [2, 0], prob = log(1 / 3) + log(1)
# (2) Path: [4, 0], prob = log(1 / 2) + log(1.2 / 5.3)
# (3) Path: [4, 3], prob = log(1 / 2) + log(1.1 / 5.3)
input_vals[2] = np.array([[0, l(1.1), 0, 0, 0], # Path [2, 0]
[eps, 0, eps, eps, eps], # Path [4, 0]
[eps, eps, eps, eps, 0]]) # Path [4, 3]
# Step 3 beam hypotheses =
# (1) Path: [4, 0, 1], prob = log(1 / 2) + log(1.2 / 5.3) + log(1)
# (2) Path: [4, 3, 4], prob = log(1 / 2) + log(1.1 / 5.3) + log(1)
# (3) Path: [2, 0, 1], prob = log(1 / 3) + log(1) + log(1.1 / 5.1)
input_feed = {inputs[i]: input_vals[i][:beam_size, :]
for i in xrange(num_steps)}
output_feed = beam_symbols + beam_path + log_beam_probs
session = tf.InteractiveSession()
outputs = session.run(output_feed, feed_dict=input_feed)
expected_beam_symbols = [[4, 2, 0],
[0, 0, 3],
[1, 4, 1]]
expected_beam_path = [[0, 0, 0],
[1, 0, 0],
[1, 2, 0]]
print("predicted beam_symbols vs. expected beam_symbols")
for ind, predicted in enumerate(outputs[:num_steps]):
print(list(predicted), expected_beam_symbols[ind])
print("\npredicted beam_path vs. expected beam_path")
for ind, predicted in enumerate(outputs[num_steps:num_steps * 2]):
print(list(predicted), expected_beam_path[ind])
print("\nlog beam probs")
for log_probs in outputs[2 * num_steps:]:
print(log_probs)

Refs


林建民-机器视觉
Blog地址:http://www.linjm.tech/
旧博客地址:http://blog.csdn.net/linj_m