简单滴说,behavior policy and target policy 一样的,那就是on policy。
两者不一样的,就称作是off-policy。
其实这样的话,也可以把on-policy 看作是 off-policy的一种特殊形式。
因为你学习率那么小,为了实现对每个状态的探索,需要进行很多很多次抽样,效率比较低。一个有代表性的exploratory policy是均匀概率策略。也就是下图所示:
下面我们区分一下 online/offline learning.
Another concept that may be confused with on-policy/off-policy is online/offline learning.
An on-policy learning algorithm such as Sarsa must work online because the updated policy must be used to generate new experience samples.
An off-policy learning algorithm such as Q-learning can work either online or offline. It can either update the value and policy upon receiving an experience sample or update after collecting all experience samples.(注意了,虽然online learning版本的Q-learning 可以实时地更新策略,但是它的新策略并不用来生成样本。Q-learning 仅仅想要利用历史样本去进行动作价值的更新。)
上一篇:普及json格式相关问题