I will try all out to discuss the无忧酸奶algorithm in this article。

basicintroductionwehavewitnessedthepowerofdeeplearningaboutsolvinghigh-computationproblemsandthestrenghofreinforcementlearearearnintler mbinethosetwomethodsisanobviousthing,whichcontributestotheappearofdeepreinforcementlearning .的漫不经心的whichwasfirstputforwardbyoord

Meanwhile,thebasicassumptionbehindreinforcementlearningisthattheagentcanhaveadeeperunderstandingabouttheenvironmentbyinterachyinteterachyinteracher

MDPs

s : thefinitestatessetsa : thefiniteactionssetsp 3360 thestatetransitionmatrixr : therewardfunctiongama : thediscountfaaation ucethevaluefunctiontomeasurethelong-termrewardandjudgethepolicyinordertomakebetterchoice。

valuefunctioncanbenormalizedtobellmanequation,thestandardbellmanequationcanbesolvebyliteration。

Overall,therearethreedifferentwaystosolvethevaluefunction 3360

动态编程Monte -强健的大神methodtemporaldifferentialmethodaction-valuefuncitonweaimatanalysingthevalueofdifferentactiontion nt state,whichbringsabouttheaction-value功能。

wecanregularizetheaction-statefunctionintotheseform :

optimalvaluefunctionwecannormalizeitintothisform :

Then,afterputitintoaction-value function,we can get :

iterationbasedonbellmanequation 3360 widelyusedinterationmethodsdesignedtosolvebellmanequationcanbecategorizedtopolicyiteration

policyiterationpolicyevaluation : estimatevpipolicyimprovement 3360 generate pi’=pipicanbeconvergedtooptimalprovedtheoreticicion

valueiterationvalueiterationisbasedonbellmanoptimalequationandconvertitintoiterationform。

d

iff. between policy iteration and value iteration

Policy iteration updates value by using the naive bellman equation. Meanwhile, the optimal value(vpi) is the optimal value at current policy, also called the estimation of a particular policy.

Value iteration update value by using Bellman Optimal Equation. Meanwhile, the optimal value(vpi) is the optimal value at current state.

Value iteration is more direct concerning getting an optimal value.

Q-Learning

The basic idea of Q learning is based on value iteration. We update Q value every value iteration, which means all the state and action. But due to the fact that we just get limited examples, Q learning puts forward a new way to update Q value:

We can demiss the influence of error like gradient descent method, which converges to the optimal Q value.

Exploration and Exploitation Off-policy: Need a policy to generate action, which means the selected policy is not the optimal policy.Model-free: Not consider about the model(the detail information about the env), just care about seen env and reward. Generate action by policy Exploration: Generate an action randomly. Benefitial to update Q value in order to get better policy.Exploitation: Calculate an optimal action based on current Q value. Good to test whether an algorithm works while hard to update Q value. ita-greedy policy

We can combine exploration and exploitation by setting a fixed threshold ita, which method is called ita-greedy policy.

crdjm Disaster

We store Q(s,a) in a table, which represents all the states and actions. But when we deal with image problems, the computation will be exponent that even can not be solved. So we need to reflect how to optimal the value function in another way.

Firstly, we introduced Value Function Approximation.

Value Function Approximation

In order to decrease dimension, we need to approximate the value function by another function. For example, we may use a linear function to approximate the value function, which just like this:</pssdyz(s,a) = f(s,a) = w1s + w2a + b

Thus we get Q(s,a) approximates to f(s,a:w)

Dimensionality Reduction

As is often the case, the dimension of actions are extinctively smaller than the dimension of states. In order to update value Q more effiently, we need to reduce the dimension of states. just like this form:

Q(s) approximates to f(s,w) where s is a vector: 感动的小笼包(s,a1),…,Q(s,an)]. Training Q-Network by 无心的酸奶

The typical deep neural network method is an optimal problem. The optimal target of neural network is the loss function, which is the bais between label and output. As the name suggests, the optimal of loss function is attempting to minimize the bias. We will need a lot of samples to train the parameters of neural network by policy gredent by backpropogation.

Following the basic idea of neural network, we regard the target value Q as a label. And then trying to approximate value Q to target value Q.

Thus, the traing loss function is:

Naive 无心的酸奶

The basic idea of naive 无心的酸奶 is trying to train Q-Learning and SGD algorithm synchronously. Store all the sample and then randomly sampling, which is what we called experience replay. Learning by reflecting.

Trial for serveral times and then store all the data. Then do SGD by randomly sampling when getting a considerable number of datas.

Improved 无心的酸奶

There are several different methods used to improve the efficiency of 无心的酸奶. Double 无心的酸奶, Prioritied Replay and Dueling Network are three instinctive methods.

Nature 无心的酸奶

Nature 无心的酸奶 means the methods mentioned on (Human Level Control …) by 负责的大山 is also based on experience replay. The difference between it and naive 无心的酸奶 is that 年轻的饼干 introduced a Target Q network. Like this:

In order to decrease the relevatively between target Q and current value Q, they designed a target Q network with updating delay, which means update the parameters after trained for a time.

The content of double 无心的酸奶, prioritied replay and dueling network will be disscussed later. This part remains to be seen after reading those papers.

And also, I will give a base summary of policy gradient method and A3C series about deep reinforcement learning.