"DDPG"의 두 판 사이의 차이

2020년 9월 1일 (화) 13:35 기준 최신판

DDPG(Deep Deterministic Policy Gradient)는 인공신경망과 강화학습을 사용하여 학습하는 알고리즘이다. DDPG 알고리즘은 모델 프리 오프-폴리시로 학습하기 때문에 잘못된 행동이 누적되어 학습에 영향을 미치는 경우를 방지하는 장점이 있다.

개요[편집]

DDPG는 DPG(Deterministic Policy Gradient)에 DQN을 결합시킨 모델 프리 오프 폴리시 액터 크리틱 알고리즘이다. DQN의 경험 반복(Experience Replay)과 저속 학습 대상 네트워크를 활용하며, 연속 액션 공간에서 동작이 가능한 DPG를 기반으로 한다.^[1] 원래의 DQN은 별개의 공간에서 동작하지만, DDPG는 액터-크리틱 프레임워크(actor-critic framework)를 통해서 결정론적 정책을 학습하면서 효과를 연속 공간까지 확장시켰다. 좀 더 나은 탐색을 하기 위해서 탐색 정책은 $\mu '$ 는 $noiseN$ 을 추가함으로써 만들 수 있다.^[2]

$\mu '(s)=\mu _{\theta }(s)+N$

DDPG는 원래 DQN에는 없는 두 가지 기법을 더 사용한다. 첫째, 두 개의 대상 네트워크를 사용한다. 왜냐하면 훈련에 안정성을 더해 주기 때문이다. 간단히 말해서, 우리는 추정 대상으로부터 배우고 있고 대상 네트워크는 천천히 업데이트되므로 추정 대상의 안정성이 유지된다. 이는 개념적으로 이것을 어떻게 잘 할 것인가, 더 좋은 것을 찾을 때까지 잠시 시험해 보겠다고 말하는 것과 같은 것으로, 모든 동작 끝에 이 게임 전체를 어떻게 하는지 다시 배우겠다고 말하는 것과는 배치된다. 둘째, 경험 반복을 사용한다. 튜플 리스트(state, action, reward, next_state)를 저장하고, 최근의 경험으로부터만 배우는 대신에 지금까지 축적된 모든 경험을 샘플링하여 배운다.^[1]

이론[편집]

모델[편집]

정책은 조치를 직접 산출하기 때문에 결정론적이다. 탐사를 촉진하기 위해 정책에 따라 결정된 작업에 가우스 노이즈가 추가된다. 상태의 Q-값을 계산하기 위해, 액터 출력을 Q-네트워크에 공급하여 Q-값을 계산한다. 이 작업은 나중에 기술할 TD 오류 계산 중에만 수행된다. 학습을 안정시키기 위해 비평가와 액터 모두를 위한 타겟 네트워크를 만든다. 이러한 타겟 네트워크는 상태 네트워크를 기반으로 한 소프트 업데이트를 갖게 될 것이다.^[3]

손실 함수[편집]

모델 아키텍처를 설명하였으므로 이어서 모델을 훈련시키는 방법, 혹은 오히려 두 모델에 대한 손실 함수가 무엇인지 알아보자. 비평가(Q)와 액터(mu)의 손실함수는 다음과 같다.

J_{Q}={\frac {1}{N}}\sum _{i=1}^{N}(r_{i}+\gamma (1-d)Q_{targ}(s_{i}',\mu _{targ}(s_{i}')-Q(s_{i},\mu (s_{i}))^{2}

J_{\mu }={\frac {1}{N}}\sum _{i=1}^{N}Q(s_{i},\mu (s_{i}))

먼저 액터(정책 네트워크) 손실을 분석한다. 손실은 단순히 상태들의 Q-값의 합이다. Q 값을 계산하기 위해 비평가 네트워크를 사용하고 액터 네트워크에 의해 계산된 조치를 전달한다. 우리는 최대 수익률/Q-값을 원하기 때문에 이 결과를 극대화하려고 한다. 비평가 손실은 우리가 타겟 네트워크를 사용하여 다음 상태의 Q-값을 계산하는 단순한 TD 오류이다. 우리는 이 손실을 최소화해야 한다. 오류를 거꾸로 전파하기 위해서는 Q-기능의 파생 상품이 필요하다. 비평가 손실의 경우 Q-값의 파생상품은 mu를 일정하게 취급하므로 간단하지만, 액터 손실의 경우 mu-함수 Q-값 안에 포함된다. 이를 위해 우리는 다음과 같은 연쇄 규칙을 사용할 것이다.^[3]

J_{\mu }=E[Q(s,\mu (s))]

\nabla _{\theta ^{\mu }}J_{\mu }=E[\nabla _{\mu }Q(s,\mu (s))\nabla _{\theta ^{\mu }}\mu (s)]

큐 러닝[편집]

먼저 최적의 동작-값 함수를 기술하는 벨만(Bellman) 방정식 $Q^{*}(s,a)$ 를 재점검해 보자. 이는 다음 식에 의해 주어진다.

Q^{*}(s,a)={\underset {s'\sim P}{E}}\left[r(s,a)+\gamma {\underset {a'}{max}}Q^{*}(s',a')\right]

여기서 $s'\sim P$ 는 다음 상태인 $s'$ 가 $P(\cdot |s,a)$ 의 분포로부터 환경에 의해 샘플링된다고 말하는 속칭이다. 이 벨만 방정식은 $Q^{*}(s,a)$ 의 근사치를 배우기 위한 출발점이다. 근사치가 신경 네트워크 $Q_{\pi }(s,a)$ 이고 매개 변수 $\phi$ 가 있으며, 전환 세트 ${\mathcal {D}}(s,a,r,s',d)$ 를 수집했다고 가정하자. 여기서 $d$ 는 상태 $s'$ 가 터미널인지 여부를 나타낸다. 평균 제곱 벨먼 오차(MSBE) 함수를 설정할 수 있는데, 이 함수는 $Q_{\pi }$ 가 벨먼 방정식을 만족하는 데 얼마나 근접하게 도달하는지 대략 알려준다.

L(\phi ,{\mathcal {D}})={\underset {(s,a,r,s',d)\sim {\mathcal {D}}}{\mathrm {E} }}\left[{\Bigg (}Q_{\phi }(s,a)-\left(r+\gamma (1-d)\max _{a'}Q_{\phi }(s',a')\right){\Bigg )}^{2}\right]

여기서는 $(1-d)$ 를 평가할 때 $True$ 를 1로, $False$ 를 0으로 평가하는 파이썬 규약을 사용했다. 따라서 d==True (즉, $s'$ 가 말단 상태일 때)가 되면 $Q$ 기능은 대리인이 현재 상태 이후 추가 보상을 받지 못한다는 것을 보여줘야 한다. 이러한 표기법 선택은 나중에 코드에서 구현하는 것에 해당한다. DQN, DDPG와 같은 기능 근사치에 대한 큐-러닝 알고리즘은 주로 이 평균 제곱 벨만 오차 손실 기능을 최소화하는 것에 기초한다.

리플레이 버퍼(Replay Buffer): $Q^{*}(s,a)$ 의 근사치를 위해 심층신경망을 훈련하기 위한 모든 표준 알고리즘은 경험 리플레이 버퍼를 사용한다. 이것은 이전 경험의 $D$ 의 집합이다. 알고리즘이 안정적인 동작을 갖기 위해서는 리플레이 버퍼의 크기가 커야 다양한 경험을 담을 수 있지만, 모든 것을 유지하는 것이 항상 좋은 것은 아닐 수 있다. 만약 가장 최근의 데이터만을 사용한다면, 그것에 지나치게 적합하게 될 것이고 상황은 깨질 것이다. 또한 만약 너무 많은 경험을 사용한다면, 학습 속도를 늦출 수 있다. 이 작업을 제대로 하려면 약간의 조정이 필요하다.

타겟 네트워크(Target Network): 큐-러닝 알고리즘은 타겟 네트워크를 활용한다.

r+\gamma (1-d)\max _{a'}Q_{\phi }(s',a')

위의 식을 타겟이라고 하는데, 평균 제곱 벨만 오차 손실을 최소화했을 때 큐-기능을 이 타겟과 더 유사하게 만들려고 하기 때문이다. 문제는 우리가 훈련시키려 하는 것과 같은 변수인

\pi

에 따라 대상이 달라진다. 그리고 이것은 평균 제곱 벨만 오차 최소화를 불안정하게 만든다. 해결책은

\phi

에 가깝지만 시간 지연, 즉 1차보다 뒤처지는 타겟 네트워크라고 불리는 두 번째 네트워크를 사용하는 것이다. 대상 네트워크의 파라미터는

\phi _{\text{targ}}

로 표시된다. DQN 기반 알고리즘에서 대상 네트워크는 몇 가지 고정된 단계마다 주 네트워크로부터 복사될 뿐이다. DDPG 스타일 알고리즘에서 대상 네트워크는 폴리아크 평균화에 의해 주 네트워크 업데이트당 한 번 업데이트된다.

\phi _{\text{targ}}\leftarrow \rho \phi _{\text{targ}}+(1-\rho )\phi

여기서

\rho

는 0과 1 사이의 하이퍼 파라미터이다. 일반적으로 1에 가깝다.^[4]

구현[편집]

  $\mathrm {{\color {OrangeRed}import}\ gym}$    
  $\mathrm {{\color {OrangeRed}import}\ tensorflow\ {\color {OrangeRed}as}\ tf}$  
  $\mathrm {{\color {OrangeRed}from}\ tensorflow.keras\ {\color {OrangeRed}import}\ layers}$ 
  $\mathrm {{\color {OrangeRed}import}\ numpy\ {\color {OrangeRed}as}\ np}$ 
  $\mathrm {{\color {OrangeRed}import}\ matplotlib.pyplot\ {\color {OrangeRed}as}\ plt}$

OpenAIGym을 사용하여 환경을 만든다. 이 upper_bound 매개 변수를 사용하여 나중에 작업을 확장할 것이다.

  $\mathrm {problem=}$  " $\mathrm {\color {YellowOrange}Pendulum-v0}$ "
  $\mathrm {env=gym.make(problem)}$ 
 
  $\mathrm {num}$ _ $\mathrm {states\ =\ env.observation}$  _  $\mathrm {space.shape[{\color {Purple}0}]}$ 
  $\mathrm {{\color {BlueGreen}print}(}$  " $\mathrm {\color {YellowOrange}Size\ of\ State\ Space\ \rightarrow {}}$  "  $\mathrm {.format(num}$ _ $\mathrm {states))}$ 
  $\mathrm {num}$ _ $\mathrm {actions=env.action}$  _  $\mathrm {space.shape[{\color {Purple}0}]}$ 
  $\mathrm {{\color {BlueGreen}print}(}$  " $\mathrm {\color {YellowOrange}Size\ of\ State\ Space\ \rightarrow {}}$  "  $\mathrm {.format(num}$ _ $\mathrm {action))}$ 
 
  $\mathrm {upper}$ _ $\mathrm {bound\ =\ env.action}$ _ $\mathrm {space.high[{\color {Purple}0}]}$ 
  $\mathrm {lower}$ _ $\mathrm {bound\ =\ env.action}$ _ $\mathrm {space.low[{\color {Purple}0}]}$ 
 
  $\mathrm {{\color {BlueGreen}print}(}$  " $\mathrm {\color {YellowOrange}max\ Value\ State\ of\ Action\ \rightarrow {}}$  "  $\mathrm {.format(upper}$ _ $\mathrm {bound))}$ 
  $\mathrm {{\color {BlueGreen}print}(}$  " $\mathrm {\color {YellowOrange}max\ Value\ State\ of\ Action\ \rightarrow {}}$  "  $\mathrm {.format(lower}$ _ $\mathrm {bound))}$

 Size of State Space ->  3
 Size of Action Space ->  1
 Max Value of Action ->  2.0
 Min Value of Action ->  - 2.0

액터(Actor) 네트워크에 의한 더 나은 탐색를 구현하기 위해, 우리는 잡음 발생을 위한 올슈타인-울렌벡(Ornstein-Uhlenbeck) 프로세스를 사용한다. 상관된 정규 분포에서 소음을 샘플링한다.

  $\mathrm {{\color {BlueGreen}class}\ {\color {Green}OUActionNoise}:}$ 
      $\mathrm {\color {BlueGreen}def}$ __ $\mathrm {\color {Green}init}$ __ $\mathrm {(self,\ mean,\ std_{d}eviation,\ theta={\color {Purple}0.15},\ dt={\color {Purple}1e-2},\ x_{i}nitial=None):}$ 
          $\mathrm {self.theta=theta}$ 
          $\mathrm {self.mean=mean}$ 
          $\mathrm {self.std}$ _ $\mathrm {dev=std}$ _ $\mathrm {deviation}$ 
          $\mathrm {self.dt=dt}$ 
          $\mathrm {self.x}$ _ $\mathrm {initial\ =x}$ _ $\mathrm {initial}$ 
          $\mathrm {self.reset()}$ 
 
      $\mathrm {\color {BlueGreen}def}$ __ $\mathrm {\color {Green}init}$ __ $\mathrm {(self):}$ 
          $\mathrm {x=(}$ 
              $\mathrm {self.x}$ _ $\mathrm {prev}$ 
              $\mathrm {+\ self.theta\ *(self.mean-self.x}$ _ $\mathrm {prev)*self.dt}$ 
              $\mathrm {+\ self.std}$ _ $\mathrm {dev*np.sqrt(self.dt)\ *np.random.normal(size=self.mean.shape)}$ 
          $\mathrm {)}$ 
          $\mathrm {self.x}$ _ $\mathrm {prev=x}$ 
          $\mathrm {{\color {BlueGreen}return}\ x}$ 
 
      $\mathrm {{\color {BlueGreen}def}\ {\color {Green}reset}(self):}$ 
          $\mathrm {{\color {BlueGreen}if}\ self.x}$ _ $\mathrm {initial\ is\ not\ None:}$ 
              $\mathrm {self.x}$ _ $\mathrm {prev=self.x}$ _ $\mathrm {initial}$ 
          $\mathrm {\color {BlueGreen}else} :$ 
              $\mathrm {self.x}$ _ $\mathrm {prev=np.zeros}$ _ $\mathrm {like(self.mean)}$

버퍼 클래스는 경험 반복을 구현한다.

모델 초기화

메인 액터와 비평가, 타겟 액터와 비평가, 4개의 네트워크를 초기화한다.^[3]

   # 네트워크 매개 변수
    $\mathrm {\color {YellowOrange}X}$ _ $\mathrm {{\color {YellowOrange}shape}=(num}$ _ $\mathrm {states)}$ 
    $\mathrm {\color {YellowOrange}QX}$ _ $\mathrm {{\color {YellowOrange}shape}=(num}$ _ $\mathrm {states+num}$ _ $\mathrm {actions)}$ 
    $\mathrm {hidden}$ _ $\mathrm {sizes}$ _ $\mathrm {1=({\color {Cyan}1000,\ 500,\ 200})}$ 
    $\mathrm {hidden}$ _ $\mathrm {sizes}$ _ $\mathrm {2=({\color {Cyan}400,\ 200})}$ 
 
   # 메인 네트워크 출력
    $\mathrm {mu={\color {YellowOrange}ANN2}({\color {YellowOrange}X}}$ _ $\mathrm {{\color {YellowOrange}shape},\ {\color {Purple}list}\ (hidden}$ _ $\mathrm {sizes}$ _ $1)+[\mathrm {num}$ _ $\mathrm {actions],\ hidden}$ _ $\mathrm {activation=\ 'relu',\ output}$ _ $\mathrm {activation='tanh')}$ 
    $\mathrm {q}$ _ $\mathrm {mu={\color {YellowOrange}ANN2}({\color {YellowOrange}QX}}$ _ $\mathrm {{\color {YellowOrange}shape},\ {\color {Purple}list}\ (hidden}$ _ $\mathrm {sizes}$ _ $2)+[1],\mathrm {hidden}$ _ $\mathrm {activation='relu')}$ 
 
   # 대상 네트워크
    $\mathrm {mu}$ _ $\mathrm {target={\color {YellowOrange}ANN2}({\color {YellowOrange}X}}$ _ $\mathrm {{\color {YellowOrange}shape},\ {\color {Purple}list}\ (hidden}$ _ $\mathrm {sizes}$ _ $1)+[\mathrm {num}$ _ $\mathrm {actions],\ hidden}$ _ $\mathrm {activation=\ 'relu',\ output}$ _ $\mathrm {activation=\ 'tanh')}$ 
    $\mathrm {q}$ _ $\mathrm {mu}$ _ $\mathrm {target={\color {YellowOrange}ANN2}({\color {YellowOrange}QX}}$ _ $\mathrm {{\color {YellowOrange}shape},\ {\color {Purple}list}\ (hidden}$ _ $\mathrm {sizes}$ _ $2)+[1],\mathrm {hidden}$ _ $\mathrm {activation=\ 'relu')}$

훈련

이제 네트워크를 훈련시키기 위해 위에서 정의한 손실 기능을 직접 사용한다. TF2에서 손실과 구배를 계산하려면 TF에서 계산을 수행해야 한다. GradientTape() 블록 TF2는 네트워크마다 다른 그라데이션 테이프를 사용할 것을 권고한다.

  $\mathrm {{\color {YellowOrange}X,A,R,X2,D}=replay}$ _ $\mathrm {buffer.{\color {Purple}sample}(batch}$ _ $\mathrm {size)}$ 
  $\mathrm {{\color {YellowOrange}X}=np.{\color {Purple}asarray}({\color {YellowOrange}X},\ dtype=np.float32)}$ 
  $\mathrm {{\color {YellowOrange}A}=np.{\color {Purple}asarray}({\color {YellowOrange}A},\ dtype=np.float32)}$ 
  $\mathrm {{\color {YellowOrange}R}=np.{\color {Purple}asarray}({\color {YellowOrange}R},\ dtype=np.float32)}$ 
  $\mathrm {{\color {YellowOrange}X2}=np.{\color {Purple}asarray}({\color {YellowOrange}X2},\ dtype=np.float32)}$ 
  $\mathrm {{\color {YellowOrange}D}=np.{\color {Purple}asarray}({\color {YellowOrange}D},\ dtype=np.float32)}$ 
  $\mathrm {{\color {YellowOrange}Xten}=tf.convert}$ _ $\mathrm {to}$ _ $\mathrm {tensor({\color {YellowOrange}X})}$ 
 
 # $\mathrm {Actor\ optimization}$    
  $\mathrm {{\color {OrangeRed}with}\ tf.{\color {YellowOrange}GradientTape}()\ {\color {OrangeRed}as}\ tape2:}$ 
    $\mathrm {{\color {YellowOrange}Aprime}\ =action}$ _ $\mathrm {max*mu.predict}$ _ $\mathrm {on}$ _ $\mathrm {batch({\color {YellowOrange}X})}$ 
    $\mathrm {temp=tf.keras.layers.concatenate(}$ [  $\mathrm {Xten,\ Aprime}$  ]  $\mathrm {,\ axis=1)}$ 
    $\mathrm {{\color {YellowOrange}Q}=q}$ _ $\mathrm {mu.predict}$ _ $\mathrm {on}$ _ $\mathrm {batch(temp)}$ 
    $\mathrm {mu}$ _ $\mathrm {loss=-tf.reduce}$ _ $\mathrm {mean({\color {YellowOrange}Q})}$ 
    $\mathrm {grads}$ _ $\mathrm {mu=tape2.gradient(mu}$ _ $\mathrm {loss,mu.trainable}$ _ $\mathrm {variables)}$ 
  $\mathrm {mu}$ _ $\mathrm {losses.append(mu}$ _ $\mathrm {loss)}$ 
  $\mathrm {mu}$ _ $\mathrm {optimizer.apply}$ _ $\mathrm {gradients(zip(grads}$ _ $\mathrm {mu,mu.trainable}$ _ $\mathrm {variables))}$ 
 
 # $\mathrm {Critic\ Optimization}$ 
  $\mathrm {{\color {OrangeRed}with}\ tf.{\color {YellowOrange}GradientTape}()\ {\color {OrangeRed}as}\ tape:}$ 
    $\mathrm {next}$ _ $\mathrm {a=action}$ _ $\mathrm {max*mu}$ _ $\mathrm {target.predict}$ _ $\mathrm {on}$ _ $\mathrm {batch({\color {YellowOrange}X2})}$ 
    $\mathrm {temp=np.concatenate(({\color {YellowOrange}X2},\ next}$ _ $\mathrm {a),\ axis=1)}$ 
    $\mathrm {q}$ _ $\mathrm {target={\color {YellowOrange}R}+gamma*(1-{\color {YellowOrange}D})*q}$ _ $\mathrm {mu}$ _ $\mathrm {target.predict}$ _ $\mathrm {on}$ _ $\mathrm {batch(temp)}$ 
    $\mathrm {temp2=np.concatenate(({\color {YellowOrange}X},\ {\color {YellowOrange}A}),\ axis=1)}$ 
    $\mathrm {qvals=q}$ _ $\mathrm {mu.predict}$ _ $\mathrm {on}$ _ $\mathrm {batch(temp2)}$ 
    $\mathrm {q}$ _ $\mathrm {loss=tf.reduce}$ _ $\mathrm {mean((qvals-q}$ _ $\mathrm {target)**2)}$ 
    $\mathrm {grads}$ _ $\mathrm {q=tape.gradient(q}$ _ $\mathrm {loss,q}$ _ $\mathrm {mu.trainable}$ _ $\mathrm {variables)}$ 
  $\mathrm {q}$ _ $\mathrm {optimizer.apply}$ _ $\mathrm {gradients(zip(grads}$ _ $\mathrm {q,q}$ _ $\mathrm {mu.trainable}$ _ $\mathrm {variables))}$ 
  $\mathrm {q}$ _ $\mathrm {losses.append(q}$ _ $\mathrm {loss)}$

이 코드 블록을 살펴보자. 먼저 리플레이 버퍼에서 샘플을 채취한다. 행위자의 경우 먼저 상태(X)에 대한 액션을 계산한 다음 계산된 액션과 상태(X)를 모두 사용하여 비평가자를 사용하여 Q-값을 계산한다. 역전파 중에는 wrt 액터 변수만 구별하기 때문에 비평가는 일정하게 유지된다. 손실에 대한 부정적인 신호는 최적화에서 이 손실을 극대화하기를 원하기 때문이다. 비평가 오류의 경우 대상 네트워크를 사용하여 TD 오류 계산을 위한 큐-타겟을 계산한다. 현재 상태(X) Q 값은 주 비판적 네트워크를 사용하여 계산한다. 이 과정에서 액터는 일정하게 유지된다.^[3]

알고리즘

  $\mathrm {Input:\ initial\ policy\ parameters\ \theta ,\ Q-function\ parameters\ \phi ,\ empty\ replay\ buffer\ D}$ 
  $\mathrm {Set\ target\ parameters\ equal\ to\ main\ parameters\ \theta _{\text{targ}}\leftarrow \theta ,\ \phi _{\text{targ}}\leftarrow \phi }$ 
  $\mathrm {repeat}$ 
    $\mathrm {Observe\ state} \ s\ \mathrm {and\ select\ action} \ a={\text{clip}}(\mu _{\theta }(s)+\epsilon ,a_{Low},a_{High}),where\ \epsilon \sim {\mathcal {N}}$ 
    $\mathrm {Execute} \ a\ \mathrm {in\ the\ environment}$ 
    $\mathrm {Observe\ next\ state} s',\ \mathrm {reward} \ r,\ \mathrm {and\ done\ signal} \ d\ \mathrm {to\ indicate\ whether} \ s'\ \mathrm {is\ terminal}$ 
    $\mathrm {Store} \ (s,a,r,s',d)\ \mathrm {in\ replay\ buffer} \ D$ 
    $\mathrm {If} \ s'\ \mathrm {is\ terminal,\ reset\ environment\ state.}$ 
    $\mathrm {if\ it's\ time\ to\ update}$ 
      $\mathrm {for\ however\ many\ updates}$ 
         $\mathrm {Randomly\ sample\ a\ batch\ of\ transitions,} \ B={(s,a,r,s',d)}\ \mathrm {from} \ D$ 
         $\mathrm {Compute\ targets}$ 
 
                       $y(r,s',d)=r+\gamma (1-d)Q_{\phi _{\text{targ}}}(s',\mu _{\theta _{\text{targ}}}(s'))$ 
 
         $\mathrm {Update\ Q-function\ by\ one\ step\ of\ gradient\ descent\ using}$ 
 
                         $\nabla _{\phi }{\frac {1}{|B|}}\sum _{(s,a,r,s',d)\in B}\left(Q_{\phi }(s,a)-y(r,s',d)\right)^{2}$ 
 
         $\mathrm {Update\ policy\ by\ one\ step\ of\ gradient\ ascent\ using}$ 
 
                          $\nabla _{\theta }{\frac {1}{|B|}}\sum _{s\in B}Q_{\phi }(s,\mu _{\theta }(s))$ 
 
         $\mathrm {Update\ target\ networks\ with}$ 
 
                          $\phi _{\text{targ}}\leftarrow \rho \phi _{\text{targ}}+(1-\rho )\phi$ 
                          $\theta _{\text{targ}}\leftarrow \rho \theta _{\text{targ}}+(1-\rho )\theta$ 
  
      $\mathrm {end\ for}$ 
    $\mathrm {end\ if}$ 
  $\mathrm {until\ convergence}$

각주[편집]

↑ ^1.0 ^1.1 amifunny, 〈DDPG (Deep Deterministic Policy Gradient)〉, 《Keras》, 2020-06-04
↑ 생각많은 소심남, 〈(RL)Policy Gradient Algorithms〉, 《티스토리》, 2019-06-17
↑ ^3.0 ^3.1 ^3.2 ^3.3 Sunny Guha, 〈Deep Deterministic Policy Gradient (DDPG): Theory and Implementation〉, 《Medium》, 06-01
↑ Deep Deterministic Policy Gradient OpenAI Spinning Up - https://spinningup.openai.com/en/latest/algorithms/ddpg.html#id5

참고자료[편집]

Deep Deterministic Policy Gradient OpenAI Spinning Up - https://spinningup.openai.com/en/latest/algorithms/ddpg.html#id5
생각많은 소심남, 〈(RL)Policy Gradient Algorithms〉, 《티스토리》, 2019-06-17
amifunny, 〈DDPG (Deep Deterministic Policy Gradient)〉, 《Keras》, 2020-06-04
Sunny Guha, 〈Deep Deterministic Policy Gradient (DDPG): Theory and Implementation〉, 《Medium》, 06-01

같이 보기[편집]

이 DDPG 문서는 인공지능 기술에 관한 글로서 검토가 필요합니다. 위키 문서는 누구든지 자유롭게 편집할 수 있습니다. [편집]을 눌러 문서 내용을 검토·수정해 주세요.

인공지능 : 인공지능 서비스, 인공지능 로봇, 인공지능 기술^□^■^⊕, 인공지능 기업, 인공지능 인물

인공지능 기술	AI 워싱 • 로봇공학 • 로봇기술 • 인지과학 • 자동추론 • 자연어 처리 • 지능 • 지식표현 • 컴퓨터 비전 • 튜링 테스트 • 프롬프트 • 프롬프트 엔지니어링

문자인식과 음성인식	ICR • OCR • OMR • TTS • 답변 • 대화 • 문자 • 문자인식 • 스토리 • 음성 • 음성인식(STT) • 인공어 • 자연어 • 질문 • 화자인식

인공지능 데이터	데이터라벨러 • 데이터라벨링 • 데이터셋 • 크라우드워커 • 토큰 • 토큰화

인공지능 학습	ADP • CoLLM • DALL-E • DDPG • DQN • LMM • SARSA • SLM • 강화학습 • 거대언어모델(LLM) • 결정이론적 메타추론 • 계통적 강화학습 • 동적 계획법 • 딥러닝 • 딥큐러닝 • 머신러닝(기계학습) • 모델 기반 강화학습 • 모델 프리 강화학습 • 미세조정 • 반영식 아키텍처 • 비지도학습 • 사전학습 • 수시 알고리즘 • 심층믿음망 • 어니 • 에이전트 • 인공지능 학습 • 지도학습 • 학습 • 확률적 경사하강법

인공지능 알고리즘	AGI • ANI • ASI • RAG • XAI • 관계형 네트워크(RN) • 다층퍼셉트론 • 데이터마이닝 • 방사신경망 • 분산 샌드박스 • 생성대립신경망(GAN) • 생성형 AI • 수퍼얼라인먼트 • 순전파 • 순환신경망(RNN) • 시그모이드 함수 • 신경망 구조 • 심층신경망(DNN) • 심층신뢰신경망(DBN) • 양방향 비고정값 암호 체계(TSID) • 역전파 • 인공신경망(ANN) • 인공지능(AI) • 제한 볼츠만 머신(RBM) • 전방전달신경망 • 코헨 자기조직 신경망 • 텍스트마이닝 • 트랜스포머 • 파이 • 퍼셉트론 • 합성곱 신경망(CNN)

계산복잡도	NP • NP-완전 • 계산복잡도 • 공간복잡도 • 시간복잡도 • 여 NP • 여 NP-완전

인공지능 프로그램	BCI • GPT • 딥블루 • 딥페이크 • 멀티모달 AI • 모달 • 모달리티 • 모달창 • 알렉스넷 • 어니 • 알파고 • 알파고제로 • 알파폴드 • 왓슨 • 카페 • 컨트롤넷 • 텐서플로 • 텔레파시 • 토치 • 파이토치 • 한돌

인공지능 특징	결정이론 • 계산상의 합리성 • 논리학 • 논리주의자 • 분산성 • 불확실성 • 삼단논법 • 선호도 • 예측곤란성 • 완벽한 합리성 • 유계 합리성 • 이유 불충분의 원리 • 자율성 • 최대기대효용 • 할루시네이션 • 효용이론

인공지능 법적 지위	권리주체성 • 소버린 AI • 전자대리인 • 전자적 인간 • 책임법

위키 : 자동차, 교통, 지역, 지도, 산업, 기업, 단체, 업무, 생활, 쇼핑, 블록체인, 암호화폐, 인공지능, 개발, 인물, 행사, 일반

[.EB.B9.A8.EA.B0.84.EC.83.89-1] 1.0 ^1.1 amifunny, 〈DDPG (Deep Deterministic Policy Gradient)〉, 《Keras》, 2020-06-04

[2] 생각많은 소심남, 〈(RL)Policy Gradient Algorithms〉, 《티스토리》, 2019-06-17

[.EB.AF.B8.EB.94.94.EC.9B.80-3] 3.0 ^3.1 ^3.2 ^3.3 Sunny Guha, 〈Deep Deterministic Policy Gradient (DDPG): Theory and Implementation〉, 《Medium》, 06-01

[.EC.98.A4.ED.94.88.EC.97.90.EC.9D.B4.EC.95.84.EC.9D.B4-4] Deep Deterministic Policy Gradient OpenAI Spinning Up - https://spinningup.openai.com/en/latest/algorithms/ddpg.html#id5

[1]

[2]

[3]

[4]

@@ 1번째 줄: / 1번째 줄: @@
-'''DDPG'''는
+'''DDPG'''(Deep Deterministic Policy Gradient)는 [[인공신경망]]과 [[강화학습]]을 사용하여 학습하는 [[알고리즘]]이다. DDPG 알고리즘은 모델 프리 오프-폴리시로 학습하기 때문에 잘못된 행동이 누적되어 학습에 영향을 미치는 경우를 방지하는 장점이 있다.
 == 개요 ==
+DDPG는 DPG(Deterministic Policy Gradient)에 [[DQN]]을 결합시킨 모델 프리 오프 폴리시 액터 크리틱 알고리즘이다. DQN의 경험 반복(Experience Replay)과 저속 학습 대상 네트워크를 활용하며, 연속 액션 공간에서 동작이 가능한 DPG를 기반으로 한다.<ref name="빨간색">amifunny, 〈[https://keras.io/examples/rl/ddpg_pendulum/ DDPG (Deep Deterministic Policy Gradient)]〉,  《Keras》, 2020-06-04 </ref> 원래의 DQN은 별개의 공간에서 동작하지만, DDPG는 액터-크리틱 프레임워크(actor-critic framework)를 통해서 결정론적 정책을 학습하면서 효과를 연속 공간까지 확장시켰다. 좀 더 나은 탐색을 하기 위해서 탐색 정책은 <math>\mu'</math>는 <math>noise N</math>을 추가함으로써 만들 수 있다.<ref>생각많은 소심남, 〈[https://talkingaboutme.tistory.com/entry/RL-Policy-Gradient-Algorithms (RL)Policy Gradient Algorithms]〉,  《티스토리》, 2019-06-17 </ref>
+<math>\mu'(s) = \mu_\theta(s) + N </math>
+DDPG는 원래 DQN에는 없는 두 가지 기법을 더 사용한다. 첫째, 두 개의 대상 네트워크를 사용한다. 왜냐하면 훈련에 안정성을 더해 주기 때문이다. 간단히 말해서, 우리는 추정 대상으로부터 배우고 있고 대상 네트워크는 천천히 업데이트되므로 추정 대상의 안정성이 유지된다. 이는 개념적으로 이것을 어떻게 잘 할 것인가, 더 좋은 것을 찾을 때까지 잠시 시험해 보겠다고 말하는 것과 같은 것으로, 모든 동작 끝에 이 게임 전체를 어떻게 하는지 다시 배우겠다고 말하는 것과는 배치된다. 둘째, 경험 반복을 사용한다. 튜플 리스트(state, action, reward, next_state)를 저장하고, 최근의 경험으로부터만 배우는 대신에 지금까지 축적된 모든 경험을 샘플링하여 배운다.<ref name="빨간색"></ref>
+== 이론 ==
+=== 모델 ===
+정책은 조치를 직접 산출하기 때문에 결정론적이다. 탐사를 촉진하기 위해 정책에 따라 결정된 작업에 가우스 노이즈가 추가된다. 상태의 Q-값을 계산하기 위해, 액터 출력을 Q-네트워크에 공급하여 Q-값을 계산한다. 이 작업은 나중에 기술할 TD 오류 계산 중에만 수행된다. 학습을 안정시키기 위해 비평가와 액터 모두를 위한 타겟 네트워크를 만든다. 이러한 타겟 네트워크는 상태 네트워크를 기반으로 한 소프트 업데이트를 갖게 될 것이다.<ref name="미디움">Sunny Guha, 〈[https://towardsdatascience.com/deep-deterministic-policy-gradient-ddpg-theory-and-implementation-747a3010e82f Deep Deterministic Policy Gradient (DDPG): Theory and Implementation]〉,  《Medium》, 06-01 </ref>
+=== 손실 함수 ===
+모델 아키텍처를 설명하였으므로 이어서 모델을 훈련시키는 방법, 혹은 오히려 두 모델에 대한 손실 함수가 무엇인지 알아보자. 비평가(Q)와 액터(mu)의 손실함수는 다음과 같다.
+:<math>J_Q = \frac{1}{N} \sum_{i=1}^N (r_i + \gamma(1-d) Q_{targ} (s_{i}', \mu_{targ}(s_{i}') - Q(s_i, \mu(s_i))^2</math>
+:<math>J_{\mu} = \frac{1}{N} \sum_{i=1}^N Q(s_i, \mu(s_i)) </math>
+먼저 액터(정책 네트워크) 손실을 분석한다. 손실은 단순히 상태들의 Q-값의 합이다. Q 값을 계산하기 위해 비평가 네트워크를 사용하고 액터 네트워크에 의해 계산된 조치를 전달한다. 우리는 최대 수익률/Q-값을 원하기 때문에 이 결과를 극대화하려고 한다. 비평가 손실은 우리가 타겟 네트워크를 사용하여 다음 상태의 Q-값을 계산하는 단순한 TD 오류이다. 우리는 이 손실을 최소화해야 한다. 오류를 거꾸로 전파하기 위해서는 Q-기능의 파생 상품이 필요하다. 비평가 손실의 경우 Q-값의 파생상품은 mu를 일정하게 취급하므로 간단하지만, 액터 손실의 경우 mu-함수 Q-값 안에 포함된다. 이를 위해 우리는 다음과 같은 연쇄 규칙을 사용할 것이다.<ref name="미디움"></ref>
+:<math>J_{\mu} = E [Q(s, \mu(s))]</math>
+:<math>\nabla_{\theta^{\mu}}J_{\mu} = E [\nabla_{\mu}Q(s, \mu(s)) \nabla_{\theta^{\mu}} \mu(s)]</math>
+=== 큐 러닝 ===
+먼저 최적의 동작-값 함수를 기술하는 벨만(Bellman) 방정식 <math>Q^*(s,a)</math>를 재점검해 보자. 이는 다음 식에 의해 주어진다.
+:<math>Q^*(s,a) = \underset{s' \sim P}{E} \left [ r(s,a) + \gamma \underset{a'}{max} Q^* (s', a') \right ]</math>
+여기서 <math>s' \sim P</math>는 다음 상태인 <math>s'</math>가 <math>P(\cdot|s,a)</math>의 분포로부터 환경에 의해 샘플링된다고 말하는 속칭이다. 이 벨만 방정식은 <math>Q^*(s,a)</math>의 근사치를 배우기 위한 출발점이다. 근사치가 신경 네트워크 <math>Q_{\pi}(s,a)</math>이고 매개 변수 <math>\phi</math>가 있으며, 전환 세트 <math>{\mathcal D}(s,a,r,s',d)</math>를 수집했다고 가정하자. 여기서 <math>d</math>는 상태 <math>s'</math>가 터미널인지 여부를 나타낸다. 평균 제곱 벨먼 오차(MSBE) 함수를 설정할 수 있는데, 이 함수는 <math>Q_{\pi}</math>가 벨먼 방정식을 만족하는 데 얼마나 근접하게 도달하는지 대략 알려준다.
+:<math>L(\phi, {\mathcal D}) = \underset{(s,a,r,s',d) \sim {\mathcal D}}{{\mathrm E}}\left[
+    \Bigg( Q_{\phi}(s,a) - \left(r + \gamma (1 - d) \max_{a'} Q_{\phi}(s',a') \right) \Bigg)^2
+    \right]</math>
+여기서는 <math>(1-d)</math>를 평가할 때 <math>True</math>를 1로, <math>False</math>를 0으로 평가하는 파이썬 규약을 사용했다. 따라서 <code>d==True</code> (즉, <math>s'</math>가 말단 상태일 때)가 되면 <math>Q</math> 기능은 대리인이 현재 상태 이후 추가 보상을 받지 못한다는 것을 보여줘야 한다. 이러한 표기법 선택은 나중에 코드에서 구현하는 것에 해당한다. DQN, DDPG와 같은 기능 근사치에 대한 큐-러닝 알고리즘은 주로 이 평균 제곱 벨만 오차 손실 기능을 최소화하는 것에 기초한다.
+*'''리플레이 버퍼'''(Replay Buffer): <math>Q^*(s,a)</math>의 근사치를 위해 [[심층신경망]]을 훈련하기 위한 모든 표준 알고리즘은 경험 리플레이 버퍼를 사용한다. 이것은 이전 경험의 <math>D</math>의 집합이다. 알고리즘이 안정적인 동작을 갖기 위해서는 리플레이 버퍼의 크기가 커야 다양한 경험을 담을 수 있지만, 모든 것을 유지하는 것이 항상 좋은 것은 아닐 수 있다. 만약 가장 최근의 데이터만을 사용한다면, 그것에 지나치게 적합하게 될 것이고 상황은 깨질 것이다. 또한 만약 너무 많은 경험을 사용한다면, 학습 속도를 늦출 수 있다. 이 작업을 제대로 하려면 약간의 조정이 필요하다.
+*'''타겟 네트워크'''(Target Network): 큐-러닝 알고리즘은 타겟 네트워크를 활용한다.
+:<math> r + \gamma (1 - d) \max_{a'} Q_{\phi}(s',a')</math>
+:위의 식을 타겟이라고 하는데, 평균 제곱 벨만 오차 손실을 최소화했을 때 큐-기능을 이 타겟과 더 유사하게 만들려고 하기 때문이다. 문제는 우리가 훈련시키려 하는 것과 같은 변수인 <math>\pi</math>에 따라 대상이 달라진다. 그리고 이것은 평균 제곱 벨만 오차 최소화를 불안정하게 만든다. 해결책은 <math>\phi</math>에 가깝지만 시간 지연, 즉 1차보다 뒤처지는 타겟 네트워크라고 불리는 두 번째 네트워크를 사용하는 것이다. 대상 네트워크의 파라미터는 <math>\phi_{\text{targ}}</math>로 표시된다. DQN 기반 알고리즘에서 대상 네트워크는 몇 가지 고정된 단계마다 주 네트워크로부터 복사될 뿐이다. DDPG 스타일 알고리즘에서 대상 네트워크는 폴리아크 평균화에 의해 주 네트워크 업데이트당 한 번 업데이트된다.
+:<math>\phi_{\text{targ}} \leftarrow \rho \phi_{\text{targ}} + (1 - \rho) \phi</math>
+:여기서 <math>\rho</math>는 0과 1 사이의 하이퍼 파라미터이다. 일반적으로 1에 가깝다.<ref name="오픈에이아이"> Deep Deterministic Policy Gradient  OpenAI Spinning Up - https://spinningup.openai.com/en/latest/algorithms/ddpg.html#id5 </ref>
+=== 구현 ===
+  <math>\mathrm{{\color{OrangeRed}import}\ gym}</math>
+  <math>\mathrm{{\color{OrangeRed}import}\ tensorflow\ {\color{OrangeRed}as}\ tf}</math>
+  <math>\mathrm{{\color{OrangeRed}from}\ tensorflow.keras\ {\color{OrangeRed}import}\ layers}</math>
+  <math>\mathrm{{\color{OrangeRed}import}\ numpy\ {\color{OrangeRed}as}\ np} </math>
+  <math>\mathrm{{\color{OrangeRed}import}\ matplotlib.pyplot\ {\color{OrangeRed}as}\ plt} </math>
+OpenAIGym을 사용하여 환경을 만든다. 이 upper_bound 매개 변수를 사용하여 나중에 작업을 확장할 것이다.
+  <math>\mathrm{problem =}</math> "<math>\mathrm{{\color{YellowOrange}Pendulum-v0}}</math>"
+  <math>\mathrm{env = gym.make(problem)}</math>
+  <math></math>
+  <math>\mathrm{num}</math>_<math>\mathrm{states\ =\ env.observation}</math> _ <math>\mathrm{space.shape[{\color{Purple}0}]}</math>
+  <math>\mathrm{{\color{BlueGreen}print}( }</math> "<math>\mathrm{{\color{YellowOrange}Size\ of\ State\ Space\ \rightarrow {} }}</math> " <math>\mathrm{.format(num}</math>_<math>\mathrm{states))}</math>
+  <math>\mathrm{num}</math>_<math>\mathrm{actions = env.action}</math> _ <math>\mathrm{space.shape[{\color{Purple}0}]}</math>
+  <math>\mathrm{{\color{BlueGreen}print}( }</math> "<math>\mathrm{{\color{YellowOrange}Size\ of\ State\ Space\ \rightarrow {} }}</math> " <math>\mathrm{.format(num}</math>_<math>\mathrm{action))}</math>
+  <math></math>
+  <math>\mathrm{upper}</math>_<math>\mathrm{bound\ =\ env.action}</math>_<math>\mathrm{space.high[{\color{Purple}0}]}</math>
+  <math>\mathrm{lower}</math>_<math>\mathrm{bound\ =\ env.action}</math>_<math>\mathrm{space.low[{\color{Purple}0}]}</math>
+  <math></math>
+  <math>\mathrm{{\color{BlueGreen}print}( }</math> "<math>\mathrm{{\color{YellowOrange}max\ Value\ State\ of\ Action\ \rightarrow {} }}</math> " <math>\mathrm{.format(upper}</math>_<math>\mathrm{bound))}</math>
+  <math>\mathrm{{\color{BlueGreen}print}( }</math> "<math>\mathrm{{\color{YellowOrange}max\ Value\ State\ of\ Action\ \rightarrow {} }}</math> " <math>\mathrm{.format(lower}</math>_<math>\mathrm{bound))}</math>
+  Size of State Space ->  3
+  Size of Action Space ->  1
+  Max Value of Action ->  2.0
+  Min Value of Action ->  - 2.0
+액터(Actor) 네트워크에 의한 더 나은 탐색를 구현하기 위해, 우리는 잡음 발생을 위한 올슈타인-울렌벡(Ornstein-Uhlenbeck) 프로세스를 사용한다. 상관된 정규 분포에서 소음을 샘플링한다.
+  <math>\mathrm{{\color{BlueGreen}class}\ {\color{Green}OUActionNoise}:} </math>
+      <math>\mathrm{{\color{BlueGreen}def}}</math>__<math>\mathrm{{\color{Green}init}}</math>__<math>\mathrm{(self,\ mean,\ std_deviation,\ theta={\color{Purple}0.15},\ dt={\color{Purple}1e-2},\ x_initial=None): }</math>
+          <math>\mathrm{self.theta = theta}</math>
+          <math>\mathrm{self.mean = mean}</math>
+          <math>\mathrm{self.std}</math>_<math>\mathrm{dev = std}</math>_<math>\mathrm{deviation}</math>
+          <math>\mathrm{self.dt = dt}</math>
+          <math>\mathrm{self.x}</math>_<math>\mathrm{initial\ = x}</math>_<math>\mathrm{initial}</math>
+          <math>\mathrm{self.reset()}</math>
+  <math></math>
+      <math>\mathrm{{\color{BlueGreen}def}}</math>__<math>\mathrm{{\color{Green}init}}</math>__<math>\mathrm{(self):}</math>
+          <math>\mathrm{x = ( }</math>
+              <math>\mathrm{self.x}</math>_<math>\mathrm{prev} </math>
+              <math>\mathrm{ +\ self.theta\ * (self.mean - self.x}</math>_<math>\mathrm{prev) * self.dt }</math>
+              <math>\mathrm{ +\ self.std}</math>_<math>\mathrm{dev * np.sqrt(self.dt)\ * np.random.normal(size=self.mean.shape) }</math>
+          <math>\mathrm{ ) }</math>
+          <math>\mathrm{self.x}</math>_<math>\mathrm{prev = x}</math>
+          <math>\mathrm{{\color{BlueGreen}return}\ x}</math>
+  <math></math>
+      <math>\mathrm{{\color{BlueGreen}def}\ {\color{Green}reset}(self):} </math>
+          <math>\mathrm{{\color{BlueGreen}if}\ self.x}</math>_<math>\mathrm{initial\ is\ not\ None:} </math>
+              <math>\mathrm{self.x}</math>_<math>\mathrm{prev = self.x}</math>_<math>\mathrm{initial} </math>
+          <math>\mathrm{{\color{BlueGreen}else}}:</math>
+              <math>\mathrm{self.x}</math>_<math>\mathrm{prev = np.zeros}</math>_<math>\mathrm{like(self.mean)}</math>
+버퍼 클래스는 경험 반복을 구현한다.
+; 모델 초기화
+메인 액터와 비평가, 타겟 액터와 비평가, 4개의 네트워크를 초기화한다.<ref name="미디움"></ref>
+    # 네트워크 매개 변수
+    <math>\mathrm{{\color{YellowOrange}X}}</math>_<math>\mathrm{{\color{YellowOrange}shape} = (num}</math>_<math>\mathrm{states)}</math>
+    <math>\mathrm{{\color{YellowOrange}QX}}</math>_<math>\mathrm{{\color{YellowOrange}shape} = (num}</math>_<math>\mathrm{states + num}</math>_<math>\mathrm{actions)}</math>
+    <math>\mathrm{hidden}</math>_<math>\mathrm{sizes}</math>_<math>\mathrm{1 = ({\color{Cyan}1000,\ 500,\ 200})}</math>
+    <math>\mathrm{hidden}</math>_<math>\mathrm{sizes}</math>_<math>\mathrm{2 = ({\color{Cyan}400,\ 200})}</math>
+  <math></math>
+    # 메인 네트워크 출력
+    <math>\mathrm{mu = {\color{YellowOrange}ANN2} ({\color{YellowOrange}X} }</math>_<math>\mathrm{{\color{YellowOrange}shape},\ {\color{Purple}list}\ (hidden }</math>_<math>\mathrm{sizes}</math>_<math>1 ) + [ \mathrm{ num }</math>_<math>\mathrm{actions ],\ hidden}</math>_<math>\mathrm{activation =\ 'relu',\ output}</math>_<math>\mathrm{activation = 'tanh')}</math>
+    <math>\mathrm{q}</math>_<math>\mathrm{mu  =  {\color{YellowOrange}ANN2} ( {\color{YellowOrange}QX} }</math>_<math>\mathrm{{\color{YellowOrange}shape},\ {\color{Purple}list}\ ( hidden}</math>_<math>\mathrm{sizes}</math>_<math>2 ) + [ 1 ], \mathrm{hidden}</math>_<math>\mathrm{activation = 'relu')}</math>
+  <math></math>
+    # 대상 네트워크
+    <math>\mathrm{mu}</math>_<math>\mathrm{target =  {\color{YellowOrange}ANN2} ({\color{YellowOrange}X}}</math>_<math>\mathrm{{\color{YellowOrange}shape},\ {\color{Purple}list}\ (hidden}</math>_<math>\mathrm{sizes}</math>_<math>1 ) + [ \mathrm{num}</math>_<math>\mathrm{actions],\ hidden}</math>_<math>\mathrm{activation =\ 'relu',\ output}</math>_<math>\mathrm{activation =\ 'tanh')}</math>
+    <math>\mathrm{q}</math>_<math>\mathrm{mu}</math>_<math>\mathrm{target =  {\color{YellowOrange}ANN2} ({\color{YellowOrange}QX}}</math>_<math>\mathrm{{\color{YellowOrange}shape},\ {\color{Purple}list}\ (hidden}</math>_<math>\mathrm{sizes}</math>_<math>2 ) + [ 1 ], \mathrm{hidden}</math>_<math>\mathrm{activation =\ 'relu' )}</math>
+; 훈련
+이제 네트워크를 훈련시키기 위해 위에서 정의한 손실 기능을 직접 사용한다. TF2에서 손실과 구배를 계산하려면 TF에서 계산을 수행해야 한다. GradientTape() 블록 TF2는 네트워크마다 다른 그라데이션 테이프를 사용할 것을 권고한다.
+  <math>\mathrm{{\color{YellowOrange}X,A,R,X2,D} = replay}</math>_<math>\mathrm{buffer.{\color{Purple}sample}(batch }</math>_<math>\mathrm{size) }</math>
+  <math>\mathrm{{\color{YellowOrange}X} = np.{\color{Purple}asarray}({\color{YellowOrange}X},\ dtype=np.float32)}</math>
+  <math>\mathrm{{\color{YellowOrange}A} = np.{\color{Purple}asarray}({\color{YellowOrange}A},\ dtype=np.float32)}</math>
+  <math>\mathrm{{\color{YellowOrange}R} = np.{\color{Purple}asarray}({\color{YellowOrange}R},\ dtype=np.float32)}</math>
+  <math>\mathrm{{\color{YellowOrange}X2} = np.{\color{Purple}asarray}({\color{YellowOrange}X2},\ dtype=np.float32)}</math>
+  <math>\mathrm{{\color{YellowOrange}D} = np.{\color{Purple}asarray}({\color{YellowOrange}D},\ dtype=np.float32)}</math>
+  <math>\mathrm{{\color{YellowOrange}Xten}=tf.convert}</math>_<math>\mathrm{to}</math>_<math>\mathrm{tensor({\color{YellowOrange}X})}</math>
+  <math></math>
+  #<math>\mathrm{Actor\ optimization}</math>
+  <math>\mathrm{{\color{OrangeRed}with}\ tf.{\color{YellowOrange}GradientTape}()\ {\color{OrangeRed}as}\ tape2:} </math>
+    <math>\mathrm{{\color{YellowOrange}Aprime}\ = action}</math>_<math>\mathrm{max * mu.predict}</math>_<math>\mathrm{on}</math>_<math>\mathrm{batch({\color{YellowOrange}X})}</math>
+    <math>\mathrm{temp = tf.keras.layers.concatenate( }</math>[ <math>\mathrm{Xten,\ Aprime}</math> ] <math>\mathrm{,\ axis=1)}</math>
+    <math>\mathrm{{\color{YellowOrange}Q} = q}</math>_<math>\mathrm{mu.predict}</math>_<math>\mathrm{on}</math>_<math>\mathrm{batch(temp)}</math>
+    <math>\mathrm{mu}</math>_<math>\mathrm{loss = - tf.reduce}</math>_<math>\mathrm{mean({\color{YellowOrange}Q})}</math>
+    <math>\mathrm{grads}</math>_<math>\mathrm{mu = tape2.gradient(mu}</math>_<math>\mathrm{loss,mu.trainable}</math>_<math>\mathrm{variables)} </math>
+  <math>\mathrm{mu}</math>_<math>\mathrm{losses.append(mu}</math>_<math>\mathrm{loss) }</math>
+  <math>\mathrm{mu}</math>_<math>\mathrm{optimizer.apply}</math>_<math>\mathrm{gradients(zip(grads }</math>_<math>\mathrm{mu, mu.trainable}</math>_<math>\mathrm{variables)) } </math>
+  <math></math>
+  #<math>\mathrm{Critic\ Optimization}</math>
+  <math>\mathrm{{\color{OrangeRed}with}\ tf.{\color{YellowOrange}GradientTape}()\ {\color{OrangeRed}as}\ tape:}</math>
+    <math>\mathrm{next}</math>_<math>\mathrm{a = action}</math>_<math>\mathrm{max * mu}</math>_<math>\mathrm{target.predict}</math>_<math>\mathrm{on}</math>_<math>\mathrm{batch({\color{YellowOrange}X2})}</math>
+    <math>\mathrm{temp = np.concatenate(({\color{YellowOrange}X2},\ next}</math>_<math>\mathrm{a),\ axis=1) }</math>
+    <math>\mathrm{q}</math>_<math>\mathrm{target = {\color{YellowOrange}R} + gamma * (1 - {\color{YellowOrange}D}) * q}</math>_<math>\mathrm{mu}</math>_<math>\mathrm{target.predict}</math>_<math>\mathrm{on}</math>_<math>\mathrm{batch(temp)}</math>
+    <math>\mathrm{temp2 = np.concatenate(({\color{YellowOrange}X},\ {\color{YellowOrange}A}),\ axis=1) }</math>
+    <math>\mathrm{qvals = q}</math>_<math>\mathrm{mu.predict}</math>_<math>\mathrm{on}</math>_<math>\mathrm{batch(temp2) }</math>
+    <math>\mathrm{q}</math>_<math>\mathrm{loss = tf.reduce}</math>_<math>\mathrm{mean((qvals - q}</math>_<math>\mathrm{target)**2)}</math>
+    <math>\mathrm{grads}</math>_<math>\mathrm{q = tape.gradient(q}</math>_<math>\mathrm{loss,q}</math>_<math>\mathrm{mu.trainable}</math>_<math>\mathrm{variables)}</math>
+  <math>\mathrm{q}</math>_<math>\mathrm{optimizer.apply}</math>_<math>\mathrm{gradients(zip(grads}</math>_<math>\mathrm{q, q}</math>_<math>\mathrm{mu.trainable}</math>_<math>\mathrm{variables))}</math>
+  <math>\mathrm{q}</math>_<math>\mathrm{losses.append(q}</math>_<math>\mathrm{loss)}</math>
+이 코드 블록을 살펴보자. 먼저 리플레이 버퍼에서 샘플을 채취한다. 행위자의 경우 먼저 상태(X)에 대한 액션을 계산한 다음 계산된 액션과 상태(X)를 모두 사용하여 비평가자를 사용하여 Q-값을 계산한다. [[역전파]] 중에는 wrt 액터 변수만 구별하기 때문에 비평가는 일정하게 유지된다. 손실에 대한 부정적인 신호는 최적화에서 이 손실을 극대화하기를 원하기 때문이다. 비평가 오류의 경우 대상 네트워크를 사용하여 TD 오류 계산을 위한 큐-타겟을 계산한다. 현재 상태(X) Q 값은 주 비판적 네트워크를 사용하여 계산한다. 이 과정에서 액터는 일정하게 유지된다.<ref name="미디움"></ref>
+; 알고리즘
+  <math>\mathrm{Input:\ initial\ policy\ parameters\ \theta,\ Q-function\ parameters\ \phi,\ empty\ replay\ buffer\ D }</math>
+  <math>\mathrm{Set\ target\ parameters\ equal\ to\ main\ parameters\ \theta_{\text{targ}} \leftarrow \theta,\ \phi_{\text{targ}} \leftarrow \phi }</math>
+  <math>\mathrm{repeat}</math>
+    <math>\mathrm{Observe\ state}\ s\ \mathrm{and\ select\ action}\ a = \text{clip}(\mu_{\theta}(s) + \epsilon, a_{Low}, a_{High}), where\ \epsilon \sim \mathcal{N}</math>
+    <math>\mathrm{Execute}\ a\ \mathrm{in\ the\ environment}</math>
+    <math>\mathrm{Observe\ next\ state} s',\ \mathrm{reward}\ r,\ \mathrm{and\ done\ signal}\ d\ \mathrm{to\ indicate\ whether}\ s'\ \mathrm{is\ terminal}</math>
+    <math>\mathrm{Store}\ (s,a,r,s',d)\ \mathrm{in\ replay\ buffer}\ D</math>
+    <math>\mathrm{If}\ s'\ \mathrm{is\ terminal,\ reset\ environment\ state.}</math>
+    <math>\mathrm{if\ it's\ time\ to\ update}</math>
+      <math>\mathrm{for\ however\ many\ updates}</math>
+         <math>\mathrm{Randomly\ sample\ a\ batch\ of\ transitions,}\ B = {(s,a,r,s',d)}\ \mathrm{from}\ D</math>
+         <math>\mathrm{Compute\ targets}</math>
+  <math></math>
+                       <math>y(r,s',d) = r + \gamma (1-d) Q_{\phi_{\text{targ}}}(s', \mu_{\theta_{\text{targ}}}(s'))</math>
+  <math></math>
+         <math>\mathrm{Update\ Q-function\ by\ one\ step\ of\ gradient\ descent\ using}</math>
+  <math></math>
+                         <math>\nabla_{\phi} \frac{1}{|B|}\sum_{(s,a,r,s',d) \in B} \left( Q_{\phi}(s,a) - y(r,s',d) \right)^2</math>
+  <math></math>
+         <math>\mathrm{Update\ policy\ by\ one\ step\ of\ gradient\ ascent\ using}</math>
+  <math></math>
+                          <math>\nabla_{\theta} \frac{1}{|B|}\sum_{s \in B}Q_{\phi}(s, \mu_{\theta}(s))</math>
+  <math></math>
+         <math>\mathrm{Update\ target\ networks\ with}</math>
+  <math></math>
+                          <math>\phi_{\text{targ}} \leftarrow \rho \phi_{\text{targ}} + (1-\rho) \phi</math>
+                          <math>\theta_{\text{targ}} \leftarrow \rho \theta_{\text{targ}} + (1-\rho) \theta</math>
+   <math></math>
+      <math>\mathrm{end\ for}</math>
+    <math>\mathrm{end\ if}</math>
+  <math>\mathrm{until\ convergence}</math>
 {{각주}}
 == 참고자료 ==
-*
+* Deep Deterministic Policy Gradient  OpenAI Spinning Up - https://spinningup.openai.com/en/latest/algorithms/ddpg.html#id5
+* 생각많은 소심남, 〈[https://talkingaboutme.tistory.com/entry/RL-Policy-Gradient-Algorithms (RL)Policy Gradient Algorithms]〉,  《티스토리》, 2019-06-17
+* amifunny, 〈[https://keras.io/examples/rl/ddpg_pendulum/ DDPG (Deep Deterministic Policy Gradient)]〉,  《Keras》, 2020-06-04
+* Sunny Guha, 〈[https://towardsdatascience.com/deep-deterministic-policy-gradient-ddpg-theory-and-implementation-747a3010e82f Deep Deterministic Policy Gradient (DDPG): Theory and Implementation]〉,  《Medium》, 06-01
 == 같이 보기 ==
-*
+* [[강화학습]]
+* [[DQN]]
 {{인공지능 기술|검토 필요}}

위키원

이름공간

변수

보기

더 보기

검색