Reinforcement Learning Framework

<!DOCTYPE html>

Reinforcement Learning General Framework

Reinforcement Learning General Framework

Reinforcement Learning: General Framework

A general reinforcement learning problem, with full observability, can be defined as follows:

Definition 1. (Reinforcement Learning Problem) A RL problem is defined by a tuple (𝔾,π•Š,𝔸,𝒫,Ξ ,β„›,Ξ³,𝕋,ΞΌ)\left(\mathbb{G},\mathbb{S},\mathbb{A},\mathcal{P},\Pi,\mathcal{R},\gamma,\mathbb{T},\mu \right), where 𝔾={g0,g1}\mathbb{G} = \{g_{0},g_{1}\} is the set with the environment g0g_{0} and the agent g1g_{1}, π•Š\mathbb{S} is the set of states, 𝔸\mathbb{A} is the set of actions, 𝒫:π•ŠΓ—π”Έβ†’Ξ”(π•Š)\mathcal{P}: \mathbb{S} \times \mathbb{A} \to \Delta(\mathbb{S}) is the environment state transition function, where Ξ”(π•Š)\Delta (\mathbb{S}) is the space of probability distributions over π•Š\mathbb{S}, Ξ \Pi is the agent’s policy space, β„›:π•ŠΓ—π”ΈΓ—π•Šβ†’Ξ”(ℝ)\mathcal{R}: \mathbb{S} \times \mathbb{A} \times \mathbb{S} \to \Delta (\mathbb{R}) is the reward function, γ∈[0,1]\gamma \in [0,1] is the discount factor, 𝕋\mathbb{T} is the time set and ΞΌβˆˆΞ”(π•Š)\mu \in \Delta (\mathbb{S}) is the distribution of the initial state s0βˆˆπ•Šs_{0} \in \mathbb{S}.

While the literarure describes RL around the Markov Decision Process (MDP) , DefinitionΒ 1 takes a different approach by incorporating MDPs into a broader RL problem definition. An MDP models decision-making problems where the states transitions satisfy the Markov property and are partially controlled by an agent. Formally, an MDP is defined as a tuple (π•Š,𝔸,𝒫,β„›,Ξ³,𝕋,ΞΌ)\left(\mathbb{S},\mathbb{A},\mathcal{P},\mathcal{R},\gamma,\mathbb{T},\mu \right), where π•Š\mathbb{S} is the set of states, 𝔸\mathbb{A} is the set of actions, 𝒫:π•ŠΓ—π”Έβ†’Ξ”(π•Š)\mathcal{P}: \mathbb{S} \times \mathbb{A} \to \Delta(\mathbb{S}) is the state transition function, β„›:π•ŠΓ—π”ΈΓ—π•Šβ†’Ξ”(ℝ)\mathcal{R}: \mathbb{S} \times \mathbb{A} \times \mathbb{S} \to \Delta (\mathbb{R}) is the reward function, γ∈[0,1]\gamma \in [0,1] is the discount factor and ΞΌβˆˆΞ”(π•Š)\mu \in \Delta (\mathbb{S}) is the distribution of the initial state s0βˆˆπ•Šs_{0} \in \mathbb{S}.

In RL there are two primary entities: the agent and the environment. The environment represents the external system with which the agent interacts. These interactions occur within a temporal context that can be either continuous or discrete and may extend over a finite or infinite time horizon. For the purposes of this discussion, we will focus on scenarios within a discrete-time framework.

The environment is characterized by a state space π•Š\mathbb{S}, whose dynamics are govern by a transition probability function 𝒫\mathcal{P}. In a discrete-time setting, at each time step tβˆˆπ•‹t \in \mathbb{T}, the environment is in a state stβˆˆπ•Šs_{t} \in \mathbb{S}, with the initial state being s0∼μs_{0} \sim \mu. Given the current state sts_{t}, the agent performs an action ata_{t}, prompting the environment to transition to a new state st+1βˆΌπ’«(st,at)s_{t+1} \sim \mathcal{P}(s_{t},a_{t}). Concurrently, the agent receives a reward rt+1βˆΌβ„›(st,at,st+1)r_{t+1} \sim \mathcal{R}(s_{t},a_{t},s_{t+1}). This iterative process continues indefinitely or until a termination condition is met, thus defining a trajectory Ο„t={s0,a0,s1,r1,a1,…,st,rt,at,st+1,rt+1}\tau_{t} = \left\{s_{0},a_{0}, s_{1},r_{1},a_{1},\dots,s_{t},r_{t},a_{t},s_{t+1},r_{t+1} \right\}, at each time step tβˆˆπ•‹t \in \mathbb{T}.

Let 𝒯t\mathcal{T}_{t} be the set of all trajectories of lenght tt: 𝒯t={Ο„t:Ο„t=(s0,a0,r1,s1,a1,r2,s2,…,st,at,rt+1,st+1)}\mathcal{T}_{t} = \left\{\tau_{t} : \tau_{t}=(s_{0},a_{0},r_{1},s_{1},a_{1},r_{2},s_{2},\dots,s_{t},a_{t},r_{t+1},s_{t+1})\right\} The trajectory space 𝒯\mathcal{T} is defined as the union of all 𝒯t\mathcal{T}_{t}, for tβˆˆπ•‹t \in \mathbb{T}: 𝒯=⋃tβˆˆπ•‹π’―t\mathcal{T} = \bigcup_{t \in \mathbb{T}} \mathcal{T}_{t}

To operate within the environment, the agent selects a policy Ο€βˆˆΞ \pi \in \Pi, a function that maps the current state to a probability distribution over the action space 𝔸\mathbb{A}, Ο€:π•Šβ†’Ξ”(𝔸)\pi: \mathbb{S} \to \Delta(\mathbb{A}). Since the environment is a MDP, the agent’s decision depends only on the current state sts_{t}, and thus his policy takes only the current state as input. A reinforcement learning algorithm, such as Q-Learning, can be conceptualized as a function L:𝒯→ΠL: \mathcal{T} \to \Pi that maps a realized trajectory to a policy. At each discrete time step tβˆˆπ•‹t \in \mathbb{T}, given a trajectory Ο„t\tau_{t}, the agent updates his policy Ο€t\pi_{t} using L(Ο„t)L(\tau_{t}). Upon observing the current state sts_{t}, the agent then samples an action ata_{t} from the probability distribution defined by Ο€t(st)\pi_{t}(s_{t}).

Partially Observable Reinforcement Learning

In an environment with partial observability, the agent doesn’t have direct access to the complete state of the environment. Instead, it receives observations that may provide incomplete or noisy information about the true state. A first-price auction is a good example of a partially observable environment, where bidders don’t know other bidders’ private valuations or, in some cases, the total number of participants. Such scenarios are formally modeled using Partially Observable Reinforcement Learning.

Definition 2. (Partially Observable Reinforcement Learning Problem) A partially observable reinforcement learning problem is defined by a tuple (𝔾,π•Š,𝔸,𝕆,𝒫,π’ͺ,Ξ ,β„›,Ξ³,𝕋,ΞΌ)\left(\mathbb{G},\mathbb{S},\mathbb{A},\mathbb{O},\mathcal{P},\mathcal{O},\Pi,\mathcal{R},\gamma,\mathbb{T},\mu \right), where 𝔾={g0,g1}\mathbb{G} = \{g_{0},g_{1}\} is the set containing the environment g0g_{0} and the agent g1g_{1}, π•Š\mathbb{S} is the set of states, 𝔸\mathbb{A} is the set of actions, 𝕆\mathbb{O} is the set of observations, 𝒫:π•ŠΓ—π”Έβ†’Ξ”(π•Š)\mathcal{P}: \mathbb{S} \times \mathbb{A} \to \Delta(\mathbb{S}) is the environment state transition function, where Ξ”(π•Š)\Delta (\mathbb{S}) is the space of probability distributions over π•Š\mathbb{S}, π’ͺ:π•ŠΓ—π•Šβ†’Ξ”(𝕆)\mathcal{O}: \mathbb{S} \times \mathbb{S} \to \Delta(\mathbb{O}) is the observation function, where Ξ”(𝕆)\Delta (\mathbb{O}) is the space of probability distributions over 𝕆\mathbb{O}, Ξ \Pi is the agent’s policy space, where policies map histories of observations and actions to distributions over actions, β„›:π•ŠΓ—π”ΈΓ—π•Šβ†’Ξ”(ℝ)\mathcal{R}: \mathbb{S} \times \mathbb{A} \times \mathbb{S} \to \Delta (\mathbb{R}) is the reward function, γ∈[0,1]\gamma \in [0,1] is the discount factor, 𝕋\mathbb{T} is the time set, ΞΌβˆˆΞ”(π•Š)\mu \in \Delta (\mathbb{S}) is the distribution of the initial state s0βˆˆπ•Šs_{0} \in \mathbb{S}.

A partially observable reinforcement learning problem is structured around two fundamental entities: the environment and the agent, collectively denoted as the set 𝔾\mathbb{G}. Within this framework, the environment exists in various states, represented by the set π•Š\mathbb{S}, while the agent can perform actions from the set 𝔸\mathbb{A}. The crucial characteristic that distinguishes this from standard reinforcement learning is that the agent cannot directly observe the true state of the environment. Instead, it receives observations from the set 𝕆\mathbb{O}, which may provide incomplete or noisy information about the actual state.