- Published on
Notes on Distributional Reinforcement Learning
- Authors
- Name
- Harvey Huang
- Github
- @Github
Introduction
Traditionally, value-based RL algorithms rely on a critical assumption: the action-value Q(s, a)
is directly estimated as a scalar quantity regardless of whether the algorithm is a tabular method or function approximation method1. Bellemare et al. (2017) proposed to estimate a unique conditional probability distribution of the value for each state and action, instead of its mean value2. The rationale behind the conjecture is that the entire distribution entails much richer information (higher-order moments) than its mean, and hence the distribution estimation is expected to yield better performance. The picture below depicts an example distributional RL in the tabular form. Z(s, a)
is a collection of learned action-values Q(s ,a)
over time as the agent takes actions in different states, {Q(s_0, a_0),... , Q(s_t, a_t)}
(in histogram form). Note Z(s, a)
is inherently stochastic and non-stationary.

Distributional RL algorithms outperform many RL algorithms in various complex and challenging games (Bellemare et al., 2017, 2019; Dabney, Ostrovski, et al., 2018; Dabney, Rowland, et al., 2018; Rowland et al., 2019). Moreover, the idea has spanned over to behavioural and neuroscience studies where researchers have found evidence that constructing action-values in a distribution rather than a single scalar quantity is biologically plausible (Dabney et al., 2020; Lowet et al., 2020).
In the following sections, I will first introduce each algorithm, and then highlight some key assumptions and issues in the empirical implementation.
High-Level Implementation
The main structure of a (deep) distributional RL algorithm remains the same as the Deep Q Network implementation, i.e. two networks with identical network structure, TD updates, etc. The main difference is that the network outputs a conditional distribution of Z(s, a)
with which to estimate the state-action value Q(s, a)
rather than a point estimate.
The decision-making process does not alter from classical reinforcement learning per se because it is still governed by action values. The difference sits in the computation of action values. Each action-value is computed by integrating over its estimated distribution (i.e., the expected value of an action-value distribution). Thus, the main challenge is embedded within the distribution approximation. More specifically, the following two broad questions arise:
- How do we approximate an action-value distribution of
Z(s, a)
? - How do we embed the concept of action-value distribution of
Z(s, a)
into the canonical RL framework?
The first question goes beyond the field of RL. Loosely speaking there are two branches, parametric and non-parametric. In the parametric approach, we may assume that each distribution can be finitely represented by a known parametric function (e.g. Gaussian with mean and standard deviation, or student-t with degree-of-freedom, etc.). Finding a distribution is equivalent to finding the parameters associated with the engineer-specified distribution function. Some distributional RL studies belong to this branch and stick to a tight family of parametric distribution functions, e.g. Gaussian distribution (Engel et al., 2005; Ghavamzadeh & Engel, 2007; Morimura et al., 2012; Sato et al., 2001) or mixture of Gaussian distributions (Barth-Maron et al., 2018; Choi et al., 2019). However, the success of parametric estimation depends heavily on the validity of distribution function assumption.
Alternatively in a non-parametric approach, to approximate a distribution, we could start by generating statistical characteristics of a distribution via a given sample. Each distribution possesses certain statistics to characterize itself. That is, if we know some statistical properties of distribution, we know what this distribution roughly looks like. For instance, a probability distribution function can be characterized by equal-sized bins with different heights (either in probabilities or frequencies), namely, a histogram3. The non-parametric method is more flexible yet computationally more difficult than the parametric method.
Note the notion of parametric vs. non-parametric is limited to distribution only. Whether one chooses to use parametric methods (e.g. a neural network) to compute distribution function parameters or to use the statistics approach depends on the implementation, and is potentially relevant to the second question.
The second question is more involved in the context of reinforcement learning. Replacing the scalar action value with a distribution raises more issues in the implementation. For instance, how do we calculate the target action-value distribution like we do for a scalar action-value? Can we obtain convergence towards the true action-value distribution? What metrics should we use to calculate the distance between distributions? The answer to these questions will depend on algorithm specifications.
While the implementation varies between algorithms, it is important to have these two top-layer questions in mind. Below, I will first lay out the details of three distributional RL algorithms from the above two perspectives, then I will briefly discuss the limitations of distributional RL algorithms and their connections to classical statistics later.
Categorical distributional RL (Histogram)
In so-called categorical RL (Bellemare et al., 2017), each action-value distribution is approximated by a histogram. Before I dive into the empirical details of this approach, readers are reminded that, from a pure statistical view, there is no compelling reason to use histograms as an estimation of the true probability density function (PDF). Histograms are not consistent estimates of the true PDF given a fixed sample size , even if the true PDF is stationary. For a more detailed discussion, I refer the readers to Chapter 14 (Estimating Distributions and Densities) of Shalizi (2022).
To characterize a histogram, we need two components: (1) bins and their locations on the x-axis, and (2) corresponding height (frequency) for each bin on the y-axis. In this case, we assume that the true action-value PDF is approximated by a truncated histogram between . Note the true PDF goes from to on the x-axis. We also assume that there are equal-sized bins within the boundary. In Bellemare et al. (2017), the author assumed in total bin edges (called atoms) on the x-axis (hence the name C51), thus 50
bins in total. The vector output from the neural network contains the frequencies of each bin (i.e. the collection of height for each bin) for each action , conditional on the state input . Each action value Q(s_t, a_t)
is calculated as the probability-weighted average of each bin (i.e. the expected value).

To perform a temporal difference update, the algorithm first shrinks the histogram by and shifts the atoms (x-axis) by the reward amount , then it re-distributes the frequencies back to where the original boundaries. The result distribution is the target action-value histogram. This shrink, shift and re-distribution
process seems convoluted, yet it is required to address one downside of using a histogram; without re-distribution of the values, the histogram would immediately move out of the truncated range. The process is illustrated in the original paper (see below).

The loss value, which is then used to update the weights of the neural network, is calculated as the Kullback-Leibler divergence between the actor distribution and the target distribution. The choice was due to a technical issue in the implementation. In Bellemare et al. (2017), the authors used the Wasserstein distance to perform the policy convergence proof. However, the Wasserstein metric could not generally be minimized via stochastic gradient descent, and hence the use of Kullback-Leibler divergence in the empirical implementation (refer to Bellemare et al. (2017) for the pseudocode4). To visualize the categorical distributional RL in action, readers may visit this video illustration.
The empirical limitation of the histogram approach is obvious: how do we ensure that the true action-value lie inside the boundary? If not, then how do we determine the appropriate boundary in the first place? In practice, it is necessary to specify a wide enough range to partially mitigate this concern, but in the meantime one also needs a smaller bin size to be able to differentiate between distributions for different actions. In a nutshell, the balance between the maximum bin range and the bin size is critical in the empirical implementation.
Quantile distributional RL
Inspired by statistics and econometrics, the notion of value distribution approximation has then been extended from the histogram approach to the quantile approach (Dabney, Ostrovski, et al., 2018; Taylor, 1999). In the Quantile distributional RL (called QR-DQN), each action-value distribution is approximated by a cumulative distribution function (CDF). To visualize how the quantile method works in the Atari games, readers may visit this video illustration. The true inverse CDF can be characterized by the true quantile function or the entire (infinite) set of the quantile values. In general, the true quantile function is unknown, so it has to be approximated. In the case of distributional RL, we are interested in the conditional distribution of Z(s ,a)
given state and action . Each set of discrete quantile values is one way to summarize and proxy this conditional distribution. Based on agents' experience, quantile regression allows us to find the conditional quantile values (Koenker & Hallock, 2001). To allow for non-linear relations, a neural network, with the quantile regression loss function as its loss function, is used to generate the quantile values. Graphically, to characterize a CDF line via a set of discrete quantile values, we need two components: (1) the pre-specified quantiles on the y-axis, and (2) the corresponding value of each quantile on the x-axis.


The quantile values, , are such that weighted average of distances of observations from is minimized; the weights are based on whether is above or not. For example, if , one will get median: quantile regression finds the median by minimizing the average (absolute) deviation from ; if , you should get the -th percentile. Here the equation minimizes the deviations from the th-percentile, but weigh observations above it while observations below it get weight .
In QR-DQN, the vector output from the neural network constitutes the quantile values for each action , conditional on the state input . Each action-value Q(s_t, a_t)
is the expectation computed from the estimated quantiles conditional on the state and action. If you view the quantile approach from the PDF's perspective, the quantile approach effectively constructs equal-height bins instead of equal-width bins where the bin locations are pre-specified by the engineers. In the quantile case, the density of a PDF is represented by how ``clustered'' the bin locations are.
The temporal difference update in the quantile approach is significantly less complicated than in the categorical approach. Since the distribution boundaries are no longer concerns with the quantiles, we may shrink () the distribution and add a reward directly on the quantile values. However, a strong assumption is required to perform the direct TD update on the quantile values.
The QR-DQN algorithm requires pre-determined meta-parameters, statistics , the number of points and their positions. In principle, the more the number of statistics points we have, the better the approximations we obtain. However, this is not feasible in practice due to the modeling and computation capacity limitation. One extension to the existing quantile approaches is to leverage neural networks' model fitting ability to re-identify the statistics functions. This can be conducted either explicitly using another neural network (Yang et al., 2019), or implicitly by incorporating the functional approximation as part of the value learning network (Dabney, Ostrovski, et al., 2018).

Expectile distributional RL
The expectile approach (called ER-DQN) is somewhat similar to the quantile approach. Expectile values are a family of summary statistics that generalizes the mean similar to the quantile values generalizing the median (Newey & Powell, 1987). While 0.5-quantile recovers the median, 0.5-expectile recovers the mean. Appendix A4 and Figure 9 in Rowland et al. (2019) documented the difference between quantiles and expectiles.
Readers are reminded that there is no theoretical justification to expect that the expectile approach will perform better than the quantile approach in general. From statistics, we know that the efficiency depends on the underlying distribution (De Rossi & Harvey, 2009). In fact, if the mean of a distribution does not exist (e.g. Cauchy distribution), the expectiles may not even have a limiting distribution.
Prior to ER-DQN, the discussions on expectiles were mainly among the financial risk management literature (e.g. (Bellini & Di Bernardino, 2017; Daouia et al., 2018; Taylor, 2008)) and the statistics literature (e.g. (Kneib, 2013; Waltrup et al., 2015)).
Although expectiles are similar to quantiles in characterizing a distribution, the temporal difference update in the expectile approach cannot simply inherit the one in the quantile approach. In general, despite serving the same purpose of approximating a distribution, each type of distributional RL requires its own temporal difference update. Rowland et al. (2019) has provided a unified framework for different statistics. The authors emphasized that the temporal difference update should be performed on samples instead of its statistics. They consider the process of retrieving the samples from a set of statistics as the imputation strategy. Under certain assumptions, the temporal difference update may appear to be applied directly on statistics, yet empirically they often result in a catastrophic collapse of the value distribution except some very special cases (Rowland et al., 2019).
Discussion on distributional RL
In summary, there are two layers of tasks in distributional RL: (i) decipher the true relation between the summary statistics and state-action pair , and (ii) identify the true conditional action-value distribution using statistics. There are interesting topics in both tasks.
Task (i) is a function approximation task; it can be approached by a neural network. Distributional RL has its own unique problems. For instance, there is no guarantee that the output values from the neural network are monotonically increasing, which violates the definition of statistical values on the probability space. With quantile, this is known as the crossing quantile curve in the statistics literature (Kneib, 2013; Waltrup et al., 2015). Although this crossing quantile issue seemed to be under-appreciated by the distributional RL community at first, a recent paper by Zhou et al. (2020) proposed a solution to enforce the monotonicity of the output quantile values.
Issues in task (ii) have been addressed more in classical statistics. Broadly, within classical statistics, two ramifications are related to distributional RL: (1) statistical inference and (2) statistical learning theory.
Statistical inference concerns distribution identification, i.e. as more sample values are collected, can we identify its true underlying distribution (Casella & Berger, 2021)? The convergence from the sample to the true underlying distribution relies on the Glivenko-Cantelli Theorem: as sample size , the empirical CDF converges uniformly to the true CDF almost surely (as well as its statistics, such as quantile values). However, the results cannot be easily generalized to histograms; more assumptions are required to obtain convergence (e.g., kernel methods). Hence, intuitively it is more reasonable to use empirical CDF than a histogram to approximate an action-value distribution.
Statistical learning theory concerns estimating functional dependency based on a sample with a fixed size (Vapnik, 2013). From a machine learning's perspective, statistical learning theory deals with supervised learning problems: given a sample with size , we have a family of distributions (models) as candidates, and the question we ask is which distribution, out of the pre-specified family, has the highest probability to be the true underlying distribution? Model selection error, i.e., the probability of choosing a wrong model, is crucial. To minimize the model selection error, Vapnik-Chernovenkis Theorem is required, and the notion of sample complexity is also relevant.
To some extent, the problems in the distributional RL are closely related to statistical learning with a slight different focus. On the one hand, the imputation strategy attempts to identify an action-value distribution from a set of statistics of size , generated by a parametric neural network, rather than from a given sample. On the other hand, sample size is fixed and pre-determined by the engineers. Notice that in this case, the uniqueness of the sample becomes a crucial problem; one cannot obtain a unique sample unless imposing heavy assumptions on the sample size . If you view these assumptions from the perspective of statistical learning theory, the approaches restricted the family of distribution candidates for the action-value distributions (effectively down to one possible candidate if ). Yet there exists a dilemma between the choice of and : if is too small, the quantile estimation becomes miserable when ; if we choose a large while keeping small, the Vapnik-Chernovenkis dimension of the space of possible true action-value distributions is large, with high probability, we will not be able to identify the correct (conditional) action-value distribution ; if we start with an arbitrarily large , computationally it is so costly that it raises a question as to whether tracking the entire action-value distribution for each state-action pair is efficient and necessary.
In general, one cannot identify true action-value distribution using the above approaches when only summary statistics are available. The observations (e.g. quantiles) do not allow you to identify the true distribution when the sample size is limited and small (when ). This issue is at the heart of sample complexity: you need a much larger sample size (more quantiles) in order to identify the distribution.
One could argue that, unlike supervised learning, the ultimate goal of an RL system is to find the optimal policy through iterative Bellman updates, and hence whether it is even necessary to identify the true action-value distribution? Ideally, we want to identify the true action-value distribution to obtain a more non-fragile out-of-sample inference. Although it is true that one should not expect that this ideal case is generically possible, infeasibility should not prevent distributional RL algorithms from utilizing the existing tools in classical statistics to establish a more robust iterative policy update. This is one of the motivations of this dissertation.
Much work is needed for a more comprehensive theoretical analysis of the efficacy of the distributional RL algorithms. It is not obvious to prove that distributional RL is theoretically better than mean estimation RL universally, due to the difficulty in establishing a closed-form relationship between expected returns of an MDP in the context of a state-action value distribution (Rowland et al., 2019). In fact, some variants of distributional RL perform exactly the same both in tabular setting and in linear function approximation setting (Lyle et al., 2019). This is not surprising since in an environment where there exists a sufficient statistic for the mean, and the statistic is efficient and can be updated recursively, it is not necessary to keep track of the entire distribution. In a Gaussian world where is distributed normally, canonical RL is computationally the easiest.
References
Footnotes
This section benefits greatly from the discussions with Matthew Farrugia Roberts. ↩
To clarify, Morimura et al. (2010) and Morimura et al. (2012) were the first propose a distribution version of the value function, but Bellemare et al. (2017) were the first to apply the distribution concept in a control setting (i.e. with actions), see Bellemare et al. (2017) for more details. ↩
Although in my opinion histograms are good for visualization, it is a suboptimal way of approximation, readers are encouraged to use quantiles whenever possible, see below for discussions. ↩
In line 16, refers to the value from re-assigning the target distribution value according to index . Note that although the term is omitted in their original pseudocode, in practice it is important to have a check whether equals and adjust the probability re-assignment accordingly. Otherwise, the sum of the resulting distribution does not equal to one of the action value goes beyond , leading to a distribution collapse/explosion. ↩