Statistical Learning Theory: Slow and fast learning bounds

Extract from my upcoming thesis - 20/03/20

Statistical learning theory

Summary: This chapter is a short introduction to statistical learning theory, which focuses on the derivation of bounds on the generalization error for the binary classification problem. For that matter, we first present the probabilistic setting associated to binary classification and the associated risks. Then, to bound the excess risk of the empirical minimizer, we introduce a few basic concentration inequalities, followed by strategies to relate the excess risk to the complexity of a proposal class, which enable us to derive generalization bounds that are independent of the distribution of the data. Next, we address the looseness of those bounds with more involved concentration inequalities, which, combined with noise assumptions on the distribution of the data, imply fast, distribution-dependent bounds on the generalization error. Finally, we dwelve into the necessary notions to extend those results for the problems tackled in the thesis, and refer to the specific parts of those that require the tools and results presented here. We refer the reader to (Bousquet, Boucheron, and Lugosi 2004), (Boucheron, Bousquet, and Lugosi 2005) and (Györfi 2002) for more detailed introductions on statistical learning theory.

1 - Introduction

Statistical learning theory is described in (Bousquet, Boucheron, and Lugosi 2004, chap. 1) as a mathematical framework for studying the problem of inference, that is of gaining knowledge about some modeled phenomenon from data. It can be considered as a mathematical tool that justifies the soundness of performing machine learning, i.e. of using data collected in some specific context, e.g. in the past, to make decisions in another context, e.g. in the future. For that to be sensible, one has to make assumptions as to how the data collected - called training data - relates to the testing data, and those assumptions are referred to as inductive bias.

Generally, one collects data that belong to a input space \(\mathcal{X}\), also called feature space, that is often a \(d\)-dimensional Euclidean space, i.e. \(\mathcal{X}= \mathbb{R}^d\). Using that information, one wishes to gain knowledge about some value that belongs to an output space \(\mathcal{Y}\). To study the relationship between the collected features and the output value, those are modeled as a pair of dependent random variables r.v. \((X, Y) \in \mathcal{X}\times\mathcal{Y}\) that follow some unknown joint distribution \(P\) defined on a probability space \((\Omega, \mathcal{A}, \mathbb{P})\). We refer to the two first chapters of (Shorack 2000) for a thorough definition of these mathematical objects. The random variable \(Y\) is referred to as the output r.v., while \(X\) is referred to as the input r.v.. We introduce the distribution of \(X\) as \(F\), called the marginal distribution of \(X\), see (Papoulis 1965 Chapter 6, page 171). Mathematically, statistical learning theory is generally interested in recovering some information about distribution of the output r.v. for any possible value of the input r.v. \(X\), i.e. about the conditional random variable \(Y | X=x\) for any \(x \in A\), with \(A \subset \mathcal{X}\) such that (s.t.) \(\mathbb{P}\{X \in A \}=1\). The distribution \(P\) can be decomposed as \(F \times P_{Y|X}\), with \(P_{Y|X}\) the conditional distribution of \(Y\) given \(X\). It is referred to as the posterior probability of \(Y\), and its existence is justified in (Shorack 2000 Chapter 8, Theorem 5.2).

Many learning problems assume that the output random variable \(Y\) is binary, i.e. \(\mathcal{Y}= \{ -1, +1 \}\), such as binary classification or bipartite ranking. Under that assumption, the posterior probability \(P_{Y|X}\) is summarized by the mapping \(\eta\) with: \[\begin{align} \eta : \; \mathcal{X}&\to [0,1],\\ x & \mapsto \mathbb{P}\{ Y=+1 \mid X=x\},\end{align}\] which is referred to as the regression function or as the Bayes scorer in the bipartite ranking literature. The distribution \(P\) is completely characterized by the pair \((F, \eta)\). Another way to specify \(P\) is to define the proportion of positive instances \(p = \mathbb{P}\{ Y = +1 \}\) and the distribution of the conditional random variables \(X|Y=y\) with for both \(y = +1\) and \(y = -1\) as \(G\) and \(H\), respectively. Then, the two characterizations are related by the relations \(G = \eta F/p\) and \(H = (1-\eta)F/(1-p)\) and \(p = \int \eta(x)dF(x)\). They imply \(F = pG + (1-p)H\) and \(\eta/(1-\eta) = pG/( (1-p)H)\).

Example 1: To illustrate this section, we provide an example with \(\mathcal{X}= [0,1]\). We introduce a posterior probability \(\eta\), that depends on the parameter \(a \in (0,1)\): \[\begin{align} \eta(x) = \frac{1}{2} + \frac{1}{2} \text{sgn}(2x-1){ \lvert 1-2x\rvert }^{\frac{1-a}{a}}. \end{align}\] See Figure 1 for a representation of the characterizations of \(P\) for \(a=1/2\).

Representation of \(P\)

Dist. \(n=200\)

Dist. \(n=1000\)

Figure 1: Representation of Example 1 with \(a=1/2\). the left-hand side figure represents the different possible parametrizations of \(P\), namely \((\mu, \eta)\) and \((p, \mu_+, \mu_-)\) (omitting \(p = 1/2\) here), while the two right-hand side figures represent the approximations of \(\mu_+\) and \(\mu_-\) that we obtain using histograms of \(n\) independent realizations of \((X,Y)\), with \(n = 200\) and \(n = 1000\).

We now focus on the binary classification problem, and refer to the thesis for a discussion on the extension of those results to other settings. The goal of binary classification is to find a function \(g: \mathcal{X}\to \mathcal{Y}= \{-1, +1\}\) in a class of candidate functions \(\mathcal{G}\) that minimizes the misclassification error \(R(g)\), \[\begin{align} R(g) := \mathbb{P}\{ g(X) \ne Y \}.\end{align}\] The following proposition characterizes the best classifier in the set of all measurable functions.

Proposition 6: The Bayes classifier is defined as the function \(g^*: \mathcal{X}\to \mathcal{Y}\) s.t.: \[\begin{align} \forall x \in \mathcal{X}, \quad g^*(x) := 2\mathbb{I}\{ \eta(x) > 1/2\} - 1. \end{align}\] For any measurable \(g : \mathcal{X}\to \mathcal{Y}\), we have \(R(g^*) \le R(g)\).

Furthermore, we have the following decomposition of the excess risk: \[\begin{align} \label{eq-decomposition-excess-risk}\tag{1} R(g) - R(g^*) = \mathbb{E}[{ \lvert 2\eta(X) - 1 \rvert } \cdot \mathbb{I}\{ g(X) \ne g^*(X) \}]. \end{align}\]

Proof. Note that: \[\begin{align} R(g) = \mathbb{E}[ \mathbb{E}[ \mathbb{I}\{ g(X) \ne Y \} | X ] ] = \mathbb{E}[ \eta(X) + (1-2\eta(X)) \mathbb{I}\{ g(X) = +1 \} ]. \end{align}\] Since: for any \(x \in \mathcal{X}\) and \(g : \mathcal{X}\to \mathcal{Y}\), \[\begin{align} (1-2\eta(x))\left( \mathbb{I}\{g(x) = +1 \} - \mathbb{I}\{ g^*(x) = + 1\} \right) = { \lvert 2\eta(x)-1\rvert } \cdot \mathbb{I}\{ g(x) \ne g^*(x) \}, \end{align}\] we have that \(R(g) \ge R(g^*)\) for any measurable \(g\).

Note that \(R(g^*) = \mathbb{E}[\min(\eta(X), 1-\eta(X))]\), which means that \(g^*\) has a positive risk unless \(\eta(X) \in \{0, 1\}\) a.s., which means that the Bayes classifier makes classification mistakes unless the classes are separable. A sensible goal when selecting a candidate function in \(\mathcal{G}\) is to find \(g \in \mathcal{G}\) with \(R(g)\) as close as possible to \(R(g^*)\), which is measured by the notion of excess risk, defined as: \[\begin{align} E(g) := R(g) - R(g^*).\end{align}\]

Remark 4: The Bayes classifier of Example 1 is the simple function \(x \mapsto \mathbb{I}\{ x \ge 1/2\}\). In that case, it belongs to the simple proposal class of functions \(\mathcal{G}_{\text{stump}}\) that consists of all of the decision stumps on \([0,1]\), i.e. \(\mathcal{G}_{\text{stump}} = \{ x \mapsto a \cdot \text{sgn}( x - h ) \mid h \in [0,1], a \in \{-1, +1 \} \}\).

Unlike for Remark 4, in most cases, the class \(\mathcal{G}\) is not large enough to contain the Bayes classifier \(g^*\). Then, the gap in performance between any minimizer \(g^\dagger\) of \(R\) in \(\mathcal{G}\), i.e. \(g^\dagger \in {\arg \min}_{g\in\mathcal{G}} \; R(g)\), and \(g^*\) is quantified by the approximation error defined as: \[\begin{align} R(g^\dagger) - R(g^*) > 0.\end{align}\] Controlling the approximation error generally requires an assumption about the smoothness of the target function. One way to introduce this type of assumption is to assume that the regression function \(\eta\) belongs to a Sobolev space, see (Jost 2011 Appendix A.1). While results exist in the case of regression, see e.g. (Smale and Zhou 2011), bounding the approximation error in classification remains an open problem.

Since \(P\) is unknown, we cannot directly measure the risk \(R\). The general idea of statistical learning is to use a notion of empirical risk to approximate \(R\). Introduce a sequence of \(n\) i.i.d. pairs with the same distribution as \((X,Y)\), i.e. \(\mathcal{D}_n = \{(X_i, Y_i)\}_{i=1}^n \overset{\text{i.i.d.}}{\sim} P\), we define the empirical risk \(R_n\), on the data \(\mathcal{D}\) as: \[\begin{align} \label{eq-estimate-empirical-risk}\tag{2} R_n(g) := \frac{1}{n} \sum_{i=1}^n \mathbb{I}\{ g(X_i) \ne Y_i \},\end{align}\] and denote by \(g_n\) the minimizer of \(R_n\) over \(\mathcal{G}\). The precision of the approximation of \(R\) by \(R_n\), and as a result, the choise of \(g_n\), depends on the difference between the empirical distribution and the real distribution, illustrated in Section 3. A quantity of interest is the risk \(R(g_n)\) of the minimizer of the empirical risk, as it quantifies the performance of the classifier learned from data. Observe, that, since \(R_n(g_n) \le R_n(g^\dagger)\) by definition of \(g_n\), we have: \[\begin{align} E(g_n) &\le R(g^\dagger) - R(g^*) + R(g_n) - R_n(g_n) + R_n(g^\dagger) - R(g^\dagger),\\ &\le R(g^\dagger) - R(g^*) + 2 \sup_{g\in\mathcal{G}} { \lvert R_n(g) - R(g)\rvert }. \label{eq-sup_emp_bound}\tag{3}\end{align}\] Eq. (3) shows that one can bound the excess of risk by the approximation error plus the uniform deviation \({ \lVert R_n - R\rVert }_{\mathcal{G}} := \sup_{g\in\mathcal{G}} { \lvert R_n(g) - R(g)\rvert }\) of \(R_n(g) - R(g)\) over \(\mathcal{G}\), which quantifies the generalization error. The bound can be seen as a formalization of the bias-variance tradeoff presented in (Domingos 2012), where the bias term corresponds to the approximation error, while the variance corresponds to the generalization error. When one increase the size of the proposal family \(\mathcal{G}\), the approximation error decreases, while the generalization error increases.

The learning bounds presented throughout the thesis consist in PAC (probably approximately correct)-style bounds, i.e. bounds satisfied for any outcome \(\omega \in A_\delta\) for some \(A_\delta \subset \Omega\) with \(\mathbb{P}\{ A_\delta \} \ge 1 - \delta\), for similar types of uniform deviations over diverse families of functions. All of the bounds concern the generalization error.

Firstly, Section 2 shows how to derive an uniform learning bound, i.e. bounds that do not depend on the distribution of the data, i.e. that hold for the most complicated distributions \(P\) possible, but tend to be quite loose for sensible distributions. Secondly, Section 3 responds to this weakness by deriving tighter bounds under mild assumptions on the data. Finally, Section 4 discusses the extension of such results to the more complicated statistics involved in the thesis and indexes the uses of the tools and results of this chapter in the thesis.

2 - Uniform learning bounds

Most learning bounds are based on the Chernoff bound, which bounds the tail \(\mathbb{P}\{ Z > t\}\) of a r.v. \(Z\) using its moment generating function \(\lambda \mapsto \mathbb{E}[e^{\lambda Z}]\), as shown below.

Proposition 3: (Boucheron, Lugosi, and Massart 2013, sec. 2.1) For any real-valued r.v. \(Z\) and any \(t \in \mathbb{R}\), \[\begin{align} \label{eq-chernoff-bound}\tag{4} \mathbb{P}\left\{ Z > t \right\} \le \inf_{\lambda > 0} \mathbb{E}\left[ e^{\lambda (Z-t)} \right] = \inf_{\lambda > 0} \left( \mathbb{E}\left[ e^{\lambda (Z-\mathbb{E}[Z]) } \right] \cdot e^{\lambda ( \mathbb{E}[Z] - t )} \right), \end{align}\]

Proof. The proof is a simple combination of an increasing transformation with Markov’s inequality. Let \(\lambda > 0\), then \(t \mapsto e^{\lambda t}\) is increasing and the simple bound \(\mathbb{I}\{ x > 1 \} \le x\) holds for \(x \in \mathbb{R}\), hence: \[\begin{align} \label{eq-inter-proof-chernoff}\tag{5} \mathbb{P}\left\{ Z > t \right\} = \mathbb{P}\left\{ e^{\lambda Z} > e^{\lambda t} \right\} = \mathbb{E}\left[ \mathbb{I}\{ e^{\lambda(Z-t)} > 1 \} \right] \le \mathbb{E}\left[ e^{\lambda (Z-t)} \right], \end{align}\] and minimizing the right-hand side of Eq. (5) gives the result.

By substituting \(Z\) with \({ \lVert R_n - R\rVert }_\mathcal{G}\) in Eq. (4), one sees that controlling the uniform deviation can be achieved by combining two separate results. Specifically, one needs to control \(\mathbb{E}[Z]\), a term that measures the expressivity of the proposal class \(\mathcal{G}\), and a term \(Z-\mathbb{E}[Z]\) a term that measures the robustness of the process of selecting \(g\) from data. For sensible families \(\mathcal{G}\), both of these quantities decrease to zero when \(n\) increases, but with different convergence speeds. The parameter \(\lambda\) is typically selected to balance the speeds of convergence of the two terms.

Section 2.1 presents basic concentration inequalities, that are required to study the terms \(\mathbb{E}[Z]\) and \(Z-\mathbb{E}[Z]\) where \(Z = { \lVert R_n - R\rVert }_\mathcal{G}\). Controlling the convergence of \(\mathbb{E}[Z]\) is based on controlling the complexity of the proposal class of functions \(\mathcal{G}\), which is dealt with in Section 2.2. Finally, Section 2.3 exploits the results of Section 2.2 and the McDiarmid’s inequality presented in Section 2.1 to conclude.

2.1 - Basic concentration inequalities

Three distinct but related formulations exist for any result of Section 2.1. One can write as a bound on the tail \(\mathbb{P}\{Z>t\}\) of a r.v. \(Z\), or a bound on the moment generating function \(t \mapsto \mathbb{E}[e^{tZ}]\) of a r.v. \(Z\), which relates simply to the tail of \(Z\) with Chernoff’s bound, see Proposition 3. One form implies the other, and all theorems can be stated in any of those two forms. Finally, consider a bound on the tail of a r.v. \(Z\) that writes \(\mathbb{P}\{ Z > t \} \le f(t)\), and define \(t_\delta\) as the solution in \(t\) of \(\delta = f(t)\) with some \(\delta > 0\). The bound \(\mathbb{P}\{ Z > t \} \le f(t)\) is then equivalent to a simple inequality of the form \(Z \le t_\delta\), that holds with probability (w.p.) larger than (\(\ge\)) \(1-\delta\). For example, Corollary 3 below is provided in that last formulation.

The most famous concentration inequality is Hoeffding’s inequality (Györfi 2002 Theorem 1.2), which gives a PAC bound on the deviation of the mean of bounded i.i.d. r.v.’s from their expectation of the order \(O(n^{-1/2})\). It stems from a simple bound on the tail \(\mathbb{P}\{ Z > t\}\) of a bounded random variable \(Z\), a property introduced in (Györfi 2002 Lemma 1.2 ) and presented in the lemma below.

Lemma 1: Let \(Z\) be a random variable with \(\mathbb{E}[Z] = 0\) and \(a\le Z \le b\) a.s.. Then for \(t > 0\): \[\begin{align} \mathbb{E}\left[ e^{tZ} \right] \le e^{t^2 (b-a)^2 / 8}. \end{align}\]

Proof. The proof of this lemma is the beginning of the proof of the Hoeffding inequality detailed in (Györfi 2002 Lemma 1.2) and is recalled here for completion. By convexity of the exponential function: \[\begin{align} e^{tz} \le \frac{z-a}{b-a} e^{tb} + \frac{b-z}{b-a} e^{ta} \quad \text{for } a \le z \le b. \end{align}\] Exploiting \(\mathbb{E}[Z] = 0\), with \(p = - a/(b-a)\): \[\begin{align} \mathbb{E}[e^{tZ}] & \le -\frac{a}{b-a} e^{tb} + \frac{b}{b-a} e^{ta},\\ & \le \left( 1 - p + pe^{s(b-a)} \right) e^{-ps(b-a)} = e^{\phi(u)}. \end{align}\] with \(u = s(b-a)\), and \(\phi(u) := -pu + \log(1-p + pe^u)\). The derivative of \(\phi(u)\) is: \[\begin{align} \phi'(u) = -p + \frac{p}{p + (1-p) e^{-u}}, \end{align}\] therefore \(\phi(0) = \phi'(0) = 0\). Moreover, \[\begin{align} \phi''(u) = \frac{p(1-p) e^{-u}}{\left( p + (1-p) e^{-u} \right)^2} \le \frac{1}{4}. \end{align}\] By a Taylor series expansion with remainder, for some \(\theta \in [0,u]\), \[\begin{align} \phi(u) = \phi(0) + u \phi'(0) + \frac{u^2}{2} \phi''(\theta) \le \frac{u^2}{8} = \frac{s^2(b-a)^2}{8}, \end{align}\] which concludes the proof.

Corollary 3: Let \(Z_1, \dots Z_n\) be \(n\) i.i.d. r.v. s.t. \(a \le Z_1 \le b\). Denote their mean by \(\bar{Z}_n = (1/n)\sum_{i=1}^n Z_i\). Then for any \(\epsilon > 0\), we have w.p. \(\ge\) \(1-\epsilon\), \[\begin{align} \bar{Z}_n - \mathbb{E}[\bar{Z}_n] \le (b-a) \sqrt{\frac{\log(1/\delta)}{2n}}. \end{align}\]

Proof. By Chernoff’s bound, see Proposition 3, we have: \[\begin{align} \label{eq-proof-hoeffding-chernoff}\tag{6} \mathbb{P}\{ \bar{Z}_n - \mathbb{E}[\bar{Z}_n] > t \} \le \inf_{\lambda > 0} e^{-t\lambda} \mathbb{E}\left[ e^{\lambda(\bar{Z}_n- \mathbb{E}[\bar{Z}_n])}\right]. \end{align}\] The independence between the \(Z_i\)’s, see (Shorack 2000 Chapter 8, Theorem 1.1), followed by Lemma 1, implies that: \[\begin{align} \mathbb{E}\left[ e^{\lambda(\bar{Z}_n- \mathbb{E}[\bar{Z}_n])}\right] = \prod_{i=1}^n \mathbb{E}\left[ e^{(\lambda/n) (Z_1 - \mathbb{E}[Z_1])}\right] = e^{ \lambda^2 (b-a)^2 / (8n)}. \end{align}\] Minimizing the r.h.s. of Eq. (6) in \(\lambda\) gives: \[\begin{align} \label{eq-proof-to_invert_hoeffding}\tag{7} \mathbb{P}\{ \bar{Z}_n - \mathbb{E}[\bar{Z}_n] > t \} \le e^{-2t^2 n /(b-a)^2}. \end{align}\] Inverting that bound, i.e. solving in \(t\) such that \(\delta\) is equal to the r.h.s. of Eq. (7) implies the result.

One can bound the tail of \({ \lvert \bar{Z}_n - \mathbb{E}[\bar{Z}_n]\rvert }\) using the union bound, also referred to as the subadditivity of measures or Boole’s inequality, (Shorack 2000 Chapter 1, Proposition 1.2), between Corollary 3 and its application to the family \(-Z_1, \dots, - Z_n\), which writes: \[\begin{align} \mathbb{P}\{ { \lvert \bar{Z}_n - \mathbb{E}[\bar{Z}_n] \rvert } > t \} \le \mathbb{P}\{ \bar{Z}_n - \mathbb{E}[\bar{Z}_n] > t \} + \mathbb{P}\{ \mathbb{E}[\bar{Z}_n] - \bar{Z}_n > t \} \le 2 e^{-2t^2 n /(b-a)^2}.\end{align}\]

The Hoeffding inequality on its own implies guarantees for finite families of functions \(\mathcal{G}\), as presented in (Bousquet, Boucheron, and Lugosi 2004, sec. 3.4) and shown in the corollary below:

Corollary 1: Assume that \(\mathcal{G}\) is of cardinality \(N\). Then, we have that, for any \(\delta > 0\) and \(n\in\mathbb{N}\), w.p. \(\ge 1-\delta\): \[\begin{align} R(g_n) - R(g^\dagger) &\le \sqrt{ \frac{2}{n} \log \left( \frac{2N}{\delta} \right) }. \end{align}\]

Proof. The union bound implies: \[\begin{align} \mathbb{P}\{ \max_{g\in\mathcal{G}} { \lvert R_n(g) - R(g)\rvert } > t \} & = 1 - \mathbb{P}\{ \cup {g\in\mathcal{G}} { \lvert R_n(g) - R(g)\rvert } \le t \}, \\ & \ge 1 - \sum_{g\in\mathcal{G}} \mathbb{P}\{ { \lvert R_n(g) - R(g)\rvert } \le t \}. \end{align}\] Applying Corollary 3 gives: \[\begin{align} \mathbb{P}\{ \max_{g\in\mathcal{G}} { \lvert R_n(g) - R(g)\rvert } \le t \} & \le \sum_{g\in\mathcal{G}} \mathbb{P}\{ { \lvert R_n(g) - R(g)\rvert } \le t \}, \\ & \le 2N e^{-2t^2 n }, \end{align}\] and inverting the bound gives the result.

The result presented in Lemma 1 can be used to derive bounds on the expectation of the maximum of a family of random variables, as shown in the following lemma. It is useful when considering minimizers of the empirical risk, to upper-bound the value of \(\mathbb{E}[{ \lVert R_n - R\rVert }_\mathcal{G}]\) presented in Section 1.

Lemma 3: (Györfi 2002 Lemma 1.3) Let \(\sigma > 0\), and assume that \(Z_1, \dots, Z_m\) are real-valued r.v.’s such that for any \(t > 0\) and \(1\le j \le m\), \(\mathbb{E}[e^{t Z_j}] \le e^{t^2 \sigma^2 /2}\), then: \[\begin{align} \mathbb{E}\left[ \max_{j\le m} Z_j \right] \le \sigma \sqrt{\log (2m)}. \end{align}\]

Proof. Jensen’s inequality, which states that that for any integrable r.v. \(Z \in \mathbb{R}\) and any convex function \(\phi: \mathbb{R}\to \mathbb{R}\), we have \(\phi(\mathbb{E}[X]) \le \mathbb{E}[\phi(X)]\), see (Shorack 2000 inequality 4.10 in chapter 3 ), implies that, for any \(t > 0\), \[\begin{align} e^{t\mathbb{E}\left[ \max_{j\le m} Z_j \right]} \le \mathbb{E}\left[ e^{t \max_{j\le m} Z_j } \right] = \mathbb{E}\left[ \max_{j \le m} e^{t Z_j} \right] \le \sum_{j=1}^m \mathbb{E}\left[ e^{t Z_j} \right] \le m e^{t^2 \sigma^2 /2}. \end{align}\] Thus, with \(t = \sqrt{2\log(m) /\sigma^2}\), \[\begin{align} \mathbb{E}\left[ \max_{j\le m} Z_j \right] \le \frac{\log(m)}{t} + \frac{t\sigma^2}{2} = \sigma \sqrt{2\log(m)}, \end{align}\] which concludes the proof.

One can again bound the tail of \(\max_{j\le m} { \lvert Z_j\rvert }\) by applying the preceding result to a family of \(2m\) elements by considering \(\max(Z_1, -Z_1, \dots, Z_m, -Z_m)\). Now that we have presented inequalities to deal with the term \(\mathbb{E}[{ \lVert R_n - R\rVert }_\mathcal{G}]\), we introduce the one that controls the deviation of \(\mathbb{E}[{ \lVert R_n - R\rVert }_\mathcal{G}]\) from its mean.

Theorem 2: (Györfi 2002 Lemma 1.4 ) Let \(Z_1, \dots, Z_n\) be a collection of \(n\) i.i.d. random variables. Assume that the some function \(g: \text{Im}(Z_1)^n \to \mathbb{R}\) satisfies the bounded difference assumption, i.e.: \[\begin{align} \sup_{ \substack{z_1,\dots,z_n \in \text{Im}(Z_1)\\ z_i' \in \text{Im}(Z_1)} } { \lvert g(z_1, \dots z_n) - g(z_1, \dots, z_i', \dots, z_n) \rvert } \le c_i, \qquad 1\le i \le n, \end{align}\] then, for all \(t > 0\): \[\begin{align} \mathbb{P}\{ g(Z_1, \dots, Z_n) - \mathbb{E}[ g(Z_1, \dots, Z_n) ] > t \} \le \exp \left\{ - \frac{2 t^2}{\sum_{i=1}^n c_i^2} \right\}. \end{align}\]

Proof. For any \(z_i \in \text{Im}(Z_1)\), introduce: \[\begin{align} V_i &= \mathbb{E}\left[ g(Z_1, \dots, Z_n) \mid Z_1, \dots, Z_i \right] - \mathbb{E}\left[ g(Z_1, \dots, Z_n) \mid Z_1, \dots, Z_{i-1} \right],\\ W_i(z_i) &= \mathbb{E}\left[ g(Z_1, \dots, z_i, \dots, Z_n) - g(Z_1, \dots, Z_n) \mid Z_1, \dots, Z_{i-1} \right]. \end{align}\] Note that \(W_i\) is a random function \(\text{Im}(Z_1) \to \mathbb{R}\) that only depends on the \(Z_1,\dots, Z_{i-1}\), hence \(W_i(Z_i) = V_i\). The definition of \(V_i\) gives the following telescopic sum: \[\begin{align} \label{eq-proof-mcdiarmid-decomp}\tag{8} g(Z_1, \dots, Z_n) - \mathbb{E}[ g(Z_1, \dots, Z_n) ] = \sum_{i=1}^n V_i. \end{align}\] Then \(\inf_{z} W_i(z) \le V_i \le \sup_{z} W_i(z)\) and: \[\begin{align} \sup_{z} W_i(z) - \inf_{z} W_i(z) = \sup_{z} \sup_{z'} (W_i(z) - W_i(z')) \le c_i. \end{align}\] For any \(i \in \{1, \dots, n\}\), from Lemma 1 above conditioned upon the \(Z_1, \dots, Z_{i-1}\): we have that, for any \(t > 0\): \[\begin{align} \mathbb{E}\left[ e^{tV_i} \mid Z_1, \dots, Z_{i-1} \right] \le e^{t^2 c_i^2 / 8}. \end{align}\] Finally, \[\begin{align} \mathbb{E}\left[ e^{ t \left( \sum_{i=1}^n V_i \right) } \right] \le \mathbb{E}\left[ e^{t \sum_{i=1}^{n-1} V_i} \mathbb{E}\left[ e^{t V_n} \mid Z_1, \dots, Z_{n-1} \right] \right] \le e^{t^2 \sum_{i=1}^n c_i^2 / 8}, \end{align}\] by repeating the same argument \(n\) times. Minimizing the Chernoff bound implies the result.

2.2 - The complexity of classes of functions

This section regroups several ways used to control the complexity of a class of proposal functions \(\mathcal{G}\), referred to as capacity measures in (Boucheron, Bousquet, and Lugosi 2005). Those include Rademacher averages, the concept of shatter coefficient and VC dimension, as well as covering numbers. First, we introduce the notion of Rademacher average and relate it to the notion of VC dimension. For an introduction to covering numbers in this context, we refer to (Boucheron, Bousquet, and Lugosi 2005 Section 5.1).

Definition 2: The Rademacher average of \(\mathcal{G}\) is defined as: \[\begin{align} \mathfrak{R}_n(\mathcal{G}) := \mathbb{E}\left[ \sup_{g \in \mathcal{G}} { \lvert \frac{1}{n} \sum_{i=1}^n \sigma_i \mathbb{I}\{ g(X_i) \ne Y_i \} \rvert } \mid (X_i, Y_i)_{i=1}^n \right], \end{align}\] where \(\sigma = (\sigma_i)_{i=1}^n\) is a set of \(n\) i.i.d. Rademacher variables, i.e. \(\mathbb{P}\{ \sigma_i = +1 \} = \mathbb{P}\{ \sigma_i = -1 \} = 1/2\) for any \(i \in \{1, \dots, n\}\).

The Rademacher average measures the capacity of a family of functions to fit random noise, and can be used to bound the expectation of \({ \lVert R_n - R\rVert }_\mathcal{G}\), as shown in the following proposition.

Proposition 2: \[\begin{align} \mathbb{E}\left[ { \lVert R_n - R\rVert }_\mathcal{G} \right] \le 2\mathbb{E}\left[ \mathfrak{R}_n(\mathcal{G}) \right]. \end{align}\]

Proof. The proof is based on a trick called symmetrization, and presented for example in (Bousquet, Boucheron, and Lugosi 2004 Section 4.4). It requires the introduction of an unobserved sample \(\mathcal{D}_n' = (X_i's, Y_i's)_{i=1}^n \overset{\text{i.i.d.}}{\sim} P\), referred to as ghost sample or virtual sample, and we denote by \(R_n'\) the estimate of the same form as Eq. (2), but estimated on \(\mathcal{D}_n\).

Using the simple property \(\sup \mathbb{E}[\cdot] \le \mathbb{E}[\sup (\cdot) ]\), followed by the triangle inequality, gives: \[\begin{align} \mathbb{E}\left[ { \lVert R_n - R\rVert }_\mathcal{G} \right] & = \mathbb{E}\left[ { \lVert R_n - \mathbb{E}[R_n']\rVert }_\mathcal{G} \right] \le \mathbb{E}\left[ { \lVert R_n - R_n'\rVert }_\mathcal{G} \right], \\ & \le \mathbb{E}\left[ \mathbb{E}\left[ \sup_{g\in \mathcal{G}} { \lvert \frac{1}{n} \sum_{i=1}^n \sigma_i \left( \mathbb{I}\{ g(X_i) \ne Y_i \} - \mathbb{I}\{ g(X_i') \ne Y_i'\} \right) \rvert } \mid (X_i, Y_i)_{i=1}^n \right] \right], \\ & \le 2\mathbb{E}\left[ \mathfrak{R}_n(\mathcal{G}) \right]. \end{align}\]

Remark 5: When \(X\) is a continuous random variable, if \(\mathcal{G}\) is the family of all measurable functions, we have: \(\mathfrak{R}_n(\mathcal{G}) \ge 1/2\) a.s., since, a.s.: \[\begin{align} \mathfrak{R}_n(\mathcal{G}) = \frac{1}{n} \cdot \mathbb{E}\left[ \max\left( \sum_{i=1}^n \sigma_i, n - \sum_{i=1}^n \sigma_i \right) \right] \ge \frac{1}{2}, \end{align}\] which makes the bound in Proposition 2 uninformative.

On the other hand, an estimation of the Rademacher average of \(\mathcal{G}_{\text{stump}}\), defined in Remark 4, gave a value of \(0.13\) for \(n = 100\) and \(0.04\) for \(n = 1000\) by averaging over 200 random draws of the \(\sigma_i\)’s, which shows that the Rademacher average decreases rapidly as \(n\) grows.

While the notion of Rademacher average is a standard analysis tool, is used for many important results, e.g. in the derivation of sharp bounds for the supremum of empirical processes, see (Györfi 2002 Section 1.4.6), and allows for the derivation of data-dependent bounds, see e.g. (Boucheron, Bousquet, and Lugosi 2005 Theorem 3.2), it does not relate directly to simple properties of the proposal family.

Introduce the shatter coefficient \(\mathbb{S}_\mathcal{G}(n)\) of a family of functions \(\mathcal{G}\) as the maximal number of ways to split any collection of \(n\) elements in \(\mathcal{X}\times\mathcal{Y}\) with \(\mathcal{G}\), formally: \[\begin{align} \label{eq-def-shatter}\tag{9} \mathbb{S}_\mathcal{G}(n) = \sup_{\{ (x_i, y_i)\}_{i=1}^n \in (\mathcal{X}\times \mathcal{Y})^n} { \lvert \left\{ (\mathbb{I}\{g(x_1) \ne y_1\}, \dots, \mathbb{I}\{g(x_n) \ne y_n\} ) \mid g \in \mathcal{G} \right\}\rvert }.\end{align}\] Note that the shatter coefficient does not depend on the sample \(\mathcal{D}_n\), and is bounded by \(2^n\). A set of \(n\) points is said to be shattered by a family \(\mathcal{G}\) if its shatter coefficient is equal to \(2^n\), with the convention that the empty set is always shattered.

Lemma 3 can be used to link the Rademacher average with the shatter coefficient, which in turn enable us to replace the supremum over \(\mathcal{G}\) with a maximum over a set of \(\mathbb{S}_\mathcal{G}(n)\) vectors, as detailed in the following corollary.

Corollary 2: (Györfi 2002 see the proof of Theorem 1.9) \[\begin{align} \mathfrak{R}_n(\mathcal{G}) \le \sqrt{ \frac{2 \log (2 \mathbb{S}_\mathcal{G}(2n))}{n}}. \end{align}\]

Proof. Introduce the vectors \(a^{(j)} = (a_1^{(j)}, \dots, a_n^{(j)}) \in \{0, 1\}^n\) for \(j \in \{ 1, \dots, \mathbb{S}_{\mathcal{G}}(n)\}\) as an enumeration of the set: \[\begin{align} \left\{ (\mathbb{I}\{g(x_1) \ne y_1\}, \dots, \mathbb{I}\{g(x_n) \ne y_n\} ) \mid g \in \mathcal{G} \right\}, \end{align}\] for which \(\{(x_i, y_i)\}_{i=1}^n\) is a fixed collection that correspond to the supremum in Eq. (9).

Introduce \(Z_j = (1/n) \sum_{i=1}^n \sigma_i a_i^{(j)}\), then: \[\begin{align} \mathfrak{R}_n(\mathcal{G}) \le \mathbb{E}\left[ \max_{j \le \mathbb{S}_\mathcal{G} (n) } { \lvert \frac{1}{n} \sum_{i=1}^n \sigma_i a_i^{(j)} \rvert } \right] = \mathbb{E}\left[ \max_{j \le \mathbb{S}_\mathcal{G} (n) } { \lvert Z_j \rvert } \right] \end{align}\] For any \(j \in \{ 1, \dots, \mathbb{S}_\mathcal{G}(n) \}\), Lemma 1 implies that, since \(-1 \le \sigma_i a_i^{(j)} \le 1\): \[\begin{align} \mathbb{E}\left[ e^{tZ_j} \right] = \prod_{i=1}^n \mathbb{E}\left[ e^{t \frac{\sigma_i a_i^{(j)}}{n}} \right] \le \prod_{i=1}^n e^{\frac{t^2}{2n^2}} = e^{\frac{t^2}{2n}}. \end{align}\] Similar bounds can be obtained for \(\mathbb{E}\left[ e^{-tZ_j} \right]\). Applying Lemma 3 with \(\sigma = 1/\sqrt{n}\) to the \(Z_j\)’s implies the corrolary.

The combination of Proposition 2 and Corollary 2 is referred to as the Vapnik-Chervonenkis inequality, see (Györfi 2002 Theorem 1.9, page 13). When the class \(\mathcal{G}\) is finite of size \(N\), then we have that \(\mathbb{S}_{\mathcal{G}}(n) \le N\), which implies a bound on \(\mathbb{E}\left[ { \lVert R_n-R\rVert }_\mathcal{G} \right]\) that decreases in \(O(n^{-1/2})\). While the shatter coefficient gives a simpler bound on the Rademacher average by removing the expectation, the bound depends of \(n\), and determining the shatter coefficients of simple families of functions is not intuitive. On the other hand, the VC-dimension is intuitive and summarizes the complexity of a class of functions by a single coefficient.

Definition 3: (Bousquet, Boucheron, and Lugosi 2004 Definition 2, page 189) The VC-dimension of a proposal class \(\mathcal{G}\) is defined as the maximum number of sets that a class can shatter, i.e. as the largest \(n\) such that: \[\begin{align} \mathbb{S}_\mathcal{G}(n) = 2^n. \end{align}\]

Remark 6: The VC-dimension of the set of decision stumps introduced in Remark 4 is \(2\), as one can always separate two distinct points, but not the set \(\{ (0.2, +1), (0.3, -1), (0.4, +1) \} \subset \mathcal{X}\times \mathcal{Y}\). We refer to (Mohri, Rostamizadeh, and Talwalkar 2012 Section 3.3, pages 41 to 45) for many examples of proposal families and a discussion on their VC-dimension.

The following lemma relates the VC-dimension to the shatter coefficient.

Lemma 4: (Bousquet, Boucheron, and Lugosi 2004 Lemma 1, page 190) Assume that \(\mathcal{G}\) is a VC-class of functions with VC-dimension \(V\). Then for all \(n \in \mathbb{N}\): \[\begin{align} \mathbb{S}_{\mathcal{G}}(n) \le \sum_{k=0}^V \binom{n}{k} \le (n + 1)^V. \end{align}\]

Proof. The last inequality is simply a consequence of the binomial theorem: \[\begin{align} (n+1)^V = \sum_{k=1}^V \frac{n^k}{k!} \frac{V!}{(V-k)!} \ge \sum_{k=1}^V \frac{n^k}{k!} \ge \sum_{k=1}^V \frac{n!}{(n-k)!} \frac{1}{k!} = \sum_{k=1}^n \binom{n}{k}, \end{align}\] as detailed in (Györfi 2002 Theorem 1.13, page 19) which also contains a proof of the first inequality. However, we recall here the proof of the first inequality of (Shalev-Shwartz and Ben-David 2014, 74, lemma 6.10) as we consider it clearer. Introduce: \[\begin{align} \mathcal{E}(C, \mathcal{G}) = \left\{ (\mathbb{I}\{g(x_1) \ne y_1\}, \dots, \mathbb{I}\{g(x_n) \ne y_n\} ) \mid g \in \mathcal{G} \right\}, \end{align}\] where \(C = \{(x_i, y_i)\}_{i=1}^n\) in Eq. (9). To prove the lemma, it suffices to prove the following stronger claim: For any \(C \in (\mathcal{X}\times\mathcal{Y})^n\) we have: \[\begin{align} \label{eq-stronger-sauer}\tag{10} \text{for any family } \mathcal{G}, \quad { \lvert \mathcal{E}(C, \mathcal{G}) \rvert } \le { \lvert \{ B \subseteq C : \mathcal{G} \text{ shatters } B \}\rvert }. \end{align}\] Eq. (10) is sufficient to prove the lemma since from the definition of VC-dimension, no set whose size is larger than \(V\) is shattered by \(\mathcal{G}\), therefore: \[\begin{align} { \lvert \{ B \subseteq C : \mathcal{G} \text{ shatters } B \}\rvert } \le \sum_{i=0}^d \binom{n}{i}. \end{align}\] Proving Eq. (10) is done with an inductive argument.

For \(n=1\), if \(\mathcal{G}\) shatters \(C\), both sides of Eq. (10) are equal to \(2\) (the empty set is always shattered). If \(\mathcal{G}\) does not shatter \(C\), both sides of Eq. (10) are equal to \(1\).

Assume Eq. (10) holds for sets with less than \(n\) elements. Fix \(\mathcal{G}\) and \(C = \{ (x_1, y_1), \dots, (x_n, y_n) \}\). Denote \(C' = \{ (x_2, y_2), \dots, (x_n, y_n) \}\) and define the following two sets: \[\begin{align} Y_0 &= \left\{ (e_2, \dots, e_n) : (0, e_2, \dots, e_m) \in \mathcal{E}(C, \mathcal{G}) \text{ or}\quad (1, e_2, \dots, e_m) \in \mathcal{E}(C, \mathcal{G}) \right\}, \\ Y_1 &= \left\{ (e_2, \dots, e_n) : (0, e_2, \dots, e_m) \in \mathcal{E}(C, \mathcal{G}) \text{ and } (1, e_2, \dots, e_m) \in \mathcal{E}(C, \mathcal{G}) \right\}. \end{align}\] It is easy to verify that \({ \lvert \mathcal{E}(C, \mathcal{G})\rvert } = { \lvert Y_0\rvert } + { \lvert Y_1\rvert }\).

Since \(Y_0 = \mathcal{E}(C', \mathcal{G})\), using the induction assumption (on \(\mathcal{G}\) and \(C'\)), we have: \[\begin{align} { \lvert Y_0\rvert } = { \lvert \mathcal{E}(C', \mathcal{G})\rvert } \le { \lvert \left\{ B \subseteq C' : \mathcal{G} \text{ shatters } B \right\} \rvert } = { \lvert \left\{ B \subseteq C: (x_1, y_1) \notin B \text{ and } \mathcal{G} \text{ shatters } B \right\} \rvert }. \end{align}\] Define \(\mathcal{G}' \subseteq \mathcal{G}\) to be: \[\begin{align} \mathcal{G}' = \Big\{ g \in \mathcal{G} : \exists g' \in \mathcal{G} \text{ s.t. } \big( 1 - \mathbb{I}\{ g'(x_1) \ne y_1 \}, & \, \mathbb{I}\{ g'(x_2) \ne y_2 \}, \dots, \mathbb{I}\{ g'(x_n) \ne y_n \} \big)\\ = \big( \mathbb{I}\{ g(x_1) \ne y_1 \}, & \, \mathbb{I}\{ g(x_2) \ne y_2 \}, \dots, \mathbb{I}\{ g(x_n) \ne y_n \} \big) \Big\}. \end{align}\] Namely, \(\mathcal{G}'\) contains the pairs of hypotheses that agrees on \(C'\) and differs on \((x_1, y_1)\). Hence, it is clear that if \(\mathcal{G}'\) shatters a set \(B \subset C'\) then it also shatters \(B\cup \{ (x_1, y_1) \}\) and the converse is true.

Combining this with the fact that \(Y_1 = \mathcal{G}'_{C'}\) and using the inductive assumption on \(\mathcal{G}'\) and \(C'\), we obtain that: \[\begin{align} { \lvert Y_1\rvert } = { \lvert \mathcal{E}(C', \mathcal{G}') \rvert } & \le { \lvert \left\{ B \subseteq C' : \mathcal{G}' \text{ shatters } B \right\} \rvert } = { \lvert \left\{ B \subseteq C' : \mathcal{G}' \text{ shatters } B \cup \{ (x_1, y_1 )\} \right\} \rvert }, \\ \end{align}\] which in conjunction with: \[\begin{align} { \lvert \left\{ B \subseteq C' : \mathcal{G}' \text{ shatters } B \cup \{ (x_1, y_1 \} \right\} \rvert } &= { \lvert \left\{ B \subseteq C : (x_1, y_1) \in B \text{ and } \mathcal{G}' \text{ shatters } B \right\} \rvert },\\ &\le { \lvert \left\{ B \subseteq C: (x_1, y_1) \in B \text{ and } \mathcal{G} \text{ shatters } B \right\} \rvert }, \end{align}\] gives an upper bound on \({ \lvert Y_1\rvert }\).

Finally, we have shown that: \[\begin{align} { \lvert \mathcal{E} (C, \mathcal{G})\rvert } & = { \lvert Y_0\rvert } + { \lvert Y_1\rvert }, \\ & \le { \lvert \left\{ B \subseteq C: (x_1, y_1) \notin B \text{ and } \mathcal{G} \text{ shatters } B \right\} \rvert }\\ & \qquad + { \lvert \left\{ B \subseteq C: (x_1, y_1) \in B \text{ and } \mathcal{G} \text{ shatters } B \right\} \rvert },\\ & \le { \lvert \left\{ B \subseteq C: \mathcal{G} \text{ shatters } B \right\} \rvert }, \end{align}\] which concludes the proof, by induction.

All of the results of Section 2.2 come into play to control the quantity \(\mathbb{E}\left[ { \lVert R_n - R\rVert }_\mathcal{G} \right]\) introduced in Proposition 3. They are necessary to derive a first, basic learning bound, presented in the section below.

2.3 - Uniform generalization bounds

By joining the results of Section 2.2 to control \(\mathbb{E}\left[ { \lVert R_n - R\rVert }_\mathcal{G} \right]\) in Proposition 3, with Theorem 2 for the deviation of \({ \lVert R_n - R\rVert }_\mathcal{G}\) to its mean, one can derive a simple uniform learning bound.

By combining Proposition 2, Corollary 2 and Lemma 4, we obtain the following result:

Proposition 7: Assume that \(\mathcal{G}\) is a VC-class of functions with VC-dimension \(V\), then: \[\begin{align} \mathbb{E}\left[ { \lVert R_n - R\rVert }_\mathcal{G} \right] \le \sqrt{ \frac{ 8 \log (2) + 8 V \log (1+2n)}{n}}. \end{align}\]

Proposition 7 gives a bound on \(\mathbb{E}\left[ { \lVert R_n - R\rVert }_\mathcal{G} \right]\), and we now deal with the other term in Proposition 3. The following corollary is a simple application of Theorem 2 to the random variable \({ \lVert R_n - R\rVert }_{\mathcal{G}}\).

Corollary 4: With \(Z = { \lVert R_n - R\rVert }_\mathcal{G}\), we have that: \[\begin{align} \mathbb{E}\left[ e^{\lambda(Z - \mathbb{E}[Z])} \right] \le e^{\lambda^2 n / 8 }. \end{align}\]

Proof. Introduce \(R_n'\) as the estimator where for a fixed \(i \in\{ 1, \dots, n \}\), \((X_i, Y_i)\) was substituted by some value \((X_i', Y_i') \in \mathcal{X}\times\mathcal{Y}\) in the estimator of the risk \(R_n\). The triangle inequality implies: \[\begin{align} { \lvert { \lVert R_n(g) - R(g) \rVert }_{\mathcal{G}} - { \lVert R_n'(g) - R(g) \rVert }_{\mathcal{G}} \rvert } & \le \sup_{g \in \mathcal{G}} { \lvert R_n(g) - R_n'(g) \rvert }, \\ & \le \frac{1}{n} \sup_{g \in \mathcal{G}} { \lvert \mathbb{I}\{ g(X_i) \ne Y_i \} - \mathbb{I}\{ g(X_i') \ne Y_i' \} \rvert } \le \frac{1}{n}, \end{align}\] which implies the result, with Theorem 2.

Joining the results of Proposition 3, with those of Proposition 7 and Corollary 4 gives the following learning bound, with an usual order in \(O(n^{-\frac{1}{2}})\) in statistical learning theory, as the term in \(\log(n)\) is negligible in front of the other terms. A more advanced technique for controlling the complexity of \(\mathcal{G}\), called chaining, allows to derive sharper bounds in \(O(n^{-\frac{1}{2}})\) without the logarithmic term \(\log(n)\). Chaining is mentioned in (Boucheron, Bousquet, and Lugosi 2005 Theorem 3.4), while its proof is featured in (Györfi 2002 Theorem 1.16 and 1.17).

Proposition 1: Assume that \(\mathcal{G}\) has finite VC dimension \(V\), then: for any \(\delta > 0, n \in \mathbb{N}^*\), w.p. \(\ge 1-\delta\), \[\begin{align} R(g_n) - R(g^\dagger) \le \sqrt{ \frac{2 \log \left( 1/\delta\right) }{n} } + 2 \sqrt{ \frac{8 \log (2) + 8V \log(1+2n)}{n} }. \end{align}\]

Proof. It follows from Eq. (3), that: \[\begin{align} \mathbb{P}\{ R(g_n)-R(g^\dagger) > t \} \le \mathbb{P}\left \{ 2 { \lVert R_n - R \rVert }_{\mathcal{G}} > t \right \}. \end{align}\] Combining Proposition 7 and Corollary 4, then minimizing the bound in Proposition 3 in \(\lambda\) gives, with \(C_{V,n}= 8 \log(2) + 8V \log(1+2n)\): \[\begin{align} \label{eq-upper-bound-chernoff}\tag{11} \mathbb{P}\left \{ { \lVert R_n - R \rVert }_{\mathcal{G}} > t \right \} \le \exp \left\{ - 2 \left( \sqrt{n} t - \sqrt{C_{V, n}} \right)^2 \right\}, \end{align}\] and inverting the bound gives the result.

While the bounds hold for any possible distribution \(P\), they are loose for most sensible distributions. Indeed, in many cases, the uniform deviation of the empirical risk \(\sup_{g\in\mathcal{G}} { \lvert R_n(g) - R(g)\rvert }\) over \(\mathcal{G}\) is a very poor bound for \(R(g_n) - R(g^\dagger)\), as mentioned in (Boucheron, Bousquet, and Lugosi 2005 Section 5.2). The looseness of the bound of this section is illustrated in Remark 3..

Remark 3: For the distributions defined in Example 1, we illustrate in Section 9 the value of the bound in Proposition 1 for the proposal family of decision stumps in \(\mathbb{R}\).

Firstly, observe that the bound is very loose, which can be explained by the facts that our distribution is not very complicated and our proposal family \(\mathcal{G}_\text{stump}\) is very favorable for the problem at hand.

Secondly, while the order of the slope is the same for any distribution \(P\), the boxplot seems to decrease faster when \(a\) is close to \(1\) than when \(a\) is close to \(0\), i.e. the learning rate for easily separated distributions seems to be faster than \(O(n^{-1/2})\), which is proven in Section 3.






Figure 2: Boxplot of the regret over 100 generations of a dataset \(\mathcal{D}_n\) for different values of \(n\), for three selected values of the parameter \(a\), compared to the bound derived in Proposition 9. See Remark 4 for more details.

3 - Faster learning bounds

This section introduces the derivations of learning rates faster than \(O(n^{-1/2})\) under a noise assumption on the distribution of the data. Firstly, we introduce concentration inequalities that involve the variance of random variables. Secondly, we introduce noise conditions and explain how they induce an upper-bound on the variance of the excess loss. Finally, we show that combining these two first results gives a fixed-point inequality, which implies fast learning rates.

The methodology presented here is limited to finite classes of functions \(\mathcal{G}\) and to problems for which the Bayes classifier \(g^*\) belongs to \(\mathcal{G}\). In this section, we also discuss the generalization of those results to general classes of functions of controlled complexity that do not contain the Bayes classifier.

We introduce a family of functions \(\mathcal{F}\) such that any \(f\in\mathcal{F}\) corresponds to a \(g \in \mathcal{G}\) with: \[\begin{align} f(X, Y) := \mathbb{I}\{ g(X) \ne Y \} - \mathbb{I}\{ g^*(X) \ne Y \},\end{align}\] i.e. the family \(\mathcal{F}\) is the family of the excess losses of all elements in \(\mathcal{G}\).

The bound of Proposition 1 used the fact that the random variable \(f(X, Y)\) is bounded for any \(g \in \mathcal{G}\). However, in many situations, as we choose a candidate \(g\) that approaches \(g^*\), the variance of \(f(X,Y)\) may be very small in front of the range of \(f(X,Y)\). Since advanced concentrations inequalities exploit the variance of random variables to derive tighter bounds, the next section present concentrations inequalities that use the variance of random variables, which will server to derive tighter bounds than based on hypotheses on the distribution of the data.

3.1 - Advanced concentration inequalities

The derivation of fast learning bounds relies on concentration inequalities that involve the variance of the random variables involved. The simplest one is Bernstein’s inequality, seen in (Boucheron, Lugosi, and Massart 2013 Corollary 2.11). Its proof is featured in (Boucheron, Lugosi, and Massart 2013 Theorem 2.10), and we recall the simplest formulation of the inequality here.

Theorem 1: Let \(Z_1, \dots, Z_n\) be \(n\) i.i.d. real-valued r.v.’s with zero mean, and assume that \({ \lvert Z_i\rvert } \le c\) a.s. Let \(\sigma^2 = \text{Var}(Z_1)\). Then, with \(\bar{Z}_n = (1/n) \sum_{i=1}^n Z_i\), for any \(\delta > 0\), w.p. \(\ge 1-\delta\): \[\begin{align} \bar{Z}_n \le \frac{2c \log(1/\delta)}{3n} + \sqrt{\frac{2\sigma^2 \log(1/\delta)}{n}}. \end{align}\]

Popoviciu’s inequality on variances, see (Egozcue and García 2018), gives an upper bound on the variance of a bounded random variable. Precisely, it states that the variance \(\sigma^2\) of a r.v. \(Z\) such that \(m \le Z \le M\) is smaller than \((M-m)^2/4\). Plugging this result into Theorem 1 gives a similar result as Corollary 3.

Bernstein’s inequality does not give a guarantee on the supremum of a family of functions, but on an empirical mean, as does Hoeffding’s inequality. A direct consequence of Hoeffding’s inequality were guarantees on a finite family of functions, as shown in Corollary 1. As such, only the derivation of guarantees on a finite family of functions are a direct application of the combination of Bernstein’s inequality and noise conditions on \(P\).

To extend those results to more general classes of functions, one requires Talagrand’s inequality below, see (Boucheron, Bousquet, and Lugosi 2005 Section 5.3, Theorem 5.4) for a simple version, and (Boucheron, Lugosi, and Massart 2013 Theorem 7.9, page 225) for the proof. For any \(f \in \mathcal{F}\), we denote by \(Pf := \mathbb{E}[f(X, Y)]\) the distribution of \((X, Y)\) and by \(P_nf := (1/n) \sum_{i=1}^n f(X_i, Y_i)\) the empirical distribution associated to \(\mathcal{D}_n\).

Theorem 3: Let \(b > 0\) and set \(\mathcal{F}\) a family of functions \(\mathcal{X}\to\mathbb{R}\). Assume that, for any \(f \in \mathcal{F}\), \(Pf -f \le b\). Then, w.p \(\ge 1 - \delta\), \[\begin{align} \sup_{f\in\mathcal{F}} (Pf - P_n f) \le 2 \mathbb{E}\left[ \sup_{f\in\mathcal{F}} (Pf - P_nf) \right] + \sqrt{\frac{2 (\sup_{f\in\mathcal{F}} \text{Var}(f) ) \log (1/\delta)}{n} } + \frac{4b \log(1/\delta)}{3n}. \end{align}\]

The formulations of Bernstein’s inequality and Talagrand’s inequality are very similar, and their main difference lies in the fact that, while the first deals with simple means, the second handles the supremum of empirical processes. Note that the first term in the bound of Talagrand’s inequality is controlled in Section 2.2.

These inequalities will prove useful when combined with a clever upper bound of the variance of the excess loss, i.e. the variance of \(f(X,Y)\). Conveniently, that quantity is upper-bounded by a function of the excess risk \(R(g) -R(g^*)\), using the noise conditions introduced in the following section.

3.2 - Noise conditions

The simplest noise assumption to derive faster bounds is called Massart’s noise condition, see (Boucheron, Bousquet, and Lugosi 2005, page 340). It is a local condition on the value of \(\eta\), that assumes that \(\eta(X)\) is always far from \(1/2\), i.e. that there is a clear separation between the two classes for the classification decision.

Assumption 2: There exists \(h > 0\), s.t. \[\begin{align} { \lvert 2\eta(X) - 1\rvert } \ge h \quad \text{a.s.} \end{align}\]

A less restrictive assumption is called the Mammen-Tsybakov noise condition, see (Boucheron, Bousquet, and Lugosi 2005, page 340), and gives a bound on the proportion of \(\mathcal{X}\) for which the classification decision is hard to take. This bound is parameterized by \(a\in[0,1]\). When \(a\) is close to \(1\), it is very easy to separate the two classes, while it is harder when \(a\) is close to \(0\).

Assumption 1: There exists \(B > 0\) and \(a \in [0,1]\), s.t. \[\begin{align} \mathbb{P}\left\{ { \lvert 2\eta(X) - 1\rvert } \le t \right\} \le Bt^{\frac{a}{1-a}}. \end{align}\]

Remark 1: The distribution \((F, \eta)\) introduced in Example 1 satisfies the Mammen-Tsybakov noise condition with parameter \(a\) and \(B=1\).

Observe that Mammen’s noise condition is void when \(a = 0\), and that it is implied with \(a=1\) by Massart’s noise condition. Both noise assumptions can be used to derive a convenient bound on the variance of a function \(f \in \mathcal{F}\), as a consequence of both a second moment bound, i.e. a bound of the form \(\text{Var}(Z) \le \mathbb{E}[Z^2]\), namely: \[\begin{align} \label{eq-var-upper-bound}\tag{12} \text{Var}(f(X,Y)) \le \mathbb{E}[\mathbb{I}\{ g(X) \ne g^*(X) \} ],\end{align}\] and the decomposition of the excess risk presented in Eq. (1): \[\begin{align} R(g) - R(g^*) = \mathbb{E}\left[ { \lvert 2\eta(X)-1\rvert }\mathbb{I}\{g(X) \ne g^*(X) \right].\end{align}\] Combined with Eq. (12) and Eq. (1), Massart’s inequality implies: \[\begin{align} \text{Var}(f(X,Y)) \le \mathbb{E}[\mathbb{I}\{ g(X) \ne g^*(X) \} ] \le \frac{1}{h} \mathbb{E}[ { \lvert 2 \eta(X) - 1\rvert } \mathbb{I}\{ g(X) \ne g^*(X) \} ] = \frac{1}{h}\left( R(g) - R(g^*) \right).\end{align}\] Similarly, the Mammen-Tsybakov noise assumption implies the following proposition.

Proposition 5: Assume that Assumption 1 is true, then, for any \(f \in \mathcal{F}\), there exists \(c\) that depends on only of \(B\) and \(a\), s.t. \[\begin{align} \text{Var}(f(X,Y)) \le c (R(g) - R(g^*))^a \end{align}\]

Proof. The proof relies on Eq. (12) and an equivalent formulation of the Mammen-Tsybakov noise assumption, which states that, there exists \(c > 0\) and \(a \in [0,1]\) such that: \[\begin{align} \mathbb{P}\{ g(X) \ne g^*(X) \} \le c \left( R(g) - R(g^*) \right)^a. \end{align}\] It is implied by Assumption 1. Indeed, Markov’s inequality, i.e. \(\mathbb{I}\{ z > t \} \le z/t\) for any \(z, t \in \mathbb{R}_+\), followed by the Mammen-Tsybakov noise assumption and the observation that \(\mathbb{I}\{A\} \le 1\) for any \(A \in \mathcal{A}\), implies: \[\begin{align} R(g) - R(g^*) &= \mathbb{E}[ { \lvert 2 \eta(X) - 1\rvert } \mathbb{I}\{ g(X) \ne g^*(X) \} ], \\ &\ge t \cdot \mathbb{E}[ \mathbb{I}\{{ \lvert 2 \eta(X) - 1\rvert } > t\} \cdot \mathbb{I}\{ g(X)\ne g^*(X) \} ],\\ &\ge t \cdot \mathbb{P}\{{ \lvert 2 \eta(X) - 1\rvert } > t\} - t \mathbb{E}[ \mathbb{I}\{{ \lvert 2 \eta(X) - 1\rvert } > t\} \cdot \mathbb{I}\{ g(X) = g^*(X) \} ],\\ &\ge (t-Bt^{\frac{1}{1-a}}) - t \mathbb{P}\{ g(X) = g^*(X) \},\\ &\ge t \mathbb{P}\{ g(X) \ne g^*(X) \} - Bt^{\frac{1}{1-a}}. \end{align}\] Taking \(t = \left( \frac{(1-a) \mathbb{P}\{g(X) \ne g^*(X)\}}{B} \right)^{\frac{1-a}{a}}\) gives the result, and \(c = \frac{B^{1-a}}{(1-a)^{1-a} a^a}\).

Both of the preceding results are contained, or consequences of (Bousquet, Boucheron, and Lugosi 2004 Section 5.2). The proof of Proposition 5 uses an equivalent formulation of Proposition 5, which are proven in (Bousquet, Boucheron, and Lugosi 2004 Definition 7). Those results enable us to derive fast, distribution-dependent generalization bounds, by using those results in conjonction with the concentration inequalities of Section 3.1, and solving a fixed-point equation.

3.3 - Distribution-dependent generalization bounds

The reasoning presented here can be found in (Bousquet, Boucheron, and Lugosi 2004 Section 5.2), but is more detailed here. A union bound of Bernstein’s inequality applied on each element of finite family of \(N\) elements gives that, for any \(\delta > 0\), w.p \(\ge 1-\delta\), we have simultaneously for any \(g \in \mathcal{G}\), with \(f\) the corresponding element in \(\mathcal{F}\): \[\begin{align} \label{eq-bernstein-4-bound}\tag{13} R(g) - R(g^*) \le R_n(g) - R_n(g^*) + \frac{2c \log(N/\delta)}{3n} + \sqrt{\frac{2\text{Var}(f(X,Y)) \log(N/\delta)}{n}}.\end{align}\] Combining Eq. (13) with the implications of the Mammen-Tsybakov noise assumption, i.e. with Proposition 5, for the empirical risk minimizer and under the assumption that \(g^*\) belongs to the proposal class, i.e. for \(g = g_n\) and that \(g^* \in \mathcal{G}\), we have: for any \(\delta > 0\), w.p \(\ge 1-\delta\), \[\begin{align} \label{eq-fixed-point-4-bound}\tag{14} R(g_n) - R(g^*) \le \frac{2c \log(N/\delta)}{3n} + (R(g_n) - R(g^*))^{a/2} \sqrt{\frac{2c \log(N/\delta)}{n}},\end{align}\] which is a fixed point inequality in \(R(g_n) - R(g^*)\). To derive an upper-bound on \(R(g_n) - R(g^*)\) from Eq. (14), we use the following Lemma, which is found in (Cucker and Smale 2002 Lemma 7).

Lemma 2: Let \(c_1, c_2 > 0\) and \(s > q > 0\). Then the equation \(x^s - c_1 x^q - c_2 = 0\) has a unique positive zero \(x^*\). In addition, \(x^* \le \max \{ (2c_1)^{1/(s-q)}, (2c_2)^{1/s} \}\).

Proof. Define \(\varphi(x) = x^s - c_1 x^q - c_2\). Then \(\varphi'(x) = x^{q-1}(s x^{s-q} - q c_1 )\). Thus \(\varphi'(x) = 0\) with \(x > 0\) if and only if \(x^{s-q} = qc_1/s\) which proves that \(\varphi'(x)\) has a unique positive zero. Since \(\varphi(x) < 0\), \(\lim_{x \overset{>}{\to} 0} \varphi'(x) \le 0\) and \(\lim_{x\to\infty} \varphi(x) = +\infty\), then \(\varphi(x)\) has a unique positive zero.

A generalization of Theorem 4.2 of (Mignotte 1992), which gives an upper bound on the maximum of the absolute values of the roots of a polynomial, implies the result.

Using Lemma 2, we can prove a fast learning bound for finite families of functions.

Proposition 4: Assume that \(\mathcal{G}\) is a finite family of \(N\) functions and that the Mammen-Tsybakov noise assumption is verified with noise parameter \(a\). For any \(\delta > 0\), we have that, w.p. \(\ge 1-\delta\), \[\begin{align} R(g_n) - R(g^*) &\le \left ( \frac{8 c\log (N/\delta) }{ n} \right )^{1/(2-a)} + \frac{8 c \log (N / \delta) }{ 3n }. \end{align}\]

Proof. The straightforward application of Lemma 2 to Eq. (14), followed by the bound \(\max(x, y) \le x + y\) for \(x,y > 0\) implies: \[\begin{align} R(g_n) - R(g^*) &\le \max \left ( \left ( \frac{8 c\log (N/\delta) }{ n} \right )^{1/(2-a)}, \frac{8 c \log (N / \delta) }{ 3n } \right ),\\ &\le \left ( \frac{8 c\log (N/\delta) }{ n} \right )^{1/(2-a)} + \frac{8 c \log (N / \delta) }{ 3n }. \end{align}\]

Proposition 4 gives a learning bound of the order \(O(n^{-\frac{1}{2-a}})\), under a noise condition parameterized by \(a \in [0,1]\). When \(a\) is close to \(0\), one recovers the order \(O(n^{-\frac{1}{2}})\) derived in Section 2.3. On the other hand, when \(a\) is close to zero, one approaches the fast rate \(O(n^{-1})\). The fast bounds are illustrated in Remark 3. The generalization of Proposition 4 to non-finite families of functions is more involved and is a consequence of Talagrand’s inequality (Theorem 3), and we refer to (Boucheron, Bousquet, and Lugosi 2005 Section 5.3) for that extension. The analysis presented there extends to the case where \(g^*\) does not belongs to \(\mathcal{G}\), which is detailed in (Boucheron, Bousquet, and Lugosi 2005 Section 5.3.5).

Remark 2: With the distributions defined in Example 1, we illustrate the value of the bound in Proposition 4 for the finite proposal family of functions \(\mathcal{G} = \{ i/100 \mid i\in \{ 0, \dots, 1\}\) in Section 9.

Observe in Section 12 that the learning rate of the fast bound seems much more aligned with the empirical learning rate than the one of Section 2.



Figure 3: Boxplot of the regret over 100 generations of a dataset \(\mathcal{D}_n\) for different values of \(n\), for three selected values of the parameter \(a\) compared to the fast bound derived in Proposition 15. See Remark 9 for more details.

In real problems, as the distribution of the data is unknown, one can not verify whether a noise assumption is satisfied by the distribution of the data. Still, fast learning speeds partially justify the looseness of the bounds derived in Section 2 and constitute strong evidence of the soundness of learning from empirical data.


Boucheron, Stéphane, Olivier Bousquet, and Gàbor Lugosi. 2005. “Theory of Classification : A Survey of Some Recent Advances.” ESAIM: Probability and Statistics 9: 323–75.

Boucheron, Stéphane, Gábor Lugosi, and Pascal Massart. 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. OUP Oxford.

Bousquet, Olivier, Stéphane Boucheron, and Gàbor Lugosi. 2004. “Introduction to Statistical Learning Theory.” In Advanced Lectures on Machine Learning, 169–207.

Cucker, Felipe, and Steve Smale. 2002. “Best Choices for Regularization Parameters in Learning Theory: On the Bias-Variance Problem.” Foundations of Computational Mathematics 2 (January): 413–28.

Domingos, Pedro. 2012. “A Few Useful Things to Know About Machine Learning.” Commun. ACM 55 (10): 78–87.

Egozcue, Martín, and Luis Fuentes García. 2018. “The Variance Upper Bound for a Mixed Random Variable.” Communications in Statistics - Theory and Methods 47 (22): 5391–9395.

Györfi, László. 2002. Principles of Nonparametric Learning. Edited by László Györfi. Springer.

Jost, Jürgen. 2011. Riemannian Geometry and Geometric Analysis.

Mignotte, Maurice. 1992. Mathematics for Computer Algebra. Berlin, Heidelberg: Springer-Verlag.

Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. 2012. Foundations of Machine Learning. MIT Press.

Papoulis, A. 1965. Probability, Random Variables, and Stochastic Processes. McGraw-Hill.

Shalev-Shwartz, Shai, and Shai Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. USA: Cambridge University Press.

Shorack, Galen. 2000. “Probability for Statisticians.” Springer.

Smale, Steve, and Ding-Xuan Zhou. 2011. “Estimating the Approximation Error in Learning Theory.” Analysis and Applications 01 (November).

Back to article list - Written on the 20/03/20.