On consistent and rate optimal estimation of the missing mass

Ayed, F.; Battiston, M.; Camerlenghi, F.; Favaro, S.

doi:10.1214/20-AIHP1126

Given n samples from a population of individuals belonging to different types with unknown proportions, how do we estimate the probability of discovering a new type at the (n + 1)th draw? This is a classical problem in statistics, commonly referred to as the missing mass estimation problem. Recent results have shown: (i) the impossibility of estimating the missing mass without imposing further assumptions on type’s proportions; (ii) the consistency of the Good–Turing estimator of the missing mass under the assumption that the tail of type’s proportions decays to zero as a regularly varying function with parameter α ∈ (0, 1); (iii) the rate of convergence n-α/2 for the Good–Turing estimator under the class of α ∈ (0, 1) regularly varying P. In this paper we introduce an alternative, and remarkably shorter, proof of the impossibility of a distribution-free estimation of the missing mass. Beside being of independent interest, our alternative proof suggests a natural approach to strengthen, and expand, the recent results on the rate of convergence of the Good–Turing estimator under α ∈ (0, 1) regularly varying type’s proportions. In particular, we show that the convergence rate n-α/2 is the best rate that any estimator can achieve, up to a slowly varying function. Furthermore, we prove that a lower bound to the minimax estimation risk must scale at least as n-α/2, which leads to conjecture that the Good–Turing estimator is a rate optimal minimax estimator under regularly varying type proportions.