Given n samples from a population of individuals belonging to different types with unknown proportions, how do we estimate the probability of discovering a new type at the (n + 1)th draw? This is a classical problem in statistics, commonly referred to as the missing mass estimation problem. Recent results have shown: (i) the impossibility of estimating the missing mass without imposing further assumptions on type’s proportions; (ii) the consistency of the Good–Turing estimator of the missing mass under the assumption that the tail of type’s proportions decays to zero as a regularly varying function with parameter α ∈ (0, 1); (iii) the rate of convergence n-α/2 for the Good–Turing estimator under the class of α ∈ (0, 1) regularly varying P. In this paper we introduce an alternative, and remarkably shorter, proof of the impossibility of a distribution-free estimation of the missing mass. Beside being of independent interest, our alternative proof suggests a natural approach to strengthen, and expand, the recent results on the rate of convergence of the Good–Turing estimator under α ∈ (0, 1) regularly varying type’s proportions. In particular, we show that the convergence rate n-α/2 is the best rate that any estimator can achieve, up to a slowly varying function. Furthermore, we prove that a lower bound to the minimax estimation risk must scale at least as n-α/2, which leads to conjecture that the Good–Turing estimator is a rate optimal minimax estimator under regularly varying type proportions.

On consistent and rate optimal estimation of the missing mass

Favaro S.
2021-01-01

Abstract

Given n samples from a population of individuals belonging to different types with unknown proportions, how do we estimate the probability of discovering a new type at the (n + 1)th draw? This is a classical problem in statistics, commonly referred to as the missing mass estimation problem. Recent results have shown: (i) the impossibility of estimating the missing mass without imposing further assumptions on type’s proportions; (ii) the consistency of the Good–Turing estimator of the missing mass under the assumption that the tail of type’s proportions decays to zero as a regularly varying function with parameter α ∈ (0, 1); (iii) the rate of convergence n-α/2 for the Good–Turing estimator under the class of α ∈ (0, 1) regularly varying P. In this paper we introduce an alternative, and remarkably shorter, proof of the impossibility of a distribution-free estimation of the missing mass. Beside being of independent interest, our alternative proof suggests a natural approach to strengthen, and expand, the recent results on the rate of convergence of the Good–Turing estimator under α ∈ (0, 1) regularly varying type’s proportions. In particular, we show that the convergence rate n-α/2 is the best rate that any estimator can achieve, up to a slowly varying function. Furthermore, we prove that a lower bound to the minimax estimation risk must scale at least as n-α/2, which leads to conjecture that the Good–Turing estimator is a rate optimal minimax estimator under regularly varying type proportions.
2021
57
3
1476
1494
Good–Turing estimator; Minimax rate; Missing mass; Optimal rate of convergence; Regular variation; Two-parameter Poisson–Dirichlet
Ayed F.; Battiston M.; Camerlenghi F.; Favaro S.
File in questo prodotto:
File Dimensione Formato  
ABCF_missing.pdf

Accesso aperto

Tipo di file: POSTPRINT (VERSIONE FINALE DELL’AUTORE)
Dimensione 515.53 kB
Formato Adobe PDF
515.53 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1810647
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 2
social impact