Quality evaluation of experimental statistics produced by making use of Big Data

Barcaroli, Giulio; Golini, Natalia; Righi, Paolo

In 2017 the Italian Institute of Statistics (Istat) has started the production of a set of experimental statistics based on the use of Internet data, one of the most relevant Big Data sources. These statistics refer to the activities that enterprises carry out in their websites (web ordering, job vacancies, link to social media, etc.) and are a strict subset of those currently produced by the “Survey on ICT in enterprises”. The idea is to calculate these estimates by making use of the websites content, that is collected by using web scraping tools, and processed by applying text mining techniques. Then, models are fitted in the subset of enterprises for which both sources are available: survey reported values, and relevant terms obtained by the web scraping/text mining procedures. Experimental statistics have been obtained by making use of two different estimators: the first one is a full model based estimator; the second one is an estimator that combines model based estimates and survey estimates. Considering the various domains for which they have been calculated, the three sets of estimates (survey, model and combined) in most cases are not distant (i.e. model and combined estimate values lay in the confidence intervals of survey estimates). The question is: how to evaluate the accuracy of the three sets of estimates in order to understand if experimental statistics can substitute survey ones? Considering the different factors that can produce bias in survey estimates (total non-response and response errors) and in alternative estimates (population under-coverage and prediction errors), these factors are analysed in detail with respect to the real conditions in the 2017 experience. Finally, a simulation study is carried out in order to investigate the conditions under which a given estimator performs better than the others.