In 2017 the Italian Institute of Statistics (Istat) has started the production of a set of experimental statistics based on the use of Internet data, one of the most relevant Big Data sources. These statistics refer to the activities that enterprises carry out in their websites (web ordering, job vacancies, link to social media, etc.) and are a strict subset of those currently produced by the “Survey on ICT in enterprises”. The idea is to calculate these estimates by making use of the websites content, that is collected by using web scraping tools, and processed by applying text mining techniques. Then, models are fitted in the subset of enterprises for which both sources are available: survey reported values, and relevant terms obtained by the web scraping/text mining procedures. Experimental statistics have been obtained by making use of two different estimators: the first one is a full model based estimator; the second one is an estimator that combines model based estimates and survey estimates. Considering the various domains for which they have been calculated, the three sets of estimates (survey, model and combined) in most cases are not distant (i.e. model and combined estimate values lay in the confidence intervals of survey estimates). The question is: how to evaluate the accuracy of the three sets of estimates in order to understand if experimental statistics can substitute survey ones? Considering the different factors that can produce bias in survey estimates (total non-response and response errors) and in alternative estimates (population under-coverage and prediction errors), these factors are analysed in detail with respect to the real conditions in the 2017 experience. Finally, a simulation study is carried out in order to investigate the conditions under which a given estimator performs better than the others.

Quality evaluation of experimental statistics produced by making use of Big Data

Natalia Golini;
2018-01-01

Abstract

In 2017 the Italian Institute of Statistics (Istat) has started the production of a set of experimental statistics based on the use of Internet data, one of the most relevant Big Data sources. These statistics refer to the activities that enterprises carry out in their websites (web ordering, job vacancies, link to social media, etc.) and are a strict subset of those currently produced by the “Survey on ICT in enterprises”. The idea is to calculate these estimates by making use of the websites content, that is collected by using web scraping tools, and processed by applying text mining techniques. Then, models are fitted in the subset of enterprises for which both sources are available: survey reported values, and relevant terms obtained by the web scraping/text mining procedures. Experimental statistics have been obtained by making use of two different estimators: the first one is a full model based estimator; the second one is an estimator that combines model based estimates and survey estimates. Considering the various domains for which they have been calculated, the three sets of estimates (survey, model and combined) in most cases are not distant (i.e. model and combined estimate values lay in the confidence intervals of survey estimates). The question is: how to evaluate the accuracy of the three sets of estimates in order to understand if experimental statistics can substitute survey ones? Considering the different factors that can produce bias in survey estimates (total non-response and response errors) and in alternative estimates (population under-coverage and prediction errors), these factors are analysed in detail with respect to the real conditions in the 2017 experience. Finally, a simulation study is carried out in order to investigate the conditions under which a given estimator performs better than the others.
2018
European Conference on Quality of Official Statistics
Krakow
26-29 June 2018
Proceedings Q2018
-
1
8
https://www.q2018.pl/papers-presentations/?drawer=Sessions*Session 32*Giulio Barcaroli
Big Data, Internet data, official statistics, model based estimation, quality evaluation
Giulio Barcaroli, Natalia Golini, Paolo Righi
File in questo prodotto:
File Dimensione Formato  
QOS 2018 Barcaroli et al..pdf

Accesso aperto

Descrizione: Articolo principale
Tipo di file: PDF EDITORIALE
Dimensione 734.92 kB
Formato Adobe PDF
734.92 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1740040
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact