CINECA IRIS Institutional Research Information System

The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020. The success of the Covid Symptom Study creates significant technical challenges around effective data curation. The primary issue is scale. The size of the dataset means that it can no longer be readily processed using standard Python-based data analytics software such as Pandas on commodity hardware. Alternative technologies exist but carry a higher technical complexity and are less accessible to many researchers. We present ExeTera, a Python-based open source software package designed to provide Pandas-like data analytics on datasets that approach terabyte scales. We present its design and capabilities, and show how it is a critical component of a data curation pipeline that enables reproducible research across an international research group for the Covid Symptom Study.

Accessible data curation and analytics for international-scale citizen science datasets

Murray B.;Kerfoot E.;Chen L.;Deng J.;Graham M. S.;Sudre C. H.;Molteni E.;Canas L. S.;Antonelli M.;Klaser K.;Visconti A.;Hammers A.;Chan A. T.;Franks P. W.;Davies R.;Wolf J.;Spector T. D.;Steves C. J.;Modat M.;Ourselin S.

2021-01-01

Abstract

The Covid Symptom Study, a smartphone-based surveillance study on COVID-19 symptoms in the population, is an exemplar of big data citizen science. As of May 23rd, 2021, over 5 million participants have collectively logged over 360 million self-assessment reports since its introduction in March 2020. The success of the Covid Symptom Study creates significant technical challenges around effective data curation. The primary issue is scale. The size of the dataset means that it can no longer be readily processed using standard Python-based data analytics software such as Pandas on commodity hardware. Alternative technologies exist but carry a higher technical complexity and are less accessible to many researchers. We present ExeTera, a Python-based open source software package designed to provide Pandas-like data analytics on datasets that approach terabyte scales. We present its design and capabilities, and show how it is a critical component of a data curation pipeline that enables reproducible research across an international research group for the Covid Symptom Study.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2021
			
	Lingua di pubblicazione
	
				Inglese
			
	Codice ISI WoS
	
				WOS:000721583100001
			
	Codice Scopus
	
				2-s2.0-85119660232
			
	Referee
	
				Sì, ma tipo non specificato
			
	Titolo rivista
	
				SCIENTIFIC DATA
			
	N. Volume
	
				8
			
	Fascicolo
	
				1
			
	Pagine (da)
	
				1
			
	Pagine (a)
	
				17
			
	Numero di pagine totale
	
				17
			
	DOI
	
				https://dx.doi.org/10.1038/s41597-021-01071-x
			
	URL del prodotto (archivi open access, fulltext su sito editore, etc.)
	
				https://www.nature.com/articles/s41597-021-01071-x
			
	Coautori affiliati a enti stranieri
	
				sì
			
	Prodotto conforme al Regolamento di Ateneo sull'accesso aperto?
	
				1 – prodotto con  file in versione Open Access (allegherò il file al passo 6 - Carica)
			
	Tipologia sito docente
	
				262
			
	Numero autori
	
				20
			
	Tutti gli autori
	
						Murray B.; Kerfoot E.; Chen L.; Deng J.; Graham M.S.; Sudre C.H.; Molteni E.; Canas L.S.; Antonelli M.; Klaser K.; Visconti A.; Hammers A.; Chan A.T.;...espandi
						
	Tipologia
	
				info:eu-repo/semantics/article
			
	Fulltext
	
				open
			
	Tipologia
	
				03-CONTRIBUTO IN RIVISTA::03A-Articolo su Rivista
			
	Appare nelle tipologie:
	
				03A-Articolo su Rivista

File in questo prodotto:

File	Dimensione	Formato
Murray2021COVID.pdf Accesso aperto Tipo di file: PDF EDITORIALE Dimensione 3.07 MB Formato Adobe PDF Visualizza/Apri	3.07 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2318/1953033

Citazioni

ND

10

9

social impact