The impact of news exposure on collective attention in the United States during the 2016 Zika epidemic

In recent years, many studies have drawn attention to the important role of collective awareness and human behaviour during epidemic outbreaks. A number of modelling efforts have investigated the interaction between the disease transmission dynamics and human behaviour change mediated by news coverage and by information spreading in the population. Yet, given the scarcity of data on public awareness during an epidemic, few studies have relied on empirical data. Here, we use fine-grained, geo-referenced data from three online sources—Wikipedia, the GDELT Project and the Internet Archive—to quantify population-scale information seeking about the 2016 Zika virus epidemic in the U.S., explicitly linking such behavioural signal to epidemiological data. Geo-localized Wikipedia pageview data reveal that visiting patterns of Zika-related pages in Wikipedia were highly synchronized across the United States and largely explained by exposure to national television broadcast. Contrary to the assumption of some theoretical epidemic models, news volume and Wikipedia visiting patterns were not significantly correlated with the magnitude or the extent of the epidemic. Attention to Zika, in terms of Zika-related Wikipedia pageviews, was high at the beginning of the outbreak, when public health agencies raised an international alert and triggered media coverage, but subsequently exhibited an activity profile that suggests nonlinear dependencies and memory effects in the relation between information seeking, media pressure, and disease dynamics. This calls for a new and more general modelling framework to describe the interaction between media exposure, public awareness and disease dynamics during epidemic outbreaks.


Introduction
The advent of the digital era has radically changed the way individuals search for information and this is particularly relevant for health-related information 1 . A 2013 study 2 found that 59% of U.S.
adults had looked for health information on the Web in the previous year and that about one in three U.S. adults use the Internet to figure out what medical condition they have. The fruition of news sources, either traditional such as television, radio and newspapers, or digital such as Web news or online social networks, has become crucial in how health information is delivered and it can play a fundamental role in shaping opinions, awareness and behaviours. In the past ten years, several studies have addressed the impact of awareness and information spread during epidemic outbreaks and it has been reported that the degree of public attention and concern induced by an epidemic threat might affect the disease transmission dynamics [3][4][5][6][7] . However, modeling efforts have been mostly theoretical and a large-scale empirical characterization of information seeking behaviour and its interplay with the disease dynamics during an epidemic outbreak has been elusive so far due to the lack of available data 8 .
Here we study a large-scale dataset on spatio-temporally resolved accesses to Wikipedia ZIKV is a RNA virus from the Flaviviridae family which is mainly transmitted by infected Aedes mosquitoes, although there have been cases of sexual and perinatal transmission. Infection is mostly asymptomatic or associated with mild symptoms 9 but it can lead to serious and sometimes fatal neurological defects in neonates born to ZIKV infected women. In particular, following the association between ZIKV and a cluster of microcephaly cases in Brazil 10 , the World Health Organization (WHO) declared the ZIKV epidemic a Public Health Emergency of International Concern (PHEIC) on February 1st, 2016 11 . The emergency lasted until November 18th 2016, when the WHO declared the PHEIC to be over 12 . As of March 2017, ZIKV has spread worldwide to 79 countries where there has been evidence of an ongoing vector-borne virus transmission. The most affected region has been the American continent with 47 countries or territories reporting local ZIKV transmission, due to the extensive presence of Aedes mosquitoes in almost all the region's countries 13 . In such epidemiological context, the ZIKV epidemic has posed peculiar communication challenges to the public due to its association with microcephaly in newborns, its transmission modalities, and its prevalence in areas where the virus was never detected before and that was suddenly characterized by intense international travel due to the 2016 Summer Olympics [13][14][15] . Public polls conducted in the United States evidenced the lack of knowledge about ZIKV in the general population and more specifically in groups at risk, such as pregnant women 16 . The novelty of 3 the disease and the lack of previous knowledge of it in the affected areas make the 2016 ZIKV epidemic an ideal case study to characterize collective attention patterns, identify their drivers and test traditional modeling assumptions. Intuitively, mass media coverage represents the main driver of public attention during an epidemic. Indeed, several peculiarities of media narratives around public health hazards 17 and infectious diseases 18 have been elucidated, but a general and quantitative comprehension of how the public opinion responds to media exposure during an emerging epidemic threat is still lacking. The majority of modeling studies assume that media exposure is driving behavioural changes, hence media exposure effects are incorporated into some kind of media function that modulates individual behaviors and may affect disease dynamics [19][20][21] . The general assumption is that as the number of cases increases and is reported by mass media, the susceptibility of individuals will decrease due to increasing awareness and the associated behavioral changes 20 . However, for most disease outbreaks, such an assumption has never been supported by direct empirical evidence. More in general, the complex interplay between media coverage, public attention and disease dynamics during an epidemic remains an open research challenge.
To address these questions, our study analyses time-resolved and geo-localized Wikipedia pageview counts to investigate the dynamics of public attention during the 2016 ZIKV epidemic in the United States. We considered the daily page view counts on 104 different Zika-related Wikipedia articles in the U.S. to be an unambiguous indicator of attention to the epidemic and we investigated the temporal and spatial patterns of page views in relation to the timeline of ZIKV incidence reported by the U.S. Center for Disease Control, and in relation to the coverage of the ZIKV epidemic by local and national media sources. In particular, we focused on news coverage of the ZIKV epidemic by online media and television in 2016, available in digital format through the GDELT project and the Internet Archive (see Methods for a full description of the data under study).

Results
Public attention and media coverage of the 2016 ZIKV epidemic showed a distinct and synchronous temporal pattern, as seen in Figure 1 The available spatial granularity of Wikipedia page view data allowed us to further inspect how the above picture changes when moving from a national perspective to States and U.S. cities.
Notably, the temporal dynamics of attention to the Zika-related Wikipedia pages in 2016 was highly synchronized across all the 50 States. Although the relative risk of case importation and local transmission varied significantly from state to state, being the Southern States more at risk due to vector's presence and abundance 24 , the Wikipedia pageview timelines were all highly cor-related, as shown in Figure 2. The Pearson correlation coefficient of the cross-correlation matrix of Wikipedia pageview time series by State ranges from r = 0.77 for Delaware and Montana, to r = 0.99 for New York and New Jersey. Overall, the correlation of the Wikipedia pageviews in each state with the national timeline was always higher than r = 0.88, indicating a high degree of spatial uniformity across the country. Given the above mentioned correlation of Wikipedia pageviews with the TV coverage of the epidemic and the mentions of Zika on the Web, the attention patterns at State level were also highly correlated with the national media coverage suggesting a fundamental role of news exposure as a driver of public attention at all geographic scales.
One could argue that local patterns of attention may be influenced by local news and local epidemic events, such as case importations or a local increase of disease prevalence. We tested these hypotheses by comparing Wikipedia page view counts in each state to Web news mentioning the word "Zika" and the name of the state, and to the local ZIKV incidence profiles. Attention profiles in each state were generally positively correlated to Web news mentioning the name of the state, however the degree of correlation ranged from r = 0.004 in Wisconsin to r = 0.74 in Texas, showing significant spatial differences across the country. Interestingly, ZIKV incidence in each state could explain such geographic variations as a negative driver of attention. On the one hand, local patterns of attention in each state were generally not correlated with disease incidence, with the exception of Montana (r = 0.32). On the other hand, Web news covering Zika in each state were positively correlated (r > 0.20) with the local incidence profiles only in 20 states out of 50 and, at the same time, these states showed the smallest degree of correlation between news and attention. A direct comparison of the 50 states ranked by degree of correlation between news and ZIKV incidence, and between news and attention, showed a negative rank correlation: weighted Kendall's τ = −0.25. Overall, in those states where local news were following more closely the local epidemic patterns, the dynamics of public attention was not driven much by news. Instead, local attention patterns followed more closely the state news where the latter was more similar to the national one and less correlated with the local ZIKV epidemiology.
It is natural to ask whether correlations between patterns of attention and disease risk may change by looking at different spatial resolutions. To answer this question, we examined the attention to ZIKV in 788 cities of the United States with a population larger than 40,000 and compared it to their total Wikipedia viewership. By ranking the U.S. cities based on their total volume of Wikipedia pageviews in 2016, and comparing such ranking with the one based on pageviews of Zika-related articles only, we identified locations where the attention to ZIKV was higher than expected. As shown in Figure 3 A, cities on the East Coast of Florida showed the highest relative attention to ZIKV, when compared to their overall Wikipedia activity. Other relevant outliers with high attention were cities in Texas and in the Northeast. On the contrary, the lowest attention to ZIKV was observed in cities in California, and in the Midwest ( Figure 3B). These results suggest that increases in public attention at city level may be explained by risk perception due to the presence of the vector (as in Florida and Texas). However, the high level of attention in other places, such as Union City, NJ, can not be easily explained by epidemiological risk factors and it may be due to specific events, such as one or more case importations, that do not appear in our dataset.
To gain insight into the relation between media coverage and collective attention, we be-gin by building an equal-time regression model that predicts the weekly number of Zika-related Wikipedia pageviews for each state, rescaled by state population, based exclusively on the frequencies of Zika-related mentions in Web news and TV closed captions. That is, we assume that information seeking behavior in Wikipedia is driven, at any given point in time, by same-week exposure to media sources. Since our goal is uncovering drivers of collective attention, rather than achieving optimal prediction of the empirical time series, we choose an equal-time modeling approach over standard time series modeling techniques (e.g., autoregressive models). More specifically, we start with a linear regression model that predicts population-rescaled pageview counts for a given week and a given state using only national Web news and TV data for the same week. We focus on 43 states with population in excess of 1 million, comprising more than 98% of the U.S. population according to 2016 United States Census Bureau estimates 25 . We train the model via state-wise cross-validation and evaluate its performance using the determination coefficient R 2 and the Pearson's correlation coefficient r. Despite its simplicity, this equal-time linear regression demonstrates that both media signals, taken independently, are already quite informative of the Zika population-rescaled pageview time series: using exclusively TV close captions we obtain R 2 = 0.61 and r = 0.80, while using only Web news we obtain R 2 = 0.52 and r = 0.78.
Combining both features, the linear model achieves R 2 = 0.63 and r = 0.82. As model performance is evaluated via state-wise cross-validation, these results highlight that national-level media signals are highly informative of state-level pageview time series, once they are rescaled to take into account population size.
To take into account the possibility of memory effects in the response to media exposure, we enrich the feature space of the regression model with additional features (time series) obtained by filtering the Web news and TV time series with an exponential memory kernel (see Methods for a complete description). The characteristic time τ of the memory kernel, describing news persistence in the attention response, is a new hyper-parameter of the model to be set via cross-validation. Table 1 (Table 1, bottom row), although it yields the best performance according to the Akaike Information Criterion (AIC). Overall, by computing the AIC for each model and averaging over all states, three linear models based on TV, Web news, and state news, can be considered equally likely, assuming evidence for ∆ i = AIC i − AIC min < 4.

Discussion
Our study demonstrates that the temporal dynamics of Wikipedia pageviews in the United States during the ZIKV 2016 epidemic was highly predictable, even at state level, based on the volume of national and international news sources mentioning Zika and the United States. Collective attention to the ZIKV outbreak thus seems to have been mainly driven by news exposure and much less by the disease transmission dynamics, although the epidemic profile of ZIKV infections varied significantly from state to state and the risk of local transmission was not uniform across the country. Such picture describes a scenario where the awareness of the epidemic in the country is globally present, while local effects, as those due to the local spreading of awareness, play a less important role 26 .
Media outlets in the U.S. have a prominent role in defining the on-line public discourse 27 .
The impact of media exposure on the collective awareness and risk perception during epidemic outbreaks has been investigated in previous works 17, 18,28 , however, only a few studies have attempted to quantitatively measure the effect of media engagement on epidemic awareness using empirical data from Web sources on a large scale 21,29,30 . While previous studies have focused on newspaper coverage of epidemics 31 , we investigated the relationship between the exposure to TV coverage and online news, and the attention to Wikipedia pages. Our study confirms the high sensitivity of Wikipedia searches to breaking news and official announcements, in particular in the case of disaster events, as found by previous studies 32,33 . On the other hand, the temporal dynamics of Wikipedia page views during the 2016 ZIKV epidemic showed a nonlinear dependence with media coverage: the Wikipedia pages activity was high in the initial phases of the outbreak, but it declined more quickly than media coverage. This can be explained by the fact that information on Wikipedia is rather static, and users will view Wikipedia pages immediately after the news breaks but they will not return in the next days, unless more recent events renew their attention 32 .
From an epidemiological standpoint, our results are consistent with the recent findings of Bragazzi et al. 34 , who analyzed various data streams to measure the global reaction to the 2015-2016 ZIKV outbreaks in different countries. Similarly, we did not find any statistically significant correlation between the attention to Wikipedia pages and the ZIKV incidence data in the U.S. The correlation between ZIKV incidence and media coverage was also mild, and varied from state to state, suggesting that media coverage was only relatively influenced by the actual progression of the epidemic over time. One might argue that our results may not generalize to all epidemic outbreaks. Indeed, the peculiar characteristics of the ZIKV infection, such as its association to mild symptoms and the relatively small size of the population at risk, due to the spatial distribution of the vector, may have influenced the attention dynamics during the outbreak. Epidemic outbreaks caused by different pathogens, possibly characterized by a higher transmissibility, and different symptomatology, such as the Ebola virus or pandemic influenza, may lead to different attention patterns. However, it is reasonable to believe that media coverage would be, in any case, the main driver of collective attention, as it also has been during the 2014 West African Ebola virus epidemic 29,35 .
The increasing availability of novel data streams, such as social media, Web search queries and participatory surveillance data, provides an invaluable resource to measure and quantify the complex interplay between the spread of information, collective attention and the epidemiology of infectious diseases 36,37 . Recently, Wikipedia pageview data have been increasingly used by researchers in epidemiology and infectious disease modeling 38,39 . The overall value of Wikipedia data to measure and forecast the dynamics of infectious diseases has been debated 40 and, in gen-eral, Wikipedia-based forecasting models have been proved successful in the case of endemic or seasonal diseases, such as influenza, dengue or tubercolosis 39 . On the other hand, our study demonstrates that Wikipedia page viewership can provide a temporally resolved measure of collective attention during epidemic outbreaks caused by novel emerging diseases, at a high spatial granularity. Previous works have investigated the effects of external events on the activity of Wikipedia editors and on the number of pageviews 41,42 . More generally, the characterization of the usage of Wikipedia as a source of information and as a proxy for measuring the global attention to realworld events has been studied 22,33,43,44 . The results of our study add further evidence of the value of Wikipedia data in the field of digital epidemiology, especially for capturing information seeking behavior, and attention patterns during disease outbreaks 45 .
We showed Wikipedia data can capture collective attention during outbreaks, however, we did not link such signal with a measure of behavioral response in the population. Detecting behavioral changes from Web sources remains a challenging task. Previous studies have used TV viewing data to infer the behavioral response during the 2009 A/H1N1 pandemic in Mexico 46 .
More recently, Poletto el al. 30 showed that an increased collective attention was correlated to changes in the hospital management of MERS-Cov patients, reducing the time from admission to isolation. Further research is needed to infer causal patterns between collective attention and behavioral responses, and to identify the most suitable approach to integrate them into diseasebehavior models.

Data sources
Wikipedia page view counts. We collected hourly pageview data of the English Wikipedia pages "Zika virus" (https://en.wikipedia.org/wiki/Zika_virus) and "Zika fever" (https:

Model
We model the weekly number of pageview counts to Zika-related Wikipedia pages in each state with a linear regression of the formP whereP V s (w) is the Wikipedia page view count in state s on week w, rescaled by the state population. The rescaling of pageview data takes the form: where N s is the state population and β = 1.1397 is a scaling exponent independently estimated on the total volume of pageviews in each state by adopting the probabilistic framework of Leitão et al. 47 . By K-fold (k = 10) and leave-one-out cross validation, we test the performance of the model considering different linear combinations of features X i . Specifically, we considered as model features the weekly media timelines Y (w), where Y = TV, Web or Web state , and Web state represents the selection of Web news mentioning only a specific state name together with the word "Zika". To take into account the saturation effect due to media exposure, we also considered an exponentially decaying function of the media timelines Y (Y = TV, Web): where τ is a free parameter, setting the memory time scale, and ∆t max is defined by the total length of the time series up to week w (∆t max = w). Thus, the full model with all the 5 features under consideration takes the following form: P V s (w) = a · TV(w) + b · Web(w) + c · Web state (w) + d · m(Web) + e · m(TV).