Tải bản đầy đủ - 0trang
1 Infodemiology: Covering `Supply' and `Demand'
Looking at the field of infodemiology today, one can differentiate between the
following approaches; epidemiological surveillance based on:
• ‘professional’, public information online (e.g. Health Map, Global Public Health
• explicit, conscious information provided by users and affected individuals (e.g.
Flu Near You, Grippeweb)
• implicit information provided (mainly) unconsciously by users (e.g. Google Flu
Trends; Yahoo research/Polgreen et al. 2008)
Already in 1997, the WHO and the Health Canada’s Centre for Emergency Preparedness and Response (CEPR) were working on a prototype for the (subscriptionbased) Global Public Health Intelligence Network (GPHIN; Health Canada 2003).
In 2005, Mawudeku and Blench described GPHIN as “unique multilingual system”
which “gathers and disseminates relevant information on disease outbreaks and
other public health events by monitoring global media sources such as news wires
and web sites” (p.9).
Such ‘supply’-oriented approaches can still be found in more recent studies. For
example, Breton et al. (2012) showed that semantic analyses of sources such as
the Agence France-Presse (AFP) may be used for predicting epidemic intensities.
Likewise, not only professional media communication may be useful, but also
the analysis of social media communication shows potential. Chunara, Andrews
and Brownstein (2012) employed data from the microblogging-platform Twitter in
order to detect the outbreak of cholera in Haiti and to monitor the intensity of the
epidemic. With regards to this data source, it remains unclear to what extent one
is dealing with information which has been provided ‘consciously’ by users, in
the sense that they were aware of a further use of their tweets. In addition, this
approach made use of data derived from the HealthMap project14 which is based on
an automatic analysis of semantic content from blogs, news-websites, RSS feeds as
well as official surveillance data.
Services such as Flu Near You15 or the German Grippeweb16 (transl. ‘Flu Web’;
Robert Koch Institute) pursue crowdsourcing strategies. They rely on the conscious
participation of volunteers providing information on their own health status or
See http://healthmap.org/en. The service as been developed by researchers affiliated with the
Boston Children’s Hospital and the Harvard Medical School (Brownstein et al. 2008). It combines
different data sources, such as Twitter feeds and platforms such as Google news, with official public
health reports. While it has been launched in 2006 already, it has received most media attention
since the Ebola outbreak in 2014. On March 14, the site first picked up on news reports about
a hemorrhagic fever, while the WHO only officially reported on the Ebola outbreak more than a
week later (March 23, 2014). In this instance, it was of course not classified as Ebola yet, but the
information could have acted as early indicator.
See https://flunearyou.org/. The service is closely related to HealthMap, and has been developed
by epidemiologists from Harvard University and the Boston Children’s Hospital as well as the The
Skoll Global Threats Fund.
Using Transactional Big Data for Epidemiological Surveillance: Google Flu. . .
(potential) influenza symptoms. These approaches depend on a deliberate effort
of volunteers and their faithful, correct information. Due to the limited scope
of this chapter, I will not be able to discuss the aforementioned approaches in
more detail. Particularly the involvement of volunteers appears to be an interesting
development which emphasises users’ deliberate, conscious involvement rather
than their automatic ‘mining’ for health relevant information. However, these
approaches raise issues regarding users’ capability for self-diagnosis and sincerity
3.2 Analysing Health Information Demand
This chapter focuses on approaches in epidemiological surveillance drawing on
transactional big data: the documentation and analysis of user behaviour (i.e. their
search queries) with regards to health relevant information. Big data, produced
by the search terms entered through vast amounts of users worldwide, form the
basis for these attempts. In particular, studies by Eysenbach (2006), Polgreen et
al. (2008) and Ginsberg et al. (2008) have explored this aspect of infodemiology.
They all start from the assumption that certain search queries may be motivated by
influenza or influenza-like-illness (ILI), either experienced by the individual her/himself or in her/his social environment. Assuming that a certain search query
correlates (steadily) with actual influenza intensities, it may be used as indicator
of disease dynamics.
Initially, Eysenbach explored this research field with his Google ad sentinel
method. He was able to demonstrate “an excellent correlation between the number
of clicks on a keyword-triggered link in Google with epidemiological data from the
flu season 2004/2005 in Canada” (Eysenbach 2006, p.244). Eysenbach described
his approach as a “trick” (ibid, p.245), since the actual Google search queries
were not available to him. Hence, he had to create a Google Adsense commercial
campaign in order to obtain the necessary data. His method was not able to obtain
actual search query quantifications, but only allowed him to factor in those users
who subsequently clicked on a presented link. When (Canadian) Google users
entered “flu” or “flu symptoms”, they were presented with an ad Do you have the
flu? created by Eysenbach. The link led to a health information website regarding
influenza. As an (alleged) advertising customer, Google Inc. provided him with
quantitative information and geographic data. When relating these to data from
the governmental FluWatch Reports (Public Health Agency Canada), he detected
a positive correlation between the increase of certain search queries and influenza
In 2008, Polgreen et al. presented a similar study design. The researchers, one of
them from Yahoo Research, were provided with data from Yahoo Inc. web search
logs. Based on queries related to influenza (March 2004-May 2008) and internet
protocol addresses which allowed for their geographic localisation, the researchers
created a database which was then related to the data of traditional surveillance
systems.17 Just like Eysenbach, they asserted a correlation between certain search
terms and actual influenza-intensities. Apart from emphasising the cost-efficient
advantages of their approach, the researchers highlighted that predictions could be
calculated in a very timely manner: “With use of the frequency of searches, our
models predicted an increase in cultures positive for influenza 1–3 weeks in advance
when they occurred” (Polgreen et al. 2008, p.1443).
When Ginsberg et al. published their results in November 2008,18 they hence did
not present a completely new approach. However, their publication was accompanied by the launch of a public Google Inc. service19 in 2008. Former studies had
merely emphasised the methodological potential of web search queries. The authors
summarise their investigation: “Because the relative frequency of certain queries is
highly correlated with the percentage of physicians visits in which a patient presents
influenza-like symptoms, we can accurately estimate the current level of weekly
influenza activity in each region of the United States, with a reporting lag of about
one day” (Ginsberg et al. 2008, p.1012).
4 Case Study: Google Flu Trends
The aforementioned studies are all enabled by the fact that digital user activities
such as the use of search engines are not ephemeral, but are turned into transactional
big data. Ginsberg et al. used data provided by Google Inc.’s market leading search
engine: they were hence derived from databases documenting users’ search queries.
As a result, the GFT interface (July 2015) illustrated historic and estimated influenza
intensities in geographic maps as well as line graphs (see Fig. 1). The service is not
merely a result of analysing web search queries: during its development phase (and
for its adjustment), the researchers had to draw on biomedical data provided by
traditional epidemiological surveillance networks. In addition to the search queries
data, they used two main data sources. They employed data publicly provided by
the CDC for nine U.S. surveillance regions as well as state-reported ILI percentages
for Utah. The CDC publish information regarding the amount of patients which
Mainly two data sources were relevant for this project: “Each week during the influenza season,
clinical laboratories throughout the United States that are members of the World Health Organization Collaborating Laboratories or the National Respiratory and Enteric Virus Surveillance
System report the total number of respiratory specimens tested and the number that were positive
for influenza. The second type of data summarize weekly mortality attributable to pneumonia and
influenza. These data are collected from the 122 Cities Mortality Reporting System” (Polgreen et
al. 2008, p.1444)
The paper was originally published online on November 19, 2008, but was corrected on February
19, 2009 (see http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html#cor1).
Strictly speaking, GFT is part of Google.org, a Google Inc. initiative (see also Strom and Helft
Using Transactional Big Data for Epidemiological Surveillance: Google Flu. . .
have been diagnosed with influenza or ‘influenza-like-illness’ (ILI) online (see
http://www.cdc.gov/flu/weekly). During flu/influenza season, these data are updated
Ginsberg et al. retrieved these data and tested them for correlations with selected
search queries. They developed a database of potentially relevant queries which
were subsequently related to the data provided by the CDC: “For the purpose of our
database, a search query is a complete, exact sequence of terms issued by a Google
search user [ : : : ]. Our database of queries contains 50 million of the most common
search queries [ : : : ]” (Ginsberg et al. 2008, p.1014). Originally, this database
consisted of “hundreds of billions of individual searches from 5 years [2003–2008]
of Google web search logs” (ibid, p.1012). The top 45 queries showing a correlation
with increasing influenza/ILI intensities were then chosen as initial basis for the
construction of GFT. These search queries were allegedly related to topics such as
influenza complications and symptoms or certain antibiotic medication. However,
as Lazer et al. pointed out, the exact terms have never been disclosed and moreover
“the examples that have been released appear misleading” (2014, p.1204). This is
already indicative for GFT’s tendency to ‘black-box’ certain information which is
on the one hand crucial in order to understand its functioning, but may on the other
hand facilitate miscalculations. In the following two sub-sections, I will analyse
GFT with regards to such issues by drawing on the pragmatist perspective which I
4.1 Normative Assumptions, Justifications and Values
First, I will highlight the normative assumptions – the arguments, justifications and
values –which are articulated and neglected in the (corporate) presentation, scientific
and public debates of Google Flu Trends. When GFT was initially presented in
a paper titled “Detecting influenza epidemics using search engine query data”
(Ginsberg et al. 2008), the authors emphasise advantages and promises, but likewise
pointed to conditions and risks of GFT. The paper starts with an affirmation of
obvious threats posed by influenza epidemics: the illnesses and deaths causes by
seasonal influenza epidemics as well as the incalculable health threat of new strains
of influenza virus. With regards to these risks, the authors claim to have developed
a model implemented in the service GFT which estimates influenza activity with a
reporting lag of one day (and is hence considerably quicker than traditional influenza
surveillance networks which provide data with a reporting lag of 1–2 weeks). The
authors summarise this in a later section of the paper:
Harnessing the collective intelligence of millions of users, Google web search logs can
provide one of the most timely, broad-reaching influenza monitoring systems available
today. Whereas traditional systems require 1-2 weeks to gather and process surveillance
data, our estimates are current each day (Ginsberg et al. 2008, p.1014).
As I explained before, this model relies on monitoring the health-seeking
behaviour of users represented by their queries in the online search engine Google.
One main condition for its functionality is hence a sufficiently large population of
search engine users. Likewise, it is based on cooperation with the CDC, using the
influenza data which are publicly accessible online. One needs to keep in mind that
these data are merely provided for the influenza seasons. Therefore, the model used
for GFT could only involve data describing ILI activities for these time frames.
As I will explain below, it has been pointed out that this approach was most
likely also responsible for early miscalculations in GFT. The cooperation during the
development is described as continuous process of ‘sharing’ with the Epidemiology
and Prevention Branch of the Influenza Division at the CDC to assess its timeliness
and accuracy (see ibid, p.1013). Hence the CDC served as source of validation
in order to ensure the accuracy of the data. Despite the abovementioned claims
regarding improved efficiency and timeliness, the authors describe GFT not as
solitary service, but as initial indication for further responses to potential epidemics.
The system is not suggested as “replacement for traditional surveillance” (ibid.);
instead, these influenza estimations are meant to “enable public health officials
and health professionals to respond better to seasonal epidemics” (ibid, p.1013).
GFT is hence not supposed to estimate and predict influenza in an isolated way:
it is offered as knowledge source and early warning system to be used by health
professionals. On www.cdc.gov/flu/weekly, GFT is mentioned (above the WHO and
Public Health Canada/England), however it remains unclear to what extent and
how these data were/are in fact used by the CDC (or other health professionals).
While the authors describe GFT mainly as information tool instructing the decision
making and responses of health professionals and institutions, the public version
of the service seems to neglect this aspect: it suggests itself as public information
source for ILI intensities.
Despite presenting GFT as tool instructing further strategies and investigations,
the authors also anticipated a main source of miscalculations: users’ search engine
queries may not only be triggered by individual health conditions, but may also be
influenced by e.g. news about geographically distant influenza outbreaks. Hence,
the dynamics of users’ search engine behaviour act as potential confounders of
data used to instruct GFT. This connection highlights two issues: the service is
susceptible to “Epidemics of Fear” (Eysenbach 2006, p.244); moreover, it relies
on users which are ideally not influenced by any other knowledge despite their own
health condition or experiences in their immediate, social environment.
Epidemics of Fear
Already in 2006, Eysenbach advised caution with regards to the significance of web
search queries, since they may “be confounded by ‘Epidemics of Fear’” (Eysenbach
2006, p.244). Also the developers of GFT pointed out this possibility: