Tải bản đầy đủ - 0 (trang)
1 Infodemiology: Covering `Supply' and `Demand'

1 Infodemiology: Covering `Supply' and `Demand'

Tải bản đầy đủ - 0trang


A. Richterich

Looking at the field of infodemiology today, one can differentiate between the

following approaches; epidemiological surveillance based on:

• ‘professional’, public information online (e.g. Health Map, Global Public Health

Intelligence Network)

• explicit, conscious information provided by users and affected individuals (e.g.

Flu Near You, Grippeweb)

• implicit information provided (mainly) unconsciously by users (e.g. Google Flu

Trends; Yahoo research/Polgreen et al. 2008)

Already in 1997, the WHO and the Health Canada’s Centre for Emergency Preparedness and Response (CEPR) were working on a prototype for the (subscriptionbased) Global Public Health Intelligence Network (GPHIN; Health Canada 2003).

In 2005, Mawudeku and Blench described GPHIN as “unique multilingual system”

which “gathers and disseminates relevant information on disease outbreaks and

other public health events by monitoring global media sources such as news wires

and web sites” (p.9).

Such ‘supply’-oriented approaches can still be found in more recent studies. For

example, Breton et al. (2012) showed that semantic analyses of sources such as

the Agence France-Presse (AFP) may be used for predicting epidemic intensities.

Likewise, not only professional media communication may be useful, but also

the analysis of social media communication shows potential. Chunara, Andrews

and Brownstein (2012) employed data from the microblogging-platform Twitter in

order to detect the outbreak of cholera in Haiti and to monitor the intensity of the

epidemic. With regards to this data source, it remains unclear to what extent one

is dealing with information which has been provided ‘consciously’ by users, in

the sense that they were aware of a further use of their tweets. In addition, this

approach made use of data derived from the HealthMap project14 which is based on

an automatic analysis of semantic content from blogs, news-websites, RSS feeds as

well as official surveillance data.

Services such as Flu Near You15 or the German Grippeweb16 (transl. ‘Flu Web’;

Robert Koch Institute) pursue crowdsourcing strategies. They rely on the conscious

participation of volunteers providing information on their own health status or


See http://healthmap.org/en. The service as been developed by researchers affiliated with the

Boston Children’s Hospital and the Harvard Medical School (Brownstein et al. 2008). It combines

different data sources, such as Twitter feeds and platforms such as Google news, with official public

health reports. While it has been launched in 2006 already, it has received most media attention

since the Ebola outbreak in 2014. On March 14, the site first picked up on news reports about

a hemorrhagic fever, while the WHO only officially reported on the Ebola outbreak more than a

week later (March 23, 2014). In this instance, it was of course not classified as Ebola yet, but the

information could have acted as early indicator.


See https://flunearyou.org/. The service is closely related to HealthMap, and has been developed

by epidemiologists from Harvard University and the Boston Children’s Hospital as well as the The

Skoll Global Threats Fund.


See https://grippeweb.rki.de/

Using Transactional Big Data for Epidemiological Surveillance: Google Flu. . .


(potential) influenza symptoms. These approaches depend on a deliberate effort

of volunteers and their faithful, correct information. Due to the limited scope

of this chapter, I will not be able to discuss the aforementioned approaches in

more detail. Particularly the involvement of volunteers appears to be an interesting

development which emphasises users’ deliberate, conscious involvement rather

than their automatic ‘mining’ for health relevant information. However, these

approaches raise issues regarding users’ capability for self-diagnosis and sincerity

in participation.

3.2 Analysing Health Information Demand

This chapter focuses on approaches in epidemiological surveillance drawing on

transactional big data: the documentation and analysis of user behaviour (i.e. their

search queries) with regards to health relevant information. Big data, produced

by the search terms entered through vast amounts of users worldwide, form the

basis for these attempts. In particular, studies by Eysenbach (2006), Polgreen et

al. (2008) and Ginsberg et al. (2008) have explored this aspect of infodemiology.

They all start from the assumption that certain search queries may be motivated by

influenza or influenza-like-illness (ILI), either experienced by the individual her/himself or in her/his social environment. Assuming that a certain search query

correlates (steadily) with actual influenza intensities, it may be used as indicator

of disease dynamics.

Initially, Eysenbach explored this research field with his Google ad sentinel

method. He was able to demonstrate “an excellent correlation between the number

of clicks on a keyword-triggered link in Google with epidemiological data from the

flu season 2004/2005 in Canada” (Eysenbach 2006, p.244). Eysenbach described

his approach as a “trick” (ibid, p.245), since the actual Google search queries

were not available to him. Hence, he had to create a Google Adsense commercial

campaign in order to obtain the necessary data. His method was not able to obtain

actual search query quantifications, but only allowed him to factor in those users

who subsequently clicked on a presented link. When (Canadian) Google users

entered “flu” or “flu symptoms”, they were presented with an ad Do you have the

flu? created by Eysenbach. The link led to a health information website regarding

influenza. As an (alleged) advertising customer, Google Inc. provided him with

quantitative information and geographic data. When relating these to data from

the governmental FluWatch Reports (Public Health Agency Canada), he detected

a positive correlation between the increase of certain search queries and influenza


In 2008, Polgreen et al. presented a similar study design. The researchers, one of

them from Yahoo Research, were provided with data from Yahoo Inc. web search

logs. Based on queries related to influenza (March 2004-May 2008) and internet

protocol addresses which allowed for their geographic localisation, the researchers

created a database which was then related to the data of traditional surveillance


A. Richterich

systems.17 Just like Eysenbach, they asserted a correlation between certain search

terms and actual influenza-intensities. Apart from emphasising the cost-efficient

advantages of their approach, the researchers highlighted that predictions could be

calculated in a very timely manner: “With use of the frequency of searches, our

models predicted an increase in cultures positive for influenza 1–3 weeks in advance

when they occurred” (Polgreen et al. 2008, p.1443).

When Ginsberg et al. published their results in November 2008,18 they hence did

not present a completely new approach. However, their publication was accompanied by the launch of a public Google Inc. service19 in 2008. Former studies had

merely emphasised the methodological potential of web search queries. The authors

summarise their investigation: “Because the relative frequency of certain queries is

highly correlated with the percentage of physicians visits in which a patient presents

influenza-like symptoms, we can accurately estimate the current level of weekly

influenza activity in each region of the United States, with a reporting lag of about

one day” (Ginsberg et al. 2008, p.1012).

4 Case Study: Google Flu Trends

The aforementioned studies are all enabled by the fact that digital user activities

such as the use of search engines are not ephemeral, but are turned into transactional

big data. Ginsberg et al. used data provided by Google Inc.’s market leading search

engine: they were hence derived from databases documenting users’ search queries.

As a result, the GFT interface (July 2015) illustrated historic and estimated influenza

intensities in geographic maps as well as line graphs (see Fig. 1). The service is not

merely a result of analysing web search queries: during its development phase (and

for its adjustment), the researchers had to draw on biomedical data provided by

traditional epidemiological surveillance networks. In addition to the search queries

data, they used two main data sources. They employed data publicly provided by

the CDC for nine U.S. surveillance regions as well as state-reported ILI percentages

for Utah. The CDC publish information regarding the amount of patients which


Mainly two data sources were relevant for this project: “Each week during the influenza season,

clinical laboratories throughout the United States that are members of the World Health Organization Collaborating Laboratories or the National Respiratory and Enteric Virus Surveillance

System report the total number of respiratory specimens tested and the number that were positive

for influenza. The second type of data summarize weekly mortality attributable to pneumonia and

influenza. These data are collected from the 122 Cities Mortality Reporting System” (Polgreen et

al. 2008, p.1444)


The paper was originally published online on November 19, 2008, but was corrected on February

19, 2009 (see http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html#cor1).


Strictly speaking, GFT is part of Google.org, a Google Inc. initiative (see also Strom and Helft


Using Transactional Big Data for Epidemiological Surveillance: Google Flu. . .


have been diagnosed with influenza or ‘influenza-like-illness’ (ILI) online (see

http://www.cdc.gov/flu/weekly). During flu/influenza season, these data are updated


Ginsberg et al. retrieved these data and tested them for correlations with selected

search queries. They developed a database of potentially relevant queries which

were subsequently related to the data provided by the CDC: “For the purpose of our

database, a search query is a complete, exact sequence of terms issued by a Google

search user [ : : : ]. Our database of queries contains 50 million of the most common

search queries [ : : : ]” (Ginsberg et al. 2008, p.1014). Originally, this database

consisted of “hundreds of billions of individual searches from 5 years [2003–2008]

of Google web search logs” (ibid, p.1012). The top 45 queries showing a correlation

with increasing influenza/ILI intensities were then chosen as initial basis for the

construction of GFT. These search queries were allegedly related to topics such as

influenza complications and symptoms or certain antibiotic medication. However,

as Lazer et al. pointed out, the exact terms have never been disclosed and moreover

“the examples that have been released appear misleading” (2014, p.1204). This is

already indicative for GFT’s tendency to ‘black-box’ certain information which is

on the one hand crucial in order to understand its functioning, but may on the other

hand facilitate miscalculations. In the following two sub-sections, I will analyse

GFT with regards to such issues by drawing on the pragmatist perspective which I

outlined before.

4.1 Normative Assumptions, Justifications and Values

First, I will highlight the normative assumptions – the arguments, justifications and

values –which are articulated and neglected in the (corporate) presentation, scientific

and public debates of Google Flu Trends. When GFT was initially presented in

a paper titled “Detecting influenza epidemics using search engine query data”

(Ginsberg et al. 2008), the authors emphasise advantages and promises, but likewise

pointed to conditions and risks of GFT. The paper starts with an affirmation of

obvious threats posed by influenza epidemics: the illnesses and deaths causes by

seasonal influenza epidemics as well as the incalculable health threat of new strains

of influenza virus. With regards to these risks, the authors claim to have developed

a model implemented in the service GFT which estimates influenza activity with a

reporting lag of one day (and is hence considerably quicker than traditional influenza

surveillance networks which provide data with a reporting lag of 1–2 weeks). The

authors summarise this in a later section of the paper:

Harnessing the collective intelligence of millions of users, Google web search logs can

provide one of the most timely, broad-reaching influenza monitoring systems available

today. Whereas traditional systems require 1-2 weeks to gather and process surveillance

data, our estimates are current each day (Ginsberg et al. 2008, p.1014).


A. Richterich

As I explained before, this model relies on monitoring the health-seeking

behaviour of users represented by their queries in the online search engine Google.

One main condition for its functionality is hence a sufficiently large population of

search engine users. Likewise, it is based on cooperation with the CDC, using the

influenza data which are publicly accessible online. One needs to keep in mind that

these data are merely provided for the influenza seasons. Therefore, the model used

for GFT could only involve data describing ILI activities for these time frames.

As I will explain below, it has been pointed out that this approach was most

likely also responsible for early miscalculations in GFT. The cooperation during the

development is described as continuous process of ‘sharing’ with the Epidemiology

and Prevention Branch of the Influenza Division at the CDC to assess its timeliness

and accuracy (see ibid, p.1013). Hence the CDC served as source of validation

in order to ensure the accuracy of the data. Despite the abovementioned claims

regarding improved efficiency and timeliness, the authors describe GFT not as

solitary service, but as initial indication for further responses to potential epidemics.

The system is not suggested as “replacement for traditional surveillance” (ibid.);

instead, these influenza estimations are meant to “enable public health officials

and health professionals to respond better to seasonal epidemics” (ibid, p.1013).

GFT is hence not supposed to estimate and predict influenza in an isolated way:

it is offered as knowledge source and early warning system to be used by health

professionals. On www.cdc.gov/flu/weekly, GFT is mentioned (above the WHO and

Public Health Canada/England), however it remains unclear to what extent and

how these data were/are in fact used by the CDC (or other health professionals).

While the authors describe GFT mainly as information tool instructing the decision

making and responses of health professionals and institutions, the public version

of the service seems to neglect this aspect: it suggests itself as public information

source for ILI intensities.

Despite presenting GFT as tool instructing further strategies and investigations,

the authors also anticipated a main source of miscalculations: users’ search engine

queries may not only be triggered by individual health conditions, but may also be

influenced by e.g. news about geographically distant influenza outbreaks. Hence,

the dynamics of users’ search engine behaviour act as potential confounders of

data used to instruct GFT. This connection highlights two issues: the service is

susceptible to “Epidemics of Fear” (Eysenbach 2006, p.244); moreover, it relies

on users which are ideally not influenced by any other knowledge despite their own

health condition or experiences in their immediate, social environment.


Epidemics of Fear

Already in 2006, Eysenbach advised caution with regards to the significance of web

search queries, since they may “be confounded by ‘Epidemics of Fear’” (Eysenbach

2006, p.244). Also the developers of GFT pointed out this possibility:

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

1 Infodemiology: Covering `Supply' and `Demand'

Tải bản đầy đủ ngay(0 tr)