Data Mining Guest Post: Steve Gittelman

Today’s guest post is from Steve Gittelman, PhD and President of Sample Solutions LLC. He describes how Facebook likes can be used for Health issues.

A New Source of Local Health Data: Facebook Likes

There is a cry for more local data on health, particularly about the behavioral drivers of disease. The few large federal surveys available fail to adequately cover more than 10% of the 3,143 U.S. counties, and these surveys are expensive and burdened with limitations. The resulting data scarcity prevents most American counties from fully understanding how they differ from each other regarding the behavioral determinants of disease and, most important, how to mitigate those differences.

A team of researchers from New York-based Sample Solutions LLC, a division of Mktg, Inc., in collaboration with the Centers for Disease Control and Prevention (CDC), compared government morbidity data to Mktg, Inc.’s (patent pending) analysis of billions of Facebook Likes. The result: patterns by which Facebook users express their Likes correlate heavily with the actual presence of chronic heart disease, diabetes, obesity and other health conditions that are at epidemic proportions. Models using Facebook data fill in the blanks that government survey data fail to provide. The ultimate goal of this research is to establish the potential contribution of Big Data to research that directly impacts government spending and public policy.

Online digital data offers opportunities to create models that predict otherwise unavailable information. For example, models have been used to help predict product usage based on demographic characteristics (Murray and Durrell K., 1999.) The frequency with which individuals employ Web-based news and research is a predictor of their gender, ethnicity and education, providing useful targeting information for ethnicity and income (Goel, Hofman, Sirer, 2012).  While there are broad similarities in what various demographic groups do online, such as e-mail and social media, there are some differences that are particularly illuminating, such as the predilection to pursue news and research health. The power of Likes is that they represent behavior, which has already proven to be a powerful tool for marketers who seek to target purchasing populations.

While Facebook Likes data are not explicitly health-related, statistical analysis shows that when taken together, the “network” of an individual’s Likes are predictive of many types of health behaviors and outcomes, regardless of the source. In short, we view Facebook Likes as a new class of data that can help us understand health conditions at a community level.  To do this, the data we derive from Facebook Likes must be relevant to the health metrics we seek to address.   Firstly, Likes must predict life expectancy, the ultimate outcome of one’s quality of health.  Predicting intermediary causes of a shortened lifespan, such as obesity and diabetes, is also a worthwhile stepping stone to that goal.  But in order to specifically target the causes of these conditions, Likes must also be able to predict the behavioral determinants of those outcomes.  If the Facebook characteristics of a region can predict exercise, smoking, and health maintenance, then a strong argument can be made in favor of the use of these data to target and correct behaviors of concern.

We hypothesize that the behaviors that drive the determinants of modern disease are behavior, lifestyle, and personality and that Facebook Likes are potentially a way to quantify regional patterns of these characteristics.  Our research to date has shown that:

  1. Likes provide a means of categorizing communities (counties).
  2. Likes can be used as an indicator of mortality.
  3. Likes can be used as an indicator of disease outcomes (obesity, diabetes, chronic heart disease).
  4. Likes can be used as an indicator of the behaviors that impact disease.

Facebook data were collected using the Facebook Advertising application programming interface (API) in February 2013, which aggregates the number of users who express interest in certain categories of items by ZIP code.  This ZIP code data were then aggregated to the county level to allow for direct comparisons to the health data.  The data reflects the cumulative total of users’ Likes at the time it is drawn.  Out of 127 categories, 40 were selected for the model from the “super-categories” of activities, interests, and retail & shopping[1].   Due to rounding performed automatically by the API that routinely led to overestimates, counties with fewer than 1,000 profiles overall were excluded from the analysis.  Facebook Likes were scored as a percentage of completed profiles in an area.  Finally, in order to reduce multicollinearity caused by variation in levels of Facebook usage by county, values were divided by the average percentage of Likes across all categories.  The resulting variables can be characterized as a measure of popularity relative to other categories.[2]

Population data, such as average income, median age, and sex ratio, was collected using the 2010 census, broken into county aggregates.  Supporting county-level statistics unrelated to health were collected using “USA Counties Information” provided by the Census Bureau. (   Overall, of the 3143 counties in the United States, only 2,069 counties contained sufficient data for all variables in the analysis.

Figure 1 displays the geographic patterns associated with the prevalence of each factor in Florida. Just as one would expect given the very different populations on the panhandle and the peninsula, scores on Factor2 (our most strongly health-related factor) vary a great deal between the state’s two regions. Beside the map of the factor is shown a map of life expectancy.  There are substantial differences, but the general contrast between the Northern and Southern portions of the state is evident in each.  This is a confirmation of our first hypothesis.

healthFigure 1: Distribution of a dimension of Facebook Likes is shown next to Life Expectancy in Florida (darker counties are higher in their respective measure).

Given that Likes, by their nature, revolve around commercial activities, whether they take the form of entertainment, food, or packaged goods, it is unavoidable that a great deal of what we see in the variation of Likes is driven by socioeconomic status.  In parallel, given the nature of our healthcare system, wealth and health outcomes are inextricably linked.  Nevertheless, while this socioeconomic component of health will be difficult to alter, there is a further element that is linked to voluntary behavior that can improve one’s health in the areas of nutrition, exercise, and health maintenance.  Unfortunately, at this point, we feel that our measures of all but exercise are too weak to form a complete picture of how well Facebook could potentially predict these other behaviors.

[1] The exact method for determining these categories has not been reported by Facebook.

[2] Though the individual variables resulting from this transformation were sometimes entirely uncorrelated with the originals, estimates using the raw and transformed variables correlated at R=0.9.  Thus, we conclude that the results of the proceeding analyses are not an artifact of this transformation.

More information about Sample Solutions LLC.


Recommended Reading

Comments Icon3 comments found on “Data Mining Guest Post: Steve Gittelman

Comments are closed.