In this digitally connected era, all of us produce enormous numbers of data points every day. What we search. How we search it. What we buy, and what we read. What we like and dislike, whom we chose to associate with, and so much more — a steady stream of data that can be quantified, sifted and analyzed en masse with the data from everyone else to reveal patterns previously hidden, sometimes things we’re not even aware of about ourselves.
That data may offer us as a society a better way to truly understand who people really are, a theory that author Seth Stephens-Davidowitz submits for our consideration in his new book Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are. A former Google data scientist who is also a visiting lecturer at Wharton, Stephens-Davidowitz joined the Knowledge@Wharton Show on Sirius XM channel 111 to talk about what properly analyzed big data can reveal about our political views, our health, our biases and more.
An edited transcript of the conversation follows.
Knowledge@Wharton: There’s not much doubt that our digital footprints say a lot about who we are, but I get the sense that people, to a degree, still scoff at the idea that so much can be gleaned from all of this information.
Seth Stephens-Davidowitz: Yes. Some people have this traditional notion of what data is. They think of it like a representative survey: You have clear questions with check boxes that people can answer very clearly. I think they get a little uncomfortable with the wild world of the internet, where data tends to be more unstructured and a little bit different than they’re used to.
Knowledge@Wharton: Does it feel like people still believe they have a higher level of data security than they really do?
Stephens-Davidowitz: I think there are definitely concerns about the power of big data. Because data is so predictive, companies can potentially use it to really take advantage of people. I talk about it in the book. One example is if you apply for a loan, companies can predict whether you’ll pay back the loan just based on the words you use in your loan application. For example, if you use the word “God” in a loan request, you’re 2.2 times more likely to default, 2.2 times more likely not to pay it back. So a company could save money by not giving loans to people who end their requests with “God bless you,” which is pretty scary.
Knowledge@Wharton: Throughout the book, you tackle some of the bigger issues that we have in society, like racism and child abuse. And there are all kinds of data points which will lean one way or another in these areas.
Stephens-Davidowitz: Right. There’s just so much information now from the web. And there are certain sources, such as Google, which I focus a lot on. People are just really honest and tell Google things they may not tell anyone else. So when it comes to really important areas like the ones you mentioned, we can get really new insights into who we are.
Knowledge@Wharton: One of the areas you look at is sex.
Stephens-Davidowitz: I like to say that big data is so powerful that it turned me into a sex expert, because it wasn’t a natural area of expertise for me. There’s obviously a lot of lying around sex because it’s an uncomfortable, taboo area. I think we can learn a lot more from Google searches about what people like.
Knowledge@Wharton: You also looked at racism; and talk about how racism actually surfaced more, not during the presidential race in 2008, but in the immediate aftermath of President Obama being elected.
“[If you] go by conventional wisdom, racism is considered a Southern issue…. If you look at the Google search data, which is more honest, you see many of the areas with the highest racism are Northern places.”
Stephens-Davidowitz: There is a disturbing element to this data. If, in general, people lie to make themselves look good, then we’re going to have an overly optimistic perception of who people are. But if we know the truth, in many areas, unfortunately, we’re going to learn darker things about people, and racism is one of the areas. It’s shocking. One of the most surprising things I found right away in this data was the shocking number of racist searches people make, basically looking for jokes mocking African-Americans. And yes, this was a big theme — really nasty searches about Obama as soon as he was elected.
Knowledge@Wharton: One of the long-held beliefs about this was that racism is more of a Southern phenomenon, but your data showed that is not necessarily the case.
Stephens-Davidowitz: Yes. If you ask in surveys or go by conventional wisdom, racism is considered a Southern issue. But I think that may be because in the South, there’s just less need to hide that racism. If you look at the Google search data, which is more honest, you see many of the areas with the highest racism are Northern places: western Pennsylvania, eastern Ohio, upstate New York, industrial Michigan. The real divide in racism these days is not South versus North, it’s East versus West.
Knowledge@Wharton: If people or companies were able to use this data in a more coherent, more effective manner, what do you think the impact in general would be for the country, or for society?
Stephens-Davidowitz: Well, there’s an optimistic scenario and a pessimistic scenario. I don’t know which one will come true. The pessimistic scenario is that companies would use this to take advantage of people, to get them to spend more money that they don’t have, or spend more time on their websites even though they don’t need to be on those websites. The optimistic scenario is that we would have insights into really, really important areas — health, racism, sexuality — and really learn how to improve society.
Knowledge@Wharton: The health angle of it is very interesting. The idea that we would be able to glean information that might lead to cures for diseases, or be able to take a more effective preventive approach, being able to catch diseases before they become worse — those things would have an incredible impact both on the people in this country, and also the economics surrounding health care.
Stephens-Davidowitz: Yes. In one of my favorite studies, they used search data and found people who made searches such as “just diagnosed with pancreatic cancer.” And you know when someone makes a search like that, they probably just got diagnosed with pancreatic cancer. Then you compare those people to similar people who never were diagnosed with pancreatic cancer, and you look in the prior months what symptoms were they searching for. And they found really, really subtle patterns that are predictors of eventually getting a pancreatic cancer diagnosis.
For example, if you searched “indigestion” followed by “abdominal pain,” that’s a risk factor in pancreatic cancer. Whereas searching “indigestion” by itself is not a risk factor. That’s a really, really subtle pattern that is hard to pick up without massive data sets, and it almost suggests a new kind of medicine.
Knowledge@Wharton: One of the things that may have helped revolutionize big data and our understanding of this, which you write about, is Google Trends.
Stephens-Davidowitz: Yes, it’s interesting. Google Trends can show you where different terms are searched, what place they’re searched more frequently, and then you can also see how things are searched over time. When it first came out, it was considered a little bit of a joke. It was not considered a scholarly source; it was just more a fun kind of PR source for Google, potentially. You could play around, learn what fashions were popular, what celebrities were popular. But I think we’re learning more and more that this is no joke. This is, as I say, probably the most important data set ever collected on the human psyche, and definitely a really important tool for researchers to focus on.
Knowledge@Wharton: And in contrast to that, a lot of these surveys that come out proclaiming data may not necessarily be as accurate as they would lead people to believe.
Stephens-Davidowitz: Yes, I think surveys have big holes in them. The more I look at surveys the more skeptical I become. Even just in little things. Recently, I looked at survey data on potential car purchase behavior versus actual car purchases, and they don’t match up at all. People say they’re going to purchase cars that they don’t, or they don’t say they’re going to purchase the car that they do. So I think surveys have been dramatically overvalued, and really are going to play a much smaller role in the future as some of these new internet data sources become more accessible.
“[Google Trends] is … probably the most important data set ever collected on the human psyche, and definitely a really important tool for researchers to focus on.”
Knowledge@Wharton: That’s part of the reason why a lot more companies are really looking at analytics, and looking at data to get a truer understanding of what consumers are thinking, correct?
Stephens-Davidowitz: Yes. I think it’s also just that you have to be careful. For every data source, you have to think: What is this data source? What are the incentives that people have when they’re giving me this data? I think a lot of people, any time they see numbers or data, they say, “Oh, that’s reliable.” But a lot of data sources are crap — pardon my language. A lot of data is really unreliable, and a lot of data is reliable. But what people click on, what people purchase, what people search — that’s more valuable than many of the other sources that you might consider.
Knowledge@Wharton: Going back to the political realm, you discuss in the book that the data and what was out there on the internet did suggest that President Trump was going to be the person to win, not only the Republican primary, but also the general election, correct?
Stephens-Davidowitz: I think there were definitely clues. It’s a little tough. It’s one of the most common questions I get: “Can you use Google searches to predict elections?” And it’s a little difficult, because we’ve only had four elections in which Google search data has been around, so it’s a little challenging to predict their models.
But I think within four to eight years, we’re going to be able to use this data to predict elections very, very well. I’ve already talked about some of the clues I already had right before the election that suggested to me Trump was going to win. A couple things tipped me off. One, you can see based on whether people search for “how to vote” or “where to vote” before an election whether they’ll actually turn out to vote. You can’t really trust when people tell you in surveys that they’re going to vote. Everyone says they’re going to vote, and then many of them don’t. But what this revealed is that African-American turnout was going to be much lower than in previous elections. This really hurt Hillary Clinton in the election.
Then there is a really subtle clue that I think is fascinating: The order in which people search for candidates can give a tip off of which way they’re going to vote. If people searched “Trump/Clinton poll,” they’re much more likely to go Trump’s way. And if people go “Clinton/Trump poll,” they’re much more likely to go Clinton’s way. And there were many more searches for Trump/Clinton polls in certain key states in the Midwest.
Knowledge@Wharton: So was there an implication that could be gleaned from just having Clinton in a search, whether that included Trump with it or whether it did not?
Stephens-Davidowitz: No, I think that search by itself is not revealing, because you may search Clinton because you love her, or you may search Clinton because you hate her. You may search Trump because you love him, you may search Trump because you hate him. It doesn’t really tell you anything. It has to be a little more subtle than that. But the order in which candidates are searched does have predictive power. It may even be that people give away who they’re going to support before they realize it themselves, because people may think they’re undecided, but if they’ve been searching “Trump/Clinton debate,” “Trump/Clinton polls,” “Trump/Clinton election,” they’re very likely to be going for Trump.
Knowledge@Wharton: Do you believe, though, that we are getting to a point where people have a better understanding in general about all the data that is out there? Because it’s seemingly a fairly common story about how we really, truly don’t understand all of this data. Maybe it’s a bit of a gradual process to really get a handle on a lot of this.
Stephens-Davidowitz: I think we’re getting there pretty fast. It does need more people. I think because it initially was considered so strange that you could just understand people from their internet behavior, it hasn’t really been the subject of as much academic research as it should have been. But it’s definitely being studied more and more, and you’re seeing more and more methodologists in this area. We’re really getting there. We’re beyond the point where it’s just, “This is cool.” We’re now actually getting, real, real insights into who we are from this data.
“Going after little questions doesn’t make sense with big data.”
Knowledge@Wharton: So is this going to be a growth area for the U.S. economy: People who can do the analytics, who can understand how to use this data to really make the impact on companies and people alike?
Stephens-Davidowitz: Absolutely. But I think it’s more subtle than people realize. This came up a lot in my Wharton class. When you think “big data,” you think it’s this very technical thing, and it’s all about statistics and a left-brain, nerdy pursuit. And it definitely is a technical area — I’m not going to lie. But it’s surprising how much it is a creative process. It’s really about knowing what questions to ask, and knowing how to find the nuggets of information in that data. You can’t necessarily teach that. It’s a bit of an art that you learn and master over time. So I don’t think it’s as simple as throw a data scientist at this question and you’re done. It’s more complicated than that.
Knowledge@Wharton: That would lead me to believe that we’re going to see more partnerships with data scientists and a variety of different business sectors over the next couple of decades to really try to get a handle on it, using it to address some of the world’s greatest problems, whether it’s access to water or fighting disease.
Stephens-Davidowitz: It’s thrilling — the possibilities are really mind-blowing, and in big areas. Because this new data exists — and it’s honest — it makes sense to go after the big questions, to be really ambitious. Going after little questions doesn’t make sense with big data.
Knowledge@Wharton: But what about people being able to understand themselves a little bit more? We talk a lot about how this data can impact other people and impact businesses: Will people be able to understand themselves better in the future?
Stephens-Davidowitz: I think so. Data frequently understands us better than we do. For example: Netflix initially in the early days of the company asked people, “What videos are you going to watch in the coming days? We know what you’re watching now, but this weekend, what do you want to watch so we’ll cue that up when the weekend comes around?” When you ask them, people say “I’m going to watch a documentary,” or “I’m going to watch avant-garde French films.” Then Friday comes around, and you have that in the queue, and they ignore it and watch the same lowbrow comedies or romance flicks that they’ve always watched. And Netflix just realized they should ignore what people tell them, and instead focus on what they actually do, and let the algorithm tell the story.
We tend to make horrible predictions about what we’re going to do in the future. Almost all of us are way too over-optimistic. I think data can ground us much better.
Knowledge@Wharton: This also could help us understand more clearly how this country may be different compared to, say, China or France or Germany. That has an impact when you’re thinking from a global perspective, whether it’s in business, politics, or on a variety of different fronts.
Stephens-Davidowitz: Definitely. It’s just really interesting to compare the differences between countries that can be revealed in this data. Then also, from a business perspective, of course, the data is just horrible from some of these countries. Nigeria, I think the biggest economy in Africa — one time, they realized there was a flaw in their GDP estimate and overnight, they changed the estimate by 90%. So traditional data in some of these countries is really, really bad. Some of these new data sources that are coming can dramatically improve our understanding of these countries.
I talk about night lights data, which can measure the economy just based on how much light is being produced. I talk about Premise, which is a company that basically just goes around taking pictures of economic activity in developing countries, and from those pictures is able to give estimates of inflation rates, interest rates and lots of other things.
Knowledge@Wharton: The potential for change from all of these different elements is massive. And they would seemingly give you much better predictive tools to use in fostering growth or avoiding pitfalls in various economies around the world.
Stephens-Davidowitz: Yes. I think I tend to be a very cynical, skeptical person, so when I hear a term like “big data,” or when I hear a buzzword, I’m just kind of like, “Ugh, these things are so silly. It’s just the latest, hottest fad.” But I’ve been studying this for five years. I’ve talked to people in the field. And I’m constantly blown away by what you can find. This one is no fad. It’s really a revolution in our understanding of people and the world.
“We tend to make horrible predictions about what we’re going to do in the future. Almost all of us are way too over-optimistic. I think data can ground us much better.”
Knowledge@Wharton: You are a self-professed cynical person, yet your life is in data. And truly, the data is the truth, correct?
Stephens-Davidowitz: Yes. I think in some sense, it confirms my skepticism, my cynicism in that you can’t trust what people tell you. With a lot of the traditional data sources, there are incentives for people giving you that data. But I’m not cynical at all about what you can learn if you know the right data to look at.
Knowledge@Wharton: You immerse yourself in this data on a daily basis at this point. I mean, this is an open-ended sector right now, in that there is data on everything and anything you could potentially want to have an effect on. This could be something where you could literally go from business to business and be able to collect data on a daily basis, correct?
Stephens-Davidowitz: Well, yes. For the end of my Wharton class, we had a group presentation and I gave them very, very broad topics. I just said, “Think of a new business in education or think about a new business in health, think about a new business in politics, and how using the tools of new data and big data would help you with that business.” And by the end of every single presentation, all of the students said, “Why doesn’t this exist? It doesn’t make sense. This should exist.” It’s usually hard to come up with something new, because smart people have been spending their whole lives trying to find things that should exist, things that people want. But I think with big data, it’s really surprisingly easy to come up with a new, important idea in a really big area.
Knowledge@Wharton: So you are positive about the future here? You’re working with that next generation, the students who are going to be out there in society. They understand the importance of these types of data points, and they will continue to build and grow them as we go forward.
Stephens-Davidowitz: It’s really exciting. The one concern is the ethical issue, definitely. Businesses may almost become too powerful, and really be able to squeeze consumers for everything they’re worth, because they know more about consumers than consumers know about themselves. That’s definitely a big concern I have.
Knowledge@Wharton: How do you guard against that?
Stephens-Davidowitz: It’s going to take a lot of work. I think most people in the areas of law and ethics aren’t quite prepared for just how revolutionary big data is in some of these arenas. Basically, the way I like to think about it is that everything correlates with everything. There’s very little that’s 0.000 correlation. So just about anything you do will predict something else you do. Traditionally, companies have had only three or four or five variables to make these predictions on. But now they have pretty much everything anybody’s ever done to make these predictions. So it’s very powerful stuff.