There is no such thing as bias in data - there is only bias in society
Sarah has expertise built over years using data science to innovate from countering terrorism to building products, and now guides others.
Ben: Welcome to the Masters of Data podcast, the podcast that brings the human to data. And I'm your host, Ben Newton. Our guest this episode is at the epicenter of innovation around artificial intelligence and machine learning. Sarah Catanzaro is a principal at Amplify Partners, a venture capital firm that specializes in early stage companies that are innovating with machine learning and AI.
Sarah helps guide founders and innovators because of her incredible expertise, built over years, using data science to innovate both for the private sector, and even protecting US national security. If you find this exciting, and believe me, I did, you are going to enjoy this episode. So without any further ado, let's dig in.
Welcome everybody to the Masters of Data podcast, and I am very excited to be here today with Sarah Catanzaro. And we're actually meeting in her office, at Amplify Partners, where she's a principal today at Menlo Park, welcome Sarah.
Sarah: Thanks, thanks for having me.
Ben: Absolutely. Been really excited about talking to you, and I think in particular when I was going back and reading about your career. You've done some very interesting things. So like we always do, on this podcast, we start off to get to know the people that we're interviewing, so talk just a little bit about ... I mean, how did you end up in data science, in technology, what led you that direction?
Sarah: Yeah, absolutely. I probably have one of the weirder career paths of anybody I've met before. So, I actually started my career in the defense and intelligence sector.
Ben: Yeah, I saw that.
Sarah: Yeah, I had grown up in New York, including during 9/11, and I think I was really compelled by this question, why do people commit atrocities? Why do people engage in violence? And so during my time at Stanford, I worked on some of these research questions, really trying to get at the heart of why do terrorist groups organize themselves in the way that they do. What I found was that the statistical approaches could be used to get better insights into these organizational dynamics, so right after college, I ended up going to the Center for Advanced Defense Studies, and there the focus of the research program I was directing was on computational models of adversary behavior. So again, how can we use software, how can we use statistics, to understand an insurgent group, or understand a terrorist or criminal group of some other nature?
I wanted to, though, really have more of an opportunity to see the output of that research, not just to do these kinds of academic exercises. So I moved on to a larger defense contractor, and then ultimately to a role with Palantir, and again, really, really kind of focusing on this question of how do we use data to predict organizational behavior?
So when people then ask, how did I end up in VC, how did I end up working with start-up data, to me, it's actually kind of the same thread. I've always been trying to answer this question, how does this small, covert organization disrupt an incumbent. How do we predict their behavior with data? So again, after leaving the defense, the federal path, I ended up at Mattermark, where we were collecting data on other startups, and selling it to investors. Ultimately one of our customers ended up recruiting me, which was my kind of path into the venture world.
Ben: You know, now hearing you actually explain it yourself, what I find interesting about that, one thing that's come up a lot on the podcast is about, it's even to the point where I made it our tagline, was about bringing human to data. But what was really behind that was, that's a lot of times the connections that get missed, is, you know, people go out and build algorithms, they collect data, they make these applications, but they're not really thinking about the human of it. But what's really interesting is, it seems like you almost started with the human.
How did you kind of come to that? Was there something kind of in your background that made you attracted to the human behavior part of it, or what really attracted you, made you want to study human behavior from like, a statistical perspective?
Sarah: Yeah, so I think the hard thing about studying human behavior is maintaining objectivity. You're always inclined to think about what you would do in a certain position, or how you would act. It's just instinct, it's empathy. And so that, in a sense, pollutes the type of analysis that you can do, which is why I think the statistical methods ... particularly when coupled with more qualitative approaches ... are really powerful. Because they enable us to analyze human behaviors in a more objective way, to kind of put it under a microscope and subtract ourselves from the equation.
Ben: That's really interesting. Well, I mean one thing, you know, we had talked about a few things we could talk about, and one of the things that came up was data bias. And a lot of times it comes up, it's the human element that, you know, we really see it.
So I get to interview a couple authors, I don't know if you've heard of Cathy O'Neill or ... and Virginia Eubanks, and you know, both of them kind of came up from different angles that there's kind of these unintended consequences. So how do you think about that? Because now you've seen, definitely some areas that a lot of people don't get to see about defense. Now you've been different kind of analytics firms, now you're in the startup world. Where do you see the bias creep in, and how do you think about it?
Sarah: So I think what we need to remind ourselves always, is that data is a representation of the world. Data is a representation of human behavior. So, in a sense, there is no such thing as bias in our data, there is only bias in society, bias in individuals, and that gets reflected in our data sources. I think what worries me sometimes is that we see these approaches, we see these conversations around removing bias from our datasets. If we remove the bias from our datasets, then we have no way of actually identifying that bias in society, so in a sense, I'd much rather us use analytical methods to identify the bias that is reflected in our data, but exists in our society, rather than just trying to eliminate the bias that appears in our datasets.
Ben: What does that look like, practically? Like how do you do that?
Sarah: These are really, really hard questions, many of which I've been toying with myself. So let's just think about GDPR. On its face, I think privacy is so important, and protecting underrepresented populations is so important. But if, for example, we remove data about gender from ... let's just take something from my world ... from a dataset about female investors, about investors writ large. Well, now I no longer have the ability to look at the difference in behaviors between men and women, and the difference in outcomes between men and women. So I can't point to kind of the systemic or structural issue that again, may lead to this "bias" in my data.
Ben: I mean, do you feel like there's a recognition in the data science community in particular to think around those issues? 'Cause I mean, it, to some degree, it feels like it's coming to a head? You know, with the recent things with Facebook and other places, you know, it's kind of coming out in full view, right?
Sarah: Yeah. I really wish I saw more efforts to be proactive about identifying the bias in our data and its underlying sources. And then potentially using analysis as a way to remedy those problems. I think right now, the level of concern is still a bit superficial. It really hinges much more on privacy, and around the ways in which data can be used as a weapon to really oppress, again, those underrepresented populations.
I see far less efforts to use data as a tool of enablement, or use data to really highlight those gnarly elephants in the room that we're not talking about. When you can put data behind something, when you can make this objective, that's really powerful. No longer are you just talking about people being oppressed in this kind of generalized way, you're showing, like, this is happening. This is real. And you can no longer ignore it. So I'd much rather see data wielded in that way.
Ben: I mean, have you actually seen an example that you are able to talk about, where you've seen that work out well, where somebody's actually done that?
Sarah: I think, while it is not an example of a formal statistical analysis or data analysis, the Me Too movement was a great example. It's one thing to say that every woman experiences harassment at some point in her life, or that there are certain things that make it more difficult for women to really accelerate in their career. But there was something very, very impactful about seeing all of those Me Too hashtags. I think, you know, that's an example of where we had data that was making these kinds of insipid societal patterns palpable, that made us see just kind of the scale of this problem.
Ben: That's interesting, yeah, 'cause I mean I guess to some degree, people don't ... things will sit at the back of their mind, but until you actually really bring it home with like the enormity of the problems, and that is a good example. I mean one of the things you do at Amplify is that you work with startups, so you're mentoring startup leaders, and you're trying to help them make, you know, the right steps.
How do you go about helping some of these startups get this right? 'Cause then you know, now it seems like data is fueling a lot of new innovations, so I'm assuming you're probably dealing with, you know, at least issues kind of around this pretty often. How would you talk to a new startup leader, to help them navigate that?
Sarah: I think a lot of it has to take place at the strategic level, and it needs to be addressed intentionally. So, as a company, you need to understand what problems you're going to solve, and what sort of future you want to enable. And from that type of awareness, you can back up into what type of data you need to collect. What sort of analysis you need to perform. What type of futures you want to avoid, and therefore how you are going to structure both your dataset and your analysis, such that you can circumvent that negative outcome.
But I don't think that the best position for any of our portfolio companies will ever be reacting to what emerges in their data. It takes just a lot more intention around objectives and potential futures.
Ben: Well it seems like, I mean part of it, the level of diversity of lots of different sorts, you know, backgrounds, perspectives early on, would ... 'cause it seems some of these things that we've seen have happened because the diversity wasn't there when they were making those strategic decisions, and so even like a recent example. Like with this thing with Wilbur Ross, you know, where they shut down talking about, he couldn't believe that these government workers couldn't go out and get loans, and it was so tone deaf, but you know, those kind of things where, because he had no experience with that, he couldn't even connect with that.
And I mean, some of this you can't get your head around unless you actually have people that can actually connect with it on a human level. Does that sound right to you?
Sarah: Absolutely. I love that you asked that question, because, you know I think another problematic trend in Silicon Valley is this tendency to want to hire for diversity as an end in itself.
Sarah: In fact, diverse teams produce better analysis for the very reason that you described. When you have different types of perspectives you can imagine different edge cases, you can imagine different outcomes, you can imagine different interpretations, and so in fact, diverse teams are just better. Especially in the world of data science.
Ben: Yeah, I mean are you seeing any movement that direction? Does it seem like it's getting better, from what you at least, the world that you get to see?
Sarah: I think it is, you know, I think we're moving more from diversity to inclusion, and in making that transition, we do need to think about, why does diversity matter? If you're not asking the question of why does diversity matter, then you're not going to design your inclusion initiatives in such a way to benefit the organization and the individually maximally. So I think we are making progress, I think Airbnb has done some great work, again, turning analysis, or using it as a tool to better understand their problems with diversity, particularly in their hiring funnel. And then talking about why diverse data teams make them stronger.
Ben: That's cool. I mean, Airbnb has always been kind of ahead of the curve, I think, on that kind of thing. Well I mean, you made a pivot there. Talk a little bit more about you, so, you're a woman leader, and venture capital, you know, there's not many of you. And how's that experience been for you, being in the venture capital world as a woman?
Sarah: Yeah, I think there are advantages and disadvantages, as there are in any role. So, I won't pretend that it's easy. I think in fact as a female in venture capital I do need to hustle harder. I need to generate my own deal flow, because I don't have this network of male colleagues that are sending things to me that are kind of propping me up. But in fact, because I need to generate my own deal flow, because I need to find things out on my own, it forces me to develop these muscles that really have applications beyond just finding great companies, or finding interesting people. It positions me better to then win deals, to then support those portfolio companies. So, it's not easy being a woman in VC, you just don't have the same social structures that your male colleagues may have, but there's more opportunity to grow in a sense.
Ben: I mean even looking at Amplify's website, and looking at your team, it seems like you do have a more diverse team in general than some of the other venture capital firms I've seen.
Sarah: Yeah. Well, I think just as I was describing with Airbnb, at Amplify we really do value having diverse perspectives and diverse networks. And so, in fact, I think even from a purely rational, purely economic perspective, we see why having a diverse team is going to make us a better venture capital firm. If I was seeing the same deals, if I was interested in the same markets, if I had the same social network and professional network as all of my colleagues, then we'd be missing out on a ton of opportunities. But because we have a diverse team, we just see more.
Ben: That makes a lot of sense. And I even think back to, I think I told you earlier when we first talked is, I got a change to interview Sarah Guo over at Greylock, and even, I realized then, and I kind of sense this here too, is, like the way she talked about her relationships with the startups seemed different to me. And you could just hear it in the words she used and how she talked about them. And so I think that's ... it really seems like it's been changing for the better over the last few years, 'cause I mean, I've been in the industry for like the last 20 years. At least from my perspective I can see a difference.
How long have you actually been here at Amplify now?
Sarah: So I've been at Amplify for about a year.
Ben: Okay. Yeah, and so I mean do you ... is your, in general when you look outside of Amplify, do you feel like you're seeing movement in the right direction? Do you feel you see there's recognition?
Sarah: So I think there's recognition now that diverse investment teams are important. And in fact, there is a bit more acceptance that diverse investment teams may in fact have better outcomes. We still probably need more data around that to sway some people. So frankly, I don't think, you know, that battle is yet won, but we're making progress there.
Where we need to use data science, where we need to use these computational tools to change things is really better understanding where the problem is. So, do we need more women in GP roles? Do we need to change promotion practices such that there's more upward movement internally? Do we need a bigger pool of female analysts and associates, or diverse analysts and associates? I don't think we yet know how to solve for this problem, we're really just at the point of recognizing that there's a problem, and that it might in fact have a broader impact than we believed true.
Ben: Like it is going back to what you were saying before about using the data to uncover the bias. So you're wanting to basically use the same method you just talked about to uncover this same problem, that makes sense.
Sarah: Yeah, exactly. Just knowing that bias exists, does not absolve you of anything. The next step is thinking about how we can leverage the set of tools we have to change things. To get rid of that bias in society, or at least to mitigate it in some way. So acknowledging that there are not enough women in venture, and that that may in fact impact venture performance, that's step one. But step two is, how do we use the data that's available to us, and other tools at our service to change that? To get more women in venture, to get more women writing checks, and so on, and so forth.
Ben: Well I'd say, and I mean the position you have at a venture capital firm, you have a unique opportunity to help make that change happen, which is great. I mean the startups that you work with, and here at Amplify. So, I mean that's great. Well, maybe we'll make a little switch here, I mean you know, one thing that caught my interest when we were originally talking was, you know, we were talking a little bit about big data ...
I know one thing that I've said before is that it feels like to me that if you go back, you know, maybe it was like over a decade ago, but you know, when people first started talking about big data, there was all this data and they started talking about the amazing things you could do with it. And it seems like that went into that trough of disillusionment, you know, where it seemed like it had failed? And so you talked a little bit about what you think was skipped during that, what went wrong, so talk to me a little bit about that. I mean, do you feel like the big data movement was a failure? I mean how do you view that from your perspective?
Sarah: So I think the problem is that there's this gap between big data and AI. So, I'll give an anecdote for a prototypical company. First, they create their [inaudible 00:19:24] infrastructure, their big data infrastructure. So now they're storing, you know, a large volume of data. Then they hire a data scientist or ML researcher. That data scientist or ML researcher is expected to design a neural net to do something or other, and then that's expected to be productionized, and that's expected to deliver some sort of business outcome.
But frankly, like, that is way too simplistic. If the data scientist does not have clean data, then they will not be able to design any model to achieve the performance metrics that are expected. If they don't have, you know, robust data infrastructure, or production infrastructure, then deploying into production and operationalizing their model is going to be exceedingly challenging. If there's no sort of monitoring system or really engineering practice around that production ecosystem, then it's going to be very, very difficult to determine whether or not it is yielding some sort of business result.
And so, in making this transition from big data to AI, we almost treated AI as a silver bullet, and we kind of forgot the fact that data preparation, that analysis, that engineering practices, like these problems need to be solved. So I think a lot of organizations are looking at their AI efforts now, and thinking either like, this was just a failure, or we need to go back and figure all of these things out.
Ben: I mean it sounds like partly what you're saying was that basically taking these big data applications and making them more robust, and inserting like ... I mean, things that we've been doing for years in IT technology, but actually, you know, you can't just throw something together and expect amazing insights to come out of it, you actually have to apply the things that we've learned about running big applications for years and years and years, and do that with your data and AI as well. Am I hearing you right?
Sarah: Yeah, exactly. I mean, let me put it this way. If you have a recipe you're not necessarily going to get a great meal. You need good ingredients. You probably need an oven, and grill, and these other tools. And if those are not working, then again, you're not going to get a great meal. But essentially, what organizations have done over the past five years is asked people to build recipes, and expected to have-
Ben: Amazing results.
Sarah: A Michelin star. Yeah, results. So, now my hope is that organizations are basically stubbing their toe. They're like, "Oh gosh, we hired a bunch of ML researchers, and we're not getting any sort of output." And thinking not, "Ah, this is a failed experiment," but, "What do we need to make this work?" And that's going to come from two directions. One is from the business strategy side of things.
So firstly, given a set of data, we can't expect just to "apply ML" and have goodness emerge. We need to understand what types of problems we should be solving with data and machine learning. So having additional clarity around ML strategy is going to really help accelerate enterprise adoption. And then we need to think about what is good practice in terms of understanding data quality, in terms of cleaning and preparing our data, in terms of monitoring and debugging our models and production. And also just in terms of organizing data science teams to remove the friction that may exist between data scientists, between data scientists and data engineers, and between data scientists, data engineers, and other business users.
Ben: When you said data engineer there it kind of ... something clicked for me. I mean, it sounds like maybe this is a transition that even computer science in these large production systems went through. I mean they have started off and it was, you know, a bunch of PSE graduates, basically continuing their research in the professional world. And over time you developed these like, layers of people like, you know, I'm really good at running things at scale, but you know, and you develop these like, kind of engineering skillsets.
So it sounds like maybe that's probably what's happening with data science, is you're developing these kinds of ancillary skills, these data engineers that actually understand how to run these things, while you still have the data scientists who are kind of figuring out how to, you know, make the ML machinery work. Does that sound right?
Sarah: Yeah, I mean skills, products, and processes. So, imagine engineering without GitHub. It would be a lot sloppier. Likewise, without the right products and processes for data science, we end up with results that are difficult to reproduce, that may not be effective at all without clean data, and that ultimately are not going to deliver the best business outcomes.
Ben: Yeah, that makes sense. Well, and one thing you've mentioned before when we talked was about building up data science teams. I don't know, something I hear over and over again is the skills gap, you know, there's not enough people to do this, and I definitely know and ... Well, if you're not in Silicon Valley, I'm sure that's definitely the case. So what do you think about that? 'Cause one thing you mentioned was that data science adoption in general was slow. So, how do you think about that in your perspective? I mean, how do we work around that? How do we build these skillsets, how do we build teams? How do you think about that?
Sarah: Well, you brought up an interesting point before talking about the specialization that occurred in engineering. I think certainly we need to understand the different roles that a data scientist may play, and different people will have different strengths. So, for example, some people really enjoy kind of data munging and data pipelining, and thinking about the data infrastructure. But they don't necessarily like the analytical component. That's perfectly normal. Others are really, really exceptional at analysis, but don't want to work with infrastructure. We do need to kind of get better at understanding what these different roles are, and also making people within each of those roles as effective as possible. Again, given the right tools and products. So a lot of the tools that data scientists are using today, they were designed for individual contributors not for teams.
I think what we're going to see in the next couple of years is that there are new products suites, new solutions that really take hold that remove some of the friction associated with individuals trying to piece all of their things together and get them to interoperate things around. Get a provenance, things around, you know, collaborative notebooks. A lot of this will not necessarily solve the data science gap, but help make people more productive in their role, so you get more out of each data scientist, and more out of the team.
Ben: Yeah. Where do you think that's going to come from? 'Cause even as now you're saying it, I mean there's, you know, OpenSource things like Jupyter Notebook, and Apache Zeppelin, and things like that. But there's also, you know, all the major cloud providers. You know, AW, Azure, and GCP now have these out of the box in mail solutions and stuff like that, so where do you think that consolidation is going to come from, the best tool sets are going to come from? Is going to be OpenSource, or you think it's going to really be these kinds of one stop shop, you know, service solutions out of a cloud provider?
Sarah: Frankly, I think it'll come from a range of places, and perhaps different products will serve different parts of the market. So, some will be more oriented around the needs of an enterprise, some will better serve the mid-market, some may serve academia, and so on and so forth. I think what we're seeing now, again, is that you have these, perhaps more sophisticated data organizations that have kind of established some sort of data science and machine learning competency. And now it's time for them to productionize their models.
And that's just not working well. You know, a colleague describes the, I think, endless cycles of data science misery that emerge when a data scientist is kind of tasked by a business analyst to answer some question, and then that data scientist needs to lean on a data engineer to create the right pipeline, or to do the ELT to get, you know, this one feature, and then the data engineer needs to lean on a Dev Ops person to get the right resources provisioned, and so on and so forth. I think with even these more sophisticated data organizations, they're seeing all of this friction and building out their own end to end infrastructure, that at least reduces some of these inefficiencies. What we're also beginning to see is that in fact, some of the teams that built this infrastructure are spinning out their solutions and commercializing those, so that it's no longer just available to said fane company. So that it can be leveraged by a broader audience.
Ben: Well, I guess that makes sense, 'cause the same thing happened with microservices and things like that, with Google and Facebook, so. It sounds like, you know, you can get back to where we started, this is a human thing at the end of the day. You know, it's humans working together, and how can you remove the friction with tools, you know, have the right set of skillsets, and you know, it sounds like there's maybe a developing set of skills in like, connecting people. 'Cause I even think like, I mean I'm a former product manager, and that's a lot of what you do as a product manager, is connecting people that are not necessarily inclined to connect, and getting everybody kind of aligned in the same direction. Does that sound right to you?
Sarah: Yeah, I mean a lot of it is just orchestrating people, but again, there are ways that you can do this more effectively through teams, through products, and through processes. So I think it'll be a combination of those different levers that leads us to a better future.
Ben: And it seems like, even getting back to the AI thing, there's a lot of hype, there's a lot of, you know, even you know, now in the popular culture, people are thinking about what AI's gonna be used for. What do you think of the things that are gonna be over the next couple of years that are gonna surprise us about the applications to this technology? I mean, are there things that you're kind of tracking at a high level? Do you think that maybe people aren't seeing right now? 'Cause I mean, a lot of it's just like, you know, the robots taking over something, you know but what do you actually see from a practical perspective?
Sarah: So, one question that I like to ask myself often is, what is something I believe that other people don't necessarily believe. Like, what are the opinions I hold that are relatively contrarian? And frankly, if there was something I was very excited about that seemed rather science fictiony, or perhaps not even necessarily over hyped, but just weird, it would be this area of augmented intelligence. So if I unpack that and that about what precedes augmented intelligence, I think it is a more sophisticated understanding of what humans are good, and what machines are good at, and how we develop the right interfaces between humans and machines.
So, when I think about, you know, the future of AI, and what we're going to see over the next couple of years, not necessarily the next, you know, 50 years, I think it will be these applications where we enable humans to excel in their role, and AI to excel in their role, and we're applying AI to augment human professions, not necessarily to replace them.
Ben: That's definitely interesting, 'cause I remember I've had a couple discussions, I think I even wrote something about, you know, this kind of difference between HAL 9000, you know, like the killer robot, and the kind of idea of an Ironman. And so it sounds like more what you're talking about is more about taking what a human can do and making them more effective at that, and you know, it's more like wearing an Ironman suit than some sort of like, walking, talking robot that we're talking about here right?
Sarah: Yeah, absolutely. I mean, it is about augmenting, about creating Ironman, not from scraps of metal, but based on human experience. There are a lot of things that we do today that we're just not super well suited for, I mean think about certain elements of memory, certain routine tasks that we do. Like, how can we fill in those gaps in the future.
Ben: That's really fascinating. So if we look at where you're kind of refocused on, well what's kind of next for you? What are you focusing on over this next year now that we're at the start of 2019? What's kind of your big focus area?
Sarah: Well I think I've alluded to this already, but certainly thinking about the data ecosystem. About the suite of tools that we need to make data scientists more effective, to enable that transition from experiments to operationalization of models, and to also do this in a compliant and ethical manner as it relates to at least observing data bias, and having more insights into data quality, that's something that's really exciting to me.
I managed data science teams for most of career before I went into venture, and I saw how hard it was frankly. It's not easy for a data scientist to operate within an organization. It's not easy for them to accelerate or progress in their career, and there's just friction kind of up and down in every aspect of what they do, even as it relates to just being able to understand their output and how they contribute to the business. How they think about the business outcomes that they deliver. And, I wanna make that better, like I don't want data scientists today to have the same headaches that teams I managed had years ago.
So, sure I could go back into managing data science teams, and help relieve some of those pains as a manager, but I think as an investor, I get to look at these opportunities to make data science work better, not just for people who are reporting to me, but for hundreds of organizations, thousands of organizations. Now that's still relatively new, I think there are interesting vertical applications too, so when I look at healthcare, and the life sciences, I think the way that we are discovering and bringing drugs to market needs to shift, and is shifting rather dramatically.
When I look at finance, and our ability to deal with truly big data, you know, petabytes of data, I think the way that financial markets operate, the way that we think about saving and investing, is also fundamentally changing. Those are just a couple of examples of places where there're gnarly data problems and big data sets that can have an impact on society.
Ben: Well it sounds like what you're working on is real exciting, and good thing that you don't have small ambitions. So that's good. Well, I mean we wish you luck, and I'm gonna be real excited to follow what you get involved in, and thank you for your time today. This was great.
Sarah: Course. Thank you, it was a pleasure.
Speaker 3: Masters of Data is brought to you by Sumo Logic. Sumo Logic is a cloud native, machine data analytics platform, delivering real time, continuous intelligence as a service to build, run, and secure modern applications. Sumo Logic empowers the people who power modern business. For more information, go to SumoLogic.com. For more on Masters of Data, go to MastersOfData.com, and subscribe, and spread the word by rating us on iTunes or your favorite podcast app.