The Problems of Sensitive Data and Leakage
Data science and machine learning have gotten a lot of attention recently, and the ecosystem around these topics is moving fast. One significant trend has been the rise of data science notebooks (including our own here at Sumo Logic): interactive computing environments that allow individuals to rapidly explore, analyze, and prototype against datasets.
However, this ease and speed can compound existing risks. Governments, companies, and the general public are increasingly alert to the potential issues around sensitive or personal data (see, for example, GDPR). Data scientists and engineers need to continuously balance the benefits of data-driven features and products against these concerns. Ideally, we’d like a technological assistance that makes it easier for engineers to do the right thing and avoid unintended data processing or revelation.
Furthermore, there is also a subtle technical problem known in the data mining community as “leakage”. Kaufman et al won the best paper award at KDD 2011 for Leakage in Data Mining: Formulation, Detection, and Avoidance, which describes how it is possible to (completely by accident) allow your machine learning model to “cheat” because of unintended information leaks in the training data contaminating the results. This can lead machine learning systems which work well on sample datasets but whose performance is significantly degraded in the real world. As this can be a major problem, especially in systems that pull data from disparate sources to make important predictions. Oscar Boykin of Stripe presented an approach to this problem at Scale By the Bay 2017 using functional-reactive feature generation from time-based event streams.
Information Flow Control (IFC) for Data Science
My talk at Scale By the Bay 2018 discussed how we might use Scala to encode notions of data sensitivity, privacy, or contamination, thereby helping engineers and scientists avoid these problems. The idea is based on programming languages (PL) research by Russo et al, where sensitive data (“x” below) is put in a container data type (the “box” below) which is associated with some security level. Other code can apply transformations or analyses to the data in-place (known as Functor “map” operation in functional programming), but only specially trusted code with an equal or greater security level can “unbox” the data.
To encode the levels, Russo et al propose using the Lattice model of secure information flow developed by Dorothy E. Denning. In this model, the security levels form a partially ordered set with the guarantee that any given pair of levels will have a unique greatest lower bound and least upper bound. This allows for a clear and principled mechanism for determining the appropriate level when combining two pieces of information. In the Russo paper and our Scale By the Bay presentation, we use two levels for simplicity: High for sensitive data, and Low for non-sensitive data.
To map this research to our problem domain, recall that we want data scientists and engineers to be able to quickly experiment and iterate when working with data. However, when data may be from sensitive sources or be contaminated with prediction target information, we want only certain, specially-audited or reviewed code to be able to directly access or export the results. For example, we may want to lift this restriction only after data has been suitably anonymized or aggregated, perhaps according to some quantitative standard like differential privacy. Another use case might be that we are constructing data pipelines or workflows and we want the code itself to track the provenance and sensitivity of different pieces of data to prevent unintended or inappropriate usage. Note that, unlike much of the research in this area, we are not aiming to prevent truly malicious actors (internal or external) from accessing sensitive data – we simply want to provide automatic support in order to assist engineers in handling data appropriately.
Implementation and Beyond
Depending on how exactly we want to adapt the ideas from Russo et al, there are a few different ways to implement our secure data wrapper layer in Scala. Here we demonstrate one approach using typeclass instances and implicit scoping (similar to the paper) as well as two versions where we modify the formulation slightly to allow changing the security level as a monadic effect (ie, with flatMap) having last-write-wins (LWW) semantics, and create a new Neutral security level that always “defers” to the other security levels High and Low.
- Implicit scoping Most similar to the original Russo paper, we can create special “security level” object instances, and require one of them to be in implicit scope when de-classifying data. (Thanks to Sergei Winitzki of Workday who suggested this at the conference!)
- Value encoding For LWW flatMap, we can encode the levels as values. In this case, the security level is dynamically determined at runtime by the type of the associated level argument, and the de-classify method reveal() returns a type Option[T] where it is None if the level is High. This implementation uses Scala’s pattern-matching functionality.
- Type encoding For LWW flatMap, we can encode the levels as types. In this case, the compiler itself will statically determine if reveal() calls are valid (ie, against the Low security level type), and simply fail to compile code which accesses sensitive data illegally. This implementation relies on some tricks derived from Stefan Zeiger’s excellent Type-Level Computations in Scala presentation.
Data science and machine learning workflows can be complex, and in particular there are often potential problems lurking in the data handling aspects. Existing research in security and PL can be a rich source of tools and ideas to help navigate these challenges, and my goal for the talk was to give people some examples and starting points in this direction.
Finally, it must be emphasized that a single software library can in no way replace a thorough organization-wide commitment to responsible data handling. By encoding notions of data sensitivity in software, we can automate some best practices and safeguards, but it will necessarily only be a part of a complete solution.