As discussed in our previous post, we recently had the opportunity to present some interesting challenges and proposed directions for data science and machine learning (ML) at the 2018 Scale By the Bay conference. While the excellent talks and panels at the conference were too numerous to cover here, I wanted to briefly summarize three talks in particular that I found to represent some really interesting (to me) directions for ML on the java virtual machine (JVM).
Talk 1: High-performance Functional Bayesian Inference in Scala
By Avi Bryant (Stripe) | Full Video Available Here
Probabilistic programming lies at the intersection of machine learning and programming languages, where the user directly defines a probabilistic model of their data. This formal representation has the advantage of neatly separating conceptual model specification from the mechanics of inference and estimation, with the intention that this separation will make modeling more accessible to subject matter experts while allowing researchers and engineers to focus on optimizing the underlying infrastructure.
Rainier in an open-source library in Scala that allows the user to define their model and do inference in terms of monadic APIs over distributions and random variables. Some key design decisions are that Rainier is “pure JVM” (ie, no FFI) for ease of deployment, and that the library targets single-machine (ie, not distributed) use cases but achieves high performance via the nifty technical trick of inlining training data directly into dynamically generated JVM bytecode using ASM.
Talk 2: Structured Deep Learning with Probabilistic Neural Programs
By Jayant Krishnamurthy (Semantic Machines) | Full Video Available Here
Machine learning examples and tutorials often focus on relatively simple output spaces:
- Is an email spam or not? Binary outputs: Yes/No, 1/0, +/-, …
- What is the expected sale price of a home? Numerical outputs – $1M, $2M, $5M, … (this is the Bay Area, after all!)
However, what happens when we want our model to output a more richly structured object? Say that we want to convert a natural language description of an arithmetic formula into a formal binary tree representation that can then be evaluated, for example “three times four minus one” would map to the binary expression tree “(- (* 3 4) 1)”. The associated combinatorial explosion in the size of the output space makes “brute-force” enumeration and scoring infeasible.
The key idea of this approach is to define the model outputs in terms of a probabilistic program (which allows us to concisely define structured outputs), but with the probability distributions of the random variables in that program being parameterized in terms of neural networks (which are very expressive and can be efficiently trained). This talk consisted mostly of live-coding, using an open-source Scala implementation which implements a monadic API for a function from neural network weights to a probability distribution over outputs.
Talk 3: Towards Typesafe Deep Learning in Scala
By Tongfei Chen (Johns Hopkins University) | Full Video Available Here
For a variety of reasons, the most popular deep learning libraries such as TensorFlow & PyTorch are primarily oriented around the Python programming language. Code using these libraries consists primarily of various transformation or processing steps applied to n-dimensional arrays (ndarrays). It can be easy to accidentally introduce bugs by confusing which of the n axes you intended to aggregate over, mis-matching the dimensionalities of two ndarrays you are combining, and so on. These errors will occur at run time, and can be painful to debug.
This talk proposes a collection of techniques for catching these issues at compile time via type safety in Scala, and walks through an example implementation in an open-source library. The mechanics of the approach are largely based on typelevel programming constructs and ideas from the shapeless library, although you don’t need to be a shapeless wizard yourself to simply use the library, and the corresponding paper demonstrates how some famously opaque compiler error messages can be made more meaningful for end-users of the library.
Aside from being great, well-delivered talks, several factors made these presentations particularly interesting to me. First, all three had associated open-source Scala libraries. There is of course no substitute for actual code when it comes to exploring the implementation details and trying out the approach on your own test data sets. Second, these talks shared a common theme of using the type system and API design to supply a higher-level mechanism for specifying modeling choices and program behaviors. This can both make end-user code easier to understand as well as unlock opportunities for having the underlying machinery automatically do work on your behalf in terms of error-checking and optimization. Finally, all three talks illustrated some interesting connections between statistical machine learning and functional programming patterns, which I found interesting as a longer-term direction for trying to build practical machine learning systems.