This year, at Sumo Logic’s third annual user conference, Illuminate 2018, we presented Sumo Logic Notebooks as a way to do data science in Sumo Logic. Sumo Logic Notebooks are an experimental feature that integrate Sumo Logic, notebooks and common machine learning frameworks. They are a bold attempt to go beyond what the current Sumo Logic product has to offer and enable a data science workflow leveraging our core platform.
In the data science world, notebooks have emerged as an important tool to do data science. Notebooks are active documents that are created by individuals or groups to write and run code, display results, and share outcomes and insights.
Like every other story, a data science notebook follows a structure that is typical for its genre. We usually have four parts. We (a) start with defining a data set, (b) continue to clean and prepare the data, (c) perform some modeling using the data, and (d) interpret the results. In essence, a notebook should record an explanation of why experiments were initiated, how they were performed, and then display the results.
Anatomy of a Notebook
A notebook segments a computation in individual steps called paragraphs. A paragraph contains an input and an output section. Each paragraph executes separately and modifies the global state of the notebook. State can be defined as the ensemble of all relevant variables, memories, and registers. Paragraphs must not necessarily contain computations, but also can contain text or visualizations to illustrate the workings of the code.
The input section (blue) will contain the instruction to the notebook execution engine (sometimes called kernel or interpreter). The output section (green) will display a trace of the paragraph’s execution and/or an intermediate result. In addition, the notebook software will expose some controls (purple) for managing and versioning notebook content as well as operational aspects such as starting and stopping executions.
Human Speed vs Machine Speed
The power of the notebook roots in its ability to segment and then slow down computation. Common executions of computer programs are done at machine speed. Machine speed suggests that when a program is submitted to the processor for execution, it will run from start to end as fast as possible and only block for IO or user input. Consequently, the state of the program changes so fast that it is neither observable, nor modifiable by humans. Programmers would typically attach debuggers physically or virtually to stop programs during execution at so-called breakpoints and read out and analyze their state. Thus, they would slow down execution to human speed.
Notebooks make interrogating the state more explicit. Certain paragraphs are dedicated to make progress in the computation, i.e., advance the state, whereas other paragraphs would simply serve to read out and display the state. Moreover, it is possible to rewind state during execution by overwriting certain variables. It is also simple to kill the current execution, thereby deleting the state and starting anew.
Notebooks as an Enabler for Productivity
Notebooks increase productivity, because they allow for incremental improvement. It is cheap to modify code and rerun only the relevant paragraph. So when developing a notebook, the user builds up state and then iterates on that state until progress is made. Running a stand-alone program on the contrary will incur more setup time and might be prone to side-effects. A notebook will most likely keep all its state in the working memory whereas every new execution of a stand-alone program will need to build up the state on every time it is run.
This takes more time and the required IO operations might fail. Working off a program state in the memory and iterating on that proved to be very efficient. This is particularly true for data scientists, as their programs usually deal with a large amount of data that has to be loaded in and out of memory as well as computations that can be time-consuming.
From an the organizational point of view, notebooks are a valuable tool for knowledge management. As they are designed to be self-contained, sharable units of knowledge, they amend themselves for:
- Knowledge transfer
- Auditing and validation
Notebooks at Sumo Logic
At Sumo Logic, we expose notebooks as an experimental feature to empower users to build custom models and analytics pipelines on top of log metrics data sets. The notebooks provide the framework to structure a thought process. This thought process can be aimed at delivering a special kind of insight or outcome. It could be drilling down on a search. Or an analysis specific to a vertical or an organization. We provide notebooks to enable users to go beyond what Sumo Logic operators have to offer, and train and test custom machine learning (ML) algorithms on your data.
Inside notebooks we deliver data using data frames as a core data structure. Data frames make it easy to integrate logs and metrics with third-party data. Moreover, we integrate with other leading data wrangling, model management and visualization tools/services to provide a blend of the best technologies to create value with data.
Sumo Logic Notebooks are an integration of several software packages to make it easy to define data sets using the Sumo Query language and use the result data set as a data frame in common machine learning frameworks.
Notebooks are delivered as a Docker container and can therefore be installed on laptops or cloud instances without much effort. The most common machine learning libraries such as Apache Spark, pandas, and TensorFlow are pre-installed, but others are easy to add through python’s pip installer, or using apt-get and other package management software from the command line. Changes can be made persistent by committing the Docker image.
The key of Sumo Logic Notebooks is the integration of the Sumo Logic API data adapter with Apache Spark. After a query has been submitted, the adapter will load the data and ingest it into Spark. From there we can switch over to a python/pandas environment or continue with Spark. The notebook software provides the interface to specify data science workflows.
Best Practices for Writing Notebooks
#1 One notebook, one focus
A notebook contains a complete record of procedures, data, and thoughts to pass on to other people. For that purpose, they need to be focused. Although it is tempting to put everything in one place, this might be confusing for users. Better write two or more notebooks than overloading a single notebook.
#2 State is explicit
A common source of confusion is that program state gets passed on between paragraphs through hidden variables. The set of variables that represent the interface between two subsequent paragraphs should be made explicit. Referencing variables from other paragraphs than the previous one should be avoided.
#3 Push code in modules
A notebook integrates code, it is not a tool for code development. That would be an Integrated Development Environment (IDE). Therefore, a notebook should one contain glue code and maybe one core algorithm. All other code should be developed in an IDE, unit tested, version controlled, and then imported via libraries in the notebook. Modularity and all other good software engineering practices are still valid in notebooks. As in practice number one too much code clutters the notebook and distracts from the original purpose or analysis goal.
#4 Use speaking variables and tidy up your code
Notebooks are meant to be shared and read by others. Others might not have an easy time following our thought process, if we did not come up with good, self-explaining names. Tidying up the code goes a long way, too. Notebooks impose an even higher standard than traditional code on quality.
#5 Label diagrams
A picture is worth a thousand words. A diagram, however, will need some words to label axes, describe lines and dots, and comprehend other important informations such sample size, etc. A reader can have a hard time to seize the proportion or importance of a diagram without that information. Also keep in mind that diagrams are easily copy-pasted from the notebook into other documents or in chats. Then they lose the context of the notebook in which they were developed.
The segmentation of a thought process is what fuels the power of the notebook. Facilitating incremental improvements when iterating on a problem boosts productivity. Sumo Logic enables the adoption of notebooks to foster the use of data science with logs and metrics data.
- Visit our Sumo Logic Notebooks documentation page to get started
- Check out Sumo Logic Notebooks on DockerHub or Read the Docs
- Read our latest press release announcing new platform innovations, including our new Data Science Insights innovation