As I mentioned in one of my previous posts, here at Sumo Logic we believe cloud-based services provide excellent value due to their ease of setup, convenience and scalability, and we leverage them extensively to provide internal services that would be far more time, labor and cash intensive to manage ourselves. Today I’m going to talk about some of the services we use for collaboration, operations and I/T, why we use them, and how they simplify our lives.
Campfire is a huge part of our productivity and culture at Sumo Logic. While I would lump this and Skype together under something like “Managed Corporate Messaging” they fill two very different niches in our environment.
Campfire from 37 Signals is a fantastic tool for group conversations. Using the Campfire service, we have set up multiple chat rooms for various types of issues, including Production Issues, Development Issues, Sales/Customer-Support Issues, and of course, a free-for-all chat-room where we try to make one another spontaneously erupt into chaotic LOLs.
These group-chats provide a critical space where we can work together to troubleshoot and solve problems cooperatively. Campfire makes it very easy to upload pictures and share large amounts of information in real-time with co-workers who can be anywhere. The conversations are all archived for later reference, which allows us to use the Production Incidents room as a 24×7 conference call and canonical forum of record for anything happening to production systems. Our Production on-call devs are expected to echo their actions into the Production channel and keep up with events there as they transpire.
Campfire also has a cool feature which allows you to start a voice conference with participants if needed, which is a great option in certain situations. These calls can also be archived for later reference. One down side to the text and audio archives is that they are not easily searchable so it helps to know approximately when something happened, and we have found it necessary to consult other records to determine where to look.
Skype is, of course, the very popular IM and VOIP service that was purchased by Microsoft a while back. We use Skype extensively for 1:1 chatting and easy and secure file-transfers throughout the company. We also make extensive use of the wide array of available emoticons. (Stefan Zier is a particularly prolific and artistic user of these.)
We also use Skype video chat for interviews and to collaborate with team members abroad. We have a conference room with a TV and Skype camera just for this application.
Running a large-scale cloud-based service requires a lot of operational awareness. One of the ways we achieve this is through Cloudkick. Cloudkick was recently acquired by Rackspace and is evolving into Rackspace Cloud Monitoring. We are still on the legacy Cloudkick service, which we have come to use heavily.
We automatically install Cloudkick agents on all of our production instances and use them to collect a wide array of status codes from the O/S and through JMX as well as by running our own custom scripts which we use to check for the existence of critical processes and to detect if things like HPROF files exist.
The Cloudkick website has a “show only failures” mode which we call the “What’s Wrong? Page”. This is a very helpful tool that allows our EverybodyOps team to quickly assess issues with our production environment.
Of course, we also need to be proactively alerted to failures and crossed thresholds that could indicate trouble, and for this we rely on PagerDuty. (Affectionately known as P. Diddy to many of us, nickname coined by Christian). PagerDuty is another great tool which allows us to maximize the benefits of our EverybodyOps culture.
Within PagerDuty we have a number of on-call rotations. One for our Production Primary role and one for the Secondary role, as well as another role for monitoring test failures and a lesser-known role for those of us who monitor the temperature in the one small server room we do have. P. Diddy allows us easily cover for each other using exceptions or by simply switching the Primary and Secondary roles on the fly if the Primary needs to go AFK for a while.
P. Diddy allows each user to set their own personal escalation policy which can include texting, calling, and emailing with a configurable number of re-tries and timeouts. Another nice touch is that the rotation calendars can be imported into our personal calendars to remind us of when we are up next. This all makes the on-call rotation run pretty flawlessly from an administrative perspective with no gnarly configuration and management on our end.
I must admit, I do have a personal habit of “Joaning” my secondary when I am on call… To properly “Joan” your secondary you accidentally escalate an alert to them that you meant to resolve, (I blame the comma after “Resolv”!)
Like many companies of all sizes we rely on Google for our email service. While some Sumos (like myself and Stefan) use mail clients to read our email, most Sumos are happy with the standard web interface from Google. We also heavily use internal groups for team communications.
We also make good use of Google Docs for document authoring and sharing (this blog post was written and communally edited using Google Docs, in fact, due to the impressive real-time collaboration, Stefan Zier is watching me add this bit in order to resolve his comment right now!) We use Google Calendar for our scheduling needs (and calendar-stalking exercises!)
We also use Google Analytics to obsess over you.
Also, as Sumo Logic’s Director of Security, (which makes me partially responsible for managing the users and groups in Google Apps) I appreciate the richness of their security settings and especially the two-factor authentication and mobile device policy management.
These are just some of our SaaS providers. In an upcoming post I’ll talk more about some of the services that help us support and bill our customers and test and develop our product.
We have found all of these providers deliver valuable and even crucial services that it would be far more expensive and time consuming for us to manage ourselves. We hope you may find some of them helpful too!