事例:コインチェック(仮想通貨交換所)〜 ログのモニタリング体制を最短で整備

さらに詳しく
video に戻る

2015年01月13日

Demo Sumo Logic Next Generation Log Management & Analytics-Featured Video

In this demo, we're going to walk through an example customer scenario and then we're going to show how you can get visibility into business and technical metrics through dashboards. Then we're going to proactively identify a problem and then we're going to determine the root cause using log reduce. Let's get started. For the purposes of our demo, we're going to talk through an example application called Stock Trader which is a pretty typical three-tier application with web, app and database, as well as servers and network devices. Starting off with the business dashboard, we can see some examples of some metrics that we might want to follow. In the top left, we can see a revenue over the last hour. It was recalculated by looking at the number of buy and sell transactions for the Stock Trader app and looking at the commission we might earn on each one of those. In the top right, we've taken the IPs of all the requests coming in and put them in the geolocation context so I can see where my users are coming from. And you can see other examples in terms of how many stocks you're selling, the top stocks that we're selling, purchase stocks, top sellers. So this fits everything in the business context, we can see how the business is running. Now at this point, if I'm a business analyst, I'm getting pretty worried because my revenue, at the top left there, just took a nose dive. So at this point, I'm going to call my operations team and ask them what's going on. So let's switch to that view.

Now we can look at what the application operations team might be looking at. So right off the bat, I can see that my response time just went through the roof. It went from a mere quarter of a second up to over seven seconds which is not great. And I can also see that the errors are increasing while my revenue is decreasing, so it's putting the situation on the business context for me in the operations center. I can also see that my 404 Errors so I'm getting errors on my web page which is again not great for the user experience. And it all seems to be coming down to these database exceptions on the bottom right. So I can see I've got some database timeout errors here. This is probably a good place to start looking for my root cause. So if I click on this monitor, it will launch me into the search console so I can take a look in detail of what's going on here. So now I can see the search that generated the results that we're just looking at the dashboard. I can see that I'm looking for exceptions in the context of my application. There are really three stages to root cause analysis. One, we did need to know what the problem is and we've seen that in terms of the database errors and the slow down for our customers. Next, I need to know when it started and I can see on the bottom right here, that it basically came out of nowhere and so that's probably right where it started. And then finally, I need to identify what the root cause of the issue is and that's what we're going to be going next.

I'm going to go take a look at the detail messages here and you could see on the bottom left, and I'm going to take and look on a five-minute window on either side and what that's going to allow me to do is try to find exactly what initiated this problem. Now we're running over 10-minute time span and in that 10-minute time span, we're looking at 133 pages of logs. That's not something, as a human, I'm going to be able to do very easily. So with keyword searches, I would typically try to pare this down. I might be looking for something like an error that I already know, the database name, the IP, anything that is associated with this problem. And this is where log reduce comes in because we need something better. It allows us to use machine learning to look at all the patterns within your data without excluding anything and find those needles in the haystack that are usually at the root cause of most problems in the data center. Logs by nature are very noisy and very verbose. So in this first line, we can see a stack trace and while it's important, it has happened over 900 times and we're able to condense that into one single line here. Next, we can see some access records for our PIX firewall and we also see the asterisks that indicate the patterns that Sumo Logic has been able to find. So it was smart enough to know that IPs do change but it still indicates the same type of grouping.

Next, we can see the GET statements from our IS logs. Based on my own understanding of the logs, I might want to adjust that pattern and I can do that with the edit function here by deselecting part of the URL and indicating that is more of the grouping pattern that I like to see. I can also influence the relevancy with the thumbs up and thumbs down here based on what I know that is important. I can also see some Window events here, interesting but not necessarily relevant. Now here's something interesting. It's a pretty draconian statement executed on my PIX firewall that denied access to a wide range of IPs. More likely than not, this is at the root cause of my problem. So let's stop here and think how we would have done this with keyword search. Again, we might have looked for the database name, the database IP, important number, any number of things that I would have known is a problem before, but if I did not specifically know to look for this command in my PIX firewall logs, I never would have found it. And that's the power of log reviews because I was able to detect those patterns without excluding any data and find something that I wasn't even looking for.

So let's summarize. Without Sumo Logic, I would have started off with an application problem. And more likely than not, I would have found out about this from a call from an angry user which might have taken hours. Now after I find that, I'm going to start off with my keyword search, but it's going to take me a very long time because I don't know what I'm looking for and I'm going to have to take a lot of wrong roads before I get to the right path. Once I find it, it's only going to take me minutes to restore the service. And how would this work with Sumo Logic? Well, I could have started off with the proactive alert to tell me in seconds that something was out of the ordinary. As you saw in the demo, root cause analysis with log reduce takes minutes because it allows me to get to those important details without excluding any of my data. But then once I found the problem, it still only takes minutes to restore service. Thank you for watching this video today and we look forward to seeing you on Sumo Logic.

部門

スポットライト