Dominique Levin, Senior Vice President at LogLogic helps us take a look at the past, present and future of log management for operations.
Log management is not a new concept and indeed has been around for a long time. In the 80s it was the primary mechanism for fault analysis and management of computer systems. The sheer success of log data nearly killed it off before it really took off. The cacophony of log formats and the sheer volume of messages generated – up to 40 terabytes a month for a mid-sized organisation or, shall we say 100,000 log messages every second – makes it virtually impossible for any human being to realistically track logs. Subsequently based on SNMP alerts and other event data, including selected error log messages, large-scale event management systems such as HP OpenView emerged as the new kings of fault detection.
For a while, all was good, but then came the big compliance demands and legislation including PCI DSS which mandates that in order to prove compliance, organisations must have knowledge of specific events and activities – for example be able to track and monitor all network resources and cardholder data. They must also allocate responsibility for checking security log files for breaches, be able to report on the findings and take action to remedy any issues raised. In short, organisations need to implement better security best practices in order to protect themselves and customer data.
These compliance drivers along with increased security concerns have revived log file management, which otherwise may not have made it back. The requirement to track user activity, provide forensic data which could stand up in court, and to identify potential insider and outsider intrusions and transgressions of corporate networks has resulted in a new, updated form of log file analysis. There are now technologies making easy work of capturing, analysing and storing the huge volumes of log data, allowing organisations to access the information in a similar fashion to using a search engine.
Now, in addition to aiding in compliance, virtualisation and the ever increasing cost of downtime in our networked economy has resulted in system and network administrators re-discovering log data and the value it delivers. In surveys, 70%+ of organisations confess their primary budget for log management still comes from compliance. However, this same group admits that for years now ,70% of their use of log data has been driven by operational needs such as fault detection and problem isolation. This is no surprise, because operations use cases can drive true log management ROI. One minute of down-time could cost millions, so if automating log management can help to accelerate problem isolation, then companies are willing to pay. If giving help-desk employees access to normalised log data can off-load expensive third-level support personnel, then that is even better.
While log management for operations and log management for compliance or security are different applications, they share many of the same foundational requirements. Consequently, system administrators can benefit from recent advances inspired by security applications, such as:
- Collection
The ability to collect log data from a large variety of sources – with different protocols and different formats, either through an agent-less or agent-based infrastructure. A near-real-time collection is also critical to both security and operations use of logs. Such timely collection enables alerting that warns the users of recent or even impending system failures.
- Normalisation
The ability to compare log data from disparate sources. For example, the ability to run a user activity report aggregating all login activity for a particular user, including login to the VPN and the finance server. Or such as the ability to run one report that shows all activity for a particular user, from e-mails sent to web sites visited. For operational use, performance measurement across different systems can only be done on normalised data.
- Summarisation
The ability to count and summarise the log messages collected, by log type, by message type and such. One failed login perhaps isn’t meaningful, but more than five in a row could be significant. The same logic applies to system errors and failures that need to be reviewed while using logs for maintaining and optimising system and network operations.
- Statistical analysis
Unusual patterns in log data, an unusual ratio between accepted and denied connections on a firewall for example, can be an indication of a security breach. In the future, statistical algorithms applied to log data may enable failure prediction and other advanced analysis that directly contributes to improved SLAs.
- Alerting
The ability to trigger (near) real-time alerts that are user configurable, either based on manually written rules or automated statistical analysis. Such alerts serve to bring urgent issues to system operator or security analyst attention.
- Search
Search is central to log-based investigations, whether for an operations use (such as system fault investigation) or security use (hacker or insider attack). An ability to go through 100% of logs is key for all three uses for logs: security, compliance and operations. Such searches must be fast and easy – so that users are able to run them while under pressure of a troubleshooting or security incident.
It is also important to note that log management for operations has its own unique requirements:
- Collection revisited
Faults are notoriously singular – this means that they occur once, but never again in quite the same manner. Therefore it is very difficult to predict what log messages are going to be most useful for problem isolation and most practitioners now admit it is best to keep all log data around for post-incident analysis. Therefore, the requirement to collect 100 per cent of all log messages of all log sources is even more important in operations than it is in security.
- Log browsing (data mining)
While for compliance, an auditor may review the same report (say failed logins) every quarter, no two troubleshooting session are quite the same. Problem isolation is an interactive process of trial and error. An administrator may look at the same data from many different angles before understanding the root-cause – like examining a Rubik’s cube. Reports have to be customisable on the fly. Pre- and post-report filtering options are important to allow for dynamic report (re)-configuration. Search is important, but not sufficient and you will likely want to combine search with access to normalised and cross-correlated information.
- Search (and reporting) speed
Speed truly matters when it comes to fault detection and problem isolation. Whether a forensic investigation takes one hour or one day or one week usually doesn’t really break the bank, but a down-time situation that persists for minutes or hours can be a matter of many millions of pounds in lost revenues. When troubleshooting a problem, every query must be very fast: whether indexed search or a report against normalised data, every second and every minute counts.
- GUI and Workflow
An external auditor looking at logs to verify that nobody improperly accessed credit card information is going to follow a very different work-flow than an internal investigator examining a potential fraud case. And still completely different will be that help-desk person who is trying to tell you why your e-mail isn’t being delivered or your VPN connection is so slow. For optimal functionality and productivity, the best graphical user interfaces and workflows are application specific.
- SOA-based portal or mash-up
The initial fault alarm will likely land with a help-desk employee; in the form of an HP Software (or equivalent) alert, a log alert or a phone call from an unhappy user. Either way, the first-level support person will attempt to perform some analysis. In many cases, truly understanding the problem requires access to log data. Without log automation, this could necessitate a phone call to a third-level support person resulting in a long wait-time until the escalation managers returns his log analysis. However, in the new brave world of log analysis, the help-desk employee could access log data remotely with a single mouse-click assuming the task is made easy enough. It probably means further customising the workflow and GUI to a particular customer’s situation. This is easy to do with today’s web 2.0 technologies and open web services APIs: a custom portal or mash-up can be created in days.
- SOA based integration
Unlike the case of log management for security, mature consoles and dashboards exist for fault analysis. These event management systems even have correlation and alerting capabilities. Rather than replacing these systems with yet another console, most companies are going to look for the ability to integrate a new information source, log data in this case, into the existing fault management console. Web services likely will be the mechanism of choice.
- (Lack of) archiving
Keeping log data around for long periods of time is not a requirement. Data quickly loses its value after the fact. However, mining historical data patterns to predict future failures before they occur can be very valuable. This field is still in its infancy, but shows a lot of promise. Given enough data, both error data and fault data, predictive analysis is not far in the future.
It appears to me that the ideal technical architecture for log management recognises both similarities and differences of the various log management use cases (and there are many more than just security and operations). Would the ideal solution perhaps be a common log data platform that can collect, aggregate, summarise, normalise, index and apply basic analytics to log data once, while allowing for many different user experiences, depending on the use case?
So, as the sun is setting on HP OpenView (the name was changed to HP Software in 2007), a new dawn has broken for log management in operations! Hoorah!
