\section{Nagios Probe}
\label{s:nagios}
+Two Nagios probes exist. One to check the running \LB server, and another one to make qualitative checks on a standalone \LB Interlogger.
+
+\subsection{Nagios Probe for the \LB server}
+
There is a Nagios probe to check the service status of an \LB server node. It is distributed from the EMI repository and the name of the package is \texttt{emi-lb-nagios-plugins}.
-\subsection{Tests Performed}
+\subsubsection{Tests Performed}
Before starting the actual test the probe checks for existence and validity of a proxy certificate, and for availability of commands (essential system commands, various \LB Client commands and grid proxy manipulation commands).
The following tests are performed by the probe. Various tests check the working status of various processes running on the \LB server node:
The test also tries to drop the test notification and purge the test job to clean up after itself. However, purging the job won't probably be allowed by the \LB server's policy and the test job will remain registered on the \LB server until removed by a regular purge.
-\subsection{Return Values}
+\subsubsection{Return Values}
Return values follow the Nagios pattern:
\begin{description}
\item[3] The service status is unknown, probe could not run
\end{description}
-\subsection{Console Output}
+\subsubsection{Console Output}
Text output indicates the results of the probe and gives a more detailed description of failure causes.
The probe can return one of the following:
& \emph{Event Delivery Chain (Logger/Interlogger) Not Running} & The server process is running but events are not being delivered by \LB's local logger/interlogger. Check the Logger and the Interlogger. \\
& \emph{Notification Interlogger Not Running} & Events are being delivered correctly and server responds properly to status queries, but it its not delivering notification messages. The notification interlogger is probably not running.\\
\hline
+\end{tabularx} %Quick and dirty solution
+\begin{tabularx}{\textwidth}{|c|p{5cm}|X|}
+\hline
\multirow{5}{*}{UNKNOWN} & \emph{Probe timed out} & The probe was unable to finish before the alotted time. Consider increasing the timeout with \texttt{-t}. The minimum reasonable value is 10\,s. \\
& \emph{No server specified} & Server address was not specified when running the probe. Give one with \texttt{-H}. \\
& \emph{Probe could not write temporary files} & The temporary directory was not writable. Check the default location or specify a new one with \texttt{-T}.\\
\hline
\end{tabularx}
-\subsection{Running the Probe}
+\subsubsection{Running the Probe}
-\subsubsection{Command Line Arguments}
+\paragraph{Command Line Arguments}
The probe recognizes the following command line arguments:
\begin{tabularx}{\textwidth}{l l X}
\texttt{-x} & \texttt{-{}-proxy} & User proxy file. It only needs to be specified if the proxy cannot be found at the default location, or pointed to by environmental variables. \\
\end{tabularx}
-\subsubsection{Environmental Variables}
+\paragraph{Environmental Variables}
In essence the probe recognizes the same environmental variables as the \LB client. No environmental variables need to be set if hostname is specified as a command line argument to the probe.
\begin{tabularx}{\textwidth}{p{4.5cm} X}
\texttt{X509\_USER\_PROXY} & Alternative location of the user's proxy certificate to use in the test.
\end{tabularx}
-\subsubsection{Sample Nagios Service Definition}
+\paragraph{Sample Nagios Service Definition}
Simple definition to be included in \texttt{/etc/nagios/commands.cfg}:
\begin{verbatim}
\end{verbatim}
+\subsection{Nagios Probe for the \LB InterLogger}
+
+This probe is intended for checking runtime parameters of an \LB interlogger. It is distributed from the EMI repository and the name of the package is \texttt{emi-lb-nagios-plugins}. It is intended for use on the target machine. It cannot measure remotely.
+
+
+\subsubsection{Tests Performed}
+
+The following tests are performed:
+
+\begin{enumerate}
+\item The total size of interlogger files (waiting for processing) is measured and compared to pre-set thresholds.
+\item The age of the interlogger socket is read
+\end{enumerate}
+
+\subsubsection{Return Values}
+Return values follow the Nagios pattern:
+
+\begin{description}
+\item[0] The service is running normally
+\item[1] The service is running but there were warnings
+\item[2] The service status is critical
+\item[3] The service status is unknown, probe could not run
+\end{description}
+
+\subsubsection{Console Output}
+Text output indicates the results of the probe and gives a more detailed description of failure causes.
+
+The probe can return one of the following:
+
+\begin{tabularx}{\textwidth}{|c|p{5cm}|X|}
+\hline
+WARNING & \emph{Total of InterLogger files exceeds bounds} & The total size of IL files is higher than the warning threshold. It is possible that the files are not being processed.\\\hline
+\multirow{2}{*}{CRITICAL} & \emph{Total of InterLogger files exceeds bounds} & The total size of IL files is higher than the critical threshold. It is possible that the files are not being processed.\ \\
+& \emph{Interlogger socket not being refreshed anymore} & The socket has not been refreshed for over 60~s. Under normal circumstances, a running Interlogger refreshes the socket every second.\\\hline
+UNKNOWN & \emph{Some commands are not available} & Probe could not run. Some of the required commands are not present on the system. Run probe from command line with \texttt{-v[vv]} and check output. \\\hline
+\end{tabularx}
+
+\paragraph{Command Line Arguments}
+The probe recognizes the following command line arguments:
+
+\begin{tabularx}{\textwidth}{l l X}
+ \texttt{-h} & \texttt{-{}-help} & Print out simple console help \\
+ \texttt{-v[v[v]]} & \texttt{-{}-verbose} & Set verbosity level (\texttt{-{}-verbose} denotes a single \texttt{v}). \\
+ \texttt{-t} & \texttt{-{}-timeout} & Timeout in seconds.\\
+ \texttt{-f} & \texttt{--file-prefix} & Path and prefix for event files \\
+ \texttt{-s} & \texttt{--sock} & Path and prefix for IL socket \\
+ \texttt{-S} & \texttt{--sock-timeout} & Timeout for the IL socket (default 60 s) \\
+ \texttt{-w} & \texttt{--warning} & Log file size limit (kB) to trigger warning (default 10 MB) \\
+ \texttt{-c} & \texttt{--critical} & Log file size limit (kB) to trigger state critical (default 128 MB) \\
+% \texttt{-W} & \texttt{--t-warning} & Time (s) elapsed since last contact to trigger warning \\
+% \texttt{-C} & \texttt{--t-critical} & Time (s) elapsed since last contact to trigger state critical \\
+ \texttt{-P} & \texttt{--proxy} & Use default prefix for Proxy Interlogger files rather than regular IL \\
+ \texttt{-N} & \texttt{--notif} & Use default prefix for Notification Interlogger files rather than regular IL \\
+ \end{tabularx}
+