\section{Introduction}
-%This Section should give a brief summary of what the service described
-%is supposed to achieve and how it fits into the overall
-%architecture. It should contain subsections on the service internal
-%architecture and its relations to other services.
-
-The Logging and Bookkeeping (\LB) service tracks jobs managed by
-the gLite WMS (workload management system).
-It gathers events from various WMS components in a~reliable way
-and processes them in order to give a~higher level view, the
-\emph{status of job}.
-
-Virtually all the important
-data are fed to \LB\ internally from various gLite middleware
-components, transparently from the user's point of view.
-On the contrary, \LB\ provides public interfaces for querying the job
-information synchronously as well as registering for asynchronous
-notifications.
-API's for this functionality are described in this document in detail.
-
-\endinput
+
+%historie: vyrobeno pro WMS v EDG, 1. a 2. verze (seq. èísla,
+%cache a dotazy na stavy), v EGEE gLite---ustabilnìní, proxy
+
+The Logging and Bookkeeping service (\LB\ for short) was initially
+developed in the EU DataGrid
+project\footnote{\url{http://eu-datagrid.web.cern.ch/eu-datagrid/}}
+as a~part of the
+Workload Management System (WMS).
+The development continues in the EGEE and EGEE-II projects,
+where \LB became an independent part of the gLite middleware~\cite{glite}.
+
+\LB's primary purpose is tracking WMS jobs as they are processed by
+individual Grid components, not counting on the WMS to provide this data.
+The information gathered from individual sources is collected, stored in
+a database and made available at a single contact point. The user get a
+complete view on her job without the need to inspect several service logs
+(which she may not be authorized to see in the entirety or she may not be
+even aware of their existence).
+
+While \LB keeps track of submitted and running jobs, the information is
+kept by the \LB service also after the job has been finished (successfully
+completed its execution, failed, or has been canceled for any reason). The
+information is usually available several days after the last event
+related to the job arrived, to give user an opportunity to check the job
+final state and eventually evaluate failure reasons.
+
+As \LB collects also information provided by the WMS, the WMS services are
+no longer required to provide job-state querying interface. Most of the
+WMS services can be even designed as stateless---they process a~job and
+pass it over to another service, not keeping state information about the job
+anymore. During development and deployment of the first WMS version this
+approach turned to be essential in order to scale the services to the
+required extent~\cite{jgc}.
+
+\LB must collect information about all important events in the Grid job
+life. These include transitions between components or services, results
+of matching and brokerage, waiting in a queue systems or start and end of
+actual execution.
+We decided to achieve this
+goal through provision of an API (and the associated library) and
+instrumenting individual WMS services and other Grid components with direct
+calls to this API. But as \LB is a centralized service (there exists
+a single point where all information about a particular job must
+eventually arrive), direct synchronous transfer of data could have
+prohibiting impact on the WMS operation.
+The temporary unavailability or overload of the remote \LB service
+must not prevent (nor block) the instrumented service to perform as usual.
+An asynchronous model with a clear \emph{asynchronous delivery
+semantics}, see Sect.~\ref{gathering}, is used to address this issue.
+
+As individual Grid components has only local and transient view about a
+job, they are able to send only information about individual events. This
+raw, fairly complex information is not
+a~suitable form to be presented to the user for frequent queries. It must
+be processed at the central service and users are presented primarily this
+processed form. This form derives its form from the \emph{job state} and its
+transition, not from the job events themselves. The raw information is
+still available, in case more detailed insight is necessary.
+
+While the removal of state information from (some of) the WMS services
+helped to achieve the high scalability of the whole WMS, the state
+information is still essential for the decisions made within the resource
+broker or during the matchmaking process.
+\Eg decision on job resubmission is usually affected by the number of
+previous resubmission attempts. This kind of information is currently
+available in the \LB only, so the next ``natural'' requirement has been
+to provide an interface for WMS (and other) services to the \LB to query
+for the state information. However, this requirement contains two
+contradictions: (i)~due to the asynchronous event delivery model, the \LB
+information may not be up to date and remote queries may lead to unexpected
+results
+(or even inconsistent one---some older information may not be available for
+one query but may arrive before a subsequent query is issued),
+and (ii)~the dependence on a~remote service to provide vital state information
+may block the local service if the remote one is not responding.
+These problems are addressed by providing \emph{local view} on the \LB data,
+see Sect.~\ref{local}.
+
+
+
+
+\subsection{Concepts}
+
+\subsubsection{Jobs and events}
+To keep track of user jobs on the Grid, we first need some reliable
+way to identify them. This is accomplished by assigning a unique
+identifier, which we call \emph{jobid} (``Grid jobid''), to every job
+before it enters the Grid. A~unique jobid is assigned, making it the
+primary index to unambiguously identify any Grid job. This jobid is then
+passed between Grid
+components together with the job description as the job flows through
+the Grid; the components themselves may have (and usually do) their
+own job identifiers, which are unique only within these components.
+
+Every Grid component dealing with the job during its lifetime
+may be a source of information about the job. The \LB gathers information
+from all the
+relevant components. This information is obtained in the form of
+\LB events, pieces of data generated by Grid components, which mark
+important points in the job lifetime (\eg passing of job control
+between the Grid components are important milestones in job lifetime
+independently on the actual Grid architecture); see Appendix~\ref{a:events}
+for a~complete list. We collect those
+events, store them into a database and simultaneously process them to
+provide higher level view on the job's state. The \LB collects redundant
+information---the event scheme has been designed to be as redundant as
+possible---and this redundancy is used to improve resiliency in a
+presence of component or network failures, which are omnipresent on any
+Grid.
+
+The \LB events themselves are structured into \emph{attribute}~=
+\emph{value} pairs, the set of required and optional attributes is defined by the
+event \emph{type} (or scheme). For the purpose of tracking job status on
+the Grid and with the knowledge of WMS Grid middleware structure we
+defined an \LB schema with specific \LB event
+types\footnote{\url{https://edms.cern.ch/document/571273/}}.
+The schema contains a common base, the attributes that must be assigned
+to every single event. The primary key is the jobid, which is also one of
+the required attributes. The other common attributes are currently the
+timestamp of the event origin, generating component name and the event
+sequence code (see Sect.~\ref{evprocess}).
+
+While the necessary and sufficient condition for a global jobid is
+to be Grid-wide unique, additional desired property relates to the
+transport of events through the network: All events belonging to the same
+job must be sent to the same \LB database. This must be done on a~per
+message basis, as each message may be generated by a different component.
+The same problem is encountered
+by users when they look for information about their job---they need
+to know where to find the appropriate \LB database too.
+While it is possible to devise a global service where each job registers
+its jobid together with the address of the appropriate database, such a
+service could easily become a bottleneck. We opted for another solution,
+to keep the address of the \LB database within the jobid. This way,
+finding appropriate \LB database address becomes a local operation
+(at most parsing the jobid) and users can use the same mechanism when
+connecting to the \LB database to retrieve information about a particular
+job (users know its jobid). To simplify the situation even further,
+the jobid has the form of an URL, where the protocol part is
+``https'', server and port identify the machine running the appropriate
+\LB server
+(database) and the path contains base64 encoded MD5 hash of random
+number, timestamp, PID of the generating process and IP address of the
+machine, where the jobid was generated. Jobid in this form can be
+used even in the web browser to obtain information about the job,
+provided the \LB database runs a web server interface. This jobid is
+reasonably unique---while in theory two different job identifications can
+have the same MD5 hash, the probability is low enough for this jobid to
+represent a globally unique job identification.
+
+%zajímá nás job, globální id, v¹echna data vzta¾ena k~nìmu, syrové události
+%
+%více zdrojù dat pro jeden job, redundance, shromá¾dìní na jednom místì
+
+\subsubsection{Event gathering}
+\label{gathering}
+%zdroje událostí, lokální semantika logování, store-and-forward
+
+As described in the previous section, information about jobs are
+gathered from all the Grid components processing the job in the form
+of \LB events. The gathering is based on the \emph{push} model where
+the components are actively producing and sending events. The push model
+offers higher performance and scalability than the pull model, where the
+components are to be queried by the server. In the push model, the \LB
+server does not even have to know the event sources, it is sufficient
+to listen for and accept events on defined interface.
+
+The event delivery to the destination \LB server is asynchronous and
+based on the store--and--forward model to minimize the performance
+impact on component processing. Only the local processing is synchronous,
+the \LB event is sent synchronously only to the nearest \LB component
+responsible for event delivery. This component
+is at the worst located in the same local area network (LAN) and usually
+it runs on the same host as
+the producing component. The event is stored there (using persistent
+storage -- disk file) and confirmation is sent back to the
+producing component. From the component's point of view, the
+send event operation is fast and reliable, but its success only means
+the event was accepted for later delivery. The \LB delivery components
+then handle the event asynchronously and ensure its delivery to the
+\LB server even in the presence of network failures and host reloads.
+
+It is important to note that this transport system does not guarantee
+ordered delivery of events to the \LB server; it \emph{does} guarantee
+reliable and secure delivery, however. The guarantees are statistical
+only, as the protocol is not resilient to permanent disk or node crashes
+nor to the complete purge of the data from local disk. Being part of the
+trusted infrastructure, even the local \LB components should run on
+a trusted and maintained machine, where additional reliability may be
+obtained \eg by a RAID disk subsystem.
+
+\subsubsection{Event processing}%
+\label{evprocess}
+
+%diagram stavù, mapování událostí na hrany
+
+%uspoøádání událostí -- seq. èísla, vèetnì shallow vìtví
+
+% ! abstraktne, nemame jeste komponenty
+
+% prichazeji udalosti, vice zdroju, zmenene poradi (az ztraty)
+% redundantni informace
+% motivace: usetrit uzivatele, hlasit agregovany stav jobu
+As described in the previous section, \LB gathers raw events from various
+Grid middleware components and aggregates them on a~single server
+on a per-job basis.
+The events contain a very low level detailed information about the job
+processing at individual Grid components. This level of detail is
+valuable for tracking various problems with the job and/or the
+components, and as complementary events are gathered (\eg each job control
+transfer is logged independently by two components), the information is
+highly redundant. Moreover, the events could arrive in wrong order,
+making the interpretation of raw information difficult and not
+straightforward.
+Users, on the other hand, are interested in a much higher view, the
+overall state of their job.
+
+For these reasons the raw events undergo complex processing, yielding
+a~high level view, the \emph{job state}, that is the primary type of data
+presented to the user.
+Various job states form nodes of the job state diagram (Fig.~\ref{f:jobstat}).
+See Appendix~\ref{a:jobstat} for a list of the individual states.
+
+% stavovy automat
+% obrazek: stavovy diagram
+
+\begin{figure}
+\centering
+\includegraphics[width=.8\hsize]{images/wms2-jobstat}
+\caption{\LB\ job state diagram}
+\label{f:jobstat}
+\end{figure}
+
+% typ udalosti -> zmeny typu stavu
+\LB\ defines a~\emph{job state machine} that is responsible for updating
+the job state on receiving a~new event.
+The logic of this algorithm is non-trivial; the rest of this section deals
+with its main features.
+
+Transitions between the job states happen on receiving events of particular
+type coming from particular sources.
+There may be more distinct events assigned to a~single edge of the state diagram.
+For instance, the job becomes \emph{Scheduled} when it enters batch system
+queue of a~Grid computing element.
+The fact is witnessed by either \emph{Transfer/OK} event reported by
+the job submission service or by \emph{Accept} event reported by the computing
+element. Receiving any one of these events (in any order) triggers the
+state change.
+
+% fault tolerance
+This way, the state machine is highly fault-tolerant---it can cope with
+delayed, reordered or even lost events.
+For example, when a~job is in the \emph{Waiting} state and the \emph{Done}
+event arrives, it is not treated as inconsistency but it is assumed that
+the intermediate events are delayed or lost and the job state is switched
+to the \emph{Done} state directly.
+
+% udalosti nesou atributy, promitaji se do stavu
+The \LB events carry various common and event-type specific attributes,
+\eg \emph{timestamp} (common) or \emph{destination} (\emph{Transfer} type).
+The job state record contains, besides the major state identification,
+similar attributes, \eg
+an array of timestamps indicating when the job entered each state,
+or \emph{location}---identification of the Grid component which is currently
+handling the job.
+Updating the job state attributes is also the task of the state machine,
+employing the above mentioned fault tolerance---despite a~delayed event
+cannot switch
+the major job state back
+it still may carry valuable information to update the job state attributes.
+
+\subsubsection{Event ordering}%
+\label{evorder}
+
+As described above, the ability to correctly order arriving events is
+essential for the job state computation.
+As long as the job state diagram was acyclic (which was true for the
+initial WMS release), each event had its unique place in the expected sequence
+hence event ordering could always be done implicitly from the
+context.
+However, this approach is not applicable once job resubmission
+yielding cycles in the job state diagram was introduced.
+
+Event ordering that would rely on timestamps assigned to events upon
+their generation, assuming strict clock synchronization over the Grid,
+turned to be a~naive approach.
+Clocks on real machines are not precisely synchronized and there are no reliable
+ways to enforce synchronization across administrative domains.
+
+To demonstrate a problem with desynchronized clocks, that may lead to
+wrong event interpretation, let us consider a~simplified example
+in Tab.~\ref{t:cefail}.
+%
+\iffalse %stare
+% usporadani udalosti -- seq. cisla
+% priklad problemu
+So far we assumed that the state machine is able to detect a~delayed event.
+As the state diagram contains cycles, delay cannot be detected from the type
+of the event only.
+The simplest approach is relying on the event timestamps.
+However, in the Grid environment one cannot assume strictly synchronized clocks.
+Table~\ref{t:cefail} shows a~simplified example of the problem caused by delayed
+clocks.
+\fi
+%
+We assume that the workload manager (WM) sends the job to a~computing element
+(CE)~A, where it starts running but the job dies in the middle.
+The failure is detected and the job is resubmitted back to the WM which sends it to CE~B then.
+However, if A's clock is ahead in time and B's clock is correct (which
+means behind the A's clock), the events in the right column are treated
+as delayed. The state machine will interpret events incorrectly, assuming
+the job has been run on B before sending it to A.
+The job would always (assuming the A's events arrive before B's events to
+the \LB) be reported as ``\emph{Running} at A'' despite
+the real state should follow the \emph{Waiting} \dots \emph{Running} sequence.
+Even the \emph{Done} event can be sent by B with a timestamp that says
+this happened before the job has been submitted to A and the job state
+will end with a discrepancy---it has been reported to finish on B while
+still reported to run on A.
+
+\begin{table}[hb]
+\begin{tabular}{rlrl}
+1.&WM: Accept&
+6.&WM: Accept\\
+2.&WM: Match $A$&
+7.&WM: Match $B$\\
+3.&WM: Transfer to $A$&
+8.&WM: Transfer to $B$\\
+4.&CE~$A$: Accept &
+9.&CE~$B$: Accept \\
+5.&CE~$A$: Run &
+10.&CE~$B$: Run \\
+\dots & $A$ dies\\
+\end{tabular}
+\caption{Simplified \LB events in the CE failure scenario}
+\label{t:cefail}
+\end{table}
+
+Therefore we are looking for a~more robust and general solution. We can
+discover severe clock bias if the timestamp on an event is in a future
+with respect to the time on an \LB server, but this is generally a dangerous
+approach (the \LB server clock could be severely behind the real time).
+We decided not to rely on absolute time as reported by timestamps, but to
+introduce a kind of \emph{logical time} that is associated with the logic
+of event generation.
+The principal idea is arranging the pass through the job state
+diagram (corresponding to a~particular job life), that may include
+loops, into an execution tree that represents the job history.
+Closing a~loop in the pass through the state diagram corresponds
+to forking a~branch in the execution tree.
+The scenario in Tab.~\ref{t:cefail} is mapped to the tree in
+Fig.~\ref{f:seqtree}.
+The approach is quite general---any finite pass through any state
+diagram (finite directed graph) can be encoded in this way.
+
+\begin{figure}
+\centering
+\includegraphics[scale=.833]{images/seqtree}
+\caption{Job state sequence in the CE failure scenario, arranged into a~tree.
+Solid lines form the tree, arrows show state transitions.}
+\label{f:seqtree}
+\end{figure}
+
+Our goal is augmenting \LB events with sufficient information that
+\begin{itemize}
+ \item identifies uniquely a~branch on the execution tree,
+ \item determines the sequence of events on the branch,
+ \item orders the branches themselves, which means that it determines
+ which one is more recent.
+\end{itemize}
+If such information is available, the execution tree can be
+reconstructed on the fly as the events arrive, and even delayed events
+are sorted into the tree correctly. An incoming event is considered
+for job state computation only if it belongs to the most recent
+branch.
+
+The situation becomes even more complicated when
+the \emph{shallow resubmission} WM advanced feature is enabled.
+In this mode WM may resubmit the job before being sure the previous attempt
+is really unsuccessful, potentially creating multiple parallel instances
+of the job.
+The situation maps to several branches of the execution tree that
+evolve really in parallel.
+However, only one of the job instances becomes active (really running) finally;
+the others are aborted.
+Because the choice of active instance is done later,
+it may not correspond to the most recent execution branch.
+Therefore, when an event indicating the choice of active instance arrives,
+the job state must be recomputed, using the corresponding active branch
+instead the most recent one.
+
+Section~\ref{seqcode} describes the current implementation of event
+ordering mechanism based on ideas presented here.
+
+\subsubsection{Queries and notifications}\label{retrieve}
+
+According to the GMA classification the user retrieves data from
+the infrastructure in two modes, called
+\emph{queries} and \emph{notifications} in~\LB.
+
+Querying \LB is fairly straightforward---the user specifies query
+conditions, connects to the querying infrastructure endpoint, and
+receives the results.
+For ``single job'' queries, where jobid is known, the endpoint (the
+appropriate \LB server) is inhered from the jobid.
+More general queries must specify the \LB server explicitely,
+and their semantics is intentionally restricted
+to ``all such jobs known here''.
+We trade off generality for performance and reliability,
+leaving the problem of finding the right query endpoint(s), the right
+\LB servers, to higher level information and service-discovery services.
+
+If the user is interested in one or more jobs, frequent polling of the
+\LB server may be cumbersome for the user and creates unnecessary overload
+on the sever. A notification subscription is therefore available,
+allowing users to subscribe to receive notification whenever a~job
+starts matching user specified conditions.
+Every subscription contains also the location of the user's
+listener;
+successful subscription returns time-limited \emph{notification handle}.
+During the validity period of the subscription, the \LB infrastructure
+is responsible for queuing and reliable delivery of the notifications.
+The user may even re-subscribe (providing the original handle) with different
+listener location (\eg moving from office to home), and \LB re-routes
+the notifications generated in the meantime to the new destination.
+The \LB event delivery infrastructure is reused for the notification
+transport.
+
+\subsubsection{Local views}\label{local}
+% motivace proxy
+
+%As outlined in Sect.~\ref{reqs}
+WMS components are, besides logging
+information into \LB, interested in querying this information back in order
+to avoid the need of keeping per-job state information.
+However, despite the required information is present in \LB,
+the standard mode of \LB operation is not suitable for this purpose due
+to the following reasons:
+\begin{itemize}
+\item Query interface is provided on the \LB component which gathers
+events belonging to the same job but coming from different sources.
+Typically, this is a~remote service with respect to the event sources (WMS components).
+Therefore the query operation is sensitive to any network failure that may
+occur, blocking the operation of the querying service for indefinite time.
+\item Due to the asynchronous logging semantics, there is a~non-zero time
+window between successful completion of the logging call and the point in
+time when the logged event starts affecting the query result.
+This semantics may yield unexpected, seemingly inconsistent outcome.
+\end{itemize}
+
+The problem can be overcome by introducing \emph{local view} on job data.
+Besides forwarding events to
+the server where events belonging to a~job are gathered from multiple sources,
+\LB infrastructure can store the logged events temporarily
+on the event source, and perform the processing described
+in Sect.~\ref{evprocess}.
+In this setup, the logging vs.\ query semantics can be synchronous---it is
+guaranteed that a~successfully logged event is reflected in the result of
+an immediately following query,
+because no network operations are involved.
+Only events coming from this particular physical node (but potentially
+from all services running there) are considered, thus the locality of the view.
+On the other hand, certain \LB events are designed to contain redundant
+information, therefore the local view on processed data (job state)
+becomes virtually complete on a~reasonably rich \LB data source like
+the Resource Broker node.
+
+
+\subsection{Current \LB implementation}
+The principal components of the \LB service and their interactions
+are shown in Figures~\ref{f:comp-gather} (gathering and transferring
+\LB events) and~\ref{f:comp-query} (\LB query and notification services).
+
+\begin{figure}
+\centering
+\includegraphics[scale=.5]{images/LB-components-gather}
+\caption{\LB components involved in gathering and transferring the events}
+\label{f:comp-gather}
+\end{figure}
+
+\subsubsection{\LB API and library}
+Both logging events and querying the service are implemented via
+calls to a~public \LB API.
+The complete API (both logging and queries)
+is available in ANSI~C binding, most of the querying capabilities also in C++.
+These APIs are provided as sets of C/C++ header files and shared libraries.
+The library implements communication protocol with other \LB components
+(logger and server), including encryption, authentication etc.
+
+We do not describe the API here in detail; it is documented in~\LB User's
+Guide\footnote{\url{https://edms.cern.ch/file/571273/1/LB-guide.pdf}},
+including complete reference and both simple and complex usage examples.
+
+Events can be also logged with a~standalone program (using the C~API in turn),
+intended for usage in scripts.
+
+The query interface is also available as a~web-service provided by the
+\LB server (Sect.~\ref{server}).
+
+\subsubsection{Logger}
+The task of the \emph{logger} component is taking over the events from
+the logging library, storing them reliably, and forwarding to the destination
+server.
+The component should be deployed very close to each source of events---on the
+same machine ideally, or, in the case of computing elements with many
+worker nodes, on the head node of the cluster%
+\footnote{In this setup logger also serves as an application proxy,
+overcoming networking issues like private address space of the worker nodes,
+blocked outbound connectivity etc.}.
+
+Technically the functionality is realized with two daemons:
+\begin{itemize}
+\item \emph{Local-logger} accepts incoming events,
+appends them in a~plain disk file (one file per Grid job),
+and forwards to inter-logger.
+It is kept as simple as possible in order to achieve
+maximal reliability.
+\item \emph{Inter-logger} accepts the events from the local-logger,
+implements the event routing (currently trivial as the destination
+address is a~part of the jobid), and manages
+delivery queues (one per destination server).
+It is also responsible for crash recovery---on startup, the queues are
+populated with undelivered events read from the local-logger files.
+Finally, the inter-logger purges the files when the events are delivered to
+their final destination.
+\end{itemize}
+
+\subsubsection{Server}
+\label{server}
+\emph{\LB server} is the destination component where the events are delivered,
+stored and processed to be made available for user queries.
+The server storage backend is implemented using MySQL database.
+
+Incoming events are parsed, checked for correctness, authorized (only the job
+owner can store events belonging to a~particular job), and stored into the
+database.
+In addition, the current state of the job is retrieved from the database,
+the event is fed
+into the state machine (Sect.~\ref{evprocess}), and the job state updated
+accordingly.
+
+On the other hand, the server exposes querying interface (Fig.~\ref{f:comp-query}, Sect.~\ref{retrieve}).
+The incoming user queries are transformed into SQL queries on the underlying
+database engine.
+The query result is filtered, authorization rules applied, and the result
+sent back to the user.
+
+While using the SQL database, its full query power is not made available
+to end users.
+In order to avoid either intentional or unintentional denial-of-service
+attacks, the queries are restricted in such a~way that the transformed SQL
+query must hit a~highly selective index on the database.
+Otherwise the query is refused, as full database scan would yield unacceptable
+load.
+The set of indices is configurable, and it may involve both \LB system
+attributes (\eg job owner, computing element,
+timestamps of entering particular state,~\dots) and user defined ones.
+
+The server also maintains the active notification handles
+(Sect.~\ref{retrieve}), providing the subscription interface to the user.
+Whenever an event arrives and the updated job state is computed,
+it is matched against the active handles%
+\footnote{The current implementation enforces specifying an~actual jobid
+in the subscription hence the matching has minimal performance impact.}.
+Each match generates a~notification message, an extended \LB event
+containing the job state data, notification handle,
+and the current user's listener location.
+The event is passed to the \emph{notification inter-logger}
+via persistent disk file and directly (see Fig.~\ref{f:comp-query}).
+The daemon delivers events in the standard way, using the specified
+listener as destination.
+In addition, the server generates control messages when the user re-subscribes,
+changing the listener location.
+Inter-logger recognizes these messages, and changes its routing of all
+pending events belonging to this handle accordingly.
+
+
+% asi nepotrebujeme \subsubsection{Clients}
+
+\subsubsection{Proxy}
+\emph{\LB proxy} is the implementation of the local view concept
+(Sect.~\ref{local}).
+When deployed (on the Resource Broker node in the current gLite middleware)
+it takes over the role of the local-logger daemon---it accepts the incoming
+events, stores them in files, and forwards them to the inter-logger.
+
+In addition, the proxy provides the basic principal functionality of \LB server,
+\ie processing events into job state and providing a~query interface,
+with the following differences:
+\begin{itemize}
+\item only events coming from sources on this node are considered; hence
+the job state may be incomplete,
+\item proxy is accessed through local UNIX-domain socket instead of network
+interface,
+\item no authorization checks are performed---proxy is intended for
+privileged access only (enforced by the file permissions on the socket),
+\item aggressive purge strategy is applied---whenever a~job reaches
+a~known terminal state (which means that no further events are expected), it is purged
+from the local database immediately,
+\item no index checks are applied---we both trust the privileged parties
+and do not expect the database to grow due to the purge strategy.
+\end{itemize}
+
+\subsubsection{Sequence codes for event ordering}%
+\label{seqcode}
+
+As discussed in Sect.~\ref{evorder}, sequence codes are used as logical
+timestamps to ensure proper event ordering on the \LB server. The
+sequence code counter is incremented whenever an event is logged and the
+sequence code must be passed between individual Grid components together
+with the job control.
+However, a single valued counter is not sufficient to support detection of
+branch forks within the execution tree.
+When considering again the Computing Element failure scenario described
+in Sect.~\ref{evorder}, there is no way to know that the counter value of
+the last event logged by the failed CE A is 5 (Tab.~\ref{t:cefail}).
+
+\begin{table}[b]
+\begin{tabular}{rlrl}
+1:x&WM: Accept&
+4:x&WM: Accept\\
+2:x&WM: Match $A$&
+5:x&WM: Match $B$\\
+3:x&WM: Transfer to $A$&
+6:x&WM: Transfer to $B$\\
+3:1&CE~$A$: Accept &
+6:1&CE~$B$: Accept \\
+3:2&CE~$A$: Run &
+6:2&CE~$B$: Run \\
+\dots & $A$ dies\\
+\end{tabular}
+\caption{The same CE failure scenario: hierarchical sequence codes.
+``x'' denotes an undefined and unused value.}
+\label{t:cefail2}
+\end{table}
+
+
+Therefore we define a~hierarchical \emph{sequence code}---an array of
+counters, each corresponding to a~single Grid component class handling the job%
+\footnote{Currently the following gLite components: Network Server, Workload
+Manager, Job Controller, Log Monitor, Job Wrapper, and the application itself.}.
+Table~\ref{t:cefail2} shows the same scenario with a~simplified two-counter
+sequence code. The counters correspond to the WM and CE component classes
+and they are incremented when each of the components logs an event.
+When WM receives the job back for resubmission,
+the CE counter becomes irrelevant (as the job control is on WM now),
+and the WM counter is incremented again.
+
+%Events on a~branch are ordered following the lexicographical order
+%of the sequence codes.
+%Branches are identified according to the WM counter as WM is
+%currently the only component where branching can occur.
+
+The state machine keeps the current (highest seen) code for the job,
+being able to detect a~delayed event by simple lexicographic comparison
+of the sequence codes.
+Delayed events are not used for major state computation, then.
+Using another two assumptions (that are true for the current
+implementation):
+\begin{itemize}
+ \item events coming from a~single component arrive in order,
+ \item the only branching point is WM,
+\end{itemize}
+it is safe to qualify events with lower WM counter (than the already
+received one) to belong to inactive
+branches, hence ignore them even for update of job state attributes.
+
+\subsection{User interaction}
+\begin{figure}
+\centering
+\includegraphics[scale=.5]{images/LB-components-query}
+\caption{\LB queries and notifications}
+\label{f:comp-query}
+\end{figure}
+
+So far we focused on the \LB internals and the interaction between its
+components.
+In this section we describe the interaction of users with the service.
+
+\subsubsection{Event submission}
+%implicitní -- registrace jobu, systémové události na middlewarových komponentách
+The event submission is mostly implicit,
+\ie it is done transparently by the Grid middleware components
+on behalf of the user.
+Typically, whenever an important point in the job life is reached,
+the involved middleware component logs an appropriate \LB event.
+This process is not directly visible to the user.
+
+A specific case is the initial registration of the job.
+This must be done synchronously, as otherwise subsequent events logged for
+the same job may be refused with a ``no such job'' error report.
+Therefore submission of a job to the WMS is the only synchronous event
+logging that does not return until the job is successfully registered with
+the \LB server.
+
+% explicitní -- user tagy, ACL
+However, the user may also store information into the \LB explicitly
+by logging user events---\emph{tags} (or annotations) of the
+form ``name = value''.
+Authorization information is also manipulated in this way, see
+\LB User's Guide for details.
+
+
+\subsubsection{Retrieving information}
+From the user point of view, the information retrieval is the most
+important interaction with the \LB service.
+
+%dotazy na stav
+The typical \LB usage are queries on the high-level job state information.
+\LB supports not only single job queries, it is also possible to
+retrieve information about jobs matching a specific condition.
+The conditions may refer to both the \LB system attributes
+and the user annotations. Rather complex query semantics can be
+supported, \eg
+\emph{Which of my jobs annotated as ``apple'' or ``pear'' are already
+scheduled for execution and are heading to the ``garden'' computing element?}
+The \LB User's
+Guide\footnote{\url{https://edms.cern.ch/file/571273/1/LB-guide.pdf}} provides a~series of similar examples of
+complex queries.
+
+%dotazy na události
+As another option, the user may retrieve raw \LB events.
+Such queries are mostly used for debugging, identification of repeating
+problems, and similar purposes.
+The query construction refers to event attributes rather
+than job state.
+
+The query language supports common comparison operators, and it allows
+two-level nesting of conditions (logically \emph{and}-ed and \emph{or}-ed).
+Our experience shows that it is sufficiently strong to cover most user
+requirements while being simple enough to keep the query cost reasonable.
+Complete reference of the query language can be found in~\LB User's Guide.
+
+\ludek{%notifikace
+The other mode of user interactions are the \LB notifications
+(Sect.~\ref{retrieve}).}
+
+\subsubsection{Caveats}
+\LB is designed to perform well in the unreliable distributed Grid
+environment.
+An unwelcome but inevitable consequence of this design
+are certain contra-intuitive features in the system behavior,
+namely:
+\begin{itemize}
+\item
+Asynchronous, possibly delayed event delivery may yield seemingly
+inconsistent view on the job state \wrt\ information that is available
+to the user via different channels.
+\Eg the user may know that her job terminated because of monitoring the
+application progress directly, but the \LB \emph{Done} events indicating
+job termination are delayed so that \LB reports the job to be still
+in the \emph{Running} state.
+
+\item
+% sekvenèní èísla -- ¹patnì øazené události jsou ignorovány pro výpoèet stavu
+Due to the reasons described in Sect.~\ref{evorder} \LB is rather sensitive
+to event ordering based on sequence codes.
+The situation becomes particularly complicated when there are multiple
+branches of job execution.
+Consequently the user may see an \LB event that is easily interpreted that
+it should switch the job state, however, it has no effect in fact
+because of being (correctly) sorted to an already inactive branch.
+
+\item
+% purge -- data z~\LB\ zmizí
+\LB is not a~permanent job data storage. The data get purged from the
+server on timeout, unrelated to any user's action.
+Therefore, the \LB query may return ``no such job'' error message (or not
+include the job in a list of retrieved jobs) even if the same previous
+\LB query had no such problems.
+
+\end{itemize}
+
+\subsection{Security issues}
+The events passed between the \LB components as well as the results of
+their processing provide detailed information about the corresponding job and
+its life. Being used by the users to check status of their jobs and also by
+other Grid components to control the job, the information on jobs has to be
+reliable and reflect the real jobs' utilization of the Grid. Also, some user
+communities (\eg biomedicine researchers) often process sensitive data and
+require the information about their processing is kept private so that only the
+job owner can access not only the result of the computation but also all
+information about the job run. Last but not least, according to legislation of
+some countries the information on users' jobs can be treated as the user
+private data, which requires an increased level of protection. \LB therefore
+must pay special attention to security aspects and access control to the data.
+
+All \LB components communicate solely over authenticated channels. The TLS
+protocol~\cite{tls} is used as the authentication mechanism and each of the
+\LB uses an X.509 public key certificate to establish a mutually
+authenticated connection. The users usually use their proxy
+certificates~\cite{proxycert} when accessing the \LB server and retrieving
+information about jobs.
+\ludek{Proxy certificates were introduced by the Grid Security
+Infrastructure\~cite{gsi} from the Globus project and extend the concepts of
+standard public key certificates and identity providers and allows a user to
+issue a certificate for herself. Such a proxy certificate is used as a means
+for Single Sign-On and rights delegation~\cite{delegation}.}
+The proxy
+certificate can also contain other information necessary to create a secure
+connection, \eg information used for authorization. The \LB security layer is
+implemented using the the Generic Security Service API~\cite{gssapi}, which
+makes it easier to port the application to an environment using mechanism other
+than PKI (\eg Kerberos).
+
+Apart from providing an authentication mechanism, the TLS protocol also allows
+the communicating parties to exchange an encryption key that is used to encrypt
+all subsequent communication. The \LB components encrypt all network
+communication to keep the messages private. Therefore, together with the access control
+rules implemented by the \LB server, the infrastructure provides very high
+level of privacy protection.
+
+By default, access to a job information is only allowed to the user who
+submitted the job (the job owner). The job owner can also assign an access
+control list to her job in the \LB specifying other users who are allowed
+to read the
+data from \LB. The ACLs are represented in the GridSite GACL
+format~\cite{gacl1,gacl2}, which is a simplified version of common Extensible
+Access Control Markup Language (XACML)~\cite{xacml}. The ACLs are stored in
+the \LB database along with the job information and are checked at each access to
+the data. The GridSite XML policy engine is used for policy evaluation. The
+ACLs are under control of the job owner, who can add and remove entries in the
+ACL arbitrarily using the \LB API or command-line tools. Each entry of an ACL
+can specify either a user subject name or a name of a VOMS group. The
+VOMS~\cite{voms2} is a VO attribute provider service, which is maintained
+by the EGEE project. It allows to assign a user with groups and roles
+membership and issues to the users attribute certificates containing
+information about their current attributes. These attribute certificate are
+embedded in the user proxy certificate and checked by the \LB server at each
+user request handling.
+
+%Only the Read semantics is implemented so far, for the
+%future we consider more complex access control, \eg implementing the access
+%control for Write operations (per event) so that particular events could be
+%only logged by allowed components. That would allow to generate more
+%trusted results since currently each components can log arbitrary events.
+%Malicious component can make the \LB server produce inappropriate results.
+
+Besides of using the ACLs, the \LB administrator can also specify a~set of
+privileged users with access to all job records on a particular \LB
+server. These privileged users can \eg
+collect information on usage and produce a monitoring data based on the \LB
+information.
+
+%Data trustworthiness - the events aren't signed, no real non-reputability or
+%traceability of the event sources.
+
+Since the hostname of the \LB server is part of the job identification, it is
+easy for the user to check that the correct \LB server was contacted and no
+server spoofing took place and thus the data received from the server can be
+trusted. The \LB server on the other hand has no means of checking that the
+logged events originated from an authorized component. Everyone on the Grid
+possessing a valid certificate from a~trusted CA can send an event to the \LB and
+let it store and process the event and possibly change the status of the
+corresponding job. This way a malicious user or service can confuse the \LB
+server by a forged events. This behavior is not a critical issue in the current
+model and the way in which \LB is used, however, we are designing a solution
+addressing this weakness. We plan to use the VOMS attributes issued to a
+selected components. These VOMS attributes must be presented when critical
+events are logged to the \LB server.
+
+\ludek{\subsection{Performance and scalability}
+The \LB service was designed with performance and
+scalability issues in mind. We have developed a series of tests of the
+individual \LB components to measure the actual behavior under
+stress conditions. These tests give us a good performance estimate of
+the \LB service and help us identify and remove possible bottlenecks.
+
+The testing itself is done by feeding the \LB components with events
+prepared beforehand, using whatever protocol appropriate for the given
+component. The feeding program uses a set of predefined events of a
+typical job which we have chosen from the production \LB server
+database. Timestamp is taken before the first event is sent, then the
+feeding program begins sending events of individual jobs (the jobs are
+all the same, the only difference is the jobid; the number of jobs used is
+configurable). The tested component is instrumented in the source code
+to break normal event processing at selected points (\eg discard
+events immediately after being read to measure the input channel
+performance, or discard events instead of sending them to the next
+component, etc.). This segmentation of event processing enables to
+identify places in the code which may slow down the event transfer.
+Optionally the events may be discarded by the next component in the
+logical path. The last event of the last job is special termination
+event, which is recognized when being discarded; then the second
+timestamp is taken and the difference between the two gives us total
+time necessary to pass given number of jobs through.
+
+Note that due to the asynchronous nature of the \LB service measuring for
+example the time it takes to send given number of jobs does not give
+us the required result, thus event (or job) throughput---when the
+producer receives acknowledgment about successful send operation, it
+is not guaranteed that the event passed through the component.
+
+The results shown in table~\ref{perf:results} give the overall
+throughput components (events are discarded by the next component
+on the path), with the exception of proxy, where the throughput to the
+database is measured. It can be seen that the majority of code is fast
+enough to handle even high event rates and that most components are up
+to our goal to handle one million of jobs per day. The first line
+indicates how fast we are able to ``generate'' events in the feeding
+program.
+
+\begin{table}[hbt]
+\begin{tabular}{l|r}
+{\bf Component} & {\bf Job throughput (jobs/day)} \\
+\hline
+Test producer & 153,600,000 \\
+Locallogger & 101,700 \\
+Interlogger & 5,335,100 \\
+Proxy & 1,267,110 \\
+\end{tabular}
+\caption{Performance testing results}
+\label{perf:results}
+\end{table}
+
+During the performance testing we have identified two possible
+bottlenecks:
+\begin{itemize}
+\item Opening connections and establishing SSL sessions is very
+expensive operation. It hinders mainly the performance of locallogger,
+because the current implementation uses one SSL session for
+event.
+\item Database operations. Storing events into database is expensive,
+but inevitable; however we were able to optimize for example the
+code checking for duplicated events.
+\end{itemize}
+
+In the current work we are addressing the issue of SSL operations by
+introducing concept of SSL connection pools, which enables components
+to reuse existing connections transparently without need to tear-down
+and setup new SSL contexts. }
+
+\subsection{Advanced use}
+
+The usability of the \LB service is not limited to the simple tasks
+described earlier. It can be easily extended to support real-time job
+monitoring (not only the notifications) and the aggregate information
+collected in the \LB servers is a valuable source of data used for post-mortem
+statistical analysis of jobs and also the Grid infrastructure behavior.
+Moreover, \LB data can be used to improve scheduling decisions.
+
+\subsubsection{\LB and real time monitoring}
+The \LB server is extended to provide quickly and without any substantial
+load on the database engine the following data:
+\begin{enumerate}
+\item number of jobs in the system grouped by internal status
+(\emph{Submitted}, \emph{Running}, \emph{Done},~\ldots),
+\item number of jobs that reached final state in the last
+hour,
+\item associated statistics like average, maximum, and minimum time spent
+by jobs in the system,
+\item number of jobs that entered the WMS system in the last hour.
+\end{enumerate}
+\LB server can be regularly queried to provide this data to give an
+overview about both jobs running on the Grid and also the behavior of the
+Grid infrastructure as seen from the job (or end user) perspective.
+Thus \LB\ becomes a~data source for various real-tim Grid monitoring tools.
+
+\subsubsection{R-GMA feed}
+The \LB server also supports streaming the most important data---the job
+state changes---to another monitoring system. It works as the
+notification service, sending information about job state changes to
+a~specific listener that is the interface to a monitoring interface.
+As a~particular example of such a generic service, the R-GMA feed component
+has been developed. It supports sending job state changes to
+the R-GMA infrastructure that is part of the Grid monitoring
+infrastructure used in the EGEE Grid.
+
+Currently, only basic information about job state changes is provided
+this way, taking into account the security limitation of the R-GMA.
+
+\subsubsection{\LB Job Statistics}
+Data collected within the \LB servers are regularly purged, complicating
+thus any long term post-mortem statistical analysis. Without a Job
+Provenance, the data from the \LB must be copied in a controlled way and
+made available in an environment where even non-indexed queries can be
+asked.
+
+Using the \LB Job Statistics tools, one dump file per job is created
+when the job reaches a~terminal state. These dump files can be
+further processed to provide and XML encoded Job History Record%
+\footnote{\url{http://egee.cesnet.cz/en/Schema/LB/JobRecord}} that
+contains all the relevant information from the job life. The Job History
+Records are fed into a statistical tools to reveal interesting information
+about the job behavior within the Grid.
+
+This functionality is being replaced by the direct download of all the
+relevant data from the Job Provenance.
+
+\subsubsection{Computing Element reputability rank}
+Production operation of the EGEE middleware showed
+that misbehaving computing elements may have significant impact on
+the overall Grid performance.
+The most serious problem is the ``black hole'' effect---a~CE that
+accepts jobs at a~high rate but they all fail there.
+Such CE usually appears to be free in Grid information services
+so the resource brokers keep to assign further jobs to it.
+
+\LB data contain sufficient information to identify similar problems.
+By processing the incoming data the information
+was made available as on-line auxiliary statistics like
+rate of incoming jobs per CE, rate of job failure, average duration of job etc.
+The implementation is lightweight, allowing very high query rate.
+On the RB the statistics are available as ClassAd
+functions, allowing the user to specify that similarly misbehaving
+CE's should be penalized or completely avoided
+when RB decides where jobs get submitted.
+
+