Glen Cowan
                                                              23.10.03


           Postgraduate Workshop on Statistical Data Analysis


1  Introduction
---------------

In the data analysis workshop we'll be working through parts of an
analysis of the search for the Higgs boson in e+e- collisions.  This
parallels very closely the real search that took place at LEP up to
the end of its operation in late 2000.  The exercises will give as a
chance to look at Monte Carlo event generation, detector simulation,
and the use of test variables to select signal events in the presence
of background.

The Higgs production process we will consider is

   e+e- -> HZ with H -> bbbar and Z -> qqbar

The main background to this is from

   e+e- -> ZZ with both Zs decaying to qqbar

We will look at highly simplified Monte Carlo generators for these
processes, and we will also simulate the response of a typical
detector in a very simplified way.  The physics behind the event
generators is described in 

   - V. Barger et al., Phys. Rev. D 49 (1994) 79
     (copy on http://www.pp.rhul.ac.uk/~cowan/barger_higgs.pdf),
   - D. Bardin et al, hep-ph/9406340,
   - Mikaelian et al., PRD 19 (1979).


2  Workshop notes
-----------------

Here are some rough notes on what to do for the data analysis workshop

Log in to one of the linux machines and copy the files (probably easiest
to copy the whole directory structure) from 

  www.pp.rhul.ac.uk/~cowan/stat/tut03

to your area.  There are three subdirectories:  toymc, evtanl and sigmatot.


2.1 The "Toy" Monte Carlo (toymc)
---------------------------------

toymc contains a Monte Carlo generator for:

  e+e- -> HZ with H -> bbbar and Z -> qqbar (the signal process) and
  e+e- -> ZZ with both Zs decaying to qqbar (a background process).

Both of these event types result in four jets of hadrons.

The program includes a simple routine that simulates the response of a
detector by smearing the momenta of the jets.  It also simulates the
tagging of jets that are initiated by long-lived quarks, i.e., b or c.
The program generates for each jet a number called "btag".  This is
the p-value for the hypothesis that all of the tracks in the jet
originate from the primary vertex.  For u, d, and s jets this is
uniformly distributed in [0,1], since the hypothesis is correct.  For
c jets it is somewhat peaked towards 0 and for b jets even more so,
since these contain long-lived mesons whose decay products originate
from a secondary vertex.  For B mesons with a momentum of 30 to 40
GeV, the mean decay length is several mm.

Type gmake to build the program.  Run by typing ./toymc and answer
the questions.  Generate two files of Monte Carlo data with, say,
Ecm = 220 GeV, M_higgs = 115 GeV, with 1000 events each.

Look at the histograms in the output file.  These give the distributions
of the various decay angles.  Compare these to the plots in the paper
by Barger et al.  (These histograms are filled before simulation
of detector effects.)

You can hack into the program and investigate the effect of the
detector simulation.  Try, for example, turning it off entirely.

The file also contains an ntuple with the four-vectors of the four
jets as well as the btag values.  Look at the distributions of the
various quantities with PAW.  Try to explain the distributions of the
btag values for the four jets.  The distributions of the momentum
components by themselves are not very informative.  To see something
more meaningful we need to form pairs of jets and calculate their
invariant masses.  In principle you can do this in PAW but really
this requires a more flexible programming environment.  One possibility
is ROOT, and you can convert the hbook ntuple file to root format
by typing, say,

h2root hz.hbook hz.root

Another alternative is to read the hbook file in with a C++ or
FORTRAN program, unpack the ntuple one event at a time and do the
analysis there.  A simple program for this is evtanl.


2.2 The Event Analysis Program evtanl
-------------------------------------

evtanl is a simple C++ program which reads in the ntuple, unpacks it
and makes the variables available to the user.  From the four-vectors,
for example, you can compute the invariant masses of two-jet pairs,
and try to figure out which pair came from the Z decay and which came
from the Higgs.  Of course it will help to use the b-tagging
information.

Try to figure out what variables provide the best discriminating
power between HZ and ZZ events.  Pick a set of selection criteria
and find out your efficiency for HZ (hopefully high) and ZZ (hopefully
low).

If you have time, try to construct a simple Fisher discriminant
function (see the course notes).  Try to produce histograms of the
test statistic for HZ and ZZ events.


2.3 Total cross sections
------------------------

The number of events n_i of type i that one will obtain for a given
integrated luminosity L is a Poisson random variable with a mean value
nu_i.  This is given by

  nu_i = sigma_i * efficiency_i * L

where sigma_i is the total cross section, which you will need for both
reactions, i.e., i = HZ and i = ZZ.  These depend on parameters like
the centre-of-mass energy and on the Higgs mass.

The directory sigmatot contains a simple program for computing total
cross sections: test_sigma_tot.cc.  There is a simple script for
compiling and linking it: test_sigma_tot.lnk.  Try to get this to give
you the cross sections for the Ecm and Higgs mass values that you
choose.


3 Putting it all together
-------------------------

At the end of the day what you want is to set the selection criteria
to maximize the expected limit that you would set on the Higgs mass.
A final statement would be of the form:  "With so-and-so much integrated
luminosity (say, 100 pb^-1), we would find so-and-so many Higgs
events at different values of the Higgs mass, and the expected
number of events from background processes is so-and-so many..."

Alternatively, you could set up a "mock data challenge" where you
prepare a sample of data with Standard Model processes mixed together
with Higgs events at a certain Higgs mass.  If your analysis finds a
significant signal you would compute the p-value of the hypothesis
that there is only background, to see if it could be rejected.  These
questions go beyond the scope of today's workshop but are discussed at
length in the many papers written by the LEP Higgs groups (see
e.g. their paper submitted to the Amsterdam conference ICHEP02).