Statistical Data Analysis

2023/24 University of London Postgraduate
Lectures for Particle Physicists

University of London MSci PH4515

 

  University of London crest


Glen Cowan, Royal Holloway, University of London, phone: (01784) 44 3452, e-mail: g.cowan@rhul.ac.uk

Time & Place: The 2023/24 course takes place Mondays 2-5 pm starting 25 September ending 4 December. The lectures will take place in London in Stewart House, Room 2/3. The entrance to Stewart House is through Senate House, Malet Street, London WC1E 7HU -- here is a map. Once in Senate House go up the big staircase, turn left and through the corridor past the Royal Holloway sign.

The core material is presented in the first two hours; the third hour is used for examples and discussion.

Moodle page: U. of London MSc and MSci students should access the course through its RHUL moodle page.

Aims: This series of lectures is intended for PhD students in Particle Physics and it also forms the University of London MSci course PH4515. The purpose of the lectures on probability and statistics is to present the basic mathematical tools needed for the analysis of experimental data. The methods will be practiced by writing and running short computer programs.

Although the examples used in the course often relate to particle physics this is done in a relatively simple way and MSci students from other physics areas should not find this too great a difficulty.

Computing: The statistical methods will be practiced using computer programs in python or C++. Students should have some familiarity with at least one of these languages or be willing to use additional resources to acquire the needed computing skills.

Syllabus: A general outline of the course topics.

Slides and notes:

  • Week 1 slides, discussion notes
  • Week 2 slides, discussion notes
  • Week 3 slides, discussion notes, Monte Carlo code cauchyMC.py, cauchyMC.ipynb.
  • Week 4 slides, discussion notes
  • Week 5 slides, discussion notes
  • Week 6 slides, discussion notes
  • Week 7 slides, discussion notes
  • Week 8 slides, discussion notes. An intro to multiple regression can be found in the extra slides and here.
  • Week 9 slides, discussion notes. Python routines histFit for histogram fitting are here.
  • Week 10 slides, discussion notes. Materials for a Bayesian fit using MCMC can be found here.
  • Week 11 slides , discussion notes; Lectures 11-3 and 11-4 refer to G. Cowan, Eur. Phys. J. C (2019) 79:133 or arXiv:1809.05778; more details in this seminar; some simple related software.
  • Solutions to problem sheet 9.
  • Problem sheets: There are 9 problem sheets due on Mondays at 18:00 from weeks 3 through 11. Further info on these can be found in the slides for week 1 and part 1 of the corresponding video.

  • Problem Sheet 1, due 9 October 2023.
  • Problem Sheet 2, due 16 October 2023.
  • Problem Sheet 3, due 23 October 2023. Materials for problem 4 can be found here.
  • Problem Sheet 4, due 30 October 2023.
  • Problem Sheet 5, due 6 November 2023. Code for doing the problems using python/scikit-learn or C++/ROOT.
  • Problem Sheet 6, due 13 November 2023.
  • Problem Sheet 7, due 20 November 2023. For the warm-up problem 2 here are files to use iminuit with python or tminuit with root.
  • Problem Sheet 8, due 27 November 2023. The exercise uses the routine mlFit with iminuit/python or TMinuit/root.
  • Problem Sheet 9, due 4 December 2023.
  • Archived lecture videos and slides from 2020/21:

  • Week 1 slides and videos part 1 (course intro), part 2 (probability), part 3 (interpretation of prob., Bayes' thm.), part 4 (random variables, pdfs).
  • Week 2 slides and videos part 1 (functions of r.v.s), part 2 (expectation values), part 3 (error prop.), part 4 (catalog of distributions, 1).
  • Week 3 slides and videos part 1 (uniform, exponential), part 2 (Gaussian), part 3 (further pdfs), part 4 (Monte Carlo).
  • Week 4 slides and videos part 1 (hypothesis tests), part 2 (example of test), part 3 (test statistic, N-P lemma), part 4 (multivariate methods).
  • Week 5 slides, part 1 (neural nets), part 2 (network training), part 3 (pdf estimation), part 4 (BDTs).
  • Week 6 slides, part 1 (p-values), part 2 (examples of p-values), part 3 (chi-square test), part 4 (parameter estimation).
  • Week 7 slides, part 1 (large-sample MLEs), part 2 (variance of MLEs), part 3 (2-D numerical example), part 4 (Extended ML, Bayesian est.).
  • Week 8 slides, part 1 (method of least squares), part 2 (linear LS, bias and variance), part 3 (goodness of fit with LS), part 4 (LS example, averaging).
  • Week 9 slides, part 1 (LS with histograms), part 2 (LR test, Wilks thm.), part 3 (interval estimation), part 4 (interval from likelihood), histFit.py (for fitting histograms).
  • Week 10 slides, part 1 (Poisson upper limit), part 2 (Jeffreys prior), part 3 (nuisance parameters), part 4 (Bayesian treatment of NPs, MCMC).
  • Week 11 slides, part 1 (Bayes factors), part 2 (Finding marginal likelihoods), part 3 (Errors on errors pt. 1), part 4 (Errors on errors pt. 2), simple program for Student's t average. Lectures 11-3 and 11-4 refer to G. Cowan, Eur. Phys. J. C (2019) 79:133 or arXiv:1809.05778.
  • Revision Session slides (29apr21).
  • Books on statistical methods:

    Books on multivariate methods:

    Some additional notes/resources:

  • Python programs and slides from the 4th KMI School on Statistical Data Analysis and Anomalies (Nagoya, December 2022).
  • The materials from RHUL's year-3 introduction to statistics include a short program simpleFit.py for doing least-squares fits with the python routine curve_fit; also a root/C++ version simpleFit.C.
  • A note on the Jeffreys prior.
  • A note on the Poisson distribution and one on the exponential distribution.
  • See Sec. 40.5 of the PDG Statistics Review for a discussion of experimental sensitivity.
  • G. Cowan, Statistical Models with Uncertain Error Parameters, Eur. Phys. J. C (2019) 79:133 or arXiv:1809.05778
  • The "Asimov Paper", aka Asymptotic formulae for likelihood-based tests of new physics, by Cowan, Cranmer, Gross and Vitells, EPJC 71 (2011) 1554. or arXiv:1007.1727 for more on statistical tests for searches.
  • G. Cowan, Topics in statistical data analysis for high energy physics, arXiv:1012.3589 (2010).
  • G. Cowan, Statistics for Searches at the LHC, arXiv:1307.2487 (2013).
  • G. Cowan, Bayes Factors for Discovery (draft note).
  • Lectures at the Galileo Galilei Institute (January 2017) .
  • An introductory paper on Bayesian statistics: G. Cowan, Data analysis: Frequently Bayesian. Physics Today, Vol. 60, No. 4. (2007), pp. 82-3.
  • The sections on probability, statistics, and Monte Carlo from the Review of Particle Physics, P.A. Zyla et al., Prog. Theor. Exp. Phys. 2020, 083C01 (2020), by the Particle Data Group.
  • G. Cowan, A Survey of Unfolding Methods in Particle Physics, in M. Whalley and L. Lyons (eds.), Advanced Statistical Techniques in Particle Physics (Proceedings) Durham, UK, March 18-22, 2002, Conf.Proc.C 0203181 (2002) 248-257.
  • Computing:

  • Some more lectures on statistics I've given:

    Archives -- Statistical Data Analysis old lectures:

    Information on computing setup: Some info on how to log into the RHUL particle physics linux machine linappserv1 from the teaching lab or your own computer is available here.

    Once you have your account on linappserv0 you connect from any other networked linux machine with

    ssh -X username@linappserv0.pp.rhul.ac.uk

    where for "username" you substitute your login name, and then enter your password. You will have been given information on computer security and on how to change your password. It is your responsibility to read and follow these rules.

    The -X qualifier above should allow you to open up an "x-window". You can check this by typing at the prompt

    xclock &

    which should open up a clock in a small window. If it doesn't work, try using -Y or -XY.

    Your default shell is bash. Your account should have in the home directory a file called .bash_profile (check this with ls -la). If it isn't there, you can copy this .bash_profile to your home directory. This defines certain aliases and environment variables automatically when you log in. In particular, it defines the environment variable ROOTSYS, which you need for the ROOT programs we will use.

    You can also copy to your home directory the file .emacs, which will set some defaults for the emacs editor.


    Glen Cowan