Nothing to see here

Notes on using RooWorkspaces

RooWorkspaces are a method of persistifying RooFit and RooStats objects. The basic apparoach to creating them is to import into it RooFit objects, such as pdf, datasets and variables. Pdfs can be of any RooAbsPdf typ. Variables can be of any RooAbsVar type. Datasets can be of any RooAbsData types.
This allows for large range of flexibility.
One side issue though is that multi-dimensional kernal estimation pdfs are not supported by RooWorkspaces and can therefore not be saved for use at a later date. It is for this reason that I have only used 1D RooKeysPdf (as this 1D version can be saved to RooWorkspace, rpobably because it is based on a finely binned histogram, and ROOT loves histograms).
To access the workspace, the easiest method is probably through CINT or therefore a ROOT macro.
Load the file.
After this point you can call the workspace directly from its name, ie w->Print() This will list all the objects in the workspace.
To access the objects: w->data("dataname"); w->pdf("pdfname"); w->var("variablename");
Obviously if you want to use these objects later, the simplest method is to define a pointer to their object type before using the above method and assigning it, ie RooRealVar* var1 = w->var("var1"); RooKeysPdf* pdf = w->pdf("pdf"); etc
Combining the workspaces makes use of the RooFactory object which has a lot of built-in methods, however, the documentation appears to be limited which is why I will try to record some of it here.
If you have two workspaces with pdfs you wish to combine:
RooWorkspace w("w", "combined workspace"); w.import("filename:workspace:objectname",);
eg: w.import("channelA:w1:pdfA",RenameAllVariablesExcept("A","mHiggs")); w.import("channelB:w5:pdfB",RenameVariable("m","mHiggs"));
w.factory("SIMUL::joint(chan[A,B],A=pdfA,B=pdfB)");
This creates a simultaneous pdf called joint combining the two channels A and B using pdfA and pdfB. This object will now reside within the workspace and can be accessed in the same was using the w->pdf("joint") method to do with what you will.

Accessing data from a RooDataSet (Not as simple as it sounds!!!)

A RooDataSet is basically an n-dimensional array which can contain RooFit objects. Typically RooDataSets we deal with will contain RooRealVar objects.
My example is a real life one. I have constructed RooDataSets from unbinned data which was saved originally as a branch in a tree. It is relatively straight forward to create such a RooDataSet from a TTree branch, so long as you have created RooRealVars which will be used as keys effectively.
The structure is like:
--------------------------
| Variable 1 | Variable 2|
--------------------------
| evt1value | evt1value |
| evt2value | evt2value |

And so on...

To access these data points though requires you first use RooArgSet* set = RooDataSet::get(index number) which will run from 0 to less than the number of entries in the RooDataSet. Note this requires you have set up a RooArgSet with the correct RooRealVars as the RooDataSet.
This effectively selects a row in the dataset (which is tabulated as above).
You then need to be able to select the datapoint you want, and to do this you use RooRealVar* var = RooArgSet::find("nameofvariable").
Once you have done this, you still don't actually have the datapoint, you have a pointer to it. To access the value itself you need to use RooRealVar::getVal(). This will (finally!) give you the useful value with which you can do your work with.
Note that even though you deal with a pointer at this stage, the getVal() method returns a double (not a pointer to double).
So in a nutshell:
RooWorkspace*->(using RooWorkspace::data("datasetname"))->RooDataSet*->(using RooDataSet::get(row index))->RooArgSet*->(using RooArgSet::find("varname")->RooRealVar*->(using RooRealVar::getVal())->double/integer value to play with!

Additional Comments

So RooFit (or possibly CINT or both) seem to treat the persitification of objects in a peculiar fashion. For example it is not possible to create a pointer to a RooFit object (as far as I can tell). However when you create a Workspace and save it to a file (w.writeToFile("file.root")) and then accesss that file again, the resulting objects you have to play with ARE pointers to those objects and as such require the use of the pointer accessor (ie -> not .). This means when you want to use the objects as a defined structure (at least in CINT) you need to create objects like RooDataSet* data = w->data("datasetname"); and then use data-> to access the methods.
This in itself is not completely peculiar. However, there is a high usage of casting object types when using the Workspaces to access objects.
For example:
I had a dataset in a RooWorkspace which I wanted to create a RooKeysPdf from in an interactive ROOT session to then be saved to the current Workspace. In order to do this, the RooKeysPdf takes the following as its constructor:
RooKeysPdf kpdf("name","title",variable,dataset) ; nb there are additional options which can be appended.
The constructor passes the variable and the dataset by reference, but this is when it gets a bit wierd.
Because the RooWorkspace only returns pointers to the variable and the dataset, one might think that dereferencing the pointer inside the constructor might be the preferable course of action, ie:
RooKeysPdf kpdf ("name", "title",*(w->var("nHadrons")),*(w->data("data_set")));
However, this does not work.
The error still comes up about the constructor and if you just use the pointer instead of dereferncing you get the same error message.
This is not really good behviour in CINT because it is not just enough to be able to save the data and not be able to regenerate from it. I personally wanted to make a new dataset and keys pdf from the hadrons counts in the planes to get a total hadron count per event. The downside of this was that if I couldn't do it from my datasets already saved I would need to rerun my ROOT analysis over the D3PDs and then add an additional branch to my TTree and then rerun the RooWorkspace derivations - not a good use of CPU time!

The way around this is to cast the objects into a non-pointer form! This is clearly odd. I'm not sure it would work in a compiled C++ program, but I wanted to do this through CINT and a ROOT macro, rather than write a whole new program to do it, mainly because of the annoying filenames I had given to objects.

In any case, the following DOES work in CINT (and yes, it does appear that I am casting a pointer to object, to an object):

RooKeysPdf kpdf ("name","title",(RooRealVar)w->var("nHadrons"),(RooDataSet)w->data("data_name"));

Another time I needed to use casting also came about in CINT and writing my macro.
When I wrote above about accessing the datapoints in a RooDataSet, there was still a problem I came up against.
On the first attempt of generating an object of type RooRealVar* var = set->find("var_name"), there are no issues. However, when you try to overwrite this object, say, for the variable in the next row (ie next event), you get an error message saying the object is not a RooAbsArg* object. This is clearly odd, as you had just used the same line of code and it worked.

The workaround is again to use casting of the form: var = (RooRealVar*)set->find(var->GetName());

This tells CINT to get the variable and cast it to a pointer to RooRealVar which is set as equal to var. You can then use var->getVal() easily.
You'll notice in the line above I used the GetName() operator as that returns the name as const char* which is equivalent to "string". This works as I had defined my var object here to be of the same "name" and "title" as the one I was extracting from the RooDataSet.

Binning is not defined in current scope

I keep running into this error when I am trying to createHistogram from a pdf.

The solution is to ENSURE YOU ARE USING NAMESPACE ROOFIT.

This is especially pertinant when you are investigating the contents of a workspace within an interactive ROOT session as it seems that even if you are using RooWorkspaces (which need RooFit), you need to explicitly include using namespace RooFit.

Then a line like this : root [37] TH1* Hist = (TH1*)h1->createHistogram("Hist",xis,Binning(1000,0.1,0.9),YVar(hpm,Binning(100000,0,100000))) "should" work.



Related source files
Kernal estimation source code
Macro to sum RooDataSet entries
Page for any codes to refer to

Some notes I have collected

Authored by Kyle Cranmer, this response to an email held some useful information about interpolation and as I hadn't found this information in the HistFactory document yet (especially where to find out what InterpCode=4 is seeing as that is likely to be used in combinations it seems, and I might end up looking into interpolation methods) I thought I should repost the infromation. I'm pretty sure there is no sensitive information in it.


I took this opportunity to write a long description of what is being done and why. I answer Gia's question near the beginning, but you should make sure to read "New Default in HistFactory" below that. Most of the middle content is related to developments that happened in the LHC-HCG, but haven't made it into the documentation yet.

See Section 5 of the LHC-HCG document for more details. note, they are not specific to Higgs http://cdsweb.cern.ch/record/1375842

Now that I've written it, I'll try to include it in the HistFactory documentation.

Initial HistFactory treatment: Gaussian constraints and linear interpolation Imagine that you have a systematic like JES. A +/- 1 sigma change in the JES might change your signal rate by 10%. For a background it might change the rate by 15%. These are statements about the effect a variation in the JES has on the rate. If alpha_JES is the nuisance parameter, then you might say that the signal rate is:
s(alpha_JES) = s_nominal * (1 + 0.10 * alpha_JES)
and your background rate is
b(alpha_JES) = b_nominal * (1 + 0.15 * alpha_JES)
where the nuisance parameter is scaled so that alpha_JES=0 is the nominal situation and alpha_JES=1 corresponds to the JES being "1-sigma" high.

Note, both signal and background rates are a function of the nuisance parameter alpha_JES and you can see there is a correlated effect on both rates when alpha_JES changes.

Next, the default for OverallSys (or HistoSys) is to add a Gaussian "constraint term": G(A | alpha_JES , 1)

Note that if alpha is sufficiently negative then s(alpha)<0, which is unphysical. This can happen a lot if the effect of the uncertainty is large. This is the motivation for using a different type of constraint term. one that ensures that the rates are always positive.

Consistent Bayesian/Frequentist modeling.. HistFactory is aiming to produce a model that can be used for both Bayesian and Frequentist situations. Interpreting the constraint term as Prob(alpha_JES) is Bayesian. So I took care to write Gaus(A | alpha,1) above. where the variable A is scaled version of an auxiliary measurement (called a global observable in RooFit/RooStats). When we generate pseudo-experiments we randomize A (the unconditional ensemble in our frequentist recommendations). Note if you assume a flat prior on alpha, the resulting posterior is still a Gaussian. Glen calls this original flat prior the "ur-prior". See table 2 in the LHC-HCG document.

HistFactory Adds Alternative Gamma & LogNormal Constraints (same interpolation): Systematics with large uncertainties are probably better described by Gamma or a LogNormal constraint terms. There are some standard conventions for mapping a Gaussian to a Gamma or LogNormal with the same relative uncertainty. But this is now a relative uncertainty on the source. e.g. how well do we know the JES? That is logically distinct from the effect it has on signals or backgrounds. In particular, the relative uncertainty on the source is a single number independent of its effect on any particular signal or background. If the JES had a 1% uncertainty, then the corresponding Gamma and LogNormal constraints would look very Gaussian However, if the JES had a 50% uncertainty the Gamma and LogNormal constraints would be very asymmetric. A Gaussian always has the same shape, so the relative uncertainty in the source was not added originally. But when we added the Gamma and LogNormal constraints, we needed to add that notion (since it is global, it goes in the top-level XML file).

Reparametrization of alpha: Gamma and LogNormal constraints are for parameters that are >=0. Above we had alpha=0 as the nominal and alpha=-1 as the -1-sigma situation. So in HistFactory I have to reparametrize the model. I replace alpha the parameter with a linear function alpha(beta) . Here alpha has the same meaning, but beta=1 is the nominal situation and beta=1+/-relUncertainty is the +/-1-sigma situation. The only problem with this approach is that you can't guarantee that s(alpha(beta)) >0. More below.

Extended statistical note for Gamma.. we have to do the same consistent Bayesian/Frequentist thinking regarding the "ur-prior" for Gamma and LogNormal as well. The term "Gamma" constraint is really a description of a Bayesian posterior a PDF like Pois(A | tau*beta) where tau is some fixed constant based on the relative uncertainty. Again A should be randomized when generating pseudo experiments. The Gamma distribution comes about as a Bayesian posterior if you multiply the likelihood by the same uniform "ur-prior" on beta (equivalently alpha b/c they are linearly related). See eqtn 18 and table 2 in the LHC-HCG document.

Extended statistical note for LogNormal.. in the case of the log-normal it's a bit odd. The justification for using LogNormal is usually given in a frequentist way (eg. many multiplicative errors). So that means that the PDF should be a LogNormal(A | beta, kappa), where again kappa is some constant determined by the relative uncertainty. If you follow the story above and multiply the resulting likelihood by the same uniform "ur-prior", you do not get a LogNormal posterior. You can show that what you need is an ur-prior that looks like 1/beta if you want to get a LogNormal posterior. But that's kind of strange thinking, b/c you don't usually pick a prior to make the posterior look like you want it to. Anyways, that's the thinking behind the use of LogNormals. See eqtn 17 and table 2 in the LHC-HCG document.

LogNormal constraint & linear interpolation = Gaussian constraint and exponential interpolation.
In the process of the LHC-HCG we realized that if one reparameterized beta[0,inf] -> alpha'=log(beta)[-inf,inf] then a uniform ur-prior on log(beta) would pick up a Jacobian factor making it equivalent to the 1/beta I mentioned above. One can re-write LogNormal(beta) as Gaussian(log(beta))=Gaussian(alpha) and then adjust the interpolation written at the top accordingly (see below). Note, when the effect of the uncertainty is different for different signal and background samples, these two approaches are not strictly equivalent, but they are close. This is now the default in HistFactory for normalization uncertainties. See discussion around eqtn 18 in LHC-HCG document.

New Default Interpolation for Normalization in HistFactory (exponential interpolation) Based on the discussion above, the default in HistFactory is to have a Gaussian constraint an alpha but the interpolation is no longer linear it's exponential.
s(alpha) = s_nominal * (1+0.10)^alpha
b(alpha) = b_nominal*(1+0.15)^alpha
With this we can ensure that s(alpha) and b(alpha)>0 even if alpha-> -infinity. Also, the Taylor expansion of this has the same linear behavior near alpha=0 as in the original situation.

In general I would suggest using the default of a Gaussian constraint and the exponential style interpolation as it ensures the rates are physical. When you write it up in your paper you have some freedom to describe it as a LogNormal constraint or the exponential style interpolation.

The code for the interpolation of the normalization constants is here: http://root.cern.ch/viewcvs/branches/dev/roostats/roofit/histfactory/src/FlexibleInterpVar.cxx?view=markup The default interpolation code is 1 (in the code it's called log interpolation, where in this email I called it exponential interpolation. depends on which way you think about alpha<->beta). The original linear interpolation strategy code is 0. Everything is actually done in a piece-wise way to deal with asymmetric errors. You can change the interpolation strategy with this script: $ROOTSYS/tutorials/histfactory/ModifyInterpolation.C (see documentation inside).

The remaining 2 chapters of this saga are:

Asymmetric errors and discontinuities in the likelihood function: Piecewise linear (original default) and piecewise exponential (new default) interpolation have discontinuous 2nd derivatives, which can cause prolbems for minuit. It is common at the Tevatron to use a quadratic interpolation for alpha in [-1,1] and linear extrapolation beyond that (option 3 in the code above). That is smooth, but can have some strange behaviors. for instance if changing JES by +1 (- 1) sigma results in 10% (15%) more events, then the quadratic interpolation will say that the number of event is less for some range of alpha. That's not very desirable. We have yet to find a good interpolation strategy based on 3 numbers that is somewhat principled, ensures physical rates, and is smooth.

Interpolation of shapes - Problems and future directions: Currently HistFactory interpolates shape systematics in a piece-wise linear way vertically for each bin of your histogram. If you have asymmetric shape uncertainties, then this has the same discontinuity problem as above and can lead to negative histogram bins. If you use the exponential style interpolation (or any other non-linear interpolation) to protect against negative histogram bins then you will have to renormalize (integrate) your shape every time your nuisance parameters change. That might slow things down a lot. It hasn't been tried, but clearly something needs to be done b/c several analyses have convergence problems in MINUIT that are related to this issue. You can make it a little bit better by normalizing all your histograms and so that the overall rate change is dealt with as an OverallSys with the exponential interpolation mentioned above.

I hope that is useful for anyone brave enough to read it.

Kyle