# 3. Diagnostics and checks for analysis workspaces¶

This section aims at getting the reader familiar with the different types of checks that are commonly run when performing statistical analyses with profile likelihood. In particular, those checks are usually very helpful to diagnose the origin of fit issues that may be encountered, as described in other sections of this document.

Therefore, this section recommends the precise plots and checks that should be performed at various stages of a statistical analysis, and explains how to read them to extract useful information. In a second part, some guidance is given to deal with issues in the parametrization of systematic uncertainties, once their origin has been found with the previous checks.

The Twiki page on Profile Likelihood Guidelines by the exotics group contains also very relevant material on this topic. In addition, it presents how in practice to perform some of the diagnostics discussed here in the most common fit packages.

Part of this guide is also extracted from the Guide to parameterized Likelihood Analyses

## 3.1. Diagnostics and checks¶

### 3.1.1. Inspecting workspace contents¶

Understanding in details the contents of a workspace is the most basic check one can do, but also a very powerful one to eliminate many sources of common issues. Especially in the setup phase of a new statistical analysis, many fit issues down the chain (large pulls, constraints…) are often simply cured by a careful bug hunting of the workspace contents.

The tools to be used for this task can differ depending on the cases. If you are building the workspace
yourself, you can make plots and other checks after all the preprocessing is done, right before calling `hist2workspace`

or any other tool to create the workspace. If you are given an existing workspace, you will have to dump all the information
from the workspace itself.

In order of increased complexity:

- check that the expected analysis regions are included, with the desired distributions boundaries and binning (for histogram-based workspaces)
- check that all samples, signals and backgrounds, are present in all regions.
- dump prefit yield tables
- check that all systematics have been included as intended, and that the parameters of interest are correctly defined as well.
- check that the normalization impact of systematics in each region and on each sample make sense
- It can be useful to draw 2D plots showing the normalization impact of a given systematic vs samples and vs regions to see if things are consistent
- If you are applying case-by-case pruning of systematics, also check that it is working as intended

- check the shape impact of systematics, by plotting the +/-1 sigma shape variations
- Given the number of Nuisance Parameters in typical ATLAS analyses, the number of histograms can be daunting. You don’t have to go through each of them, but have them ready at hand as they can be extremely useful to understand features of the fit down the chain.
- If you are smoothing usually noisy systematics (energy scale / energy resolution ones), check that the smoothing procedure is working as intended
- If you are applying case-by-case pruning of systematics, also check that it is working as intended
- Finally, it is possible to create 2D plots of a sample shape vs the value of a given NP. This allows to examine the effect of the interpolation/extrapolation strategy (especially for analyses based on HistFactory) for values of the NP between or above +/-1 sigma.

### 3.1.2. Investigating simple fits¶

Once you are confident that your workspace contains what you intended to put in, the first step of the statistical analysis is to perform simple fits (minimizations) in certain conditions.

#### 3.1.2.1. Which fit to perform ?¶

There is no general rule about what is *the correct* fit to perform, as it depends on the particularities of the analysis, especially
on the blinding strategy chosen. Some rules based on common sense can however be stated:

- do not perform a fit with a floating parameter of interest and including the signal regions before you unblind, unless you have taken specific action to precisely blind its value.
- for searches, look at one or a few typical points in the mass range your analysis is probing.
- in general, before unblinding you want to use a fit model as close as possible to the full (unblinded) fit in order to avoid surprises when unblinding.

Some more practical advice is given in the section Advice in the choice of the fit setup for diagnostics and checks below, and a few common cases are discussed.

#### 3.1.2.2. What to look for ?¶

Once the fit is performed, the first thing to check is its correct convergence: both `MIGRAD`

and `HESSE`

should terminate with status 0.
Typically in a MINUIT logfile, you will find these lines:

```
RooFitResult: minimized FCN value: X, estimated distance to minimum: Y
covariance matrix quality: Full, accurate covariance matrix
Status : MINIMIZE=0 HESSE=0
```

Non-0 status for `MIGRAD`

should necessarily be fixed. Usually it turns out to be a problem with the technical implementation of the workspace
and can be debugged using the plots described in the previous section. A non-0 status for `HESSE`

may not be an issue to compute the results
of an analysis, but it will complicate the extraction of some of the diagnostics plots.

Plots of the post-fit value of the nuisance parameters (often refered to as *pulls* in the
ATLAS jargon) should be produced, along with their associated error, obtained typically by calling `HESSE`

. The correlation matrix between the nuisance
parameters should also be looked at. In statistical analyses with many nuisance parameters, it is useful to group them in categories
(experimental, modelling, theory…) and to make one plot per category. Similarly, for readability reasons it is advised to create a
sub-correlation matrix that focuses only on the nuisance parameters that have a correlation larger than some given threshold
(e.g 20%) with any other.

Note that the internal documentation of any parameterized likelihood analysis should always show the full list of fitted nuisance parameters and their uncertainties, as well as the correlation matrix.

The simple pull plots allow to distinguish if the *in-situ* constraining of nuisance parameters happens in your measurement and to what extent. Three classes of nuisance parameters can be identified.

- Nuisance parameters with a estimated uncertainty of \(\approx\) 1 : this represents nuisance parameters for which the likelihood of the data does not have any significant sensitivity to the systematic effect and the uncertainty is thus constrained by the auxiliary measurements.
- Nuisance parameters with a estimated uncertainty of < 1 : this occurs when the measurement provides a higher constraint on the allowed variation than provided by the auxiliary measurements. The parameters are then said to be
*constrained*in situ by the measurement. - Nuisance parameters with a estimated uncertainty of > 1 : this often points to an underlying problem in the measurement such as incorrect normalisation
or non-convergence. This can also happen in cases where the +1 sigma and -1 sigma variations are on the same side wrt the nominal (a detailed explanation is
available in this notebook). Except in some very specific circumstances,
this should not happen, therefore
*underconstraints*should always be checked carefully.

Getting confidence in the soundness of the statistical model therefore requires to understand the origin of any significant deviation from 0 of the postfit value of the nuisance parameters, of any significant constraint (postfit error <1), and of large (anti-)correlations in the correlation matrix. The understanding of these features can be obtained by performing a number of tests of the fit model, and by comparing the fit results of different configurations. It is therefore useful to have a tool that can easily produce plots comparing different fit results.

Before moving on to the description of the checks that can be performed, it is important to stress out that the significance of the pull of a nuisance parameter does not only depend on its postfit value. The postfit value of a nuisance parameter is the result of the balance between the value preferred by the data, and the (usually gaussian) constraint applied on it. If the data has some sensitivity to a nuisance parameter, it will constrain it, so it is expected that a nuisance parameter pulled away from 0 will also be somewhat constrained. Conversely, a nuisance parameter which is pulled without being constrained indicates that the data has little sensitivity to it but still prefers a non-0 value: this is quite unlikely to happen, and such cases should be thoroughly investigated. It can be shown that for nuisance parameters with a gaussian constraint of width 1, the significance of the pull, i.e the compatibility between the data and the constraint, is given by:

#### 3.1.2.3. Investigations from comparisons of fit results¶

A first check to be done is the comparison of the fit results between the fit to the data, and the same fit performed against an Asimov dataset. In well-behaved cases, the observed constraints on the nuisance parameters should be very close to the expected constraints from the Asimov fit.

A tighter constraint on a nuisance parameter in the data fit (smaller post-fit errors) is usually a sign of a tension in the fit, i.e that different regions in the fit try to pull the nuisance parameter in different directions. Decorrelation tests should then be performed to understand the origin of this constraint (see below). There is however a very common special case: when the tighter constraints concern energy scale or energy resolution nuisance parameters, such as JES or JER. As the evaluation of those systematics results in moving events in the fitted distributions, the obtained +/-1sigma templates can be noisy, with some bins being outliers. The fit can use these noisy variations to accomodate for various fluctuations in the data, with the end result being the artificial constraints observed. A proper smoothing of the +/-1sigma templates at the time of the creation of the workspace usually solves the problem.

If the analysis combines several channels, it is usually useful to compare the results of the fit performed on each individual channel to the combined fit results. Similarly if the analysis has dedicated CRs, the CR-only fit should be compared to the SR+CR fit. The comparison of the constraints and of the postfit values gives some understanding of the compatibility between the different analysis regions, and of the origin of the sensitivity to some nuisance parameters.

To investigate a pull on one specific nuisance parameter, it is recommended to perform the main fit with that nuisance parameter decorrelated between the different fit regions or between the different samples. This allows to efficiently check for tensions in the fit on one parameter, while the others are correlated as usual.

The thoroughness of the investigations of the pulls and constraints of the nuisance parameters should be proportional to the impact of systematic uncertainties in the final result: a measurement or a search whose sensitivity is dominated by systematics requires an excellent control of their impact.

#### 3.1.2.4. When should you be happy with the profiling of nuisance parameters ?¶

When performing a profile-likelihood-based statistical analysis with a model consisting of several regions, some of which having large event yields, it is quite likely that some of the nuisance parameters are profiled significantly (pulled and/or constrained). Any significant pull or constraint shows that the data has enough sensitivity to give information on the nuisance parameter beyond the initial prior constraint. The checks described above should be sufficient to understand precisely the origin of the pull or constraint, but the question of its legitimacy is analysis dependent, and has to be answered with good physics judgement.

To help in this process, the Guide to parameterized likelihood analyses by Wouter Verkerke proposes a classification of the nuisance parameters depending on their source:

- good
- Good systematic uncertainties have a clear cause and a clear evaluation strategy.
- bad
- Bad systematic uncertainties have a clear cause but no clear evaluation strategy.
- ugly
- Ugly systematic uncertainties have no clear cause and no clear evaluation strategy.

and discusses the merits and pitfalls of profiling for the different cases. Some advice is also given in section Dealing with Bad and Ugly systematics at the end of this document.

### 3.1.3. Postfit plots and tables¶

Postfit plots and yield tables are an important part of the presentation of the results in the publication. They are also very useful checks that the profile likelihood is giving reasonable results. In particular, it is possible to make postfit plots of other variables or regions than the ones fitted by propagating the pulls of the nuisance parameters to them:

- If your analysis has defined Validation Regions (VR), make postfit plots in them
- Make postfit plots of variables important to your analysis in your Signal Regions, other than the final discriminant that has been fitted (for instance variables on which you have applied cuts, or input variables of a multivariate discriminant)

Checking the data modelling by the fit model in these plots helps cross-check that the postfit values of the nuisance parameters make sense.

Note: while there exists metrics to quantify the quality of the fit (goodness-of-fit tests), to our knowledge there is no simple and rigorous way to quantify the postfit agreement (i.e define a meaningful p-value) for a single analysis region inside a combined fit (because nuisance parameters correlate all regions), neither to quantify the postfit agreement of other variables than the one fitted (because of the statistical correlation of the data). Chi-2 like metrics can be defined and may be useful to compare the agreement between different fit models, for instance, but in order to quantify the agreement the p-value should be calibrated.

### 3.1.4. Likelihood profiles of nuisance parameters¶

To understand the behaviour of nuisance parameters for which the simple checks are not sufficient to find the cause, it can be useful to draw the plot of the profiled negative log likelihood value as function of the value of the nuisance parameters.

In well behaved cases, this should exhibit a nice parabolic shape around the minimum. Bad features that may appear are typically two close-by minima, or kinks (non-smooth behaviour) around the minimum. In these cases, it should be understood if there is a physics reason behind, or if it is caused by technical issues in the fit model: typically bad/spurious variation of a template for this systematic, bad choice of interpolation strategy. That can be checked by looking at the +/-1sigma plots for this systematic, as well as the 2D ‘morphing’ plots mentioned in section Inspecting workspace contents.

### 3.1.5. Impact of nuisance parameters on a measurement¶

Understanding the impact of systematics on a measured parameter of interest is of great value both for the presentation of the results and for the understanding of the soundness of the fit model.

#### 3.1.5.1. Impact of individual nuisance parameters¶

A simple way to evaluate the impact of a given nuisance parameter is to fix its value to the +/-1 sigma of its postfit error (as evaluated with
`MINOS`

or a similar technique), then check the value of the parameter of interest after profiling the likelihood with this fixed nuisance
parameter value.

When repeating this procedure for each nuisance parameter, and displaying the results ordered by the impact on the parameter of interest,
one obtains the so-called *ranking plot*. It is obviously quite useful to learn which are the most important nuisance parameters in an analysis:
this allows to pay close attention to those parameters when performing the investigation of pulls described above in section Investigations from comparisons of fit results.

One word of caution: it may be tempting to limit the investigations of the pulls and constraints only to the nuisance parameters appearing
in the top ten of the ranking, and argue that the other don’t matter much. This reasoning can however be wrong, especially in the presence of
artificial constraints: after fixing them, the affected nuisance parameters can move significantly up in the ranking plot. To check which parameters
could have a significant impact if they were not constrained, the *prefit* ranking plot can be performed, where the nuisance parameters are fixed
to +/-1 sigma of their *prefit* uncertainty (which means +/-1 for all the parameters that have the usual Normal constraint).

The ranking plot can differ quite a bit between before and after unblinding an analysis. In the case where the parameter of interest is measured significantly different from its expected value, some of changes are to be expected, especially between nuisance parameters that affect the signal and nuisance parameters that affect the backgrounds. At first order:

- A nuisance parameter that affects the signal will always keep the same relative impact, whatever the measured value of the parameter of interest. For instance, a signal theory uncertainty of 5% will always have an impact of 5% on the parameter of interest.
- A nuisance parameter that affects the background has an impact that scales with S/B, and will therefore change depending on the measured signal strength. For instance, if a background normalization has an expected impact of 10% on the parameter of interest, and that the parameter of interest is measured to be twice its expected value, then the post-unblinding impact will be of around 5%.

#### 3.1.5.2. Impact of groups of nuisance parameters¶

When performing a measurement, it is common practice to report the uncertainty broken down into separated components: statistics, experimental uncertainties, theory uncertainties, MC statistics… This is also quite interesting as a complementary method to the ranking plot, to understand which parts of the fit model matter most in the result.

One difficulty arising in profile likelihood analyses, is that all nuisance parameters tend to be correlated postfit. This implies that there is no unique way to compute the various components of the uncertainty, with different methods treating differently the correlations between groups of nuisance parameters. The existence of these correlations also means that in general the quadratic sum of the components will differ from the total error. This should be made explicit in the paper.

The statistical uncertainty is defined as the uncertainty on the parameter of interest computed when all nuisance parameters are fixed to their profiled value. Note that this definition is itself debated, as Standard Model measurements tend to quote the uncertainty coming from Poisson variations of the signal yield.

The recommended way to obtain the contribution from a group of nuisance parameters is to subtract in quadrature the uncertainty on the parameter of
interest computed when fixing the values of these nuisance parameters to their profiled values, from the total uncertainty. This procedure should be done
separately for the positive and negative errors, with the uncertainty on the parameter of interest computed with `MINOS`

.

When applied on a single nuisance parameter acting on backgrounds, this definition gives (at first order) the same impact as the one obtained in the ranking plot.

In the case of signal systematics, the degeneracy between a signal systematics nuisance parameter and the parameter of interest tends to lead to asymmetric results (difference between the positive and negative errors), even when the input uncertainties are symmetric. In this case, it may be appropriate to quote the average of the positive and negative errors, as the asymmetry is coming from the procedure used to compute the impact.

#### 3.1.5.3. Recommendations for searches¶

The previous paragraphs are very much focused on measurements, but evaluating the impact of nuisance parameters in searches is still relevant, all the more in case of searches for low S/B signals in high statistics regions, where systematics will probably have a significant role.

If the search scans through a relatively large phase space where S/B or background composition vary significantly, then a few reference points should be looked at.

For each point, a ranking plot can be produced. Given the dependence of the ranking plot on the signal strength, a reasonable signal cross-section should be chosen before unblinding: for instance the cross-section at the expected limit.

If computing time is not an issue, the impact of groups of nuisance parameters can be evaluated on the limit itself instead of on the uncertainty in the parameter of interest, using the same methodology as above (fixing all parameters under study to their profiled value). Otherwise, the uncertainty in the parameter of interest can be used, provided that the signal is scaled to a reasonable cross-section, as in the ranking plot case.

## 3.2. What to do in case of problems¶

### 3.2.1. Improving the parameterization of systematic uncertainties¶

Some care should be taken in understanding the effect of simplified modeling of systematic uncertainties even if no in-situ constraining occurs. The extent to which some under or over estimation of a systematic uncertainty occurs depends on the analysis generally for all systematic uncertainties which are not dominant, a simple but slightly conservative evaluation strategy can be well justified.

For compound systematic uncertainties (those with multiple sources of uncertainty, each with a distinct distortion of the distribution being fitted, such as JES) that are in-situ constrained; splitting the uncertainty into its components in the PLL fit should be considered as this is generally more conservative (realistic) than modeling with a single component.

For systematic uncertainties that describe a calibration-type systematic uncertainty that are expected to exhibit an unknown degree of variation on the detector phase-space more than one nuisance parameter may be needed. The optimal number depends on the expected correlations of the systematic uncertainty over the phase space. A good starting point may be a breakdown provided by a performance group, but this is not necessarily optimal and parameterization choices should be discussed with the appropriate performance groups using the expected correlation matrix over the phase space as a guideline.

For certain analyses it is possible to use a simplified nuisance parameter model even for in-situ constraining of systematic uncertainties. Such scenarios usually arise in the domain of nuisance parameters that map phase-space variations of calibration systematic uncertainties. For example if an analysis is only sensitive to a JES in a very limited region of \(p_\mathrm{T}\) , a single parameter may be sufficient to effectively describe the impact of JES on the analysis, even if the generic JES uncertainty has many degrees of freedom. To determine if an analysis is sensitive to a sufficiently small range of the phase space to be able to effectively describe the uncertainty with a single parameter one can rerun the analysis with a series of templates that implement different systematic uncertainty variations as function of \(p_\mathrm{T}\): for example a flat variation, a variation with a linearly increasing JES with \(p_\mathrm{T}\) and a linearly decreasing JES as function of \(p_\mathrm{T}\). If the effect of these alternative JES descriptions have little effect on the measurement of the parameter of interest one may conclude the that effective sensitivity to JES is in such a limited \(p_\mathrm{T}\) range that a simplified model can be justified.

If no obvious breakdown exists for in-situ constrained systematic uncertainties that are a priori expected to vary strongly over phase space, or the analysis is expected to be effectively sensitive only to a small fraction of the parameterized phase space - consider doing cross-checks where the effect of the systematic uncertainty across phase space is artificially varied (e.g instead of 5% flat vs \(p_\mathrm{T}\) make a toy model where the effect is varied between 2% and 8% as a linear function of \(p_\mathrm{T}\) ). If such a model results in very different in situ constrained uncertainties, the shape of the constraint is clearly important and a simplified shape in the parameterized likelihood fit is possibly dangerous.

Note that a proper description of the observed data distribution by the template model is insufficient proof in itself that a simplified model is justified.

A common choice of technique to avoid in-situ constraining is to artificially break the nuisance parameter in to N nuisance parameters corresponding to different regions of phase space and choosing N sufficiently large so that each region has insufficient statistical power to constrain the uncertainty from the data. When pursuing this ‘artificial breaking’ strategy for in-situ constrained systematic uncertainties it is important to demonstrate that the N components that are created represent uncorrelated systematic uncertainty contributions, as they are treated that way in the parameterized likelihood. If they do not represent uncorrelated components the act of neglecting the source correlations may result in an underestimation of the systematic uncertainty, even when each is evaluated at its nominal magnitude. The validity of this approach is strongly dependent on the choice of the phase space regions. If chosen in a wrong way this approach can underestimate the systematic uncertainty significantly.

### 3.2.2. Dealing with Bad and Ugly systematics¶

As mentioned in section When should you be happy with the profiling of nuisance parameters ?,
*Bad* systematic uncertainties have a clear cause but no clear evaluation strategy. *Ugly* systematic uncertainties have no clear cause
and no clear evaluation strategy. In general, it should be avoided to constrain such nuisance parameters in a profile-likelihood fit.
If such uncertainties dominate the total systematic uncertainty, one should strongly consider to either redesign the analysis, or invest
additional effort in understanding these systematics to the level where they can be expressed in therms of ‘good’ uncertainties.
There can be exceptions to this rule, but they are very analysis dependent and should be well justified.

## 3.3. Advice in the choice of the fit setup for diagnostics and checks¶

The choice of the right fit setup to use depends on the information that one expects to get out of this fit.

### 3.3.1. Basic fit investigations¶

As explained previously, one typically first wants to understand the fit model in terms of pulls and constraints prior to unblinding. The goal is then to use a fit setup as close as possible to the final one (that will be used to extract the signal), in a way that allows to firmly understand the backgrounds, without being impacted by the actual value of the signal strength (since that would compromise the background fit understanding and give some hint on the unblinded value at the same time).

The most typical cases are therefore:

- if you expect a significant signal contribution in only a few bins (narrow resonance ; signal present in only a few high BDT score bins ; well-defined signal and control regions), it may be sufficient to remove only those few bins from the fit. This applies both in cases when those bins form a separate signal region, and when they are parts of a larger analysis category.
- for analyses with low S/B and where a signal is expected (measurements of rare SM processes…), it may be fine to fit your full distributions, but forcing mu=1.
- for searches with low S/B, it may be fine to fit your full distributions, but forcing mu=0. Indeed at this stage the goal is only to understand the background model. You can check on Asimov data how much the nuisance parameters are pulled if you inject a signal at a cross-section equal to the previous published limit in the same channel, then fit with mu=0. If the pulls are very small, it means you can safely assume mu=0 in a fit to data: any significant pull will be related to the background composition, and cannot be explained by the signal.

### 3.3.2. “Postfit” expected results¶

When the fitted background is significatively different from the one expected from Monte Carlo, one often wants to evaluate the expected results based on the postfit background estimation, but still before unblinding. Compared to the previous case, the signal-enriched bins must therefore be always present in the fit, and the signal strength will be left floating to evaluate either expected limits, or significances, or impacts of systematics… Before unblinding, no real data can be used in the signal-rich bins, so an Asimov dataset should be used.

When the background studies show that the difference to the Monte Carlo concerns mainly normalizations, it may be simpler to change the cross-sections in the Monte Carlo and evaluate the expected results based on the nominal Asimov data built from these rescaled Monte Carlo datasets.

#### 3.3.2.1. Case of hybrid Asimov+data datasets¶

In other cases, one typically resorts to hybrid data-Asimov datasets, where Asimov data are used in the blinded bins, and data elsewhere. In these cases, care should be taken to build correctly the hybrid dataset. The goal is that the pulls of a simple fit do not change between the background fit (that includes only the bins not blinded) and the full fit (that includes the Asimov data in the blinded bins). To achieve this result, the Asimov data in the blinded bins should be built exactly using the pulls from the background fit : the nominal Asimov dataset from prefit Monte Carlo should never been used.

Indeed, when building the Asimov data this way the best fit from the fit to the Asimov data (not including the constraint terms) is by construction the same as the fit to the data in the bins not blinded, including the constraint terms. Therefore, leaving aside some possible numerical inaccuracies, the best fit of the combined fit, which is Asimov data in blinded bins, data in bins not blinded, and the constraint terms, will be the same.