How does normalisation work in Progenesis QI?
Background
Normalisation is required in differential experiments to calibrate data between different sample runs. This corrects for factors that result in experimental or technical variation when running samples. For example, a common factor to correct for would be sample loading. The effect of these systematic errors can be modelled by a unique gain factor for each sample. The gain factor is represented by a scalar multiplier that is applied to each ion abundance measurement. The key underlying assumption is that most compound ions do not change in abundance (the abundance distributions do not alter globally), and hence recalibration to globally adjust all runs to be on the 'same scale' is appropriate.
This scalar factor required can be represented as αk for each sample:
y′i=αkyi
Where yi is the measured compound ion abundance of compound ion i on sample k, αk is the scalar factor for sample k and y′i is the normalised abundance of compound ion i on sample k.
However, there are several means by which this scalar can be estimated, even within the parameters of the key underlying assumption (that overall, the distributions do not change).
By default, Progenesis QI uses ratiometric data in log space, along with a median and mean absolute deviation outlier filtering approach, to calculate the scalar factor. This is a robust approach, which is less influenced by noise in the data and any biases owing to abundant species, as the absolute values of abundance are disregarded.
This default method is referred to as Normalise to all compounds at the Normalisation Method tab in the Review normalisation window at the Peak Picking stage of the workflow, and will be applied if no changes are made to the selection. There are several alternative methods available which will be covered later.
Process - Normalise to all compounds
Normalisation reference (the 'target')
One run is automatically selected as the normalisation reference (see detailed note [1]). Note that this may well not be the same as the alignment reference. Also, once selected, this run will not be re-evaluated if you add more runs to the experiment (so that your existing normalised data will not be altered).
Log10 ratio calculation
Because of the accurate alignment and aggregate co-detection, every run has a reading for all compound ions. Hence, for every run, a ratio can be taken for the value of the compound ion abundance in that run to the value in the normalisation reference:
Ri,x=Abi,x/Abi,NR
Where Ri,x is the ratio of the abundance of the compound ion i in run x to that of compound ion i in the normalisation reference NR.
However, such ratiometric data follow a skewed distribution (a 2-fold increase giving 2, a 2-fold decrease giving 0.5; 3-fold giving 3 and 0.33, etc.). To obtain a distribution treating both directions equally, log transformation is applied, which yields a normal distribution. Progenesis QI carries this transformation out (base 10) on all ratio data within each run, and for all runs, to generate a series of normal distributions. At this stage, these are offset – that is, because of scalar differences in signal, the ratios will not centre on 1 (and the log ratios not on zero), as would be the case if there was no global shift in the signal. This is the shift that must be addressed.
Note that this ratio calculation removes the influence of absolute abundance from the process, which is a major advantage over total-abundance-based methods.
Scalar estimation in log space
The next step is to centre the log10 ratio distributions onto that of the normalisation reference in each case. This is achieved by simply adding or subtracting the value required to shift the sample distribution over the normalisation reference one. This additive or subtractive shift in log space, is, of course, a multiplicative scalar in the sample abundance space.
There is a second improvement over traditional methods applied in this step. The median and also median absolute deviation are used as an approximation of the variance of the ratio distribution; this allows the filtering out of outlying ratio values so that they do not perturb the results. This process is carried out iteratively, to robustly remove the influence of outliers. See detailed notes [2] and [3].
Scalar application
Once the scalar has been derived in log space and then returned to an ‘abundance-space ratio’, it can be applied to all values in the sample run being normalised, and this completes the process.
Illustrations
The process as visualised in the software is shown below.
Select Review normalisation at the Peak Picking stage:
This will bring up the Review normalisation screen. The normalisation reference is indicated, and the log10 mean distribution shifts (the log10 scalars) are indicated in the table for each run.
Detail for each run is shown on the right in the Normalisation Graphs tab, and the graph sizes can be adjusted using the slide-scale at the bottom of the page. The log abundance ratio is shown for each compound ion (ordered by ascending normalisation reference run abundance). The mean log abundance ratio and the robust estimation limits are shown as solid and dashed red lines. Hovering over a point will summon a tooltip with more information on that compound ion in that run and the normalisation reference, and hovering over the mean and robust estimation lines will provide more detail on their values.
For the run above, the red arrow represents the normalisation shift required in log space, and the yellow arrows highlight the robust estimation limits.
Detailed notes
-
The normalisation reference is found by treating each sample as a putative normalisation reference, and calculating the robust standard deviation estimate (note [2]) for every other run. This set of standard deviation estimates is then used to calculate a pooled variance for each potential normalisation reference. The sample with the lowest pooled variance is used as the normalisation reference.
The pooled variance is not a measure of the 'total ratio distance' between one sample and all others, so that the scalar ratios you see may not be evenly 'up' and 'down'. Rather, it is a measure of how consistent its difference from all the other samples is across all the compound ions. A sample that has a consistent scalar shift across all its analytes relative to another will introduce the minimum possible propagated error across all the analytes when they are scaled together (the scalar introduced in normalisation will be more accurate for more of the compound ions, as more ratios are nearer to the mean value).
-
Median and Median Absolute Deviation (MAD) are used for outlier rejection before the robust mean and standard deviation estimates are calculated.
The upper and lower outlier limits are calculated as: Median + 3 * (1.4826*MAD)
Median – 3 * (1.4826*MAD)
Where 1.4826*MAD is an estimate of the standard deviation.The robust mean and standard deviation measures are then the mean and standard deviation of all of the log(abundance ratios) falling within these limits. The robust mean is used to calculate the normalisation scaling factor. This factor is 10^(-mean), converting back from a log shift to a scalar multiplier. The robust standard deviation is also used in the normalisation reference selection (as per [1] above).
-
Ions with an abundance of 0 for either the normalized or normalizing sample are not included in the calculation.
Alternative methods
Along with the Normalisation Graphs tab, one can also select the Normalisation Method tab at the Review normalisation screen. This provides access to several alternatives to the default (Normalise to all compounds is the default, as described previously).
Normalise to a set of housekeeping compounds
This option allows you to normalise to either a 'spike' or to compounds you reasonably expect to be unchanging in the sample. It may be of benefit where the standard key assumption (that most compound ions do not change in abundance between samples) is known to be violated, so that the default method is no longer appropriate, but there are unchanging compound ions present that can instead be used to standardise between the runs. Examples might include a dilution series with a consistent spike added where the bulk sample is being intentionally uploaded in sequence, or samples without any loading standardisation (so that variable amounts may be present) but with a consistently loaded spike. Alternatively, the assumption may be made that specific housekeeping compounds present in the sample will be unchanging despite the sample as a whole potentially altering.
If you wish to apply this option, naturally, you must first have identified your compounds be able to find your chosen standard(s) among them.
There are then two ways to select the appropriate compounds.
The first is to go to the Review Compounds stage, and in the table, find your standard(s) using the search box and their identifications. These can all be selected and tagged using the relevant column in the table.
Back at the Review normalisation stage in the Normalisation Method tab, the tagged compounds can be selected in the filter dialog. These can then all be selected and ticked, and on returning to the Normalisation Graphs, the housekeeping normalisation will be applied.
The restriction of compound ions used in the graphs can be seen on the updated plots, and the normalisation factors will also be updated.
The second method of setting the housekeeping compounds is to simply search for their identifications at the Review normalisation stage in the Normalisation Method tab directly, using the search box provided. Tagging is now optional, but the process is otherwise identical.
Normalise using total ion intensity
With this option, the normalisation factors are calculated so that they scale the summed up abundance of all compound ions in each sample to an equal value. This method is much more sensitive to high abundance outliers than the ratiometric default and is not recommended.
Normalise to external standard
If your experiment uses an external standard compound, you can use the external standard concentrations to normalise your samples.
Don't use any normalisation
Selecting this option will not apply any normalisation to the data (all representations and calculations will use raw abundance). This may be appropriate in certain specific cases, as determined by the user.