False discovery rate

From TheGPMWiki
Jump to: navigation, search

False Positive Rate

The False Positive Rate (FPR) is a conventional statistical measure of the likelyhood of chance results being present in a given set of results. If, for a particular result (i), the expectation (Ei) that this result is stochastic can be determined, then the FPR (in percent) for N results is simply:

FPR(%) = 100×[ N∑Ei ]/N.

The FPR does not require the use of thresholds or additional Monte Carlo calculations, as it is a property of any set of results that have valid E-values (or p-values) assigned.

Please see Gupta N, et al., J Am Soc Mass Spectrom. 2011 22:1111-20 (PubMed 21953092) for a good discussion of the merits of FPR as opposed to peptide-FDR.

False Discovery Rate (Decoy-to-Target Ratio)

Note: There is a well-defined statistical method used to correct p-values for effects caused by multiple sampling referred to as False Discovery Rate. This method is unrelated to the "FDR" used in proteomics and the two entities should not be confused. The peptide False Discovery Rate used in proteomics would be more properly called the Decoy-to-Target Ratio (FDR/DTR).

The peptide False Discovery Rate (FDR/DTR) is a calculated number frequently used in the proteomics literature, based on a single Monte Carlo simulation. This number is generated by performing a spectrum-to-peptide sequence assignment process using peptide sequences that are very unlikely to be correct assignments and those scores for a very large number of spectra are collected (DECOY). The resulting "scores" obtained from this process are compared with those obtained from an assignment process for the same large set of spectra in which the desired peptide sequences are probably available (TRUE). The two distributions are compared by first setting an arbitrary "threshold" score for the TRUE values and taking as a hypothesis that all assignments with a score greater than this threshold are correct, thereby generating the number of putative true results (T). The same score threshold is set in the DECOY distribution and the number of assignments above this threshold are taken to be the number of putative false results (D) generated in this data set with this threshold value. The FDR/DTR (in percent) is calculated as:

FDR/DTR(%) = 100×D/T.

There are a number of potential variations to this simple method that are in use and there is currently no single accepted way to calculate this number.

The FDR/DTR calculated in this way has no direct interpretation in terms of statistical theory and should be treated as a potentially unreliable estimate of the number of false positive peptide sequence assignments in a large data set.

The FDR/DTR approach is prone to significant errors in many of the special cases that are commonly found in proteomics and it has no implications regarding the validity of any particular result: it is property of the full set of results only. It should only be used when the identification algorithm uses a scoring system in which the high scores approach an asymptotic upper-bound linearly. If such an algorithm is used and for some reason it is impossible to use proper statistical tests (e.g., those made available through PeptideProphet, OMSAA or X! Tandem) a FDR/DTR may have some use as a quick and dirty estimate of data quality. If statistical tests have been performed, the False Positive Rate should always be used instead.

There is no such thing as a protein equivalent of FDR/DTR when using peptide identification to assign protein sequences (bottom-up proteomics). It can be applied to top-down proteomics results, but only if appropriate statistical methods are unobtainable for some reason.

Personal tools