P-values must still be used, but they should be reported as continuous exact numbers (e.g., P = 0.07), clearly describing its scientific or practical implications to better interpret it. Moreover, rather than adopting rigid rules for presenting and interpreting continuous P-values, we need a case by case thoughtful interpretation considering other factors such as certainty of the evidence, plausibility of mechanism, study design, data quality, and costs-benefits that determine what effects are clinically or scientifically important. It is also important to remember that clinical implications of results cannot be extrapolated to patient groups other than the patients included in a study [12].
There are many frequentist and Bayesian tools to provide a significance level [5], but the P-value should be interpreted in the light of its context of sample size and meaningful effect size. Thus, we need to distinguish between statistical and clinical significance. Although we will use the term "clinical significance" in this text, it may be a good idea to replace it with "clinical relevance" and to apply the term “significance” only to statistical issues. For example, a two-stage approach to inference requires both a small P-value and a pre-specified sufficiently large effect size to declare a result “significant” [13]. Predetermining whether an effect size is relevant for the patient is much more important than the statistical significance [14]. This is the minimal important difference (MID) that is the smallest change in a treatment outcome that an individual patient would identify as important and that would indicate a change in the patient's management. This term is preferred to the minimal clinically important difference (MCID) because this terminology focuses attention on the clinical aspects rather than patients' experience [15, 16] and should be presented with the minimum and maximum of the scale and its direction [17] to facilitate the interpretation of results [10, 13]. This term is generally applied to continuous outcomes, but it could be used for other types of outcomes as well.
Deciding whether the size of an effect is relevant or not depends on how critical an outcome is. For example, it is difficult to define a lower threshold for clinical significance/relevance of mortality estimates as of any benefit of a new treatment, whatever small, is relevant [12, 18]. Conversely, the lower limit threshold will necessarily be higher for the less important outcomes. Thus, the threshold should be based on how much the intended beneficiaries value each relevant outcome and what they would consider to be an important absolute effect. There are several recommended methods for determining MID for patient-reported outcomes [19]. However, the information on how much people value the main outcomes varies and is unreliable. In this case, the authors should at least state that the MID will be based on their own judgement [20].
A judgment and rationale are required to decide what constitutes appreciable benefits and harms. Regardless of the type of outcome, an intervention with a small beneficial clinically relevant effect will not be recommended if adverse effects are relevant [12, 18]. Serious adverse effects, even if rare, may make the use of an otherwise beneficial intervention not justified. Therefore, it is mandatory to assess the harmful effects to determine the clinical significance of an intervention [12, 18]. Moreover, if a new intervention is classified as having “statistically significant” effects but its effect size is smaller compared to another intervention, then the new intervention effect might be considered as “not clinically significant”. Hence, the effect size of alternative interventions for a condition could help to state clinical significance thresholds for an intervention of interest. This threshold should focus on both relative and absolute effects, since it is difficult, if not impossible, to judge the importance of a relative effect alone. For example, a relative risk reduction of 20% for women with a 20% likelihood of abortion would mean a risk difference (an absolute effect) of 4%, or a Number Needed to Treat to Benefit (NNTB) of 25. However, the same relative effect for women with a 1% likelihood of abortion would mean an absolute risk difference of only 0.2% or an NNTB of 500, which represent a much less important effect.
For a drug with no serious adverse effects, minimal inconvenience, and modest cost, even a small effect would warrant a strong recommendation. For instance, we may strongly recommend an intervention with a MID of at least 0.5% of absolute risk reduction of abortion (NNTB of 200). However, if the treatment is associated with serious toxicity, we could prefer a more demanding MID like 1% (NNTB of 100).
We consider that reporting a point estimate and CI is much more informative and should be the rule. Additionally, the P-value informs the probability that this effect has been observed by chance.Footnote 1 Therefore, it is much better to report the exact P-value than the binary approach of statistical significance based only on the arbitrary cut-off point of 0.05.
However, a binary use of confidence or credible intervals (focused on whether such intervals include or exclude the null value) could lead to the same problem caused by the use of the statistical significance. In fact, some authors propose the alternative term “compatibility” intervals to guard against overconfidence [21]. Authors should describe the practical implications of all values inside the interval compatible with the data, especially the observed effect (or point estimate) that is the most probable or compatible result of this interval. Even with large P-values or wide intervals, authors should discuss the point estimate, as well as discussing the limits of that interval. An interval containing the null value will often also contain non-null values of high practical importance that should not be left out of the conclusions. If the imprecision is accepted and properly interpreted, we will embrace replications and the integration of evidence through meta-analyses, which will in turn give us more precise overall estimates.
One of the most polemic scenarios is the situation of point estimate showing important clinical benefits with 95% CIs compatible with both, even better benefits or important harms and a P-value > 0.05 (i.e. P = 0.08). We have argued against interpreting and/or reporting results as statistically non-significant. A better statement could be “the intervention did not demonstrate superiority vs the comparator”. Although this is true, it is still possible that this result could be interpreted as indicating “no effect” [3] and this is not outside of the binary logic of superiority. On the other hand, it is possible to report that the intervention might be superior to its comparator, but it is also compatible with beneficial or detrimental effects. The supporters of the “non-superiority statement” claim against the last option because it could be misinterpreted as a positive effect. However, the P-value of this example indicates that the chance of concluding that there is a difference where, in reality, none exists (Type I error or false positive). This proportion could be unacceptable for conclusions, but it is not so high in terms of probability. Additionally, the estimate of the effect that has the maximum likelihood across the CI is the point estimate. To illustrate this point, consider an example in which the effect to be estimated is the difference between two means of normally distributed populations. Two independent samples from these populations yield sample means, and their difference with a 95% confidence interval has been calculated. Figure 1 presents the probability density function of this difference (with the 95% confidence interval for the effect indicated). In real life, distributions are likely to deviate from normality and the confidence interval for an effect might not be symmetric around the point estimate.
Therefore, it seems fairer to report the point estimate as to the more likely value jointly with a very clear statement of the implications of extremes of the confidence interval. In fact, this is the approach recommended in the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) guidelines [22].