The Bland-Altman LIMITS OF AGREEMENT: How Often HaVE tHEY Been Misapplied?
Catarina Carvalho; Cláudia Silva; Daniel Silva; Fernando Nogueira; Helena Greenfield; Inês Casais; João Pedro Costa; Mariana Liz; Sara Martins; Sérgio Sousa
(e-mails: ; ; )
Professora Doutora Cristina da Costa Santos (); Turma 13
ABSTRACT
Background The methods of clinical measurement are an essential tool for medical practice. Bland and Altman created a statistical method to evaluate agreement between them in an easy and reliable way, and published it in an article that has been quoted 18360 times. However, not all applications of these limits of agreement have been done correctly. Aims Our aims were finding out the percentage of articles in which the method was misapplied and what were the most common errors made. We intended to evaluate whether this percentage had changed through the years, according to the impact factor of journals and according to whether it was used for a primary or secondary analysis. Methods We wanted to analyze a sample of 70 randomly selected articles and proceeding papers that cited the limits of agreement of Bland and Altman, indexed on ISI, according to a check-list. Results and Conclusions We analyzed 56 articles: 51 were applications of the limits of agreement. None fulfilled all the points of our check-list. 39 verified/reported if there was a relation between the differences and the averages, 7 verified/reported if there was a normal distribution of the differences, 36 presented the limits of agreement, 26 interpreted that 95% of the times the differences lied within the limits of agreement and 35 correctly interpreted the outcome of the method according to the clinical needs. We found no statistically significant differences in the application of the method according to the impact factor of the journal where the articles were published, but we found statistically significant differences according to the year of publication (articles published after 2004 fulfilled more points of the check-list; p=0,034, p=0,014, p=0,006 and p=0,017). The articles which used the method to do a primary analysis fulfilled more points of the check-list (p=0,036, p=0,050 and p=0,019).
KEYWORDS MeSH terms: Diagnosis; Statistical; Other terms: Agreement; Methods of clinical measurement.
INTRODUCTION
Due to the advances of technology, new methods of clinical measurement appear constantly, and they keep becoming more innovative and different1.
In the decade of 1980, Martin Bland and Douglas Altman took knowledge of the wide use of the correlation coefficient, amongst other methods, as a way to evaluate the agreement between two methods of clinical measurement when some cardiologists asked them to help them do so. They realized there wasn’t any adequate method, and, so, they created their own method, which was easy to apply, known as the limits of agreement of Bland-Altman, in an article published in 1986 in the Lancet.2 This method allows us to validate a new method of measurement through comparison with another one that has already been accepted by the scientific community.
Obtaining such agreement is very important because if it isn’t accomplished there is a high risk of diagnosis mistakes, which may lead to serious consequences, namely due to the influence that the results of this methods may have in the doctor, that may be led into taking bad decisions.3
The authors of the method, firstly, reject the use of the correlation coefficient to measure the agreement between two instruments. Instead of this, they suggest that we make measurements with the two instruments and, in each assessment, calculate the average between the two values and the difference between them. Later, the average of the differences will be calculated and with that we can immediately obtain a conclusion: if it equals zero or isn’t significantly different from this value, there is no systematic error; otherwise, it exists.
With the construction of a graph of the differences in function of the average of the obtained values and bearing the limits of agreement in mind, it is possible to know whether the agreement is sufficient to use the method of clinical measure in test safely. The limit of agreement consists of the interval between the average minus twice the standard deviation and the average plus twice the standard deviation and it gives us the information of the interval of values of the differences in 95% of the times.
If the limits of agreement are too wide, there are random mistakes associated with the measuring instrument, and it is thereby unacceptable for clinical use. If they are, on the other hand, small but still the average of the differences is different from 0, we’ll be faced with a systematic error, which may be compensated by the calibration of the device.
The evaluation of whether the limits of agreement are too wide or, on the other hand, adequate, may be a little subjective. Thereby, it is important that the maximum limits of agreement are defined according to the clinical needs.
For the application of the Bland-Altman method, we must take into account that there are two mandatory assumptions: the differences between the measured values must follow a normal distribution (Assumption 1) and the standard deviation must be constant (Assumption 2).2
The method in question had a great impact on the scientific community and after being published in Lancet magazine, was quoted 18360 times.4 However, some of the quotes/applications of this method may not have been correctly made, as the authors noticed themselves in further studies5, which leads us to the initial problem: the usage of inefficient methods of clinical measurement that were incorrectly evaluated to establish a good enough level of agreement with previously accepted methods.6
RESEARCH QUESTION AND AIMS
Our research question was: “What is the percentage of articles in which the Bland-Altman method is applied correctly?”
In addition, our secondary aims were to verify:
§ what percentage of articles fitted into each of the document types defined by ISI.
§ at which level the method was misapplied (in terms of assumptions, the method itself or the interpretation of obtained results);
o which assumption was the less fulfilled one, when it came to the articles where this was the error made;
§ if, through the years, the percentage of articles applying the method incorrectly had varied;
§ If the impact factor of a journal influenced the percentage of articles with correct application of the method published in it.
§ if the percentage of articles applying the method correctly varies according to whether it was used to do a primary or secondary analysis.
METHODS
In order to accomplish our objectives, we wanted to analyze 35 randomly chosen articles and proceeding papers indexed by ISI that cited the article where Bland and Altman expose their method2, published by the Lancet. These articles were randomly chosen out of 70 randomly selected articles and proceeding papers indexed on ISI that cite the Bland and Altman’s article published on Lancet. The remaining 35 were analyzed by class 4, which was doing the same research as we were and using the same check-list. The data obtained was shared between the two classes. Five articles were analyzed by two different people, and the results obtained were compared, in order to evaluate the reproducibility of our check list.
The analysis was made according to a check list (see Appendix) that evaluated the article when it comes to the following aspects:
· Verification of the assumptions;
· Application of the method itself;
· Interpretation of the obtained limit of agreement.
The check list also gathered some relevant data related to the articles: type of article and year and journal in which it was published.
In order to be considered well applied, the articles must:
· present a scatter that concludes about the relation between averages and differences and the constancy of the standard deviation or report it;
· present a histogram of the differences or report whether they follow a normal distribution;
· present the limits of agreement;
· correctly interpret the outcome: i.e. define if the method of clinical measurement is adequate for use according to the clinical needs.
We did a research in ISI to find out the exact number of citations of the Bland-Altman article and the types of document in which the citations were made.
We constructed tables summarizing the main results of the check list.
Statistical Analysis
The data collected with the check-list was gathered and analyzed using SPSS.
We described the data in absolute values and percentages.
We analyzed, running a Chi-Square test 7, if there were differences between the percentages of articles committing a specific error according to:
· The categories of impact factor (“≤2,378” and “>2,378”);
· The categories of the years of publication(“2004 and before” and “after 2004”);
· The type of data the method is used to obtain (“primary data” or “secondary data”).
In order to define these categories, we calculated the median of the values of the impact factor and year of publication and used it as a cut-off between the two groups.
We considered a significance level of 0,05.
RESULTS
ISI reported, on 19/03/2011, at 21:22, (date of the selection of our sample), 18360 citations of the Bland-Altman limits of agreement, out of which 16230 were articles, 291 reviews, 70 meeting abstracts, 2 reprints, 1059 proceeding papers, 121 notes, 2 corrections/additions, 1 correction, 471 letters and 113 editorial materials.
From our selection of 70 articles and proceeding papers, we were able to access 56 (the remaining 14 were either unavailable in a full text version or in foreign languages).
Out of those 56, 5 weren’t applications of the Bland and Altman limits of agreement, while 51 were.
Tables 1, 2 and 3 present “absolute value (percentage)” of the analysis of the 51 articles.
Total(n=51) / Impact factor ≤2,378 (n=24) / Impact factor 2,378 (n=27) / p-value (Chi-square test)
Verify/report if there is a relation between the differences and the averages / 39 (76) / 19(79) / 20(74) / 0,669
Verify/report if there is a normal distribution of the differences / 7 (14) / 3(12) / 4(15) / 0,811
Present the LA / 36 (71) / 19(79) / 17(63) / 0,205
Interpret that 95% of the times the differences lie within the LA / 26 (51) / 15(62) / 11(41) / 0,121
Correctly interpret the outcome of the method according to the clinical needs / 35 (69) / 16(67) / 19(70) / 0,776
Table 1 – Percentage of articles fulfilling each main point of the check list, divided according to the impact factor of the journal where they were published. We used a Chi-Square test to compare the percentages amongst the two levels of impact factor. LA – Limits of agreement.
Total (n=51) / Published in 2004 or before (n=29) / Published after 2004 (n=22) / p-value (Chi-square test)Verify/report if there is a relation between the differences and the averages / 39 (76) / 19 (66) / 20 (91) / 0,034*
Verify/report if there is a normal distribution of the differences / 7 (14) / 1 (3) / 6 (27) / 0,014*
Present the LA / 36 (71) / 16 (55) / 20 (91) / 0,006*
Interpret that 95% of the times the differences lie within the LA / 26 (51) / 13 (45) / 13 (59) / 0,313
Correctly interpret the outcome of the method / 35 (67) / 16 (55) / 19 (86) / 0,017*
Table 2 – Percentage of articles fulfilling each main point of the check list, divided according to the year when they were published. We used a Chi-Square test to compare the percentages amongst the two levels of impact factor. LA – Limits of agreement. * - statistically significant.
Total (n=51) / Primary data (n=34) / Secondary data (n=17) / p-value (Chi-square test)Verify/report if there is a relation between the differences and the averages / 39 (76) / 29 (85) / 10 (59) / 0,036*
Verify/report if there is a normal distribution of the differences / 7 (14) / 6 (18) / 1 (6) / 0,250
Present the LA / 36 (71) / 27 (79) / 9 (53) / 0,050*
Interpret that 95% of the times the differences lie within the LA / 26 (51) / 19 (56) / 7 (41) / 0,322
Correctly interpret the outcome of the method / 35 (69) / 27 (79) / 8 (47) / 0,019*
Table 3 – Percentage of articles fulfilling each main point of the check list, divided according to the type of data obtained by using the limits of agreement. We used a Chi-Square test to compare the percentages amongst the two levels of impact factor. LA – Limits of agreement. * - statistically significant.
Reproducibility of the check list
Of the 5 articles analyzed by two different students, there was an agreement of 100% in all questions, except for the one that asked if the article had interpreted the outcome correctly according to the clinical needs, in which there was a disagreement relative to 1 article. The two students which disagreed re-evaluated the question and came to an agreement.
DISCUSSION
Our main aim was to discover how many of the many articles citing the Bland-Altman method were applying it correctly and, interestingly, out of all the articles we analyzed there was not one article which correctly applied the method in its entirety. This is obviously an important discovery, since the correct application of this method is of the utmost importance and its incorrect usage can have severe consequences as explained in our introduction.