Data Entry Checking
Error checking was carried out in the databases of the BALTI-prevention, SARAH and DRAFFT trials. The purpose of this was to check that the error rates for the primary outcomes were 1% and the error rates for the secondary outcomes were 5%.
To work out the error rates the number of questions on the forms are needed as these make up the denominators of the primary and secondary error rates. Generally, the process of totalling the entry points was straightforward for the outcomes in these datasets with the exception of the post-operative outcomes days 4-28 in the BALTI-prevention dataset. The reason being that in this trial not all patients remain in hospital for 28 days and the post-operative outcomes are only recorded for the days they remain in hospital. There are 13 questions for each day of the 4-28 days on the post-operative forms, a total of 325 entry points. However, these forms would only be filled in for the days when the patient was still in hospital so if the denominator for the error rate for this outcome was set at 325 for each patient this would artificially deflate the error rate. For this reason the denominator for this outcome was unique for each patient and was calculated as 13 × the number of days that the patient was in hospital post-operative.
It is issues such as this that make generalising a process for checking of data in datasets against the data source difficult. In the case of the BALTI-prevention trial it meant that a tally of the data points needed to be recorded as well as the number of errors as part of the data entry checking. However, a guide to the process which would be considered best practice for checking of data input errors in trial datasets is considered in terms of; when, how much and how.
When should data entry checking be carried out?
Error rates for data entry were highest in the BALTI-prevention dataset, 1% for primary outcomes and 3.9% for secondary outcomes, where no formal data entry checking had been carried out prior to the end of follow-up for the trial. In the SARAH trial, data entry checking had been carried out at regular stages during the trial and the error rates for the 10% sample of patients from 491 recruited prior to this exercise were <0.5% for primary and secondary outcomes. These error rates for primary and secondary outcomes were <0.5% for patients at each stage of follow-up; baseline, 4 months and 12 months. This suggests that it can be beneficial to carry out data entry checking on a random sample of patients regularly throughout the trial so that data accuracy can be monitored, if a single entry procedure is adopted.
The suggestion is that, had data entry checking been carried out at an earlier stage of the BALTI-prevention trial then issues with the data entry would have been raised and could have been addressed. For example, in the BALTI-prevention trial one of the main contributing factors to the increased error rate for the secondary outcomes was that in certain cases whole forms had not been enteredfor the post-operative outcomes days 4-28. Out of the sample of 40 patients used for this error check there were 4 patients where the entire post-operative form days 15-28 hadn’t been entered. In these instances all of these questions that had been filled in on the form were recorded as an error as they hadn’t been entered onto the database. Had this been raised earlier in the trial, through earlier data entry checking, then the overall error rate could have been reduced.
For other trials this might have been picked up with a check on the level of missing data. However, due to the nature of the BALTI-prevention trial, and the fact that these outcomes aren’t always completed, it makes it more difficult to pick up the level of data that is invalidly missing and so these types of errors can only be picked up through data entry checking. This suggests that the type of checks that are done at different timepoints should be dependent on the forms being checked. For trials where the majority of questions are applicable to all participants then range checks and missing value checks should identify the variables where there might be errors; forms will need to be checked where missing data is high if the reason for this isn’t already known. Whereas for trials where there are a lot of questions that are only required for certain patients and validly missing data is difficult to assess then regular data entry checking is important to identify any errors.
One way to reduce the amount of data entry checking that is required by making range value checks and checks of missing data levels more rigorous would be to include indicator variables prior to all questions that are not always applicable. This would indicate whether certain forms and questions were required to be filled in or not and would make the level of missing data more easily quantifiable. In the case of BALTI-prevention, inclusion of a variable giving the number of days each participant was in hospital post-operative would indicate the number of days that were required to be completed for the post-operative outcomes and could be used to check missing data where forms hadn’t been entered onto the database. This might also act to prompt the person undertaking the data entry that data should be input for a certain number of days and help to ensure that all forms are properly checked.
Brunelle et al. suggest that a review of data quality should be undertaken after the data have been cleaned [1]. This means that it would be sensible to carry out missing value and range checks as well as data entry checking during soft lock of the database in interim periods of a trial prior to DMEC and TSC meetings.
How much data entry checking needs to be carried out?
The literature on double data entry is far from conclusive on its usefulness. Reynolds-Hartle et al. suggest that double data entry leads to a reduced error rate compared to single data entry whereas the paper by Goldberg et al. found error rates using double data entry as high as 27% [2],[3].Day et al. suggest that exploratory data analysis as a means for detecting errors is more important than double data entry as it can detect errors made by the person completing the form and not just errors introduced at data entry [4].It might also be worth noting that data management services Veristat and Pharpoint optfor single data entry with 100% quality control and not double data entry. It is obvious that double data entry uses more resources particularly in terms of time and finance and currently it isn’t used routinely at Warwick CTU. In keeping with current practice and due to a lack of conclusive support in favour of double data entry the focus here remains on data entry checking on single entry data.
The amount of data entry checking required will also inform when it is best to carry out data entry checks. For example, in carrying out data entry checks for this exercise a random sample of participants was taken for each trial and all of their completed CRFs and questionnaires were checked. This meant that for the DRAFFT trial only eight follow-up questionnaires at 6 months were checked and only one 12 month questionnaire was checked, from a 5% sample of patients. Although the checking of these forms can feed into the overall assessment of primary and secondary error rates it means the assessment of the error rates at these follow-up time points, particularly 12 months, isn’t very precise. The suggestion is that data entry checking should be carried out by follow-up time point. A pre-defined level of patients would need to have completed each follow-up before checking of the data entry at that time point would be worthwhile. Shen (2006) suggests that at least 1830 data points are required to achieve at least a 95% power [5]. This works out as follow-up forms for approximately 7 participants at 4 months and 12 months in the SARAH trial and approximately 10 follow-up forms at 3 months, 6 months and 12 months in the DRAFFT trial. Therefore, a sensible number of participants to have completed follow-up before CRFs at that follow-up were checked would be 100.
For single data entry, data entry checking would take place on a 10% random sample whenthe data had been cleaned and follow-up forms would only be checked if 100 patients had completed that follow-up giving a starting sample of 10 patients. Then, at subsequent soft locks, another 10% random sample would be checked on the follow-up forms where ≥100 patients had completed follow-up. Although more time consuming, it would be ideal if all data checking was carried out on the entire set of patients to have completed that follow-up stage and not just on these who have completed follow-up since the initial sample. This is as it means the error checking at each follow-up will continually become more precise and forms with a high level of errors that aren’t selected in the initial random sample can still be picked up. The more forms that are checked the more precise the estimate of the error rate will be. If an approach is taken of only checking follow-up forms that have been completed since the previous check then this should be done when ≥100 additional patients have completed that follow-up.
If a sample of CRFs produces an error rate that is above a pre-defined level then a second 10% random sample will need to be taken and checked. Continual need for re-checking or a persistently high error rate at certain time points might necessitate double data entry. It might also be a good idea to check all CRFs entered by a new individual after they have entered data for approximately 10 forms. This will highlight any discrepancies in their interpretation of the data and also check that they are entering data correctly.
How should data entry checking be carried out?
The purpose of data entry checking is to check data that has been entered onto a database against the original CRF. Although for the purpose of this exercise the on-screen data was checked against the CRF it might be possible to check a printout of the screens against the CRF so that there is a physical record of the checker which can then be signed.
As has been done for the SARAH trial an Excel table should be used to record errors in the sections of the CRF where they occur. Error rates can then easily be calculated for primary, secondary and demographic data. The general case is that the error rate is the number of errors divided by the number of data points that have been checked. The number of data points is the number of fields that require an entry in the database and as such this is easiest to calculate by checking the amount of required fields on-screen. Generally, for questions where there is only one answer required e.g. a ‘Yes’ or ‘No’ question this is one data point and not two whereas, if a question can have multiple options selected then each option is a data point. However, there are certain cases, as described above for the BALTI-prevention trial, where the number of data points can differ for each participant and as such the number of data points needs to be recorded individually for each form along with the number of errors.
Given a list of study numbers a random sample can be obtained by Statisticians or Programmers using statistical software. It is also relatively straightforward to obtain a random sample using Excel.
The only requirement for who carries out the data entry checking is that it isn’t the person who entered the data. It might be beneficial for somebody who is familiar with the trial and the CRFs to carry out data entry checking however, there is no reason why somebody independent of the trial couldn’t undertake the task and they might be able to bring a new perspective on any issues with the CRF or the database.
There isn’t clear guidance as to whether or not erroneous data should be corrected. In the instance where errors in the data have been identified through exploratory data analysis, range and validation checks; then data should be corrected as this covers the entire dataset at that time point. However, correction of erroneous data in samples from a 10% data entry check means that the data is clean of data entry errors for that 10% only. Any subsequent 10% samples would therefore have to not include patient forms that were in any previous samples or the error rate wouldn’t reflect the true error rate for single entry data. It therefore might be more sensible to not correct for errors found during data entry checking so that if the error rate is found to be higher than acceptable then the whole dataset would require appropriate cleaning or double entry.
The method used to estimate the error rate should be documented in the data management master file. There also needs to be documentation as to what is the pre-defined acceptable error rate for primary and secondary outcomes or critical and non-critical endpoints. This will vary from trial to trial and should be specified in the statistical analysis plan. Along with a specification of the acceptance level for error rates there needs to be consideration of what to do in the case of the error rate exceeding the pre-defined level. Higher than acceptable error rates are unexpected for primary outcomes and in the case of excessive secondary error rates that only exceed the pre-defined limit by a small amount a further 10% check would normally be the next course of action. Again, this will vary depending on the trial as in certain trials further data checking might only confirm what is already known about the level of data entry error and so addressing this is more important than deriving further error rates. Similarly, there will always be a point when further checking of random samplesis no longer informative and second data entry will need to be considered. Another possible next step following on from an unsatisfactory error rate might be to do a check on particular variables or types of variables that mainly contribute to the error rate instead of on whole forms. An approach to data entry checking based on variables instead of patients is discussed by Brunelle et al. [1]
Guidelines for error avoidance and checking
-Ideally one person will be responsible for the data entry throughout the trial.
-The data entry of each new person entering data on a trial will be checked early on i.e. after first 10 forms.
-Exploratory statistics checking for outliers, the range of variables and levels of missing data need to be checked and documented at regular intervals. Out of range values should be corrected and apparent outliers and variables with high levels of missing data should be checked. Once the statistician has set these checks up they should be able to re-run them at any stage as long as the database doesn’t change.
-Data entry checking is still needed as exploratory statistics won’t pick up on erroneous values that are plausible and within range that could still affect the conclusions. For example all categorical data that is out of range is an obvious error but this doesn’t pick up on values that have been assigned to the wrong category within range.
-If it is feasible, then databases that have in-built checks that flag when data is entered that is out of range would prompt the person entering the data to instantly check and correct.
-The constraints used for range checks could be made less broad so that a greater number of potential errors are highlighted and can be re-checked.
-From the data checking exercise, dates seemed to be the type of variable that was entered incorrectly most frequently. Dates should be checked against others at different time points to check that they follow-on as they should do, e.g. checking that treatment date is after injury date.
-It is suggested that data entry checking will be carried out during interim periods when the data is cleaned and will be carried out separately for each follow-up form. Error rates across follow-ups can then be combined.
-Data entry checking for follow-up forms should be carried out on samples of 10 or more. Ideally data entry checking will be carried out on a sample of all patients to have completed follow-up whenever data entry checking takes place so that the error rates will be more precise.
-If data entry checking is only carried out on the patients to have completed follow-up since the last check then calculation of the error rate should be combined with data from the previous check. This will improve precision.