Ooh Ooh Ooh! They Finally Did It!
I have been waiting a while for a paper like this. Someone finally went and did a rather comprehensive analysis of outcome measures used in autism research. The evaluated whether or not the outcome measures were valid as endpoint targets for intervention studies.
To give away the ending, there aren’t any valid endpoints for autism research at present. What concerned me more, however, is the number of studies they had to exclude for invalid statistics, underpowered studies, or too few participants. I find this terrifying because there are a lot of drug treatment studies for autism, and if this manuscript is correct-and I feel it is-not a single one of these studies is valid.
The paper I am going to use as a springboard for discussion is entitled: “Systematic review of tools to measure outcomes for young children with autism spectrum disorder”, and it can be freely downloaded Here.
I will review the findings of this paper, but in this post I will also go deeper into some problems with scientific research right now, particularly as related to clinical populations. By and large, my impression is that there is a lot of cherry picking of subjects and convenient removal of inconvenient outlying/strange datapoints that do not support hypotheses. We ignore and actively avoid data that contradict what we want to hear. And the paper I will talk about addresses these topics, albeit in a much more diplomatic way than I will in this post.
“Everyone is entitled to his own opinion, but not to his own facts.”
– a statement commonly attributed to Daniel Patrick Maoynihan
A Precìs of the Manuscript
The first thing that struck me with the manuscript I will talk about was the scope of the reviewed research. The authors systematically reviewed around 10,154 scientific articles published between 1992-2013 (i.e., they limited themselves to studies using ICD-10 and DSM-IV diagnoses of conditions included under “Autism Spectrum Disorders”). From these papers they identified 131 assessment tools that were used in multiple studies and were not lab-specific. They used these tools to evaluate the appropriateness and efficacy of the tools. From an additional 2,665 articles, they were able to extract sufficient data to evaluate 57 tools more closely (The rest of the 131 did not have sufficient data available to support an analysis of efficacy so these were discarded). To evaluate the measurement properties of these tools the authors used the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN). From their website, here is the intent of COSMIN:
The COSMIN initiative aims to improve the selection of health measurement instruments. As part of this initiative, the COSMIN group developed a critical appraisal tool (a checklist) containing standards for evaluating the methodological quality of studies on the measurement properties of health measurement instruments. The COSMIN checklist was developed in an international Delphi study as a multidisciplinary, international collaboration with all relevant expertise involved. The focus was on Health-Related Patient-Reported Outcomes (HR-PROs), but the checklist is also useful for evaluating studies on other kind of health measurement instruments, such as performance-based tests or clinical rating scales.
The COSMIN checklist can be used to evaluate the methodological quality of studies on measurement properties, for example in systematic reviews of measurement properties. In systematic reviews it is important to take the methodological quality of the selected studies into account. If the results of high quality studies differ from the results of low quality studies, this can be an indication of bias.
The COSMIN checklist can also be used as a guidance for designing or reporting a study on measurement properties.
Students can use the COSMIN checklist when learning about measurement properties.
Reviewers or editors of journals can use the COSMIN checklist to appraise the methodological quality of submitted studies on measurement properties and to check whether all important design aspects and statistical methods have been clearly reported.
There were a set of domains that the study evaluated and associated tools that made the final cut to be formally evaluated (other measures lacked empirical support or did not have data available to alow for validation):
Autism symptom severity: Autism Behavior Checklist (ABC); Autism Diagnostic Interview-Revised (ADI-R); Autism Diagnostic Observation Schedule (ADOS, including Toddler Module and Calibrated Severity Score); Autism Observation Scale for Infants; The Baby and Infant Screen for Children with aUtIsm Traits-Part 1 (BISCUIT); Behavioral Summarized Evaluation (BSE-R; including Revised and Infant); Childhood Autism Rating Scale (CARS); Gilliam Autism Rating Scale (GARS and GARS-2); Modified Checklist for Autism in Toddlers; Parent Observation of Early Markers Scale; Pervasive Developmental Disorders Rating Scale; Social Communication Questionnaire; Social Responsiveness Scale (SRS).
Global measure of outcome: Autism Treatment and Evaluation Checklist; Pervasive Developmental Disorders Behavior Inventory (PDDBI).
Social awareness: Imitation Battery; Preschool Imitation and Praxis Scale (PIPS).
Restricted and repetitive behaviour and interests Repetitive Behavior Scale-Revised.
Sensory processing: Sense and Self-Regulation Checklist; Sensory Profile including Short Sensory Profile.
Language MacArthur–Bates Communicative Development Inventories (MCDI); Preschool Language Scale-Fourth Edition.
Cognitive ability: Leiter International Performance Scale-Revised; Mullen Scales of Early Learning; Stanford–Binet Intelligence Scales-Fifth Edition.
Emotional regulation: Baby and Infant Screen for Children with aUtIsm Traits-Part 2 (BISCUIT-Part 2); Children’s Global Assessment Scale; Infant–Toddler Social–Emotional Assessment (including Brief form).
Play: Test of Pretend Play (ToPP).
Behaviour Problems: Child Behavior Checklist (CBCL 1.5–5 and CBCL 6–18); Aberrant Behavior Checklist; BISCUIT-Part 3; Home Situations Questionnaire-Pervasive Developmental Disorders (HSQ-PDD) version; Nisonger Child Behavior Rating Form.
Global measure of functioning: Behavior Assessment System for Children-Second Edition (BASC-2); Psychoeducational Profile-Revised (and Third Edition); Scales of Independent Behavior-Revised; Vineland Adaptive Behavior Scales (VABS; including Classroom and Screener versions).
Parent stress: Autism Parenting Stress Index; Parenting Stress Index-Short Form (PSI-SF); Questionnaire on Resources and Stress-Friedrich Short Form.
The authors’ conclusion of their work:
The review has provided, for the first time, not only a list of tools used in measuring outcomes for children with ASD up to the age of 6 years, but also a systematic evaluation of their measurement properties and qualities. A tension between the diagnostic process in ASD, and the focus on parent and professional valued outcomes, was evident. The synthesis of evidence took into account the availability of tools, stakeholder views about the presentation of tools, the age range covered and the extent of the positive evidence about measurement properties in use with children with ASD. In summary, just 12 tools
were considered the most valid overall; however, given their scope and limitations, these should not be considered a ‘recommended battery’. These tools were ADOS; BSE-R; CARS; SRS; PDDBI; PIPS; MCDI; BISCUIT-Part 2 (co-occurring symptoms); CBCL; HSQ-PDD version; PEP; and the PSI-SF.
Great, but what does all that mean?
The take home measure, as stated in the actual paper was that the above 12 measures (ADOS; BSE-R; CARS; SRS; PDDBI; PIPS; MCDI; BISCUIT-Part 2 (co-occurring symptoms); CBCL; HSQ-PDD version; PEP; and the PSI-SF) had the strongest supporting evidence for their usage. However, and this is a huge however, there was at best patchy evidence and an extremely limited scope of outcome measures that means that none of the tools can comprise a recommended battery. There was also very limited evidence that these 12 tools would be useful for detecting change in any type of behavioral or pharmacological intervention study. In plain English, even though these are the best tools we have at present, none of them should be used during clinical studies of medication or behavioral therapy to measure improvement. Additionally, these tools lack patient well-being and participation outcomes for children being studied and the lack quality of life outcomes for patients and patient families. Based on these results, I further suggest that none of these tools are appropriate for educational placement purposes either. New, more reliable, tools are needed.
We are not entitled to our own facts
We now enter into the meat of the discusion. It is my decided opinion that to a large degree, all of these “evidence based” outcomes are only evidence based because the authors of the manuscripts wanted the measures to work. And so did the rest of us-we depend upon it. As an example, the researchers explain a number of times they had to exclude tools and articles from their analysis for irregularities. These irregularities include: underpowered studies/analyses (factor analyses in experiments including only 32 participants, effect sizes well below accepted norms, etc), data from the tool being widely variable across experiments, and/or analytical/statistical analyses that could not be reproduced from the data.
The reason I find these irregularities concerning is that a lot of the measures that were classified as unreliable by the authors are considered “gold standard” measures in both clinical studies as well as heavely used for classification in school districts across the United States. The low reliability observed in the BASC-2 and Vineland is appaling given the wide usage of these tools. The low reliability of the GARS and GARS-2 relative to the CARS is similarly troublesome. The fact that the Mullen, Stanford Binet, Bayley and other tests of cognitive function are unreliable scares me. All of the above tools are widely relied upon as primary outcomes in intervention studies. They are also used to provided diagnositc data used to properly provide diagnoses in the clinic. Further terrifying is the ADI-R and ADOS tend to be confirmed as powerful tools for diagnostic purposed in studies deemed to be “of low quality” based on COSMIN criteria, but are only shown to be moderately useful (if not variable measures) in experiments deemed of higher quality. I am worried for one simple reason: These “low quality” tests are precisely the tests we rely upon to correctly classify children as autistic or not.
So which is it? Are these gold standards valid or not? Right now, I cannot say. What I can say based on the data I have seen and the papers I have read is that it is a mixed bag. And mixed bags do not gold standards make.
Why is this a problem
So far as I can tell, when we select outcomes for our studies (or educational intervention), we look up studies that support what we want to do. We actively ignore the ones we do not like. We find reasons to accept that the tools we are used to are infallible and that anyone who says otherwise is biased or out to gain from making waves. To the last point, far enough, but we refuse to apply the same scrutiny to the studies we agree with. Specifically, in what universe is it okay to take only 32 autistic kids, give them a standardized measure, and proclaim loudly that the Aberrant Behavior Checklist is valid, because a factor analysis was able to pull out 5 factors that match the Aberrant Behavior Checklist rubric. This is bad because the subject to variable ratio should be at least 20:1 (20 participants for each factor: such that 5 factors would require 100 subjects to be valid). I pull this example because (this study) has been used as a validation of the Aberrant Behavior Checklist and as evidence for its wide usage. A number of studies exist that have not shown as compelling of data for the utility of the Aberrant Behavior Checklist, but they go relatively unignored.
I have also lived through both watching closely as well as from afar scientists selectively pick subjects to maximize the efficacy of their test, not to represent the larger group. There was a time in my life where my brother did not qualify for a lot of research projects because he was nonverbal. Because he was nonverbal, he was specifically excluded from studies. Now, this could be for a couple of reasons, it could be because as scientists we are lazy and demand verbal answers to our questions and a nonverbal kid does’t fit that model, or else we just assume nonverbal=dumb or mentally retarded in the parlance of the DSM-IV. However, when some researchers finally relented and let my brother participate in a study, he blew the verbal participants out of the water. I relate that to get to my next point: I have watched researchers gleefully remove data points that do not fit what “other papers” have reported. For example, data come in for a participant that look weird. Rather than formally evaluate assumptions for later statistics and verify whether the data points are legitimate outliers or just influential/extreme data points. I have watched individuals switch back and forth between different criteria for outliers (e.g., 2.5 Standard Deviations beyond the mean vs 1.5 greater than the interquartile range), depending if they wanted certain datapoints to remain part of the dataset or to get removed-with the desire to keep or remove data based solely upon whether they support pet hypotheses. Now, there are legitimate reasons to remove outliers, but this process needs to be done by an unbiased statistician using previously determined criteria, not on the fly by the PI of the study in a manner that can allow bias to enter the analysis. This is a problematical because when scientists “clean” their data, they make it look more like their expectation-and thus less representative of reality.
An example of similar phenomena from my behavioral neuroscientist past is a long frustration I have had with the wide use of the water maze or contextual fear conditioning. Despite research showing extreme variability within and among labs for performance of rodents on the water maze/contextual fear conditioning, they are still considered gold standards for testing hippocampus-based memory function (see a thorough analysis Here and Here). Now this is despite >30 years of researchers saying the water maze and fear conditioning have major confounds and often do not test what the sientist thinks it tests. However, scientists ignore a preponderance of evidence and simply cite the review papers that support their preconception that the water maze or fear conditioning are the best tasks out there. They then creatively interpret their data in a Procrustean manner to fit their hypothesis and show they were right all along. This Procrustean activity also, much like mentioned above, removing outliers to “clean” the data. In fact, very often the scientists go out of their way to attack those that raise doubt regarding the efficacy of their chosen task (I have 15 years of manuscript reviews to back up this statement).
Now, I am 100% cool with someone choosing to use any task that they want, in fact I have used fear conditioning in my research. I am not cool when scientists try to use cherry-picked science to defend their position. Optimally, the task selection process would go as follows: The scientist has an hypothesis to tetst and need to choose a task, the scientist reads the evidence regarding the pros and cons of different behavioral paradigms as related to their hypothesis. They make a decision based upon a thorough literature review. In publication, they defendtheir choice scientifically. Repeat. But, do not hide behind using “evidence” or “data” that you know do not tell the whole story to defend a position. It is intellectually and scientifically dishonest.
Why does this happen? I think the reason is the same for both clinical research using the tools mentioned earlier as well as in rodent research: money talks. Research funding (especially NIH funding in the United States) is not easy to get to develop, test, and refine better diagnostic tools. In the present funding situations such endpoints have to be developed as side projects because the bulk of the grant needs to be used to pharmaologically treat patient populations. As a scientific community, we are more interested in finding the magic bullet that fixes autism than we are in identifying an appropriate endpoint needed to quantify any treatments we give.
Everyone is entitled to his own opinion, but not to his own facts.
Now, a synthesis of my above 2 points. What we do in clinical science/neuropsychology/school placement, etc. is important. We are using carefully selected tools to help determine the palcement and facilitate academic success in out students. We are using these outcome measures as targets for drug studies that may take a medication from a limited pilot study into a full clinical trial involving hundreds, if not thousands, of children. We are diagnosing kids as having autism or not. These are important decisions. Critical in fact. If we let our biases enter into our decision making we run the risk of hurting kids. We can change their lives for the worse. All because we “knew” something was true despite evidence to the contrary that we chose to ignore; or else we chose not to use a test because we did not want to deal with evidence contrary to our hypothesis/desires.
All in all, we as scientists need to remember the responsibilities we have to the populations we serve. We have every right to prefer certain measures. We have a right to our opinions. We have a right to do everything in our power to help those we serve. We do not have the right to ignore evidece that we are wrong. We do not have the right to let our intuition regarding which studies are valid and which are invalid allow us to ignore data that contradicts our opinion.
Doing so renders all our data invalid.
Doing so means we aren’t scientists-it means we are chalatans.
Doing so hurts kids.
And we are doing so on purpose.
So what is the fix?
So to avoid pontificating doom without providing a possible solution to the problem, here we go:
- Scientists need to read the literature. This means researchers need to read up on the outcome measures they use. They need to read both the complimentary papers as well as papers that emphasize the limitations of the methods. In other words, scientists need to know their methods inside and out. This is particularly important when the researcher has not developed the method in house. The review paper I link to at the into to this post provides a great starting point.
- Scientists need to choose a “battery” of methods. This one would require a slight change in how clinical trials work, unfortunately. At present a single outcome measure has to be identified and used as the endpoint. This invites bias. It demands homogeneity in the data. A better alternative would be to select a domain as an outcome and use converging data from slightly overlapping tools as the metric (i.e., ADI-R + ADOS together used as measures of communication deficits in autism).
- Scientists need to employ statisticians. One of the main problems identified in the article were statistical. Researchers performing complex analyses with too few research subjects, for example. Obtaining the help of a statistician will prevent the researcher’s bias entering into any analyses (e.g., convenient removal of outliers).
- Scientists need to rigorously develop and validate outcome measures. In other words, we need to effort ourselves to overcome the limitations described in the linked article. In other words, a specific effort should be undertaken to evaluate each of the 131 identified tools (and any new tools as they become available) in as many autistic individuals as possible across a wide age range. Preferably, these studies should be multi-site studies as well, just to increase the sample size. And if a popular test doesn’t work, we reject it. If a relatively unknown test works reliably, we adopt it.
- Education needs to only adopt methods that are truly evidence-based. This means that as tools are shown to be invalid, they need to be discarded. As methods are shown to be hghly reliable, they are to be included. Yes. I understand this makes continuity across students difficult, but we have rained clinical psychologists in school districts that can identify analogous norms across tests. This will reduce probability that we cling to popular, yet invalid, tests that misclassify students and do them harm rather than provide much needed assistance.
These are not easy steps. But they are the steps that are necessary to move forward understanding of autism. We need a broad spectrum of tasks to evaluate autism because autism presents along a spectrum. We need t effort ourselves as a scientific and educational community to improve our methods and leave our preconceptions at the door. They only hurt the kids we are professing to help.
In selecting tools we need to remember:
“If you’ve met one person with autism, you’ve met one person with autism.”
– attributed to Dr. Stephen Shore (among many others).