false
Catalog
2018 AANS Annual Scientific Meeting
Prospective Clinical Trials in Neurovascular Surge ...
Prospective Clinical Trials in Neurovascular Surgery: The Good, The Bad & The Ugly
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Okay, we'd like to introduce Dr. Kevin Cockcroft to give the lecture on Prospective Clinical Trials in Neurovascular Surgery, The Good, the Bad, and the Ugly. All right. Thank you, Robert. Thanks to the organizers for inviting me to give this little talk here. So, as you probably heard from the sessions this morning, evidence-based medicine and how you grade and assess evidence become an important part of being a neurosurgeon in deciding whether you can believe what you read and what you can do with what you read in the journals these days. So what I'd like to kind of give you is my perspective on some of this. And this is from the perspective not of somebody who is doing the clinical trials themselves but of somebody who has spent the past several years reviewing trials, grading evidence, and reviewing guidelines, and also participating in guideline writing. So the subtitle is The Good, the Bad, and the Ugly, and that's basically giving you examples of what I think were some good trials, some not-so-good trials, and then one that is, I think, a little bit sort of confusing in how it's set up. And that's what the ugly part is. So a couple of financial disclosures. None of these are relevant to this particular talk. So you kind of heard from Jacques earlier about art and life. Well, this is life imitates art far more than art imitates life. And we'll look at these examples in a bit. So what I want to cover is give you an idea of what the typical levels of evidence are that are used to classify clinical research studies, and then so you can identify the key design characteristics that impact these trials, particularly in neurovascular surgery, and then recognize the limitations that are associated with prospective randomized trials in neurovascular surgery in particular. So we're going to frame the discussion by first talking briefly about the different levels of evidence. So typically there are four levels of evidence that are discussed, and these are often named as 1 through 4 or A through D, and they're based on the type and quality of the study that was used to generate the evidence. So at level 1, or level A evidence, depending on which criteria you're using, this is a good quality prospective randomized controlled trial. And these are a high quality design. There's good execution. It avoids bias. And these are often referred to as the gold standard of clinical trials. So a good PRCT, this should have random sequence generation. The allocation should be concealed from all parties. It should be analyzed on an intent-to-treat basis. It should be blind or independent assessment of the outcomes. And the interventions in both groups should be applied about similarly. You want to have good follow-up, and you want to have an adequate or representative sample size. So a level 2 study, this is a moderate or poor quality RCT. So just because you have an RCT doesn't necessarily mean it's level 1. You can also have a good quality cohort study as a level 2 evidence. And this has some potential for bias, but the idea is that the bias here is not so significant that it's going to really affect the conclusions or be significant. Then a good cohort study, these are for prospective studies. These should have blind or independent assessment. And then for retrospective studies, you ought to have a really reliable database to base this on. Interventions, again, should be applied equally. You ought to have good follow-up. You need to have an adequate representative sample size. And you want to control as much as possible for the known confounders. So a level 3 evidence, these are moderate or poor quality cohort studies. Case control studies would fit into this category. And these have significant flaws in the design, which may potentially impact or invalidate the results of the study. And then a level 3, well, these are case series. Case reports sometimes put in as level 4. Sometimes put in as level 4. Sometimes case reports are moved to a separate level 5. But these studies have significant potential for bias. And the main problem here is there's no control group, so you have difficulty making any comparisons. And as you can imagine, most of the neurosurgical literature is rife with level 4 evidence, right, case series. So what do you do with the evidence? Well, you usually make recommendations based on the evidence, right? So in evidence-based medicine, you want to try and link the evidence to the level of the recommendation. And the classes of the recommendations vary according to how important or strong you think the recommendation ought to be. So a typical EBM guideline will have evidence tables. This allows you to see what the authors were thinking when they decided how to grade the evidence and how to decide on levels of the recommendations. So a class 1 recommendation, this would be a typical class 1 recommendation. This is for IV tPA based on a prospective randomized trial of a drug. It doesn't matter who gives the drug. There's no skill involved in that. And the patients were randomized, equal distribution, an outcome that was very important to the patients, whether they're independent or not. And this gets a level A, level of evidence is A from the prospective randomized trial, and it gets a class 1 recommendation. So the problem, though, is, well, what happens if the strength of evidence doesn't match the strength of the recommendation? And the classic example for this is the parachute example. This is an article in the British Medical Journal back in 2003 by these two guys. It's sort of tongue-in-cheek, but it kind of illustrates the point here. So parachute use to prevent death and major trauma related to gravitational challenge, a systemic review of randomized controlled trials. You can imagine there probably aren't many randomized controlled trials in this area. And so what they found was, well, the perception that parachutes are a successful intervention is based largely on anecdotal evidence, true, right? Observational data have shown that their use is actually associated with morbidity and mortality. Just because you have a parachute doesn't mean you're going to survive when you pull the chute. And then natural history studies, quote, have indicated that sometimes, even if you fail to deploy the parachute, you might still survive. So do we take from this, then, that we don't use a parachute? Well, obviously, no. We assume that there's some intuitive benefit to the use of a parachute, so we will use it. But the intuitive benefit thing also gets us into trouble. So in neurovascular surgery, you have the intuitive benefit, well, you've got this large artery. It's going to the brain, and it's blocked, right? So you figure, well, the patient's having symptoms. It's related to the large artery that's blocked. Then why don't we just create a bypass around that large blocked vessel? Well, guess what? That's ECIC bypass. And what happened with that? Well, we had one, two, three prospective randomized controlled trials that showed that it actually didn't benefit patients. Now, as you heard in the talk earlier this afternoon, maybe there's a subset of patients that do benefit, and people are selecting those and perhaps still sending them and having good outcomes, but we've not been able to demonstrate that from a prospective randomized trial. So where do we go from here? Well, is the prospective randomized trial really the gold standard? Particularly, is it really the gold standard for a procedure or a device examination? So I would say that surgical devices or surgical procedures rarely meet all the criteria to fulfill as a good level one prospective randomized trial. And we'll look at some of those advantages and disadvantages. So some of the disadvantages, then, of a PRCT for surgical device evaluation depend on this depends a lot on the surgeon and the device. As you heard earlier talk this afternoon, the importance of the art of surgery. It's very difficult to double blind a surgical or device trial. Somebody's going to know. The user certainly knows. And then the patient often knows, although we do have some sham surgical trials. These are pretty rare these days. The technology is constantly changing. When the trial starts, there's one technology in use, and by the time it finishes, it's often different. And oftentimes our strict inclusion and exclusion criteria can then limit the generalizability of the trial. So these are the list I had before showing you what's supposed to be a good PRCT. And if you look at these, how do they actually apply to a device or surgical procedure? Well, the allocation can seem almost pretty difficult, as we saw. Intent-to-treat analysis is also not that important. For a drug, you want an intent-to-treat because you want to know if the drug can't be tolerated, if the patient gets ill when they take it, they don't like the taste, they're not taking it, compliance is an issue. Whereas with a device or a surgical procedure, you actually want to know what happens when they get the device or when they actually have the procedure. So the as-treated analysis is probably more important. And then it's very hard to do blind or independent assessment of outcomes because, again, somebody often knows. The patient often knows what they had. And then adequate representative sample sizes can be difficult because many of these diseases that we're talking about in neurovascular are actually not very common, and so it can be hard to get large sample sizes, especially to accrue that large sample size in a short period of time. So there's some important design choices which go into the decision-making of how you're going to put together a trial. I'm not going to cover all of them here. This is just to show you a few of them, and a few that, as a reviewer of trials and as a person creating guidelines, that really impact how we look at what the trial is showing us. So the first thing you need to do is you need to determine what's going to be the relevant inclusion criteria. You need to define what the intervention is going to be, identify an appropriate control group, select some meaningful outcomes for the disease process. There are lots of different outcomes you could look at. You want to find something that's going to be actually meaningful. And then plan a statistical analysis and define your safety endpoints and any stopping parameters you may need. So I'm not going to go through all of these, but there's three here that I'm going to sort of concentrate on and show you in these examples kind of how they influence the actual results and, therefore, what we decide to do with those results. So the good. Let's start with this one. Mr. Clean, he fits pretty well with that picture. So Mr. Clean, as you probably know, is a prospective randomized controlled trial. It was done in the Netherlands. It's open label. And this was acute ischemic stroke patients, and they had to have an inclusion by CTA that would be feasible for intra-arterial treatment within six hours. Randomized IV tPA alone or mechanical intervention plus or minus IV tPA. And the outcome was modified rank in zero to two at 90 days, so a good functional outcome. So the results, about 500 patients enrolled. Most of the patients actually did receive IV tPA, and most of the patients were treated with a retrievable stent. The outcome, good outcome at 90 days, 32.6% in the treatment group and 19.1% in the control group. There were no significant differences in the other outcomes. So this favors the treatment group. More than twice as likely to have a good outcome. So the design choices here, what were they? Well, they had a relatively broad representative criteria for inclusion, and they had enforced participation. This is an unusual thing about this study, but it's very important in this particular case. So unlike in the United States, in the Netherlands they have a single-payer system. It's a relatively small country. So what they were able to say was you cannot get paid for doing mechanical thrombectomies or thrombectomies in general unless you participate in the study. So they were able to enforce inclusion, which eliminates a lot of selection bias. We had problems in the U.S. where people were cherry-picking the patients because they felt that they knew that these people were going to do better, so they're going to treat the young, healthy patients and then not enroll them in the trial. So that enforced participation was important. And they had a relatively consistent intervention. Almost everybody got the same kind of device. And they picked a meaningful outcome for the disease. This is an acute disease process. The outcome at 90 days is important. It's whether the patient is disabled or not disabled, so something that the patient can relate to. So it's good, but it's not perfect, right? So it's relatively generalizable, not completely. You still had to have a large vessel occlusion. You still had to come in within six hours. And, in fact, most patients in the treatment group actually didn't benefit from the treatment. It was only 32.6%. And if you look later on, we did subsequent studies, and we had more stricter inclusion criteria while the benefit increased. So in ESCAPE, where you used aspect scores to help define the patient population, then the group benefit went up to 53%, and EXTEND-IA, where you used CTP as well, went up to 71%. So you see here how just the choice of the inclusion criteria makes a big difference in the treatment effect. Didn't change the intervention, just changed what the inclusion criteria was going to be. And then the other question that came up here is, now, was this result device-specific, or was this procedure-specific? So we used one device for 97% of the patients, but does that translate into meaning that the same thing would work if you used an aspiration device? The question is not answered. So how about the bad? Well, most of you have heard me talk about evidence-based medicine before and can guess which trial I'm going to show for bad. It's ARUBA, although that's probably not the right picture. It's probably that one. So ARUBA was an NIH-sponsored prospective trial. It's unblinded. It's adult patients with unruptured AVMs. The diagnosis was made by CT or MR. It did not require angiography, and a one-to-one randomization between any kind of treatment to eradication and observation alone. Primary endpoint was stroke or death. They did have a clinical impairment endpoint that was a modified Rankin 0 to 2, but that was a secondary one. So enrollment was reduced from an original plan of 800 to 400 because of issues with enrollment. And randomization was stopped in April 2013 by order of their DSMB because they had overreached their primary endpoint. So the outcome data was published based on 223 patients with a mean follow-up of only 33 months. And what was the main result? Well, the main result, you had 11 patients in the control of medical group that reached the primary end point versus 35 patients in the interventional group. And that was highly significant. So the hazard risk was reduced by almost a quarter. So what were some of the design choices here in ARUBA? Well, they had relatively broad inclusion criteria, but they had no ability to enforce the inclusion of the patients. So what you can do if you don't have the ability to enforce it, then you can track what happens to those people that are not in the trial. But unfortunately in ARUBA, they didn't really make a very strong effort to do that. They had a list of what happened, but they didn't know what happened to those patients. So you had no detailed screening logs. You didn't know what happened to the hundreds of patients that were not included. There were a wide variety of interventions allowed. Practitioners were not adjudicated. Adjudicating practitioners is a common way to try and improve your chances of a good outcome with a procedure in a trial because you try and stack the deck in favor of people that are good at whatever procedure you're doing. The follow-up was one of convenience, really based on NIH funding. They couldn't get follow-up for 20 or 30 years. And then the primary endpoint was an event rather than a condition. What I mean by that is the event is stroke or death. It's not the condition. It's not whether the patient is disabled or not disabled. So if you have a stroke, whether you get better or not doesn't matter. You had a stroke, that's the primary endpoint. So these design choices, they're just choices. You can make these choices or not make these choices, but there are investigator biases that go into determining which choices you make. And when you see a publication like this that came out before the trial, and this was by the primary investigators, you begin to wonder if there's not only some unconscious bias, but perhaps some conscious bias in how the trial is designed in order to get the desired result. So let's look at a couple of these things in a little bit more detail. So selection bias in terms of the patient. So about 1,700 patients are screened. Of those, an eligible population comes up with just over 700. Only 226 are actually randomized. Most enrollments are actually from Europe. Of the 13 enrolling centers, many U.S. centers which had busy AVM practices either didn't enroll any patients or enrolled very few. So this is a clear indication of potential selection bias. Now, one of the things you look at for selection bias is, well, how did you evaluate disease severity? Can you tell what the disease severity is in your population? And so the investigators said, well, we had a lot of grade 1 through 3 AVMs in this group, so we know that we selected optimal patients for intervention. Well, I'd say that's not necessarily the case because the Spetzler-Martin grade has nothing to do with the disease severity. This is a predictor of the outcome with surgical intervention. So you don't really know, did they have intranithal aneurysms? Did they have venous outflow obstruction? Do they have other risk factors that practitioners decided would be indications for treatment and therefore got treated outside of the trial and were not randomized? So, again, the problem with selection bias is a huge issue here. Let's talk about the impact of outcome and outcome selection in the short follow-up. So 33 months, I'd say, is not really adequate follow-up for a lifelong disease, right? So in ARUBA, the average age of the patients was 44 and 45, depending on which group you're in. Now, what does that mean? Well, the actuarial tables from the IRS, they're very interested in how long you're going to live, that says you've got about 38 to 34 years, depending on whether you're a man or a woman at those ages. The rupture rate in the medical group was 2.2 percent per year, actually, which is higher than the investigators had predicted. So what was the impact of that? Well, this is the looking now at the modified Rankin. This is the clinical outcome. So initially, modified Rankin of greater than 2, those are the disabled patients, was 15 percent in the medical group versus 46 in the interventional group. So it definitely favors the interventional group. Now, five years, at five years, which is two years later than the original one, you'll see there's a couple of changes. So not only has the rate in the interventional group gone down, okay, they're less disabled patients, but the number of disabled or percentage of disabled patients in the medical management group has gone up. So two things are happening here. The patients in the interventional group actually get better, because that's usually what happens if you survive your stroke, and then more events are occurring in the group that's not treated. So the outcome is changing in two ways. So this is showing you the impact not only of the duration of follow-up, but also of what you select as your endpoint. I can also look at their primary endpoint, and that was stroke or death, right? So at 33 months, I'm going the wrong way here. Sorry. At three years, primary endpoint was reached by 11 in the medical group and 35 in the interventional group. So now if you assume that the medical group is going to continue to have the same event rate, because they weren't treated, so it probably will be about the same, that's 11 every three years, and the interventional group, well, some people might say their event rate is going to be zero because they're treated, they're completely cured. Well, I'd say that's probably not the case. I mean, they may have strokes from other reasons, they may die from other reasons, so there's going to be an event rate, and I just picked three just for argument's sake, but you can vary that number a little bit, but it's going to almost certainly be substantially lower than the medical group. So at 15 years, you'll have had 15 events in the medical group and 47 in the interventional group, using these numbers at least, and that will give you 50% versus 41% for a 9% absolute risk reduction and a relative risk reduction of 18%. So well in line with typical risk reduction seen in a lot of trials for asymptomatic diseases. And this is only giving you an example of 15 years, which is less than half the life expectancy of the average patient at ARUBA. So this, again, illustrates how the impact of what you choose as your outcome and what you choose as the duration that you're going to assess people for can impact your results. So the last example, the ugly. This is perhaps a bit of a stretch, but just keeps the metaphor going, I guess. So we'll look at the CREST-2 trial. And I say it shows ugly for this really not because I think necessarily the trial is bad, but it's confusing and it's a complex trial. It's actually two trials in one, not the most simple streamlined trial design you'll ever see. So there are two trials here. One of them is randomizing anorectomy patients. The other is randomizing angioplasty and stenting patients. Medical management's the same in all of them. Primary outcome is the composite of stroke and death at 44 days and then going out to ipsilateral stroke at up to four years. Started in December 2014 and is planned to end in December 2020. Enrollment right now, the target is around 2,500. It's just over 1,000 right now. So they're about on pace, maybe a little bit behind. So some of the design choices here. Well, there's some varying but overlapping inclusion criteria. So it's not the same for both trials. And this is actually an example of the exclusion criteria. On the right here you have the CEA inclusion criteria and this is the angioplasty and stenting exclusion criteria. So very different. Now there's some overlap. There's some patients going to be eligible for both. Some are going to be eligible for one. And this, you know, makes it a little bit more difficult to decide sometimes. There is adjudication. I'm sorry, we'll go back to this one. Participation here cannot be enforced. So you run into this, again, this concern of selection bias. A patient is going to be treated outside of the trial. Variable site participation. So some sites are doing both trials. Some sites are only doing one. And then you do have adjudication of practitioners. So, again, this tends to stack the deck in favor of the intervention. But it's different for each one. So for endoderectomy you have to show 50 cases. For angioplasty and stenting you only have to show 25. The assessment of illness severity here is very rudimentary. So the assessment of illness severity here is carotid stenosis. This is the same thing we did in ACAS over 20 years ago. There's a lot of new technology involved in looking at plaque morphology and friability and that, and we're not using any of those in this trial. Just stenosis. So you worry that centers that are using these more complex assessments are then going to make decisions on which patients to treat or not treat and which patients to put in the trial or not treat. Primary outcome includes any stroke. So this would bias you towards the non-intervention group, right, because you're including strokes that you would expect not to be able to reduce by an intervention. And the follow-up is relatively short for an asymptomatic disease. Again, the life expectancy in these patients is probably going to be shorter than in the AVM trial, but you're still probably talking about 10 to 20 years. So four years is probably not going to be long enough. So what's the impact of all of this? Well, impact can range from a new guidelines update for ischemic strokes, which is this one here, which got a class one level of evidence A recommendation for using a stent retriever for endovascular treatment, or it can be simply a scientific statement, not a guideline, for AVMs, which in fact says that the debate is still a subject of debate because of insufficient high-quality evidence. So there could be varying impact. Now, this is the scientific impact. The actual social and clinical impact can be great even if it's just a scientific statement. So the take-home message from ARUBA becomes that no asymptomatic AVM should be treated, and that's what gets reported in the lay press. And I would submit to you that even if ARUBA was positive, it would not mean that every asymptomatic AVM should be treated. So this is not good either way. So it kind of demonstrates some of the limits of PRCTs. So randomization eliminates confounding, but the problem is that variations in equipoise can lead to selection bias. And then the narrow inclusion criteria may help you improve the outcomes, but in the end it limits your generalizability. And the choices with regard to these outcomes, they actually really do impact the results, and thereby impact your conclusions. So the PRCT challenge is randomization is best when a given practice either exists with no knowledge about the preference or when there's no knowledge about the preference exists. If multiple strong preferences exist, then randomization is going to be difficult, because people have already decided what they're going to do. And then finally, randomization is practically impossible if you only have one strong preference. If everybody's doing the same thing, they're not going to be willing to randomize the patients. There's no equipoise. For randomization to be effective, you really have to minimize selection bias. So the main randomization challenge can be summarized as someone needs to agree. And obviously there's certain things that nobody's going to agree to, and that's like the parachute example makes randomization very difficult. So is there another way, or was there another way? Well, I'd say that one thing to consider is what about using a registry to answer some of these questions. So the registry argument is that a large observational trial using things like propensity score analysis or multivariable analysis, that will give you some real-world information about the question you're asking, and will provide it in a more representative sample of patients rather than a homogeneous group that would be in a randomized control study. So the registry challenge, though, is that the registry is not inherently comparative. It's only going to be comparative if certain conditions exist. And those conditions tend to be that the registry is going to work when there are multiple strong preferences. If people are doing lots of different things in your registry, then chances are you're going to get a good comparison of different groups. It can also be useful when there's no knowledge or no preference exists, because again, people are going to be doing lots of different things. The problem is that when it becomes of limited value when there's only one strong preference. If everybody's doing the same thing, you're not going to end up with the ability to create any kind of comparison. So the main problem with a registry is internal validity. And this results from an imbalance of confounders, either those that are known or in the case where you've got multiple strong preferences when the confounders are unmeasured or unknown. But not all registries are created equal. You can do some things to make a registry better and improve your chances of getting good data out of it. So prospective data entry is key to this, having the subjects being open to audit, and then having independent adjudication of the outcomes. So the newest thing in the registry realm is doing a prospective registry-based randomized trial. And this is where you include a randomization module in the registry, and then you have the consecutive enrollment is unselected and patients get randomized from consecutive enrollment. And this can reduce the cost compared to a prospective randomized trial, and you can get improved efficiency because you're enrolling a lot more patients than you would in a prospective study. And this was actually in a New England Journal article. It was discussed as perhaps the new disruptive technology in clinical research, the prospective registry-based randomized trial. So just a couple of words then about what you can do in neurovascular surgery for registry work, and that's QOD neurovascular. This was launched in 2014. It's managed by the MPA, designed in conjunction with CV section representatives. And the purpose of this is to track surgical care for the most common neurovascular procedures and then provide an infrastructure for people to use to look at their own outcomes and report quality outcomes to various either payers or the government. Data collected are for the main subgroups of neurovascular diseases, aneurysms, AVMs, intraarterial thrombolysis procedures, and intraparenchymal hemorrhage. And there are 24 active sites currently with over 3,800 patients enrolled and full accrual on over 3,600 patients. That's to say they've made all of their follow-up. Right now we're working on integration with NVQI, which is the registry from SNIS, and there's also a collaboration underway with the FDA looking at devices for acute ischemic stroke, and that registry is called DAISY. So in conclusion, just to remind you that design choices that you make dramatically impact PRCT results, and these choices are subject to investigative bias, whether that bias be conscious or unconscious. There are significant limitations to RCTs, particularly for procedures and for devices that are dependent on the operator. And finally, that registries, including the idea of a prospective randomized registry, can be a viable alternative to PRCTs for some of these difficult clinical questions. Thank you.
Video Summary
Dr. Kevin Cockcroft gives a lecture on prospective clinical trials in neurovascular surgery, focusing on the good, the bad, and the ugly examples. He starts by emphasizing the importance of evidence-based medicine in neurosurgery and the need to assess the quality of evidence in clinical trials. He discusses the four levels of evidence used to classify clinical research studies and explains the characteristics of each level. Level 1 evidence is considered the gold standard and involves high-quality randomized controlled trials. Level 2 evidence includes moderate or poor quality randomized controlled trials and good quality cohort studies. Level 3 evidence consists of moderate or poor quality cohort studies, while level 4 evidence includes case series and case reports. Dr. Cockcroft also highlights the importance of linking evidence to the strength of recommendations in evidence-based medicine guidelines. He then presents examples of good, bad, and ugly trials in neurovascular surgery. The good example is the Mr. Clean trial, a high-quality randomized controlled trial that showed favorable outcomes for acute ischemic stroke patients treated with mechanical intervention. The bad example is the ARUBA trial, which had significant limitations in design and yielded conflicting results on the treatment of unruptured arteriovenous malformations. The ugly example is the CREST-2 trial, which is complex and challenging due to variations in patient inclusion criteria and potential selection bias. Dr. Cockcroft concludes by suggesting that prospective registries can be a viable alternative to prospective randomized controlled trials for certain clinical questions in neurovascular surgery.
Asset Caption
Kevin M. Cockroft, MD, FAANS
Keywords
prospective clinical trials
neurovascular surgery
evidence-based medicine
levels of evidence
randomized controlled trials
strength of recommendations
×
Please select your language
1
English