false
Catalog
Science of Neurosurgical Practice
The Science of Practice - A New Algorithm for EBM
The Science of Practice - A New Algorithm for EBM
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
What I'd like to do tonight is to give a presentation on a new algorithm for evidence-based medicine in neurosurgery. And this isn't my new algorithm. What I'm hoping is that by getting residents involved early on, by getting people enthused and interested about evidence-based medicine, biostatistics, trial design, that we can really improve how we do evidence-based medicine in neurosurgery, that we can make it better than it is right now. And I'm just so happy that the faculty and the residents showed up for this, because I think it is going to be very important for our future. What I hope to do with this is cover a number of the topics that we discussed today, maybe expand on a few of the things a little bit, and bring in a few concepts that we're going to talk more about over the next day and a half. So sorry to interrupt your dinner. I would suggest that you drink heavily. This talk gets much better the more you drink. So this is my disclosure slide that shows that I'm adequately bought and paid for. None of these represent any conflict of interest with this topic. I did put the NeuroPoint Alliance in red, because I will be discussing the NeuroPoint Alliance tonight and how I think that fits in with this new algorithm for evidence-based medicine. So I'm going to start off with a question, and that is, why can't we reach consensus on how to treat patients? And I think anybody who's honest will say, we haven't reached consensus. Here are just a few recent issues. General neurosurgery, the Barra ruptured aneurysm trial, fighting tooth and nail with the ISAT study. Carotid artery stenting versus carotid endarterectomy, multiple randomized controlled trials. We're still arguing about that. Stereotactic radiosurgery plus whole brain radiation therapy versus stereotactic radiosurgery alone for brain metastases. If you look at just the treatment of brain metastases, there are all kinds of options available with people advocating for one treatment paradigm or another. So the point is that we've been doing evidence-based medicine for quite a while now. We've done a lot of RCTs. We've generated guidelines. And for many of our neurosurgical disease processes, many of our neurosurgical treatments, we haven't reached consensus. And I don't think it's just because we're recalcitrant, that it's just because neurosurgeons are unwilling to accept the evidence. I think we have to figure out why we can't reach consensus and then try to develop something that will help. So let's talk a little bit about the present evidence-based medicine algorithm. You've heard a lot about it today. You're going to hear more about it in the next day and a half. But essentially, it's an algorithm for clinical decision making. And it's an algorithm that says we're going to use clinical research published in peer reviewed journals as the primary, or in the real strong advocates, the only source of evidence. We're then going to define a hierarchy of the strength of evidence based on the methodology of the data collection and analysis. And randomized controlled trials are appropriately accepted as the gold standard in that arena. We're then going to derive standards, guidelines, and options from a methodologically rigorous analysis of this evidence. And then the latest twist on this, this is going to become more and more important for us. We're going to determine the quality of our health care by how closely we adhere to these established practice parameters. So I think in a nutshell, that's the present evidence-based medicine algorithm. And the rationale for this is really compelling. We say if we apply this algorithm, we'll improve patient care by fostering better clinical decision making. And that's because the decisions will be made on the basis of a scientifically validated data rather than on intuition, or this is how we did it where I trained, or other non-verifiable factors. And if we make better clinical decisions, that should result in improved outcomes. And the final goal is the improved outcomes. But I think we fall short of this. The evidence-based medicine algorithm we have now is really compelling. But if it worked as well as it should based on what we've just gone over, we would have consensus on how to treat a number of things that we don't have consensus on now. So I think that the present evidence-based medicine algorithm isn't right or wrong. I think it's just inadequate. And I'm going to try to make the case for that as to how we can improve this going forward. So why are RCTs considered a gold standard? Steve Haynes gave a very nice talk today about this. And I'm just going to recap it quickly, that all clinical studies are prone to error from three things, bias, confounding, and chance. My residents get that drilled into them week after week, that the three things that are going to give you problems in clinical studies are always bias, confounding, and chance. And we have accepted the RCT as a gold standard because in the ideal RCT, this minimizes all of those errors. Just as a brief example, because it's prospective, it decreases information and recall bias. Because it's randomized and controlled, it will equally distribute the confounders. And as Steve pointed out, that can be confounders we don't even know about. It will eliminate internal selection bias if it's appropriately randomized. And then we can use the kind of frequentist statistics that we've talked about today to calculate things like the p-value and power calculations and confidence intervals. And we can really, with some precision, determine the likelihood of a type 1 and a type 2 error. So all of those things we've done very well. But one of the problems is that, you know, all RCTs are not created equal. And one of the points I'm going to try to make is that we've put an awful lot of effort into looking at the data and refining the precision of our data and understanding type 1 and type 2 errors. And we can do an awful lot with that on the statistical methods you'll learn here. But we don't put much effort into determining the accuracy of the data, how much any study reflects the real world situation. And this is really challenging because in a diagnostic test, for instance, you can tell about the accuracy of a diagnostic test if you have a gold standard. You say, how close are the results of this diagnostic test to the gold standard? When we're talking about clinical research, we often don't know what the right answer is. We don't have that gold standard to which we compare our data. But I think there are ways to get closer to approximating that, which we'll talk about at the end of the presentation. So at any rate, the ideal randomized controlled trial. And when we talk about a gold standard, it's really the RCT isn't the gold standard. It's the ideal RCT is the gold standard. Because the ideal RCT is almost irrefutable, and it needs three essential components. One is concurrent comparison to eliminate temporal bias. So you randomize people to one treatment or another, and they move forward at the same time. You need unbiased observation of clear and clinically meaningful endpoints to assure that statistical significance equals clinical significance. And then you need to randomize a representative population, it has to represent the target population that you're worried about, of adequate size in order to equally distribute confounders and reduce the chance for sampling errors. So those are the three essential components. And one way to sort of keep this in your head, instead of trying to remember all these steps, is, you know, the ideal RCT would be something like a double blind study of aspirin versus placebo following myocardial infarction with death as the outcome of interest. Because it's double blind, you get rid of all of the bias, you randomize people to aspirin or placebo, you move ahead at the same times, there's no temporal bias. Dead is a very clear and clinically meaningful endpoint that isn't going to be difficult to determine. And if you do this study in an adequate and representative population, and you find out that aspirin is significantly better than placebo for these patients in preventing death, you can take that to the bank, that this is basically irrefutable evidence. But we do neurosurgery, and there are real problems with surgical RCTs. It doesn't mean that RCTs are bad, it's just there are real problems and we should face up to them. If we don't, then we're going to kid ourselves. The first one is concurrent comparison. And we talked earlier today about intention to treat analysis and crossovers. And if you want to have a randomized trial, in order to preserve the benefits of randomization, you need to analyze the patients in their assigned groups. You have to do an intention to treat analysis. If you do an as-treated analysis, that may be valuable information, but it's not a randomized trial. So the intention to treat analysis is essential if you want to have the value of an RCT. And as Steve pointed out, that kind of makes your head spin if surgical patients never receive surgery and non-surgical patients do. And there is another problem with surgical trials and crossovers, and that is that the crossover periods are often asymmetrical. Let me explain that. So let's take something like the SPORT trial, where someone is going to come in, they've got, we'll talk about this in a minute, but let's say the lumbar disc herniation sub-study in the SPORT trial. So the patient comes in and has a herniated disc, and they're deemed to be a surgical candidate. At that point, they're randomized, and if they're randomized to non-surgical treatment, they can have physical therapy, chiropractic, injections, all the things, and that can go on for months. If then they feel they're not doing well, they can cross over to surgery, and if they cross over to surgery at three months, and then at a year they're doing well, legitimately that good outcome at one year is attributed to the medical group, because it's an intention to treat analysis. That same patient who randomizes to surgery is going to be scheduled for surgery within a week, 10 days. They have a much shorter period of time in which to change their mind and decide not to have surgery. So not only do you have the problem of people who aren't treated as assigned, but the crossover periods are asymmetrical. We have to ask ourselves how that affects the outcome and how much weight we can put in those sorts of studies. And there are statistical methods to help ameliorate this problem, but they don't eliminate it. And I think realistically, if you get enough crossovers, the honest thing to say is, you know, this study, we spent the money, we didn't learn much from it. I think that's what happened with sport. Here's the sport trial. This was the sub-study in lumbar disc herniation published in JAMA in 2006. Of 256 patients who were assigned to non-operative care, so they randomized to non-operative care, 30% had surgery at three months, 45% had surgery within two years. If they did well at their follow-up and they randomized to medical treatment, even if they had surgery, that was attributed to the medical arm. If you looked at the as-treated analysis, again, this isn't a randomized study. If it's an as-treated analysis, but the as-treated analyses demonstrated a strong and statistically significant advantage for surgery at all follow-up times. And the intent to treat analysis, on the other hand, revealed only a slightly better outcome for the group assigned to surgery at each time point, and it didn't reach statistical significance. So based on the intention to treat analysis, outcomes with surgery were not statistically significant better than outcomes with non-surgical treatment. But if you looked at the as-treated group, there was a significant benefit for surgery at every treatment point. So how do you interpret that? Well, I would interpret that as saying what this trial really did was to say if someone comes in, they have symptoms and findings consistent with, you know, herniated lumbar disc, that you should probably treat those patients conservatively initially, and then if they don't get better, operate on them, which is exactly what we did before we spent $30 million on the trial to prove that that's what we should do. I just don't think that's a real good use of the funds. More importantly, what got published, and this is the New York Times assessment, study questions need to operate on disc injuries. I don't think that's what this study showed at all. I think the study showed pretty clearly that if you're looking at the initial assignment, it made perfect sense to say let's try something non-surgical, but if they're not doing well with that, surgery seems to have a real benefit in the as-treated groups. One of the quotes in this New York Times article was, everyone was hoping the study would show which was better. Everyone was surprised by the tremendous number of crossovers. In the end, the study could not provide definitive results in the best course of treatment because so many patients chose not to have the treatment that they had been randomly assigned. Actually, we had published a critique of the design of the SPORT trial before it was done. When you do it afterwards, it always sounds like special pleading. You just didn't like the outcome, so you're arguing about it. Our critique was that, one, it's really hard to believe these patients will be representative. I don't think they were. Two, is that there are going to be a horrendous number of crossovers, and you're not going to be able to interpret the data, which indeed was the case. How about essential component number two? That's unbiased observation of clear endpoints. In surgical trials, masking is difficult, sometimes impossible. Endpoints may be vague. It's not as clear as dead or alive, and that can introduce bias for or against surgery. I want to make a point, is that unmasked trials are biased trials. You can have someone other than the surgeon evaluate the outcomes, but if that evaluator is not blinded, not masked, then you simply substitute somebody else's bias for the surgeon's bias. If you can mask the treatment from the observer, you can get an unbiased trial. But if you can't, an unmasked trial is a biased trial. And remember, one of the reasons that an RCT is a gold standard is it helps get rid of bias. And so if it's unmasked, it doesn't do that in the way that a masked trial can. So let's talk about a trial that Neil brought up, and that's the ISAT study. Now, I don't want this to be interpreted as a rant against endovascular treatment. You can ask our faculty and residents. The vast majority of the aneurysms at our place are treated endovascularly. Actually, I think that's the right thing to do, particularly in the ruptured aneurysms. If there's a good endovascular treatment, I think that's the better option. All I'm saying is I think that the ISAT was not a very good study, and yet it had a profound effect on how we treat patients that we may yet come to regret. But anyway, the ISAT study published in Lancet, 2002, open study, not blinded. Optimized controlled trial, clipping versus coiling for ruptured aneurysms. Primary endpoint was a functional health status at one year, and it was measured by a modified Rankin scale, MRS, sent as a postal questionnaire. So this went out to the patients and the patient's family and was sent back in. And the endpoints are as follows, 23.7 percent of patients treated with coiling had MRS scores of three to six at one year, and 30 percent, that's a bad outcome, 30.6 percent of patients treated with clipping had an MRS score of three to six at one year. So there's an absolute risk reduction with coiling of a bad outcome of about 7 percent. Of course, when this was published, the people who actually make the catheters and coils and wanted to sell a lot more catheters and coils were poised, and it came out not as the absolute risk reduction, but a 20-plus percent relative risk reduction with coiling. This is clearly the better treatment. Everybody should get the coils. And I have two questions I want you to really think about and go through this study. Was the functional health status reporting reliable, and was it biased in favor of coiling? Those are very important questions. This, more so in Europe than in this country, but really overnight changed the treatment of patients with ruptured aneurysms. Did it deserve that kind of credibility? Well, let's look at was ISAT reliable. There's a very good study in Stroke 2007 looking at outcomes, validity, and reliability of the modified Rankin scale and its implications for stroke clinical trials. And this was a nice paper, Literature Review and Synthesis, and what they found was that the inter-rater reliability with modified Rankin is poor to moderate. It gets to moderate with structured interviews. So if you actually sit down with a patient or a family member and say, what can this person do on a day-to-day basis, then you can assign that person to one of the modified Rankin scale numbers with moderate reliability. It's poor if you just send out a postal postcard, ask him to fill it out, and send it back in. That wasn't discussed in regard to ISAT. Another study in Stroke 2009, again looking at the reliability of the modified Rankin scale in a systematic review, had the same assessment, that there was real uncertainty regarding this scale, its reliability, that inter-observer studies closest in design to large-scale clinical trials demonstrate a lot of variability. So to take these numbers as somehow, you know, revealed truth when we find that the variability is very high, at least calls this into question. But you could say, well, okay, so maybe inter-observer reliability isn't great, but there was a benefit of, you know, coiling over equipping. How can you explain that? Well, I think the other question is, was this biased? And this is an important question. This isn't something that we can say, well, you can't blind these studies very well, so we don't have to worry about it. This is an unblinded study. The docs knew what the patients had done. The endovascular specialists and the surgeons knew what the patients had done. The patients knew, and the patient's family knew. So this is a completely unblinded study, and if you look at the Cochrane Collaboration Criteria for a high risk of bias, there are a couple of things. No blinding of outcome assessment, check for ISAC, and that the outcome may be influenced by the lack of blinding, check. Well, why would the outcome be influenced? Well, it was pretty clear in this study that the patients were given the option of the old treatment or the new treatment. We have a real belief in Western civilization that it's new and improved, that the new treatment is better. And there are a number of ways that you can show how, you know, that kind of expectation changes, you know, how you perceive the outcome. One of my favorites is the Ford-Mercedes comparisons, and that is if you take somebody, blindfold the person, and put them in a Mercedes or a Ford Taurus, but you don't tell them, take them on a ride, and ask them to grade the quality of the ride, they can't differentiate between the Mercedes and the Taurus. If you tell them they're in a Mercedes or a Taurus, the Mercedes always gets ranked significantly higher. Elias gave me another great example recently, and that's skilled violinists asked to judge the sound quality of a new violin or a Stradivarius. When they knew it was a Stradivarius, always picked the Stradivarius. When they were blinded, there was no difference. As a matter of fact, the new violins actually had a bit better sound quality as judged by the same musicians. So to think that somehow we're not biased by our expectations, I think, is wrong. And there was a good article in the British Medical Journal in 2012, looking at bias in randomized clinical trials with binary outcomes, so it would fit this. And it was in trials where they had both blinded and non-blinded outcomes assessors. And the conclusion is that non-blinded assessors were really very much different in judging the outcomes than the blinded group. So keep this in mind. If you're going to do a randomized control trial in surgery, do whatever you can to blind the assessors, because it makes a big difference. How about essential component number three, and that's randomization of an adequately sized and representative population? This is another problem, and it's a problem we have to pay attention to. The surgeon has an implicit contract with the patient to offer the best care available, the therapeutic imperative. So when you sit down with the patient, you want to do what you think is best for that patient. And so if the surgeons don't believe that both treatment arms are equally efficacious, if they don't have equipoise in regard to the treatments. And Dr. Barker has done some really wonderful work in assessing equipoise, seeing how we can improve equipoise in trials. But this equipoise issue is something that doesn't get enough attention. And what surgeons will often do is offer surgery to patients who are most likely to benefit and randomize those least likely to benefit. So let's go back to the sport trial. And I know this happened. My wife is a neurosurgeon. She was in a private practice outside of Dartmouth when I was at Dartmouth when the sport trial was being done. And it was very common for the primary care docs to have a patient with miserable leg pain, an MRI scan showed a free disc fragment, and they said, oh, this person really needs surgery. We'll refer them to the private practice group. And then the one who wasn't quite so clear cut would go to the academic center and maybe get in this trial. I know we did the same thing within our group, where someone who came in with a clear cut disc herniation, maybe a little bit of weakness in the foot, young, good surgical candidate, and a big free fragment of disc on the MR, you'd say, well, you know, I know if I operate on this guy, he'll be pain free tomorrow. He'll go home tomorrow or today. He'll be back at work in a few weeks. So I'm going to tell this guy he needs surgery. The next patient comes in. He has a bulging disc, back pain, and leg pain. You say, eh, I'm not really sure if this guy is a good surgical candidate or not. Oh, I've got it. I'll put you in a randomized trial. Well, that's fine. But if you do that, then you don't have equipoise for that study. And you've taken your best surgical cases out before the trial ever randomizes anyone. So you can't then generalize those results to the general population. The ISUIA study, International Study of Unruptured Cranial Aneurysms, had a proposal to the NINDS a few years ago. They wanted to do a randomized study of the treatment of unruptured cranial aneurysms. And it ranged from like, you know, 7 millimeters to 15 millimeters over a wide age range. And their study was going to be as follows. Observation, endovascular treatment, or surgery. And they were going to do their outcomes at one in three years. And I was asked to comment on that, as were a number of other people. And the response was, don't bother to do the study. I can tell you what the results of that study will be. In three years, it will unquestionably be that observation is your best option. But if you have a 40-year-old patient who needs to, you know, has a 40-year lifespan ahead of them, and you do something like a decision analysis using the best data we have, you find that the crossover for treatment and observation is at somewhere around 12 years. Running through the lifetime of that patient, there's a clear benefit for surgical intervention. But at one year or three years, unquestionably, observation will be the best. Because, you know, the annual risk of rupture is low enough that you will not get up to the complication rate with any of your interventions. Now think of what that does, though. Right now, when we see unruptured aneurysms, we take into account a lot of different things that I think increase the risk of rupture. May or may not be able to prove it, but aneurysm size, location, irregularity, family history, smoking, et cetera, et cetera. I take into account a whole range of things in regard to the surgical risk of that particular patient. And then I would pick out a pretty small fraction of the unruptured aneurysms where I say, I think we should really treat this one, and the others get surveillance. Do that randomized study with outcomes at one in three years, guaranteed it will be observation is the best thing to do. And now, even those patients that probably really needed surgery, so, you know, the 35-year-old woman with the 12-millimeter irregular basilar apex aneurysm, positive family history, that person won't be able to get treated because you'll have level one evidence that proves that observation is the better option. We need to be very careful about that. Anyway, if surgeons don't believe that both treatment arms are equally efficacious, if there is an equipoise, then they will offer surgery to patients most likely to benefit. They'll randomize those least likely to benefit. And I would say that randomizing patients unlikely to benefit from treatment is really analogous to performing diagnostic tests on patients unlikely to have a disease. And this takes us into, you know, the Bayesian analysis that was brought up earlier today. Thomas Bayes, actually one of my intellectual heroes, was an English theologian, mathematician. And what he really investigated was the probability of an event given a condition and how that related to the probability of a condition given an event. In other words, if you have a cold, what's the likelihood that you're going to sneeze and how does that relate to, if you hear someone sneeze, what's the likelihood that that person has a cold? And where Bayes' theorem has been applied most often in medicine is in diagnostic tests. And one of the things, and I think Elias brought this up today in the questioning, one of the things that Bayes' theorem demonstrates is that to know the predictive accuracy of a test result, you really need to know not only the sensitivity and specificity, which we're going over so well today, but you also need to know the prior probability of disease in the population being tested. So let's do another one of our sort of thought experiments here. So now we're going to do a Bayesian inference on a diagnostic test. And you can think of this test as a carotid duplex, but the test specifics don't matter. But for the purpose of this thought experiment, our diagnostic test is going to be for greater than 60% carotid stenosis. And we're going to say our test has a specificity of .95 and a sensitivity of .8, and you'll soon see why those were chosen. Now we're going to give this test to two groups of 1,000 patients each. One is a high-risk group, and let's say these are elderly smokers. They've had repetitive TIAs in the same vascular territory. So let's say the prior probability of these people having carotid stenosis is .9, 90% chance that they have a carotid stenosis. The second low-risk group, young asymptomatic athletes, say 1% of those people have a healed dissection and have carotid stenosis. Now when you talk about the priors, you can think of this in two ways. One is either if you know with certainty what the prevalence of the disease process in the population is, then that's really your gold standard. So if you know for sure that this high-risk group has 90% carotid stenosis, then you can evaluate your test against that gold standard. But another way to look at it is, what do you think? What's your best guess before you do the test as to the likelihood of this degree of carotid stenosis? So in the high-risk group, you say, well, I think there's a 90% chance. In the low-risk group, I think there's a 1% chance. Both of those you need to keep in mind because they're used a little bit differently. So let's look at this. These are the results in 1,000 high-risk patients. Sensitivity 0.8, specificity 0.95, prior probability 0.9. So if you look at the right-hand column of the 1,000 patients in this high-risk group, 900 of them actually have carotid stenosis greater than 60%. The sensitivity of our test is 0.8. So we correctly identify 720 out of the 900. So that positive test in people who actually have the disease, 720 true positives. A hundred of these patients don't have stenosis. We have a specific test at 0.95. So we correctly identify 95 out of that hundred as not having stenosis. But we have five false positives in that group. So at the end of screening these patients, we have 725 positive tests. Of that 725, 720 are true positives. So if you get a positive test in this patient population, the chance that that person actually has carotid stenosis is 99 plus percent. But you suspected it was the case at 90% before you did the test. So what this test has done is improved your accuracy by about 9%. So that's the high-risk group. Now what happens if you use exactly the same test in the low-risk group? So we haven't changed anything. Specificity 0.8, specificity 0.95, except this group of patients only has 1% of real stenosis. So of the 1,000 patients, 10 actually have the stenosis. We pick out 80% of those. So we have eight true positives. 990 do not have stenosis. And even with the specificity of 0.95, you still get 49 positive tests that are false positives. Now you have 57 positive tests, but the predictive accuracy, only eight of those 57 are true positives. So the chance in this group that that person really has carotid stenosis is 14%. Same test, but you apply it in the wrong population. You get a wildly different predictive accuracy. Everyone clear on that? That's one of the reasons why it makes no sense to go to the mall and have these guys do your duplex studies. You're always going to get more false positives than true positives. So I think that's important from the standpoint of diagnostic tests, but I'm going to try and make the case that there is really a very close analogy between diagnostic tests and clinical trials. So in a diagnostic test, we're looking for the absence of disease, that's a negative test result, or the presence of disease, a positive test result. We need to know the sensitivity and the specificity of that test. And we've just seen how the prior probability of disease is going to profoundly affect the predictive value and predictive accuracy of the test and our belief in the results. So in that high-risk group, where we already thought that that person likely had carotid stenosis, if we get a positive test, now we're almost certain. In that low-risk group, even if we get a positive test, we're going to say, well, I still don't think this is right, I don't think this is correct, and the numbers bear that out as a reasonable conclusion. So now let's look at clinical trials. We're looking for the absence of a treatment effect, a negative study result, or the presence of a treatment effect, a positive study result. The sensitivity of a diagnostic test is exactly the same as the power in a clinical trial. If there's a real difference there, how likely am I to find it? And that's why I picked a sensitivity of .8, because we often pick a power of .8 for clinical trials. Furthermore, the specificity of a diagnostic test is exactly analogous to 1 minus the p-value. So if you have a specificity of .95 in a diagnostic test, it's like looking for a p-value of .05 or less in your clinical trial. And I would submit that a prior probability of a treatment effect will profoundly affect the predicted value, the accuracy of the study, and our belief in the results the same way it does for diagnostic tests. And that's, I think, an important point. So if there is this precise analogy between diagnostic tests and clinical trials, and if surgeons operating those patients most likely to benefit from surgery and randomized patients least likely to benefit, and if we're not very careful about equipoise, that will happen, then the likelihood of a randomized clinical trial demonstrating a benefit from surgery is reduced. So we have to pay attention to the equipoise issue. More importantly, perhaps, I think we may be able to use a Bayesian analysis approach to determine the accuracy of clinical trials in the same way that we now use it to determine the accuracy of diagnostic tests. And that's the rest of the talk. I'm not going to belabor this, but a couple of things to keep in mind with the randomized trials, that you can get statistical significance without clinical significance. We all understand that. My personal favorite is Gliadel. I don't know if there are any Gliadel advocates in the audience, but you can, you know, have people that with Gliadel live, you know, three weeks longer than people who don't get Gliadel, and that's statistically significant. You can randomize them. And then at the end of the whole thing, you have to ask yourself, is that three weeks, most of which is spent in the hospital, of clinical significance for the investment? You have to make that call. Another problem, and this is a huge problem, and that is randomized controlled trial surgeons may not be representative. If you go back to the ideal RCT, aspirin versus placebo, it doesn't matter if it's the intern on the service or the chief of cardiology who puts that order in the chart. But it matters a lot if it's the intern or the chief of neurosurgery who's doing the operation. I mean, we know that makes a big difference. In a randomized controlled trial, the surgical risk is a given. Now, if you take that result and your surgical risk is considerably different from the randomized controlled trial, to think that those results apply to your practice is wrong. And I think the carotid studies are the best example of that. One of the things that Steve pointed out earlier today is we often, you know, say if the study kind of comes out our way, we think, oh, yeah, that's a great study. And if it doesn't, then we say, oh, I don't think that study's worth anything. Well, this is going to refute that. I think, you know, the randomized trials of carotid surgery have resulted in a whole bunch of people getting surgery that should never have been operated on, for two reasons. One is if you look at, let's take the ACAST study, that is the asymptomatic study published in the early 90s. If you look at the Medicare database, and that's most patients in that database are asymptomatic patients, and you look at the mortality in that group of patients and compare that to the mortality in ACAST, it's about 18 times higher. So we did ACAST. We tortured the data until it confessed. We made believe that there was a benefit of surgery versus medical management, but only in the hands of surgeons who had very, very low complication rates. We published that data. Everybody now starts operating on these patients, and they're operating on them with 18 times the mortality of the study surgeons. Were more people helped or hurt by that randomized trial? I guarantee more people were hurt. Not that we wanted it to be that way, but you can't, if you're a golfer, you can't quote Tiger Woods' handicap. You have to know what your complication rate is to make sure that that makes sense in applying that treatment in a discipline where surgical skill varies widely. We've talked about some of the other stuff. One of the technology issues is, you know, as the technology changes, you know, the RCTs are called into question. I mean, that's happening with the carotid studies now. If you look at the medical management of carotid stenosis, the stroke rates with good medical management in asymptomatic carotid stenosis are so low, it is hard to imagine how any surgical intervention is going to improve on that. That was not the case when the studies were done. Now, one option would be to simply repeat the studies with the modern medical management, and that is being considered. But here's my point, is we can't reach consensus because we disagree on how accurate the RCTs are. It's not about, you know, P values or confidence intervals or any of that. We have that down and nobody argues about those numbers. We argue about the accuracy of these studies in reflecting the patients we see. And we've come to agree on standard procedures to interrogate the precision of clinical trial data. Much of what we're going to hear in this two and a half days is to refine the precision of the data, but we really haven't agreed on any such mechanism for accepting the accuracy of the clinical trials. Does that make sense? And that's where I think we have to go if we want to improve our evidence-based medicine algorithm. It's not a rant against randomized trials. It's not that this is better than that. It's we need to do all of the above, but we really have to think about the accuracy and not just the precision. So what do we want to accomplish? Now RCTs evaluate the difference between populations. You have a population treated with A, a population treated with B. Overall, the population treated with B does better. It reaches statistical significance and we all understand that. And we've got a lot of expertise in refining the precision of that analysis. But no matter how much we refine that precision, it won't increase our confidence in the validity of the study, in its accuracy. That's a different question. And part of the issue, of course, is the overlap. That there are a bunch of patients who did better with the inferior treatment. So at a population level, we can determine a better treatment. At the individual patient level, I'm not sure we can. So this leads to the science of practice algorithm. This is a little different than the evidence-based medicine algorithm we've sort of grown up with. The science of practice algorithm is habitual and systematic collection of data. It should be inseparable from practice. And I think the best way to do it is via audited registries. And this is not to replace randomized trials. It's not to replace cohort studies. It's not to replace anything. It's in addition to all of this work. But the idea behind this is that the data are collected to allow analysis of patient characteristics, processes of care, and outcomes. And this data will then generate new knowledge about the effectiveness of care and about operative indications. If we can actually look at a pattern to say, if a patient has these characteristics, this process of care, the chance of a good outcome in a clinically meaningful way is 90%. With a different set of characteristics, the same process of care, the chance of a good outcome is 10%. That's a huge difference. We need to be able to do those predictors. If some of the work that Tony Asher's done with N2QOD has been, I think, going to be really kind of revolutionary, and I won't steal your thunder, Tony, but the early analysis of some of the spine surgery says that there are 15 to 20% of patients who aren't doing well following their surgery. They didn't benefit from the surgery. If we could pick those patients out, if we could change our good outcomes from 85% to 99%, we would do more for quality improvement in neurosurgery than all the times you wash your hands and give people, you know, DVT prophylaxis and all the other stuff that we do because that's supposed to improve quality. I mean, if we can refine our operative indications and really offer surgery only to those patients who benefit, that's a huge plus for us. We can take this knowledge and improve the quality of care. And then the other thing that's nice about this approach is it really is very good for innovation. One of the things I'm terribly worried about is we're being driven toward protocol-driven care so that we have this is the best way to do it. Now you're going to be determined on whether or not you're a good doctor by how closely you adhere to that protocol. And if you don't, if you do something outside the protocol, then you get dinged for it. And I understand all the reasons for that. But if there's one way to kill innovation, it's to say we know all the answers. This is the right way to do it. This is what you have to do. This science of practice approach wouldn't do that. Because if you do have a new innovation, a new treatment option, a new way of selecting patients, add it to the registry, see if it makes a difference down the road. So just to look at sort of a side-by-side comparison, evidence-based medicine, really perfect for population-level studies. Science of practice is really geared more to the individual. Null hypothesis testing is the basis for our evidence-based medicine approach. Bayesian inference may have a lot more value in the science of practice. RCTs, cohort studies, all those things we talked about are, you know, at the top of the list in evidence-based medicine for very good reasons. RCTs have a real place in the science of practice. Evidence-based medicine is a discontinuous process. So we're going to do a cohort study or an RCT. We start the study. We finish the study. We analyze the data. We publish the data. And then that's the, if it's a good study, that becomes level one evidence until the next study. The registries are an iterative process. There really is no start and stop. The data are collected continuously. Evidence-based medicine, surgical skill is uniform. You know, it's a given. Surgical skill is a variable in a registry. You compare your outcomes to the other surgeons. Evidence-based medicine, I think, is very nicely set up to do comparative effectiveness. You know, if we're going to compare two different treatments, this is a good way to do it. For a personalized medicine approach, not genetics, but actually being able to pick out the best treatment option for an individual, I think there's a lot to be said for the science of practice. Evidence-based medicine is prescriptive and proscriptive. It says, you should do this. You should not do that. And the science of practice really is more, encourages innovation. Try something different, see if it makes, has an effect on the quality of care. And then, right now, with the system we have, we're determining quality by compliance. You're a good doctor by how closely you adhere to the protocol. In the science of practice, it's really your outcomes that determine the quality. So differences, and I'm not saying one should be done or the other. I think we need to do both. If we really want to improve the quality of care for our patients, we need both of these to come together. So I want to talk, just a minute, Tony is going to talk a lot more about this tomorrow, but the power of patient care registries. They're cost-effective, they're easily scaled, they do document care in real-world environments. Fred mentioned some things like propensity analysis, and there are some other biostatistical tools that can be used for pseudo-randomized studies. Now you do that by putting patients into quintiles of risk. We know a lot about what factors lead to a good or bad outcome, and so you can match patients with treatment A versus treatment B for those risk factors and do a propensity analysis. What Steve mentioned is what about those confounders that you don't know about that may be affecting the outcome? This doesn't help that a bit. The only way to address those is randomization. So this isn't, I mean, nothing's perfect, and that's why we need to combine things. So there are real problems with this algorithm as well. Data collection, analysis, and feedback infrastructure is needed. We need to set up the basis to collect this data, analyze it, and feed it back to the individuals. I think registries, to be of any value, need to be risk-adjusted, and they need to be audited. If people are only going to put their good outcomes in a registry, then it's a meaningless exercise, and to have confidence in the registry, it needs to be audited. There must be a commitment to continuous data collection. This isn't something that you can, oh, this is really fun for a few months, and then I'm going to stop. And as I mentioned before, propensity analysis can't account for unrecognized confounders. Finally, if we want to do comparative effectiveness research with this approach, then we do need interdisciplinary registries, so that you look at medical versus surgical treatment for a given problem, for instance. And this is where the NeuroPoint Alliance has come in. I've put probably 15 years of my life into trying to get this up and running. Tony Asher has made a whole career out of getting the N2QOD to the point where it is. I think you put in about 40 hours a week on that, and then the other 80 hours a week in your surgical practice. But the NeuroPoint Alliance has as a mission, we want to facilitate this practice science for neurosurgeons. We want to be able to collect both clinical and economic data using an internet-based technology, and we can use that then for a whole bunch of things that are listed here. And we've actually made, I think, a lot of progress toward achieving that. Finally, and I'm coming to the end, I promise, why can't we reach consensus? And I think this comes down to a frequentist versus a Bayesian approach to how we understand what the precision and the accuracy of our data is. The frequentist inference, and most of the things we've talked about so far have been frequentist statistics. Use the locus of uncertainty as the observations. How precise can we get the data? How likely is it that this is just a sampling error? We interrogate the data because we view that as the locus of uncertainty. Does that make sense to everybody? I mean, that's what we've been doing so far. And we do all these tests to evaluate the precision of the data. And as we collect more data, we get a much better chance of rejecting the null hypothesis. You know, if you have huge numbers in a randomized study, the chance of, you know, finding some statistically significant difference is much higher than if you have small numbers. Everything else being equal. Bayesian inference views the locus of uncertainty as the observer. We're not questioning the data. We're questioning what you believe. That's where the uncertainty is. And it evaluates the perceived accuracy of the findings. Now, if we had God's view, we knew the truth, then we could evaluate all of our things as to how far from accurate the study is. But we don't. But with a Bayesian approach, what happens is as you gather more data, whether it be from randomized trials, registries, or whatever, then widely varying initial beliefs, not testing precision, your beliefs, you know, one person started out saying, I think there's a 20% chance this works. The other person says, I think there's an 80% chance this works. As you gather more data, those beliefs converge in regard to both the likelihood and the magnitude of a treatment effect. That's the value of a Bayesian analysis. And a combination of both a frequentist analysis to interrogate the data, to look at the precision, and a Bayesian analysis to interrogate beliefs I think would really help us reach consensus. And I think it's within our power to do that. But it's hard. You've seen the little targets. These are darts instead of bullet holes, so it's less violent. But, you know, Fred has talked about something that's accurate but not precise. So if you look at all of those darts around the target, if we knew what the real value was, if we have a target to shoot for, those are all around the target. But they're in every quadrant. You can be precise but not accurate. So you can have your data looking very, very nice, very nice precise data, and you can refine that. But if there's some bias in the study, if the population that you're evaluating doesn't match the population in your clinic, et cetera, et cetera, no matter how much you refine the precision, it won't help. You're not going to believe the results of that study. And what we're shooting for is accurate and precise. And I think accurate and precise needs frequentist and Bayesian analysis. And another way of demonstrating that is just the graph. So Bayesian evidence-based medicine, science of practice evidence-based medicine, however we'd like to phrase it, would combine frequentist and a Bayesian analysis of clinical research data. I think we need a hierarchy of evidence that addresses both precision and accuracy. And to do that, we need to stratify not only that, you know, RCTs are a higher level of evidence than, you know, a cohort study, which is higher than a case series, which is higher than a case report. I think we need to stratify the RCTs. And some of the things, the consort that Steve was talking about, are on the way to doing that. But we need to look at the RCTs and say, what's the likelihood of bias in this study? How generalizable are these patients? Was there equipoise for treatment? Was it objective and clinically meaningful endpoints chosen? And so we can begin to stratify the randomized trials. But probably more importantly, I think we can also stratify our non-randomized data along the same lines. And once we do this, if we can combine both of these, we'll have a lot more data to work with. And I think we will be able to refine both the precision and the accuracy of our clinical data and maybe actually reach consensus on what makes sense for neurosurgical treatment. So, summary, I think the present evidence-based medicine algorithm is clumsy for QI effects efforts. It hasn't allowed us to reach consensus in many areas of neurosurgery, despite doing this for a few decades now. This Bayesian evidence-based medicine or science of practice evidence-based medicine may help us improve the quality of neurosurgical care and gain consensus. But it's the young people like you who can get enthused and interested in this that can really make that happen. I think the NeuroPoint Alliance, working with AANS, the board, Congress, senior society, other neurosurgical organizations, there's a real commitment from organized neurosurgery now to make this happen. And I think this is an exciting time. So, thank you for your attention.
Video Summary
In this video, the speaker discusses the need for an improved algorithm for evidence-based medicine in neurosurgery. They emphasize the importance of getting residents involved early on and getting people interested in evidence-based medicine, biostatistics, and trial design. The speaker presents the limitations of the current evidence-based medicine algorithm, including the lack of consensus in the field of neurosurgery and the inability to reach consensus on how to treat patients. They suggest that the present algorithm is inadequate and propose a "science of practice" algorithm that incorporates both frequentist and Bayesian analysis to improve the accuracy and precision of clinical trial data. The speaker argues for the use of patient care registries to collect data on patient characteristics, processes of care, and outcomes. They highlight the need for risk-adjusted and audited registries to ensure the accuracy and quality of the data. The speaker concludes by discussing the power of patient care registries and their potential to improve the quality of care and facilitate innovation in neurosurgery. Overall, the speaker calls for a combination of evidence-based medicine and the science of practice approach to improve evidence-based medicine in neurosurgery and reach consensus on treatment approaches.
Asset Subtitle
Presented by Robert E. Harbaugh, MD, FAANS
Keywords
evidence-based medicine
neurosurgery
improved algorithm
residents
biostatistics
trial design
consensus
clinical trial data
patient care registries
innovation in neurosurgery
×
Please select your language
1
English