Beth Israel Deaconess Medical Center and Harvard Medical School recently released a study in which researchers built a deep learning algorithm capable of assessing whether a cluster of lymph node cells contain cancer with a 92 percent success rate based on image recognition-based classification. While a 92 percent diagnostic success rate is typically more favorable than, say, a 90 percent diagnostic success rate, in a field with stakes as high as those in healthcare, the nature of those kinds of mistakes matter.\nFor simplicity\u2019s sake, let\u2019s consider a similar (but fictional) algorithmically-trained, 92 percent accurate diagnostic study including exactly 10,000 patients. The researchers\u2019 deep learning algorithm would return a correct diagnosis in 9,200 cases, and an incorrect diagnosis in 800 cases. Now, each of these 800 errors would fall into one of two categories: a false positive, wherein the algorithm reported that a patient had cancer when in reality they did not, or a false negative, wherein the algorithm reported that a patient was cancer-free when in reality they were not.\nNeedless to say, a return of eight hundred false negatives would have the potential to be far more catastrophic than a return of eight hundred false positives. A false positive causes a great deal of unnecessary stress, to be sure, but the error is likely to be spotted fairly quickly. A false negative, on the other hand, greatly increases the likelihood of a missed diagnosis \u2014 which could drive the possibility that a patient will fail to undergo potentially life-saving treatment in a timely manner.\nAs such, in and of itself, \u201caccuracy\u201d \u2014 or how frequently a deep learning algorithm makes the correct assessment \u2014 is an insufficient indicator of algorithmic success in the context of medical diagnosis. When the outcome of an algorithm\u2019s errors matters with respect to real-world decision making (especially in a medical context), we must evaluate not only statistical accuracy but consider statistical \u201csensitivity\u201d and \u201cspecificity,\u201d as well.\nThe critical insight provided by sensitivity and specificity\nIn a diagnostic context, sensitivity \u2014 also commonly referred to as the \u201crecall\u201d or the \u201ctrue positive rate\u201d \u2014 measures the percentage of sick patients who are correctly identified as having the disease in question. Specificity \u2014 or the \u201ctrue negative rate\u201d \u2014 measures the percentage of healthy patients who are correctly identified as being disease-free.\nFor the reasons discussed above, most medical screening tests are designed to be highly sensitive \u2014 that is, less likely to deliver a false negative \u2014 with the knowledge that the patient will also have to undergo a diagnostic test (which is usually highly specific). That said, different biases are built into different tests based on both what the test is screening for and what other tests will be conducted thereafter.\nResearch from Kidney International uses the detection of chronic kidney disease as an example of an instance in which a highly sensitive screening test is preferable. In detecting chronic kidney disease, an inexpensive dipstick test could be preferred as a screening test, allowing many individuals to be tested. In this test, it is important that all patients with chronic kidney disease have a positive test result (high sensitivity), whereas the number of patients with false-positive results (low specificity) is considered somewhat less important, as they would be quickly identified by a subsequent test.\nAlternatively, a diagnostic test with a lower sensitivity and specificity could be preferable when the subsequent test is is either invasive or has a high risk of complications. As the report\u2019s authors suggest, performing renal arteriography to test for renal artery stenosis is an invasive diagnostic method with potential complications. In many cases it might be preferably to replace arteriography with Doppler testing, which has 89 percent sensitivity and 73 percent specificity. In this way, HCPs and researchers are almost always working to contextualize and balance the positive and negative outcomes of different diagnostic tests.\nWhy context matters\nIn the end, statistical sensitivity is hugely consequential for some algorithms and entirely inconsequential for others. If we\u2019re told that an image recognition algorithm pinpoints pictures of cats with 92 percent accuracy, the algorithm\u2019s sensitivity should be of little interest. It doesn\u2019t really matter whether the 8 per 100 mis-recognitions are cases of the algorithm identifying dogs as cats, overlooking cats in a picture that contains other items, or a mix of both. In this case, the cost of an incorrect prediction \u2014 while somewhat annoying \u2014 is trivial with respect to outcomes.\nAs a less trivial example, consider my field: marketing. In marketing, the ultimate arbiter of success is whether you achieve your stated KPIs. If, for example, a specific ad is served to a member of your target audience 85 percent of the time, you\u2019ll probably be satisfied. Figuring out why the ad was improperly served 15 percent of the time will help you refine future campaigns \u2014 and thus spend your budget more wisely down the line. A percentage of the ads were served to an inherently uninterested audience; which uninterested audience doesn\u2019t make your wasted spend any worse or any better.\nThis, clearly, is not the case for deep learning algorithms deployed in a diagnostic setting. A missed diagnosis can quite literally be a matter of life and death, placing a tremendous onus on healthcare stakeholders to carefully consider not just a diagnostic algorithm\u2019s accuracy, but its sensitivity, as well.\nAn algorithm that is 90 percent accurate but highly sensitive presents less risk in a healthcare environment than an algorithm that is 95 percent accurate but not sensitive at all. This will be essential to remember as algorithmic diagnostics slowly trickle to market over the coming years and decades, and, hopefully, will be the decisive factor in which tools are exposed as high-risk novelties and which tools go on to make a significant positive impact on the way that diagnosticians approach their craft.