Solving Healthcare’s Big Data Analytics Security Conundrum
HIPAA understandably makes it hard for organizations to obtain personal health information and even harder to use that information for the purpose of data analysis. Empowering patients to own and share their own data -- and then assuring them that it's being properly de-identified -- can ease this process.
Big data holds much promise for healthcare. Analytics use cases — which focus on heady tasks such as giving physicians more information at the point of care, reducing hospital readmissions and better treating chronic diseases — continue to emerge, while vendors such as SAP and Oracle increasingly pitch their in-memory platforms as the solution to solving healthcare’s exceedingly complex problems.
Most of medicine’s data is unstructured, though. It exists largely in free-form physician notes fields in electronic health record (EHR) systems or, worse, in manila folders. On top of that, the complexities of interoperability and health information exchange make it difficult for healthcare organizations to share information, structured or otherwise.
There’s another, often overlooked wrinkle: Much of that data is personal health information strictly protected by the Health Insurance Portability and Accountability Act, which the HIPAA omnibus rule recently strengthened to bring PHI security into the 21st century.
This means tomorrow’s data scientists, not to mention today’s, must make the task of keeping patient data secure as much of a priority as actually analyzing that data in order to improve outcomes and reduce costs.
Go Straight to Patients Willing to Share
Under HIPAA, notes David Harlow, a healthcare attorney and consultant and founder of The Harlow Group LLC, any institution’s use of PHI for purposes other than treatment, payment or operation requires patient consent. This provision prevents organizations from using patient information in marketing or selling it to a third party, but it’s worth noting that “data analysis” doesn’t meet those criteria, either.
Such strict safeguards make sense, Harlow says. PHI as well as genetic research — increasingly prevalent thanks to advances in genomic research — is far more valuable to ne’er-do-wells than a Social Security number or credit card information, as it opens the door to healthcare fraud as well as potential discrimination based on one’s medical condition.
Because HIPAA makes it hard to get information from healthcare providers, Harlow says those interested in analyzing PHI for both individual care needs and population health management could consider another source — patients themselves.
Admittedly, for this to happen, healthcare needs to do nothing less than develop “an ecosystem based on patient-controlled data,” but Harlow says it’s a “viable alternative” to the status quo.
Luckily, the later stages of the federal government’s meaningful use incentive program start to provide some answers. Stage 2 of meaningful use, which goes into effect in 2014, requires providers to document that 5 percent of unique patients have viewed, downloaded or transmitted their electronic PHI.
Harlow also points to the Blue Button initiative as a patient enabler. The initiative — which began in the U.S. Department of Veterans Affairs but now includes more than 450 payers, providers, pharmacies and medical labs — lets patients view online, download and share any electronic PHI held by an entity that displays the blue button on its website.
Put that information in a patient’s hands, Harlow says, and those who are willing are empowered to share it. Go beyond just the Blue Button, as patient advocates suggest, and it could be possible for patients to be more specific about who gets what information: A Green Button for anonymized data, for research purposes, or a White Button for encrypted PHI. (Data need not be de-identified, though, if that’s the patient’s particular preference.)
“Can there be a parallel universe of sharing, [of] providing information that can be analyzed … and also shared on an individual basis?” Harlow asks. “Let’s create this critical mass.
Anonymizing Health Data for HIPAA-Compliant Analysis
Expert determination applies “generally accepted statistical and scientific principles for rendering information not individually identifiable” in such a way that “the risk is very small” that a person could be re-identified.
Safe harbor removes 18 specific identifiers that range from name, address and phone number to license plate number and IP address.
Most of the data that’s covered under safe harbor is a direct, or unique, identifier, says Khaled El Emam, CEO of data anonymization system vendor Privacy Analytics, and would therefore be removed from a data set prior to analysis anyway. (El Emam and Harlow spoke at the recent Strata Rx conference in Boston.) What needs de-identification, then, are the quasi-identifiers — the bits of information that can’t identify a person on their own but can when combined with other data.
This can get tricky. To combat it, El Emam describes how the State of Louisiana CajunCodeFest de-identified the 6.7 million Medicaid claims and 4 million immunization records it used in its recent hackathon.
Say you’re looking at a large patient population. The vast majority will visit a hospital only once or twice a year, but that data set will include a small minority in the long tail who made many visits. The same is true for claims data: Most patients will file but a handful of claims annually, but those in the long tail could file hundreds. Here, you likely need to “truncate the tail” so those individuals don’t stand out, El Emam says.
Pay attentions to the dates that a patient’s claims were filed, too, El Emam says. Randomizing the sequence of dates could shift the order and suggest, for example, that a person was admitted the fifth time before he was admitted the fourth time. A fixed shift for a set of dates will keep the intervals intact, but it’s hard to make the case that that actually de-identifies the data. In this case, randomized generalization — converting all dates as intervals from a first “anchor date,” then randomizing the intervals within a seven-day range — will add noise but maintain order, El Emam says.
Brian Eastwood is a senior editor for CIO.com with more than 10 years of experience writing, editing and producing content for newspapers and the Web. He is primarily responsible for working with CIO.com's contributors and columnists, who cover topics such as cloud computing, big data, development and architecture, personal tech, the IT channel, business applications, BYOD, consumerization and business / project management. Brian's specific area of interest and expertise is healthcare IT. Prior to CIO.com, Brian was an editor at TechTarget and a newspaper reporter in the Boston suburbs. Outside the office, Brian is a history buff with a particular interest in postwar Europe and a runner who recently finished his 11th marathon.