by Brian Eastwood

Solving Healthcare’s Big Data Analytics Security Conundrum

Oct 07, 20135 mins
AnalyticsBig DataData and Information Security

HIPAA understandably makes it hard for organizations to obtain personal health information and even harder to use that information for the purpose of data analysis. Empowering patients to own and share their own data -- and then assuring them that it's being properly de-identified -- can ease this process.

Big data holds much promise for healthcare. Analytics use cases — which focus on heady tasks such as giving physicians more information at the point of care, reducing hospital readmissions and better treating chronic diseases — continue to emerge, while vendors such as SAP and Oracle increasingly pitch their in-memory platforms as the solution to solving healthcare’s exceedingly complex problems.

Most of medicine’s data is unstructured, though. It exists largely in free-form physician notes fields in electronic health record (EHR) systems or, worse, in manila folders. On top of that, the complexities of interoperability and health information exchange make it difficult for healthcare organizations to share information, structured or otherwise.

There’s another, often overlooked wrinkle: Much of that data is personal health information strictly protected by the Health Insurance Portability and Accountability Act, which the HIPAA omnibus rule recently strengthened to bring PHI security into the 21st century.

This means tomorrow’s data scientists, not to mention today’s, must make the task of keeping patient data secure as much of a priority as actually analyzing that data in order to improve outcomes and reduce costs.

Go Straight to Patients Willing to Share

Healthcare Data Security

Under HIPAA, notes David Harlow, a healthcare attorney and consultant and founder of The Harlow Group LLC, any institution’s use of PHI for purposes other than treatment, payment or operation requires patient consent. This provision prevents organizations from using patient information in marketing or selling it to a third party, but it’s worth noting that “data analysis” doesn’t meet those criteria, either.

Such strict safeguards make sense, Harlow says. PHI as well as genetic research — increasingly prevalent thanks to advances in genomic research — is far more valuable to ne’er-do-wells than a Social Security number or credit card information, as it opens the door to healthcare fraud as well as potential discrimination based on one’s medical condition.

Because HIPAA makes it hard to get information from healthcare providers, Harlow says those interested in analyzing PHI for both individual care needs and population health management could consider another source — patients themselves.

Related: Experts Say Health Information Exchange Privacy Concerns Overblown

Admittedly, for this to happen, healthcare needs to do nothing less than develop “an ecosystem based on patient-controlled data,” but Harlow says it’s a “viable alternative” to the status quo.

Luckily, the later stages of the federal government’s meaningful use incentive program start to provide some answers. Stage 2 of meaningful use, which goes into effect in 2014, requires providers to document that 5 percent of unique patients have viewed, downloaded or transmitted their electronic PHI.

Harlow also points to the Blue Button initiative as a patient enabler. The initiative — which began in the U.S. Department of Veterans Affairs but now includes more than 450 payers, providers, pharmacies and medical labs — lets patients view online, download and share any electronic PHI held by an entity that displays the blue button on its website.

Put that information in a patient’s hands, Harlow says, and those who are willing are empowered to share it. Go beyond just the Blue Button, as patient advocates suggest, and it could be possible for patients to be more specific about who gets what information: A Green Button for anonymized data, for research purposes, or a White Button for encrypted PHI. (Data need not be de-identified, though, if that’s the patient’s particular preference.)

“Can there be a parallel universe of sharing, [of] providing information that can be analyzed … and also shared on an individual basis?” Harlow asks. “Let’s create this critical mass.

Anonymizing Health Data for HIPAA-Compliant Analysis

When data does need to be de-identified before analysis, HIPAA’s de-identification standard, spelled out in the HIPAA Privacy Rule, give an organization two options:

  • Expert determination applies “generally accepted statistical and scientific principles for rendering information not individually identifiable” in such a way that “the risk is very small” that a person could be re-identified.
  • Safe harbor removes 18 specific identifiers that range from name, address and phone number to license plate number and IP address.

Most of the data that’s covered under safe harbor is a direct, or unique, identifier, says Khaled El Emam, CEO of data anonymization system vendor Privacy Analytics, and would therefore be removed from a data set prior to analysis anyway. (El Emam and Harlow spoke at the recent Strata Rx conference in Boston.) What needs de-identification, then, are the quasi-identifiers — the bits of information that can’t identify a person on their own but can when combined with other data.

This can get tricky. To combat it, El Emam describes how the State of Louisiana CajunCodeFest de-identified the 6.7 million Medicaid claims and 4 million immunization records it used in its recent hackathon.

Related: Coding Contest Shows How Big Data Can Improve Healthcare

Say you’re looking at a large patient population. The vast majority will visit a hospital only once or twice a year, but that data set will include a small minority in the long tail who made many visits. The same is true for claims data: Most patients will file but a handful of claims annually, but those in the long tail could file hundreds. Here, you likely need to “truncate the tail” so those individuals don’t stand out, El Emam says.

Pay attentions to the dates that a patient’s claims were filed, too, El Emam says. Randomizing the sequence of dates could shift the order and suggest, for example, that a person was admitted the fifth time before he was admitted the fourth time. A fixed shift for a set of dates will keep the intervals intact, but it’s hard to make the case that that actually de-identifies the data. In this case, randomized generalization — converting all dates as intervals from a first “anchor date,” then randomizing the intervals within a seven-day range — will add noise but maintain order, El Emam says.

Brian Eastwood is a senior editor for He primarily covers healthcare IT. You can reach him on Twitter @Brian_Eastwood or via email. Follow everything from on Twitter @CIOonline, Facebook, Google + and LinkedIn.