Big data holds much promise for healthcare. Analytics use cases \u2014 which focus on heady tasks such as giving physicians more information at the point of care, reducing hospital readmissions and better treating chronic diseases \u2014 continue to emerge, while vendors such as SAP and Oracle increasingly pitch their in-memory platforms as the solution to solving healthcare's exceedingly complex problems.\n\n\nMost of medicine's data is unstructured, though. It exists largely in free-form physician notes fields in electronic health record (EHR) systems or, worse, in manila folders. On top of that, the complexities of interoperability and health information exchange make it difficult for healthcare organizations to share information, structured or otherwise.\n\n\nThere's another, often overlooked wrinkle: Much of that data is personal health information strictly protected by the Health Insurance Portability and Accountability Act, which the HIPAA omnibus rule recently strengthened to bring PHI security into the 21st century.\n\n\nThis means tomorrow's data scientists, not to mention today's, must make the task of keeping patient data secure as much of a priority as actually analyzing that data in order to improve outcomes and reduce costs.\n\nGo Straight to Patients Willing to Share\n\nUnder HIPAA, notes David Harlow, a healthcare attorney and consultant and founder of The Harlow Group LLC, any institution's use of PHI for purposes other than treatment, payment or operation requires patient consent. This provision prevents organizations from using patient information in marketing or selling it to a third party, but it's worth noting that "data analysis" doesn't meet those criteria, either.\n\n\nSuch strict safeguards make sense, Harlow says. PHI as well as genetic research \u2014 increasingly prevalent thanks to advances in genomic research \u2014 is far more valuable to ne'er-do-wells than a Social Security number or credit card information, as it opens the door to healthcare fraud as well as potential discrimination based on one's medical condition.\n\n\nBecause HIPAA makes it hard to get information from healthcare providers, Harlow says those interested in analyzing PHI for both individual care needs and population health management could consider another source \u2014 patients themselves.\n\n\nRelated: Experts Say Health Information Exchange Privacy Concerns Overblown\n\n\nAdmittedly, for this to happen, healthcare needs to do nothing less than develop "an ecosystem based on patient-controlled data," but Harlow says it's a "viable alternative" to the status quo.\n\n\nLuckily, the later stages of the federal government's meaningful use incentive program start to provide some answers. Stage 2 of meaningful use, which goes into effect in 2014, requires providers to document that 5 percent of unique patients have viewed, downloaded or transmitted their electronic PHI.\n\n\nHarlow also points to the Blue Button initiative as a patient enabler. The initiative \u2014 which began in the U.S. Department of Veterans Affairs but now includes more than 450 payers, providers, pharmacies and medical labs \u2014 lets patients view online, download and share any electronic PHI held by an entity that displays the blue button on its website.\n\n\nPut that information in a patient's hands, Harlow says, and those who are willing are empowered to share it. Go beyond just the Blue Button, as patient advocates suggest, and it could be possible for patients to be more specific about who gets what information: A Green Button for anonymized data, for research purposes, or a White Button for encrypted PHI. (Data need not be de-identified, though, if that's the patient's particular preference.)\n\n\n"Can there be a parallel universe of sharing, [of] providing information that can be analyzed \u2026 and also shared on an individual basis?" Harlow asks. "Let's create this critical mass.\n\nAnonymizing Health Data for HIPAA-Compliant Analysis\n\nWhen data does need to be de-identified before analysis, HIPAA's de-identification standard, spelled out in the HIPAA Privacy Rule, give an organization two options:\n\nExpert determination applies "generally accepted statistical and scientific principles for rendering information not individually identifiable" in such a way that "the risk is very small" that a person could be re-identified. \nSafe harbor removes 18 specific identifiers that range from name, address and phone number to license plate number and IP address.\n\nMost of the data that's covered under safe harbor is a direct, or unique, identifier, says Khaled El Emam, CEO of data anonymization system vendor Privacy Analytics, and would therefore be removed from a data set prior to analysis anyway. (El Emam and Harlow spoke at the recent Strata Rx conference in Boston.) What needs de-identification, then, are the quasi-identifiers \u2014 the bits of information that can't identify a person on their own but can when combined with other data.\n\n\nThis can get tricky. To combat it, El Emam describes how the State of Louisiana CajunCodeFest de-identified the 6.7 million Medicaid claims and 4 million immunization records it used in its recent hackathon.\n\n\nRelated: Coding Contest Shows How Big Data Can Improve Healthcare\n\n\nSay you're looking at a large patient population. The vast majority will visit a hospital only once or twice a year, but that data set will include a small minority in the long tail who made many visits. The same is true for claims data: Most patients will file but a handful of claims annually, but those in the long tail could file hundreds. Here, you likely need to "truncate the tail" so those individuals don't stand out, El Emam says.\n\n\nPay attentions to the dates that a patient's claims were filed, too, El Emam says. Randomizing the sequence of dates could shift the order and suggest, for example, that a person was admitted the fifth time before he was admitted the fourth time. A fixed shift for a set of dates will keep the intervals intact, but it's hard to make the case that that actually de-identifies the data. In this case, randomized generalization \u2014 converting all dates as intervals from a first "anchor date," then randomizing the intervals within a seven-day range \u2014 will add noise but maintain order, El Emam says.\n\n\nBrian Eastwood is a senior editor for CIO.com. He primarily covers healthcare IT. You can reach him on Twitter @Brian_Eastwood or via email. Follow everything from CIO.com on Twitter @CIOonline, Facebook, Google + and LinkedIn.