Recently, HHS released a proposed rule regarding new regulations for health insurance companies. The specific document is called:
Patient Protection and Affordable Care Act: HHS Notice of Benefit and Payment Parameters for 2016
In that proposed rule were two open data concepts that are worth noting:
- A suggestion that insurance companies be required to release their formulary data as machine readable data sets.
- A suggestion that insurance companies be required to release data about their current provider directory as machine readable data sets.
As you might imagine, the DocGraph Journal consistently advocates for open data and indeed, we did submit comments regarding this issue…
We had one of our part time researchers (thanks Armie!!) search all of the comments for mentions of “machine readable” and/or “data” to see who had commented on this matter besides us. Then we created a google sheets page with all of the relevant comments in one place. We are now releasing this data to the public.
Read on to access the data, and to read our first-pass analysis of what we found!
Fred Trotter has been working with the PubMed API, related to WikiProject Medicine and the UCSF 2014 Elective class. Both have goals to evaluate and improve the medical content on Wikipedia.
You can learn how to search for, and download data about specific articles using the PubMed API on Fred Trotter’s blog post here.
If you spend much time in the patient community you meet someone who has been burned, badly by the “out of network” game that insurance companies play with/against healthcare providers.
Its simple, you get insurance plan A from company Z. Then you go to a specialist or get a scan or something and you ask, “do you take company Z insurance”? They say “sure”. You hand them the insurance card. What they don’t tell you is that they will be billing “out of network” which means they will be hardly covered at all.
You go to the insurance company, they point to the provider. You go to the provider, they point to the insurance company. Who is left with the huge bill? The patient.
Sometimes this gets really bad, in the worst cases important treatments to relieve suffering are delayed.
Are you tired of this? In order to fix this, we need to be able to build systems that tell us for sure which providers are in a given plan at a given time. We need to have that system available when we purchase our health insurance so that we can buy insurance that covers the doctors that we already use, or the ones that we want to use. We can imagine a theoretical tool called JustShowMeTheDoctorNetwork.com that solves this problem in a user friendly way.
There are lots of companies and journalists in the DocGraph community that would love to be able to build such a tool. DocGraph would love to provide the data for such a tool but right now that would require that we scrape the websites of every insurance company provider directory in the country. Those websites are really unfriendly to such efforts. The following text was taken from the user agreement of the doctor finder tool for Aetna:
Provider information contained in this directory is updated 6 days per week, excluding holidays, Sundays, or interruptions due to system maintenance, upgrades or unplanned outages. This information is subject to change at any time. Therefore please check with the provider before scheduling your appointment or receiving services to confirm he or she is participating in Aetna’s network. Participating physicians, hospitals and other health care providers are independent contractors and are neither agents nor employees of Aetna. The availability of any particular provider cannot be guaranteed, and provider network composition is subject to change. Notice of the change shall be provided in accordance with applicable state law.
The underlines are mine.
First, Aetna does not want anyone scrapping their website. They do not want people like DocGraph to create these data sets. They view their list of providers as a protected information asset, that only they can leverage.
But more importantly, they put the responsibility on “who is in what plan” squarely on the doctors. Which really means the patients, because the doctors websites will just say “check the insurance company website”. See what I mean about finger pointing?
Insurance companies, and healthcare providers need to be held accountable for their in vs out status. The only way to do this is to create open data set that maps Plans to Providers so that projects like JustShowMeTheDoctorNetwork.com is really easy to build.
The policy wonks at HHS/CMS/ONC et al get this. The have recently added the following text to the rules for the 2016 insurance plans.
…we propose that a QHP issuer must publish an up-to-date, accurate, and complete provider directory, including information on which providers are accepting new patients, the provider’s location, contact information, specialty, medical group, and any institutional affiliations, in a manner that is easily accessible to plan enrollees, prospective enrollees, the State, the Exchange, HHS and OPM. As part of this requirement, we propose that a QHP issuer must update the directory information at least once a month, and that a provider directory will be considered easily accessible when the general public is able to view all of the current providers for a plan on the plan’s public Web site through a clearly identifiable link or tab without having to create or access an account or enter a policy number….(blah blah)…We also are considering requiring issuers to make this information publicly available on their Web sites in a machine-readable file and format specified by HHS.
underlines are mine…
This would solve the problem. Anyone who wanted to could create a website that showed what plans any given provider accepted, would be able to easily do so.
But they key word here is “propose”. Insurance companies in this country benefit greatly from the confusion about in network and out of network, and so do some unethical healthcare providers. There will be lots of people who oppose this proposal.
I hope that I have made the case that this information needs to be open and machine readable. If your convinced, then you can find the comment page to support this policy here. If you disagree with us, and you still want to submit a comment, you can use this page.
Please take a few moments and write in to support this policy change. The comments are due Dec 22nd 2014 which is basically tomorrow.
If you would like to read the in-progress comments from the DocGraph Journal you can go here. Feel free to cut and paste from out comments into your own comments, we would be flattered.
Feel free to tell them that I sent you
The DocGraph Summit was a great success, a big thanks to everyone who made it down. We filled our day discussing current open health data initiatives, questions, and goals. We are grateful to Houston Technology Center for providing an excellent venue, and we are already scheming for the 2nd annual Summit next year! Check out the Storify here: https://storify.com/fredtrotter/docgraph-summit-2014 .
The DocGraph Summit is just around the corner!
This “unconference” will include short presentations on current projects of the participants, and discussions on the topics, challenges, and ideas deemed most relevant and paramount to the open health data community. Our goal is to set an atmosphere conducive to in-depth dialogue, concept mapping, networking, and brainstorming.
The Summit will also review DocGraph’s open healthcare data initiatives. These projects include food, medical, doctor, and hospital data, as well as other fun topics that are not easily categorized.
Currently we have academics, corporate delegates, researchers and entrepreneurs attending the Summit. Their areas of focus include data analytics, open source drug databases, EHRs, gene/drug interactions, VistA, Health IT, ACOs, statistics, etc. Attendees are coming from from Rice, Stony Brook, UTHSC, e-mds, PwC, Baylor Medicine, the DocGraph community, and more.
Email email@example.com for university student and faculty discount codes.
The DocGraph Summit is being held alongside International Conference on Biomedical Ontology (ICBO) 14
DocGraph Omni was a website that we used to display a merged set of the open data that is available on healthcare providers.
It was a good idea, but it did not work. Or at least, it is used so infrequently that it is not worth the resources that DocGraph is spending on it. Omni was interesting only to the degree that it could serve as a crowdsourcing mechanism for even more awesome open data about doctors and hospitals. Omni is just not doing its job as a crowdsourcing tool.
More importantly, two of our informal journalism partners, Propublica and US News, have both begun offering more popular consumer facing systems, using our data.
We would rather invest in doing a better job providing US News and Propublica with data, then offer a clearly inferior consumer facing product ourselves. We will do our best to ensure that both Propublica and US News at least have the option of replicating the all of the functionality of DocGraph Omni.
More importantly, CareSet Systems, the sister company to DocGraph which focuses on healthcare system analytics, is offering a commercial product called Patch that does far more than Omni ever did. But Patch functionality is focused on the needs of healthcare organizations, like Hospitals, ACOs, SNFs and LTACs.
We have decided to retire Omni, and invest in our relationships with other data journalists and in CareSet Patch service. We will leave the Omni server up for the next few days, but expect that site to forward to DocGraph.org soon.
The DocGraph Journal creates multiple, unprecedented datasets to improve healthcare. It is focused on building an open community of data scientists primed to share analysis of the torrential amount of new healthcare data posted by federal and state governments. The DocGraph Journal interfaces government affairs (with HHS and CMS), to journalism organizations (O’Reilly Media, US News, ProPublica), to academics and entrepreneurs. The journal is supported by research grants from Merck, athenahealth, and Robert Wood Johnson Foundation.
The 2014 Summit will review DocGraph’s open data healthcare projects. These projects include food, medical, doctor, and hospital data, as well as other fun topics that are not easily categorized.
Join fellow health data enthusiasts for an engaging day of unconference-style discussions and presentations, as well as meals and happy hour within walking distance of the venue.
Date: October 8, 2014
Location: Houston Technology Center
The DocGraph Summit is being held alongside International Conference on Biomedical Ontology (ICBO) 14
ICBO 14 runs Oct 6-9 and we are encouraging DocGraph Summit participants to attend the first two days of ICBO (Oct 6, 7), which will feature workshops discussing the options for an Open Source Medication Database.
Today, word came out that NY released taxi data that has been entirely reidentified.
The technique and concepts to conduct the attack can be found here, and I also found the slashdot discussion interesting.
The result is that the identity and paths of specific named taxi cabs is now public information. This is not entirely bad, since now the data set will be extensively used to detect specific bad actors. Still it was more than the NY government intended and will probably result in a lawsuit.
That lawsuit will be mostly justified, since it is well-understood among security professionals how you do de-identification right and the rules were not followed. If you are doing this with health data, I can recommend fellow O’Reilly Author Khaled El Emam who wrote both Anonymizing Health Data and also Guide to the De-Identification of Personal Health Information both of which I can recommend. You can hire him through Privacy Analytics. He is the de-identification expert that I know the best and I can endorse, but he is far from the only one.
Generally, hashing can be a reasonable approach as long as salts are used in combination with a secure hash algorithum. I prefer to use a different salt for every id, which makes a rainbow attack (like this one) pretty hard to do.
More importantly, it also entirely appropriate to simply use a randomly generated number instead of a hash. Hashes are convenient when you need to rely on a dynamic and extensible process, rather than static data. It also allows you to throw away the original data, and know that you can reliably repeat the process given new data. That is why it is used so frequently in password storage.
This will result in a chilling effect for open data releases unfortunately, but I am glad it happened. This is a relatively unimportant data set. Which is to say, this could have been much worse. This could have happened with patient data. I work with stuff like HIV and TB infection data, as well as EHR notes containing infidelities etc. I hate to say it, but its better for governments to learn on taxi cabs.
Lastly, I would encourage those who are considering doing data releases like this to reach out to organizations like Propublica and/or DocGraph. If you cannot afford to hire Khaled, we can at least help to ensure that you avoid the basic mistakes. Believe it or not, data journalists like myself are not interested in violating legitimate privacy rights (although we can have a healthy debate around the word “legitimate”) and we would be more than happy to help ensure that a data release is free from reidentification drama.
Part of me wonders why they didn’t just release the taxi data with the taxi numbers intact. I strongly prefer real-name accountability in data sets like this. It might be because by learning the identity of the taxi, you might be able to infer the identity of the passenger, who has a legitimate privacy concern.
Accidents like this will happen, and NY was right to make a release rather than hold back a release because there “might” be a way to reidentify a data set. My hat is off again to NY state/city… innovators in open data.