The DocGraph Journal creates multiple, unprecedented datasets to improve healthcare. It is focused on building an open community of data scientists primed to share analysis of the torrential amount of new healthcare data posted by federal and state governments. The DocGraph Journal interfaces government affairs (with HHS and CMS), to journalism organizations (O’Reilly Media, US News, ProPublica), to academics and entrepreneurs. The journal is supported by research grants from Merck, athenahealth, and Robert Wood Johnson Foundation.
The 2014 Summit will review DocGraph’s open data healthcare projects. These projects include food, medical, doctor, and hospital data, as well as other fun topics that are not easily categorized.
Join fellow health data enthusiasts for an engaging day of unconference-style discussions and presentations, as well as meals and happy hour within walking distance of the venue.
Date: October 8, 2014
Location: Houston Technology Center
The DocGraph Summit is being held alongside International Conference on Biomedical Ontology (ICBO) 14
ICBO 14 runs Oct 6-9 and we are encouraging DocGraph Summit participants to attend the first two days of ICBO (Oct 6, 7), which will feature workshops discussing the options for an Open Source Medication Database.
Today, word came out that NY released taxi data that has been entirely reidentified.
The technique and concepts to conduct the attack can be found here, and I also found the slashdot discussion interesting.
The result is that the identity and paths of specific named taxi cabs is now public information. This is not entirely bad, since now the data set will be extensively used to detect specific bad actors. Still it was more than the NY government intended and will probably result in a lawsuit.
That lawsuit will be mostly justified, since it is well-understood among security professionals how you do de-identification right and the rules were not followed. If you are doing this with health data, I can recommend fellow O’Reilly Author Khaled El Emam who wrote both Anonymizing Health Data and also Guide to the De-Identification of Personal Health Information both of which I can recommend. You can hire him through Privacy Analytics. He is the de-identification expert that I know the best and I can endorse, but he is far from the only one.
Generally, hashing can be a reasonable approach as long as salts are used in combination with a secure hash algorithum. I prefer to use a different salt for every id, which makes a rainbow attack (like this one) pretty hard to do.
More importantly, it also entirely appropriate to simply use a randomly generated number instead of a hash. Hashes are convenient when you need to rely on a dynamic and extensible process, rather than static data. It also allows you to throw away the original data, and know that you can reliably repeat the process given new data. That is why it is used so frequently in password storage.
This will result in a chilling effect for open data releases unfortunately, but I am glad it happened. This is a relatively unimportant data set. Which is to say, this could have been much worse. This could have happened with patient data. I work with stuff like HIV and TB infection data, as well as EHR notes containing infidelities etc. I hate to say it, but its better for governments to learn on taxi cabs.
Lastly, I would encourage those who are considering doing data releases like this to reach out to organizations like Propublica and/or DocGraph. If you cannot afford to hire Khaled, we can at least help to ensure that you avoid the basic mistakes. Believe it or not, data journalists like myself are not interested in violating legitimate privacy rights (although we can have a healthy debate around the word “legitimate”) and we would be more than happy to help ensure that a data release is free from reidentification drama.
Part of me wonders why they didn’t just release the taxi data with the taxi numbers intact. I strongly prefer real-name accountability in data sets like this. It might be because by learning the identity of the taxi, you might be able to infer the identity of the passenger, who has a legitimate privacy concern.
Accidents like this will happen, and NY was right to make a release rather than hold back a release because there “might” be a way to reidentify a data set. My hat is off again to NY state/city… innovators in open data.
The DocGraph Alliance is a new group of organizations committed to supporting data journalism and data science community efforts. Three global leaders in healthcare, athenahealth, CareSet, and Merck (known as MSD outside the United States and Canada), have signed on as founding members of the Alliance.
The DocGraph Alliance’s community mission is to encourage an ecosystem of innovators to collaborate and share tools and research methodologies around open healthcare datasets. This Alliance will help further develop technical analysis and methods around data released by federal, county, and state entities, as well as those originated by the community.
The DocGraph Alliance is a project of The DocGraph Journal, who shares data with a community of quantitatively minded professionals who mine publicly available clinical datasets to uncover interesting and meaningful insights. Support from the Alliance members means the DocGraph Journal can continue providing support for the growing community of data scientists focused on leveraging initiatives of transparency in healthcare. As a result of the community’s work, specific news coverage has incorporated DocGraph data, including work from US News, Propublica and Kansas City Star.
“The DocGraph project created a platform for data scientists to collaborate openly on publicly available health data sources where nothing existed before”, said James Ciriello, Associate Vice President of Merck IT Strategy and Innovation, “and as we watched this community become more and more active in trying to address significant problems, we wanted to support it and help it grow. As publicly available healthcare data continues to grow at a fast pace, coordination and comparatives of care become commonplace, and insights on therapy start to drive novel innovation.”
“We are thrilled to partner with the DocGraph Alliance. Fred Trotter in particular has taken on ambitious and important work to socialize open data assets in healthcare and to leverage data in meaningful ways to advance the industry,” said Todd Rothenhaus, chief medical officer, athenahealth, Inc. “At athenahealth, we believe healthcare could benefit from more data openness and transparency. Access to expanded and new types of data through the DocGraph Alliance will support our work to improve our cloud-based services and further innovate based on evidence-based insights and industry trends.”
“Our business, as well as countless others, rely on the availability of Open Healthcare datasets. Our healthcare system modeling tools improve with every Open Data release..”, said Ashish Patel, founder of CareSet Systems. “We want to ensure that DocGraph continues to flourish! The healthcare system needs a cadence of Open Data in order to effectively pursue the Triple Aim.”
DocGraph will work to grow and nurture an open community of data professionals through a series of trainings and events with a focus on further use of open health datasets and development new methods and tools to analyze those datasets.
About The DocGraph Journal
The DocGraph Journal seeks to create and disseminate new open healthcare data sets, and to foster a community of data scientists who contribute tools and expertise to the analyses of open healthcare data. The Journal was founded after Fred Trotter’s crowdfunding of the first DocGraph data set demonstrated a demand for open healthcare data. The original data set, created from a FOIA request, showed how physicians and other healthcare providers collaborate to deliver care to Medicare patients. This original DocGraph data set remains the largest real-‐name social graph available to the public.
Alma Trotter, email@example.com<mailto:firstname.lastname@example.org>
Fred Trotter, co-founder of The DocGraph Journal, will be speaking at Health Datapalooza in D.C. this week! Health Datapalooza is a national conference focused on liberating health data, and bringing together the companies, startups, academics, government agencies, and individuals with the newest and most innovative and effective uses of health data to improve patient outcomes.
If you are attending the conference, be sure to attend one of his panels listed below, and watch for tweets from @DocGraph and @fredtrotter.
2:45 – 3pm
4:15-5:45pm Tech Track:
HealthCare Entrepreneurs BootCamp- Strategy, Practice & Games for Using Public Data to Build, Scale and Deliver Value
1:30-2:30pm Consumer Track:
Data Scientist- Extracting Data Forcefully From Bureaucracies
3-4pm Tech Track
“What if It Actually Works?” A World with People Using Open Health Data- Dystopia or the Singularity?
To catch everyone up, here is the brief sequence of events:
With that in mind, we are happy to announce that DocGraph Omni has been selected as one of the winning entries into the Code-a-Palooza!
Now its time for a look at the competition!
- Arcadia Solutions Arcadia looks like a top of the line Health IT consulting shop, and they have previously won a Surescripts hackathon.
- DocSpot DocSpot is an advanced doctor search tool. They are also an active member of the DocGraph community and have done some innovative work with chargemaster data in hospitals.
- karmadata karmadata is an advanced API to lots of healthcare data sets. They seem to have lots of international data, and solid data sets for clinical trials. They should be able to come up with something really easy using their other data sources!
- Lyfechannel Develops advanced patient intervention mobile apps.
- Medecision Population management with Big Data.
- Team FloriDUH Another DocGraph community member Mandi Bishop is leading a team of top thinkers!
- University of Wisconsin-Madison Cant find a link for this one, but its one of the two academic teams!!
- Zynx Health Another Big Data player, this time with expertise in Clinical Decision Support
Lots of Big Data expertise, experience designing software, even AI and robotics experience. You can expect some crazy good applications and a fierce competition. Which we plan on winning.
The new procedure data set was just uploaded by CMS. In major points for style, they released the data a 12:01 am.
You are probably wondering: How do I work with this data? If you are comfortable with MySQL or MariaDB this will help:
Unless you have been living under a rock (or you are just not “into” healthcare data journalism) you know that CMS is planning on releasing a massive data set about how doctors provide healthcare to Medicare patients (of course patient privacy will be protected).
This is a very exciting day. The Obama administration, HHS and CMS should all be applauded for taking Obama’s commitment to open government seriously! They have already and will continue to take heat from doctors who believe that the data will be used to hurt them. The AMA had a press release opposing the data drop, which they hastily removed (but which is still available on google’s cache). This is what they originally had to say about the matter:
“The American Medical Association (AMA) is committed to transparency and supports the release of physician data to improve quality of care. However, we also believe that certain safeguards are necessary to ensure that information is accurate and reliable for patients and other stakeholders.
“The AMA is concerned that CMS’ broad approach to releasing physician payment data will mislead the public into making inappropriate and potentially harmful treatment decisions and will result in unwarranted bias against physicians that can destroy careers. We have witnessed these inaccuracies in the past.
“To guarantee that information is accurate, complete, and helpful, the AMA strongly recommends that physicians be permitted to review and correct their information prior to the data release. This safeguard is not only practical but was recognized and included in other data release proposals, including bi-cameral and bi-partisan legislation supported by the AMA. Additionally, any analysis of the data released should note methodologies to ensure understanding of its limitations.
“Taking an approach that provides no assurances of accuracy of the data or explanations of its limitations will not allow patients to draw meaningful conclusions about the quality of care.”
Ardis Dee Hoven, M.D.
President, American Medical Association
Now the AMA has reversed course. They have removed the above press release from their news site, and an anonymous official has apparently spoken with an Associated Press reporter indicating that the AMA will not seek to enjoin the release of the data. For the AMA “not officially not opposing” something is as close to an endorsement as it gets. I expect that they will have something permanent to say on the matter pretty soon, and I will link to that here once they do.
Of course, we here at DocGraph disagree with most of the AMA’s brief opposing position. Generally, what a doctor has billed to Medicare is what they have billed to Medicare. The notion that a doctor is going to “correct” the billing record is a little silly. Even if the billing record were wrong for a doctor, it’s not likely they would go and engage with CMS to fix data. OIG has already documented the degree to which doctors ignore their responsibility to update their NPPES and PECOS record, and they can go and change that data at any time. So that point is pretty silly.
Most of the other points that the AMA makes here are pretty valid however. These criticisms are directed towards the press and the Internet community. Bad stories about doctors, (as opposed to good stories about bad doctors) have destroyed many careers, and if this data is presented poorly on the Internet, it could lead patients to make poor decisions.
The journalism and blogging communities have a responsibility to treat this data, and the doctors represented therein with respect. That means not jumping to conclusions. Using this data, it will be possible to see lots of new information about how doctors practice and how they are payed. But this data is extraordinarily complex; it will be very difficult to draw secondary conclusions from it consistently. Here are some core caveats for those seeking to work with the data set when it is released:
- The first thing to understand about the data is that it is blinded to protect patient privacy. When less than 11 patients were treated by a given doctor, using a given procedure, that data is withheld by Medicare. Ensuring that more than 11 patients were involved in every publicized transaction ensures that this is a data set about doctors, rather than identifiable patients. I am not sure why 11 patients is the threshold instead of 10 or 9, but that is a common standard. Someone explained it to me once… my take away from that conversation was “because math”.
- Some doctors have 90% of their patients payed for by Medicare, others have 9%. Some doctors do a lot of Medicaid (which will not be shown in this data release). Some primarily work with a single commercial payer, some work with lots of payers. You can think of all of the procedures that a doctor performs on all of his/her patients, no matter who is paying for the care, as a pie. Each source of income for the doctor is a slice of that pie.
- The way the pie is sliced, and the size of each piece going to each payer is generally referred to as the “payer mix” in the healthcare industry. Payer mix makes analyzing this data especially difficult for doctors who do no bill Medicare much. In many case, these doctors will have trouble reaching the 11 patient per procedure threshold required to even have data in this release.
- For many doctors this will mean that Medicare won’t have anything to say about how they practice at all, in order to protect patient privacy. Sometimes, for these “low billing doctors” they will randomly cross the threshold of 11 patients per procedure sometimes, but not others. It will be very difficult to perform accurate analysis on these doctors, in terms of their practice patterns as a whole. We should be very careful to not draw any conclusions at the low end of the spectrum. That doctor who “only” performed procedure X eleven times? That probably means nothing. What the doctor is actually doing with his/her patients is just not showing up at all.
- The payer mix is further complicated because of the possibility that a doctor might only do one procedure for one payer. One can imagine a brain surgeon that only does brain surgery for Blue Cross Blue Shield patients, but does lots of “neurological consults” for Medicare patients. Typically the services offered by a doctor are relatively consistent between payers, but only “typically”, and at least some variation will be very normal.
- Doctors that have lots of Medicare business will be easier and more productive to analyze. But be careful drawing conclusions about the “top billers” too. For many procedures (especially complex surgery) there is evidence that suggests that there is a “quality threshold effect”. If a doctor/surgeon does not get enough volume in a particular procedure, then in some cases it is difficult to maintain competency in that procedure. For some procedures this really matters. For others it doesn’t matter at all.
- Because of the “payer mix problem” it will not be possible to reach a conclusion like “Surgeon A does three times as many X procedures than Surgeon B, therefore Surgeon A is probably better”. Surgeon B might be doing exactly the same number of procedures over all, but have far fewer Medicare patients. If you call both Surgeon A and Surgeon B and ask “what is your payer/procedure mix” then you have a much better chance at getting an accurate picture.
- This data should include charge data. Charge data is an interesting topic that could use some careful investigation. The charge is like an “opening offer” that a doctor makes to Medicare, and then Medicare actually pays a completely different number. Sometimes charge data is used when billing patients without insurance, and sometimes it is used to calculate patient copay. There has been a lot of interesting and detailed articles that have come out recently about the interactions between “charge” prices and what patients ultimately pay. This data should make those analysis more robust, but in reality Medicare and Medicaid patients have strict policies on what expenses can actually return to a patient. As a result the charge data might have more implications for those who are on commercial plans.
- Remember that this is not all of Medicare. Medicare Advantage plans are not covered here, and there are lots of people who opt for those plans.
I hope I have adequately scared you away from drawing conclusions simple “sort by column Y” type conclusions on this data. Medical Billing generally and Medicare specifically are extremely complex fields and it is easy to get lost. You might be asking “well what conclusions can be drawn here?”. I think the most interesting thing about this data is that we have really never had a solid picture of what doctors actually do for a living. Most of what the average public health student learns about the healthcare system is based on hearsay. Someone they know, told someone they studied with, once took a course from a guy whose wife worked at a single Cardiology practice and saw some interesting things in the data. Only people “behind the curtain” in healthcare have every been able to look at this data. This will not be a surprise to an EHR company, or a claims clearing house or a insurance company… but for the rest of us, for data scientist generally, this will be the most accurate picture of the healthcare system as a whole that has ever been revealed.
This data release will allow us to examine the most foundational partnership in our healthcare system, the collaboration between CMS and the AMA. I suspect that on some level, the AMA must be aware that this data release will serve as the ultimate testing ground for its CPT codeset. Currently HIPAA (yep, that healthcare privacy law) gives CMS the authority to dictate how doctors and payers communicate. This website lists their choices: X12 for bill formatting, ICD9 (becoming ICD10) for diagnostic codes, and CPT 4 for procedures. CMS mandates that everyone pay the AMA for CPT copyrights in order to gain the right to transact healthcare business. This is not just for Medicare/Medicaid, the HIPAA rule covers -any- electronic healthcare transaction, including those between doctors and third party insurance companies. The CEO of Sermo, a doctor social network at one time in partnership with the AMA, frequently rants against the way that the AMA uses its control of CPT codes to bully doctors, insurance companies and even CMS.
For the first time ever, it will now be possible to ask the fundamental question: Are CPT codes working? Is the AMA living up to the monopoly status granted it by the Federal Govt? Using the data from this release, as well as publicly verifiable data analysis methods, we can finally start to tease apart this question. Never before has the effectiveness of CPT codes been subject to public scrutiny. For years, industry insiders have been railing against the CPT code approval process, as well as the the controversial Relative Value Units (RVU) process, which ultimately dictates how much a medical procedure is worth (through the lens of the CPT encoding of that procedure). I cannot imagine that anyone other than the AMA believes that the CPT/RVU scheme is actually working well, but for the first time, we can quantify and categorize the problem. That is the first step on the road to something better.
Part of the mission of The DocGraph Journal is to support Healthcare Journalists as they write data-driven stories about the healthcare system. As a result, if you are a healthcare journalist and you are planning on doing a story on this data, I would be happy to provide you with a free copy of my book Hacking Healthcare. David Uhlman (my coauthor) and I both have an extensive background in Medical Billing and chapters 1,2,3 and 10 of the book cover medical billing concepts carefully. If you would like a copy, please send me a shout on Twitter. If you are not a Healthcare Journalist, I will see if I can get the O’Reilly folks to offer a sale on my book so you can save a few bucks. If you do not already know what “CPT, HCPCS and ICD9″ mean… then you are really going to be over your head when looking at this data set.
More importantly, the entire DocGraph community is willing to help Journalists to make better stories with this data release. If you have a question about a particular provider, or about the data regarding a whole city or state, feel free to join the DocGraph mailing list and ask a question. We are here to help.
HHS is taking a huge risk in releasing this data in this manner. I don’t think that the AMA, as a whole, is against data transparency, but there are certainly detractors in that organization. If this data comes out and people publish a lot of half-baked stories or blog posts based on this data, it is going to give the “data hoarders” within the AMA and other organizations ammunition to prevent further data releases. The downside of transparency is that it creates the opportunity for careless slander. Let’s not do that. If you need help to write a good story about this, get in touch with me, and I will help. I would be happy to provide quotes, but also to help “sanity-check” conclusions. Believe me, there will be plenty of good-old-fashioned dirt that gets revealed by this data release, there is no reason to manufacture drama prematurely.
Update: Charles Ornstein wrote a piece over at Covering Health about this issue.
The DocGraph Journal is sponsoring Cajun Codefest 2014 (April 23-25 in Lafayette, LA) and simultaneously holding a virtual codeathon focused on the recently released DocGraphRX – Medicare D prescribing data. Since our friends in Louisiana themed this years event “Aging in Place” and we’ve spoken with dozens of community members about DocGraphRX in the last few months, we thought the combined opportunity could not be better. Further, we’re providing $2500 in cash prizes for the virtual and in-person competitions. Each team will receive access to the prescribing patters for physicians in Louisiana.
Register today to be included in the activities leading up to and during the DocGraph challenge!
Follow the event @cajuncodefest #ccf3…
Note: Registration below is for the DocGraph virtual challenge only. If you would like to also register for the Cajun Codefest main event please visit cajuncodefest.org.