Check out Janos Hajagos’ post “Visualizing the DocGraph for Wyoming Medicare Providers”. Beautiful work Janos!
Check out Janos Hajagos’ post “Visualizing the DocGraph for Wyoming Medicare Providers”. Beautiful work Janos!
Today, word came out that NY released taxi data that has been entirely reidentified.
The result is that the identity and paths of specific named taxi cabs is now public information. This is not entirely bad, since now the data set will be extensively used to detect specific bad actors. Still it was more than the NY government intended and will probably result in a lawsuit.
That lawsuit will be mostly justified, since it is well-understood among security professionals how you do de-identification right and the rules were not followed. If you are doing this with health data, I can recommend fellow O’Reilly Author Khaled El Emam who wrote both Anonymizing Health Data and also Guide to the De-Identification of Personal Health Information both of which I can recommend. You can hire him through Privacy Analytics. He is the de-identification expert that I know the best and I can endorse, but he is far from the only one.
Generally, hashing can be a reasonable approach as long as salts are used in combination with a secure hash algorithum. I prefer to use a different salt for every id, which makes a rainbow attack (like this one) pretty hard to do.
More importantly, it also entirely appropriate to simply use a randomly generated number instead of a hash. Hashes are convenient when you need to rely on a dynamic and extensible process, rather than static data. It also allows you to throw away the original data, and know that you can reliably repeat the process given new data. That is why it is used so frequently in password storage.
This will result in a chilling effect for open data releases unfortunately, but I am glad it happened. This is a relatively unimportant data set. Which is to say, this could have been much worse. This could have happened with patient data. I work with stuff like HIV and TB infection data, as well as EHR notes containing infidelities etc. I hate to say it, but its better for governments to learn on taxi cabs.
Lastly, I would encourage those who are considering doing data releases like this to reach out to organizations like Propublica and/or DocGraph. If you cannot afford to hire Khaled, we can at least help to ensure that you avoid the basic mistakes. Believe it or not, data journalists like myself are not interested in violating legitimate privacy rights (although we can have a healthy debate around the word “legitimate”) and we would be more than happy to help ensure that a data release is free from reidentification drama.
Part of me wonders why they didn’t just release the taxi data with the taxi numbers intact. I strongly prefer real-name accountability in data sets like this. It might be because by learning the identity of the taxi, you might be able to infer the identity of the passenger, who has a legitimate privacy concern.
Accidents like this will happen, and NY was right to make a release rather than hold back a release because there “might” be a way to reidentify a data set. My hat is off again to NY state/city… innovators in open data.
The DocGraph Alliance is a new group of organizations committed to supporting data journalism and data science community efforts. Three global leaders in healthcare, athenahealth, CareSet, and Merck (known as MSD outside the United States and Canada), have signed on as founding members of the Alliance.
The DocGraph Alliance’s community mission is to encourage an ecosystem of innovators to collaborate and share tools and research methodologies around open healthcare datasets. This Alliance will help further develop technical analysis and methods around data released by federal, county, and state entities, as well as those originated by the community.
The DocGraph Alliance is a project of The DocGraph Journal, who shares data with a community of quantitatively minded professionals who mine publicly available clinical datasets to uncover interesting and meaningful insights. Support from the Alliance members means the DocGraph Journal can continue providing support for the growing community of data scientists focused on leveraging initiatives of transparency in healthcare. As a result of the community’s work, specific news coverage has incorporated DocGraph data, including work from US News, Propublica and Kansas City Star.
“The DocGraph project created a platform for data scientists to collaborate openly on publicly available health data sources where nothing existed before”, said James Ciriello, Associate Vice President of Merck IT Strategy and Innovation, “and as we watched this community become more and more active in trying to address significant problems, we wanted to support it and help it grow. As publicly available healthcare data continues to grow at a fast pace, coordination and comparatives of care become commonplace, and insights on therapy start to drive novel innovation.”
“We are thrilled to partner with the DocGraph Alliance. Fred Trotter in particular has taken on ambitious and important work to socialize open data assets in healthcare and to leverage data in meaningful ways to advance the industry,” said Todd Rothenhaus, chief medical officer, athenahealth, Inc. “At athenahealth, we believe healthcare could benefit from more data openness and transparency. Access to expanded and new types of data through the DocGraph Alliance will support our work to improve our cloud-based services and further innovate based on evidence-based insights and industry trends.”
“Our business, as well as countless others, rely on the availability of Open Healthcare datasets. Our healthcare system modeling tools improve with every Open Data release..”, said Ashish Patel, founder of CareSet Systems. “We want to ensure that DocGraph continues to flourish! The healthcare system needs a cadence of Open Data in order to effectively pursue the Triple Aim.”
DocGraph will work to grow and nurture an open community of data professionals through a series of trainings and events with a focus on further use of open health datasets and development new methods and tools to analyze those datasets.
About The DocGraph Journal
The DocGraph Journal seeks to create and disseminate new open healthcare data sets, and to foster a community of data scientists who contribute tools and expertise to the analyses of open healthcare data. The Journal was founded after Fred Trotter’s crowdfunding of the first DocGraph data set demonstrated a demand for open healthcare data. The original data set, created from a FOIA request, showed how physicians and other healthcare providers collaborate to deliver care to Medicare patients. This original DocGraph data set remains the largest real-‐name social graph available to the public.
Alma Trotter, email@example.com<mailto:firstname.lastname@example.org>
Fred Trotter, co-founder of The DocGraph Journal, will be speaking at Health Datapalooza in D.C. this week! Health Datapalooza is a national conference focused on liberating health data, and bringing together the companies, startups, academics, government agencies, and individuals with the newest and most innovative and effective uses of health data to improve patient outcomes.
2:45 – 3pm
4:15-5:45pm Tech Track:
1:30-2:30pm Consumer Track:
3-4pm Tech Track
To catch everyone up, here is the brief sequence of events:
With that in mind, we are happy to announce that DocGraph Omni has been selected as one of the winning entries into the Code-a-Palooza!
Now its time for a look at the competition!
Lots of Big Data expertise, experience designing software, even AI and robotics experience. You can expect some crazy good applications and a fierce competition. Which we plan on winning.
The new procedure data set was just uploaded by CMS. In major points for style, they released the data a 12:01 am.
You are probably wondering: How do I work with this data? If you are comfortable with MySQL or MariaDB this will help:
Unless you have been living under a rock (or you are just not “into” healthcare data journalism) you know that CMS is planning on releasing a massive data set about how doctors provide healthcare to Medicare patients (of course patient privacy will be protected).
This is a very exciting day. The Obama administration, HHS and CMS should all be applauded for taking Obama’s commitment to open government seriously! They have already and will continue to take heat from doctors who believe that the data will be used to hurt them. The AMA had a press release opposing the data drop, which they hastily removed (but which is still available on google’s cache). This is what they originally had to say about the matter:
“The American Medical Association (AMA) is committed to transparency and supports the release of physician data to improve quality of care. However, we also believe that certain safeguards are necessary to ensure that information is accurate and reliable for patients and other stakeholders.
“The AMA is concerned that CMS’ broad approach to releasing physician payment data will mislead the public into making inappropriate and potentially harmful treatment decisions and will result in unwarranted bias against physicians that can destroy careers. We have witnessed these inaccuracies in the past.
“To guarantee that information is accurate, complete, and helpful, the AMA strongly recommends that physicians be permitted to review and correct their information prior to the data release. This safeguard is not only practical but was recognized and included in other data release proposals, including bi-cameral and bi-partisan legislation supported by the AMA. Additionally, any analysis of the data released should note methodologies to ensure understanding of its limitations.
“Taking an approach that provides no assurances of accuracy of the data or explanations of its limitations will not allow patients to draw meaningful conclusions about the quality of care.”
Ardis Dee Hoven, M.D.
President, American Medical Association
Now the AMA has reversed course. They have removed the above press release from their news site, and an anonymous official has apparently spoken with an Associated Press reporter indicating that the AMA will not seek to enjoin the release of the data. For the AMA “not officially not opposing” something is as close to an endorsement as it gets. I expect that they will have something permanent to say on the matter pretty soon, and I will link to that here once they do.
Of course, we here at DocGraph disagree with most of the AMA’s brief opposing position. Generally, what a doctor has billed to Medicare is what they have billed to Medicare. The notion that a doctor is going to “correct” the billing record is a little silly. Even if the billing record were wrong for a doctor, it’s not likely they would go and engage with CMS to fix data. OIG has already documented the degree to which doctors ignore their responsibility to update their NPPES and PECOS record, and they can go and change that data at any time. So that point is pretty silly.
Most of the other points that the AMA makes here are pretty valid however. These criticisms are directed towards the press and the Internet community. Bad stories about doctors, (as opposed to good stories about bad doctors) have destroyed many careers, and if this data is presented poorly on the Internet, it could lead patients to make poor decisions.
The journalism and blogging communities have a responsibility to treat this data, and the doctors represented therein with respect. That means not jumping to conclusions. Using this data, it will be possible to see lots of new information about how doctors practice and how they are payed. But this data is extraordinarily complex; it will be very difficult to draw secondary conclusions from it consistently. Here are some core caveats for those seeking to work with the data set when it is released:
I hope I have adequately scared you away from drawing conclusions simple “sort by column Y” type conclusions on this data. Medical Billing generally and Medicare specifically are extremely complex fields and it is easy to get lost. You might be asking “well what conclusions can be drawn here?”. I think the most interesting thing about this data is that we have really never had a solid picture of what doctors actually do for a living. Most of what the average public health student learns about the healthcare system is based on hearsay. Someone they know, told someone they studied with, once took a course from a guy whose wife worked at a single Cardiology practice and saw some interesting things in the data. Only people “behind the curtain” in healthcare have every been able to look at this data. This will not be a surprise to an EHR company, or a claims clearing house or a insurance company… but for the rest of us, for data scientist generally, this will be the most accurate picture of the healthcare system as a whole that has ever been revealed.
This data release will allow us to examine the most foundational partnership in our healthcare system, the collaboration between CMS and the AMA. I suspect that on some level, the AMA must be aware that this data release will serve as the ultimate testing ground for its CPT codeset. Currently HIPAA (yep, that healthcare privacy law) gives CMS the authority to dictate how doctors and payers communicate. This website lists their choices: X12 for bill formatting, ICD9 (becoming ICD10) for diagnostic codes, and CPT 4 for procedures. CMS mandates that everyone pay the AMA for CPT copyrights in order to gain the right to transact healthcare business. This is not just for Medicare/Medicaid, the HIPAA rule covers -any- electronic healthcare transaction, including those between doctors and third party insurance companies. The CEO of Sermo, a doctor social network at one time in partnership with the AMA, frequently rants against the way that the AMA uses its control of CPT codes to bully doctors, insurance companies and even CMS.
For the first time ever, it will now be possible to ask the fundamental question: Are CPT codes working? Is the AMA living up to the monopoly status granted it by the Federal Govt? Using the data from this release, as well as publicly verifiable data analysis methods, we can finally start to tease apart this question. Never before has the effectiveness of CPT codes been subject to public scrutiny. For years, industry insiders have been railing against the CPT code approval process, as well as the the controversial Relative Value Units (RVU) process, which ultimately dictates how much a medical procedure is worth (through the lens of the CPT encoding of that procedure). I cannot imagine that anyone other than the AMA believes that the CPT/RVU scheme is actually working well, but for the first time, we can quantify and categorize the problem. That is the first step on the road to something better.
Part of the mission of The DocGraph Journal is to support Healthcare Journalists as they write data-driven stories about the healthcare system. As a result, if you are a healthcare journalist and you are planning on doing a story on this data, I would be happy to provide you with a free copy of my book Hacking Healthcare. David Uhlman (my coauthor) and I both have an extensive background in Medical Billing and chapters 1,2,3 and 10 of the book cover medical billing concepts carefully. If you would like a copy, please send me a shout on Twitter. If you are not a Healthcare Journalist, I will see if I can get the O’Reilly folks to offer a sale on my book so you can save a few bucks. If you do not already know what “CPT, HCPCS and ICD9″ mean… then you are really going to be over your head when looking at this data set.
More importantly, the entire DocGraph community is willing to help Journalists to make better stories with this data release. If you have a question about a particular provider, or about the data regarding a whole city or state, feel free to join the DocGraph mailing list and ask a question. We are here to help.
HHS is taking a huge risk in releasing this data in this manner. I don’t think that the AMA, as a whole, is against data transparency, but there are certainly detractors in that organization. If this data comes out and people publish a lot of half-baked stories or blog posts based on this data, it is going to give the “data hoarders” within the AMA and other organizations ammunition to prevent further data releases. The downside of transparency is that it creates the opportunity for careless slander. Let’s not do that. If you need help to write a good story about this, get in touch with me, and I will help. I would be happy to provide quotes, but also to help “sanity-check” conclusions. Believe me, there will be plenty of good-old-fashioned dirt that gets revealed by this data release, there is no reason to manufacture drama prematurely.
The DocGraph Journal is sponsoring Cajun Codefest 2014 (April 23-25 in Lafayette, LA) and simultaneously holding a virtual codeathon focused on the recently released DocGraphRX – Medicare D prescribing data. Since our friends in Louisiana themed this years event “Aging in Place” and we’ve spoken with dozens of community members about DocGraphRX in the last few months, we thought the combined opportunity could not be better. Further, we’re providing $2500 in cash prizes for the virtual and in-person competitions. Each team will receive access to the prescribing patters for physicians in Louisiana.
Register today to be included in the activities leading up to and during the DocGraph challenge!
Follow the event @cajuncodefest #ccf3…
Note: Registration below is for the DocGraph virtual challenge only. If you would like to also register for the Cajun Codefest main event please visit cajuncodefest.org.
After many months of mild government employee harassment, and delays based mostly on “other projects” that HHS has been working on, (have I ever told you how many of my friends at HHS were pulled on to the healthcare.gov launch… almost all of them), I am proud to announce that a new, improved update of the DocGraph Edge data set has been released.
This is not just an update, but a dramatic improvement in what data is available. After working with the original DocGraph, we thought of several fundamental improvements, and our FOIA request was much meatier this time. If you liked DocGraph before, you are going to love it now.
First, lets talk about the contents of the data. The original data set had three columns:
FirstNPI, SecondNPI, SharedTransactionCount
The SharedTransactionCount was the number of times that FirstNPI had seen a given patient first, and SecondNPI had seen the same patient later, within a 30 day window. (If that is tough to follow, you can read the full documentation of the original version of the data set). SharedTransactionCount was a measure of overlapping patient transactions, but we did not know how many patients were included. There was a threshold of patient count of at least 11 patients that had to be met. So if the SharedTransactionCount was 1100 there was no way to know if that meant 1100 patients, or 11 patients 100 times each. At least, that’s how the previous data set worked.
The new data set includes the actual number of patients in the patient sharing relationship. The new data set has the following data structure:
FirstNPI, SecondNPI, SharedTransactionCount, PatientTotal, SameDayTotal
The PatientTotal field is the total number of the patients involved in a treatment event (a healthcare transaction), which means that you can now tell the difference between high transaction providers (lots of transactions on few patients) and high patient flow providers (a few transactions each but on lots of patients).
In the original data set, you knew that the two treatment events happened somewhere between “on the same day” and within 30 days. In this new data set, you can differentiate treatment events that happened on the same day, using the SameDayTotal field. Now you can see how often the services were provided on the same day, which is really a whole new graph, with a 0-day window.
But wait…there’s more!! We also got additional “windows” beyond 30 days. We have data for 60, 90, 180 and 365 day windows. These data sets are much larger. The data is spread between 2012 and the middle of 2013 (which is not actually what we asked for, but we will take it). These data sets are enormous:
|30 day||73 Million Edges|
|60 day||93 Million Edges|
|90 day||107 Million Edges|
|180 day||132 Million Edges|
|365 day||154 Million Edges|
This means that for every edge in the database, we now have three weights instead of just one, and we have more than double the number of edges in our largest-window data set. I look forward to the DocGraph community doing a much more detailed analysis of this data set.
Probably the most significant announcement that we have to make is that we are releasing this data set for free and without any restriction. We have started a new DocGraph Alliance in which large companies pay the DocGraph Journal to reveal new and more interesting data sets, and to support the DocGraph community in analyzing open data. We will probably still crowdfund data sets when they are “brand new” but for older datasets like the DocGraph Edge data set, we are moving towards a fully open model, sponsored by Alliance Members. With that in mind, we asked HHS to go ahead and publish the newest version of DocGraph data on directly on their site for everyone to see. This means that the data can be used for any reason by anyone, without a license restriction by anyone. We hope to be announcing the initial DocGraph Alliance members soon, but you can thank them for sponsoring this model!
With that in mind, please find the data below:
What kinds of amazing things can you do with this new dataset?
Let us know on the DocGraph Community Google Group.