DocGraph Edge Data Documentation
If you have questions about this documentation. Please do not hesitate to bring them up on the DocGraph Mailing List.
DocGraph is very likely the largest open and real-named Social Graph of any kind. There are almost 1 million entities (either individuals or organizations) that appear in the 2011 data. Each of these entities is either a specific person or organization that provides health care services to Medicare patients. Of course, the graphs found in the Facebook, Zynga, Twitter and LinkedIn datasets are far more expansive, but they are also the closely held property of those companies. When other organizations interact with those graphs, they are given access to only slivers of the whole dataset. If you are not an employee of one of the previously named companies, this is probably the largest named graph that you will have access to. (Let me know if you know of an open real-named graph dataset that is bigger …)
This data is keyed using the National Provider Identifier (NPI). This is a universal identifier for doctors and hospitals and was mandated for the purpose of medical billing by HIPAA as replacement to the UPIN system. (HIPAA did a lot of administrative things beyond patient privacy regulations). Any doctor who bills Medicare, or prescribes medication, must have an NPI number. The release and adoption of the NPI was an important component of health care reform in its own right, since the NPI was intended to ensure that doctors had only one identifier rather than maintaining a separate identifier for each insurance company that they billed. This number is basically the equivalent of a social security number for doctors. It would be fairly difficult for a health care provider to provide care without one, and as a result they are fairly ubiquitous. I have been working on NPI data for years, at the prompting of the folks from NPIdentify. It is a rich and messy dataset all on its own!
The core NPI database is already an open dataset. You can use the government’s NPI search tool, but it sucks. So I built a better one (at DocNPI.com). If you are a health care provider, you can update your NPI record here. This is a good time to remind doctors that it is a bad idea to list your home address in the NPI data. Because the entire NPI database is public information, and you can download the core NPI release file here. For years the NPI database was updated on a monthly basis, but now it is updated weekly. There are two parts to any graph data set: nodes and edges. Our data set is the edge database and the NPESS download is the node database. Sometimes working with the entire node database bothersome (its too big to work with in GePhi for instance) if you would like the list of NPIs filtered by states, we sell these to help support our project. Click here to buy state-level NPI databases.
The DocGraph dataset is fairly large, with exactly 49,685,586 pairs of referring parties. Of course, even with this many links, the actual dataset that I will be providing is relatively small. The 2011 file is 1.3 GB, which includes about 1 million providers participating in the graph at least once. To provide context, there are about 3.7 million entries in the core NPI file.
I sometimes call the DocGraph dataset the “referral” dataset, and interactions that are traditionally understood as referral relationships make up the bulk of the data. But strictly speaking the data should be considered a “teaming” dataset, which shows when providers work on the same group of patients within the same time frame. I frequently refer to these teaming relationships as “referrals” because A. they usually are referrals, and B. this is easier to say than “participating in the same teaming coupling instance.”
Specifically, the data represents the number of times that two providers billed Medicare for the same patient within a sliding 30-day window, where at least 11 patients were involved in the transaction. If Provider A sees a patient on January 15, and Provider B sees the same patient on February 15, then that counts as “+1.”
In order to ensure that this dataset did not provide any data about patients, we enforce minimum number of patients involved in a given teaming relationship. So for every entry in this dataset, at least 11 patients (which is a standard for CMS somewhere) are involved. This is intended to address the Elvis problem. Imagine that everyone knows who Elvis’ doctor is and everyone knows that Elvis is that doctor’s only patient. Therefore, if Elvis’ doctor is “referring” to a cardiologist in this dataset, then everyone would know that Elvis has heart problems. This problem goes away once you include a minimum number of patients in a transaction, so 11 patients is the floor. So we know that at least 11 patients took part in any given “referral count.” Because the patient data is both de-identified and aggregated, there should be no patient privacy concerns in this dataset. (Do let me know if you find evidence that I am wrong about this.) (11/2013 Update): While it is not possible to reverse engineer what patients are involved in a particular transaction within the DocGraph data set, there is some evidence it might be possible to use DocGraph data in part of a de-identification attack on some other data set. So while there is no re-identification risk in DocGraph itself, it should be considered in the overall de-identification landscape.
To further protect patient privacy we cannot tell anything about the number of patients involved beyond the fact that there were at least 11 involved in a given provider-provider relationship. If the same patient sees Doctor A on January 15, Doctor B on February 15 and then again on June 15 and July 15, then that counts as “2″ referrals in this dataset. When a referral relationship has a score of 1,100 we cannot know if this was 11 patients with 100 referral instances, or 1,100 patients with 1 referral instance, or 10 patients with 10 referral instances and 1 patient with 1,090 referral instances. The whole point here is that we have a score that approximates the strength of the relationship between two entities in the NPI database, and for that purpose it does not really matter what kind of patient flow is being indicated.
Entries in the DocGraph edge dataset take the form
Where the first number is the NPI of the entity that saw the patient first in time, and the second number is the NPI of the entity that saw the patient the second in time, and the score is the number of times this happend in a 30-day period within a given (2011) year.
I have uploaded the entire referral graph for the Methodist Hospital in Houston, Texas to Pastebin as an example of what you can find in the larger file.
Usually, patients go see a primary care doctor and then get referred to specialists. Usually this translates to the primary care doctor being seen first between the two doctors. But frequently a specialist will be seen and then a patient will return to a primary care provider. In fact, it might be possible to use the relative “directionality” of the graph to automatically guess which provider was the primary care provider and which was the specialist. For instance, given:
It might be reasonable to assume that 1112223334 was a primary care provider.
Doctor and organization types
Beyond just being a “wide” graph with lots of nodes with real-named entities, this dataset is incredibly deep. It is deep because the core NPI public release file contains a tremendous amount of detailed information, which is usually even right!
The first thing that the NPI file contains is at least one and possibly many different types of provider-type taxonomy. These provider types are coded in a provider ontology maintained by the American Medical Association’s National Uniform Claim Committee (I would like to thank the members for performing in this usually thankless task. Committee participation gives me a headache.) You can download this Health Care Provider Taxonomy, or you can browse it online.
The good news here is that the NPI database uses a provider-type taxonomy. However, there is little justification that this should be a “tree” style taxonomy. The assumption in a hierarchical taxonomy is that leaves can only have one “parent.” This means that a given “doctor type” can be either listed under the “cardiology” group or the “pediatrics group,” but not both. As a result there a lots of very arbitrary groupings for doctor types here. Doctors cannot find a sensible way to navigate this “tree” style taxonomy when they sign up. Since they have to choose something, they usually get at least one “type” correct, but often, this database does not correctly represent the breadth of a given doctor’s actual specializations.
Still, it is a good assumption that the provider taxonomy field in the core NPI file is usually correct, and as a result, it is possible to distinguish effectively between hospitals, primary care doctors, specialists types, and laboratories in the referral dataset. In fact, one of the most frequent “referrals” in the data is the referral to get lab work done at LabCorp, Quest or one of the local lab providers. Referrals to hospital emergency departments (which are not referrals at all, of course) and treatment facilities like DaVita are also very common. I have uploaded the top 100 organizations by the number of entries that they have in the dataset to Pastebin so that you can clearly see the types of relationships that will be most common in the data. (11/2013 UPDATE) Ryan Weld has taken this to the next level with his analysis of “What doctor will you see next“.
The core NPI database also contains two addresses for each NPI record, the practice location address and a mailing address. I have done queries against the Open Street Map database and about 80% of the addresses are already coded to latitude or longitude. There are zip codes that can be used to detect general location for the other 20%.
This means that is going to be possible to run all kinds of geo-data queries against this dataset. There are all kinds of other geo databases that can be overlaid against this referral database in order to reach interesting conclusions. You could easily, for instance, study referrals to allergy doctors in relationship to geo-recorded air quality scores. Let me know if you make some pretty maps and I will try to give you a shout out on Twitter or on my blog.
We have been hesitant to go too deeply into the process of mapping addresses for the node data set, because we know that many of the addresses in the data are not accurate. In fact the Office of the Inspector General in the United States has issued a report indicating that 34% of all of the address records were inaccurate. The full report is titled: Improvements Are Needed To Ensure Provider Enumeration and Medicare Enrollment Data Are Accurate, Complete, and Consistent in typical understated government style. Here is the summary page for the report.
Now that this has been quantified to this degree, we should advocate a “data scientists beware” label on all of the address related data in the NPI data set.
Quality and performance data for individual doctors is pretty hard to come by. However, there has been an explosion in the availability of hospital data that details how hospitals perform on critical issues like readmission rates and central line infection rates. Frequently this quality data is coded natively using NPIs and it is usually pretty simple to convert this data to NPI coded when the NPI is not directly available.
This hospital data will be part of what we are trying to merge together for our improved DocGraph project.
There are lots of interesting questions that you can ask regarding this dataset. For instance, using the DocGraph, you can determine which cardiologists are referring to hospitals with poor central line infection rates.
State level credentialing data
Every state has a state level medical board that releases data on individual doctors. This data usually includes what medical school a doctor attended, what board certifications they maintain, and any disciplinary actions that the state board has undertaken against this doctor. Unfortunately this data is expensive between ($50 and $500 per state) and rarely coded using an NPI.
This is the largest single disconnected data source from the current NPI database. Buying and normalizing this data is the first goal of our Next Level crowdfunding effort.
Using this data it would be simple to determine if attending the same medical school was an important part of how doctors refer. It will also be possible to determine if board certification has an impact on referral patterns.
I have just discovered through BoingBoing that resource.org will be performing data extraction on its enormous cache of non-profit tax filings. Once it is possible to import this as a data source, it should be possible to figure out what the NPI for different non-profit hospital systems are. Once this is done, it would be possible to see how executive compensation works with the graph. This is not something we will be doing with our initial improvement project, but this is obviously where we would like to take this!
As with all rich datasets, this data is convoluted and can be confusing. Already we have seen patterns that do not make sense unless you understand how the data was built. This is based on administrative data and not clinical data. So this shows not how patients were “treated” together, but how they were “billed” together. There are several important artifacts to consider as the result of this.
First, individual providers frequently have two NPIs in the database. One for them as an individual and another for an organization that exists only for that individual. Seeing an NPI for “Dr. Smith” and another for “Dr. Smith, LLC” is not uncommon. As a result of this, frequently a given provider will not even appear in the DocGraph at all. Some digging is often required to determine how a given doctor is actually interacting with Medicare. Frequently, when an individual NPI and organizational NPI share an address, this means they are working as one unit.
Sometimes, it is the organization that is hidden in the billing. We have already seen cases where a given primary care provider is referring to more than 20 different cardiologists. A little further digging shows that these cardiologists are all part of a cardiology service. Obviously the primary care doctor is referring to the cardiology service, and does not actually have relationships with the individual cardiologists. However, there is no way, using just the DocGraph, to tell the difference between a “service” where doctors bill as individuals and a group of unaffiliated doctors.
There are cases where cardiologists list themselves as cardiac surgeons in the NPI database and vice versa. Doctors frequently rely on administrative staff members to fill out the data in the NPI database, and they often get it wrong. This issue impacts the quality of addresses and countless other issues. The current NPI registration form attempts to normalize addresses, but the older forms did not. Keep in mind that this data is just as subject to user entry errors as anything else.
In many cities, a given hospital will win the contract for the main emergency room. This often makes it seem like every provider in the city has a “referral” relationship with that hospital. Of course, this is not strictly true. A good sign is if an organization has a strong “referral” relationship with the local fire department, then they run the local emergency room and there “team” with every doctor in the city.
Because Medicare typically covers people over the age of 65, there is not very much information about doctors who exclusively treat children, or women who are having children. This means that there is a lack of data about pediatrics or ob/gyn doctors. There is also no information on doctors who do not take Medicare patients. Also there is no data in this dataset for providers who do not bill Medicare in a transactional fashion. Fully capitated plans like Kaiser Permanente will not have data in this dataset. If you put this data on a map, there should be a big hole over southern California as a result of this.
This dataset, like any messy data, should not be considered the “truth” but it can be used to help generate and reject hypotheses about how a given community delivers health care. We have lots of ideas about how to make the data more reliable and more accessible, but in order to do that, we need your help paying for improvements to the data. Even if you do not yourself want to have access to richer data, consider supporting us as we provide data for those who wish to have access to high quality versions of open doctor data. This dataset should deliver on the overall promise of open data: transparency improves performance.
There is a DocGraph Google Group now, that you can join if you would like to ask specific questions about this dataset.