Partners HealthCare

Biobank Disease Challenge

Frequently Asked Questions

Why do some of the observation facts have concept codes that not show up on concept_dimension? (e.g. concept_cd = ‘c93010’)

All concept_cds in the observation_fact are in the concept_dimension (the export query requires this). However, note the data is case-insensitive which may cause some issues with joins in R or python. For example c93010 is listed as C93010 in concept_dimension.

We also have questions regarding parts 1 and 2 of the challenge. Each asks for a Jupyter/R markdown notebook. Does this mean one notebook for each part (two notebooks total)? And, if there are two notebooks, do they have to work completely independently of one another, or can part 2 rely on output of part 1? We ask this, because if there are two notebooks, each can be run in a different environment, making visualization much easier for part 2.

The notebook for part 1 it is fine to tie it to the part 2 submission. However, note that part 1 requires submission of 4 phenotype predictions whereas part 2 requires an analysis notebook for only one phenotype.

Number of patients with ICD code for the phenotypes?

abbr           n
AD          2369
AFIB       10894
MHA        12721
MI          8360


When using i2b2 in firefox I get a message 'User not registered'

Clear the firefox cahce on the ubuntu instance by clicking on the icon the right side with 3 vertical bars and a backslash, select history and 'Clear Recent History...', choice 'Everything'.
Than close firefox and back to into i2b2 by using the following url


What is the recommended way to close Hoirzon Client?

Just exit out of the application, do not logout or shutdown the machine. When you return all your appliation will be where you left them.


Is it correct that “N” in the gold standard does not necessarily mean that a patient does not have a disease? In other words, “N” does not mean that BDC clinician found the evidence, that a patient is disease-free. Instead, ”N” just means, that even if it is likely that a patient has the disease, the clinician did not find enough evidence to confirm the diagnosis. Let’s say for some patient there is 90% likelihood that he or she has the disease and there is a diagnosis in the database. However, if BDC clinician does not see gold standard tests for the disease in the database, then this clinician will not put “Y” into training labels (because it is not possible to put a definitive diagnosis according to clinical guidelines). At the same time, other clinician who actually saw the patient may have had additional information which allowed him or her to come up with a definitive diagnosis. Is this correct?

The N labels means that the patient did not have the disease either because there was explicit information in the chart ruling out the disease or there was no mention of the disease even under full physical examination. Yes there is always some ambiguity in doing chart reviews were there is no 100% definitive diagnosis. For the purposes of the challenge however you can consider the training labels as gold standards. Both training and test labels are created based on the same criteria with the same reviewer.


Did clinicians who were putting “gold standard” labels use the same information which is available to us from the database? Or they had an access to some additional information which is not available in this database?

The chart reviews are based on detailed clinical narratives that are not available in the Biobank challenge database.


Can you please explain in detail what do you mean by the numeric score (Algorithm_score)? Should it reflect the probability that a patient has certain phenotype?

It can be any numeric score that can be ranked and can be used to calculate an AUC. A probability (0..1) is fine.


Is there a way to see the result of a test like MRI or CT scan? In the database we can see that the test was performed, but we do not see any result of this test. Good example is a test with concept_cd = ‘MRRADPETALZ’

Imaging and radiology reports are not available as part of the challenge. Diagnosis codes assigned by the radiologists as a result of the imaging procedures are in the database.


Should we output our predictions about the disease in general (like AFIB) or should it be a specific ICD code (like ICD10:I48.1)?
Which of the two outputs below looks closer to what you want us to generate:

Option 1:
Option 2:

Option 1


Do I need to have the DUA signed for registration? 

Yes, you and your authorized institution representative need to have the form signed ready for upload to be able to complete the registration. 


Can I change any information on my registration form after the registration? 

Once the registration is sent there will be no updates. Since the request will have to be evaluated we can not change the information on it. 


What information do I need to register as a group?

You would need to select a group leader and include his/her email on the registration form.


Do I have to be part of an US institution to be able to participate?

Yes, your institution has to have United States representation for you to be able to participate.


Can a group be form with participants from different institutions?

Yes, each member has to have their own institution approval for the DUA but the group can have members of different institutions.