welcome to the wednesday afternoon lecture series. this is the final lecture of the season. i'm dr. francis collins. [ laughter ] or i claim to be. dr. collins couldn't be here today.
i'm the chief of the laboratory of informatics development. partially the reason for inviting our guest speaker today, a friend of mine from way back. i'm going to read his intro because i can't memorize all these titles.
dr. atul butte is the chief of the division of systems medicine at stamford university school of medicine. ies an associate professor of pediatrics, computer science and immunology and rheumatology. his background has a bachelors of computer science and md from
brown, did residence nepediatrics and fellowship in feed at rick endocrinology at children's hospital and then went to mit for master in medical informatics and ph.d. in medical engineering and physics. since 1990s, he has been
developing knowledge discovery methods and clinical laboratory databases using techniques like relevance networks and has been having remarkable success using publicly available data resources to discover novel genome relationships. he has a number of hon honors.
he did a lecture up on youtube now, it's maybe the short version of this. so you'll get the longer version. and he is also been on the board of directors of the american medical informatics association, a fellow of the american college
of informatics and in 2008, he started or ran the first translational bioinformatics conference in san francisco which he did the firston and it's taken off and become a major annual meeting in the field of translational bioinformatics.
so with that, i'd like to welcome, atul butte. [ applause ] >> excellent. thank you for having me here. thank you for putting up with heat and actually for those of you here live, for making the trek over to building 10.
i also want to make one other point about introducing myself to the high school kids and the college kids here in the audience, i was in your seats back in 1991 as a summer student here. i spent the summer here. i became a year of medical
school living on the clifter and then i got a institutional nrsa and then a t grant then a k and then r1 and then us and ps. and everything started with that summer. it's a honor to be able to give a lecture since i benefited from listening to them at one point
in my life. the usual disclosures, i started a bunch of companies. i consult for a bunch of companies. you shouldn't believe another word i'm about to say here. so, if you haven't heard, we are in the middle of a data deluge
according to the economist magazine, this came out about two summers ago. the human species generates two zeta bites per year. if you don't know your metric prefixes, next year, we will generate four sweat bates because it's growing.
we generate so much data -- zeta bites -- that my favorite article comes from economist magazine. we are generating so much data, science citizen starting to become obsolete. -- science itself -- what does he mean by that?
if you think about the scientific method. the scientific method going back let's say 400 years, the scientific method is, we ask an interesting question or generate a hypotheses and then go get and make measurements to answer that question.
or address that hypotheses. and what happens in the world when we have so much data already? we have the measurements, in many fields, physics, stron me, life sciences, so much data, that the new magic isn't figuring out what is the cool
question i want to ask of this data. and that is really 99% of the work in my lab. what is the cool question we want to ask? you know, the whole world is waiting for the answer and anybody realize this has
question is askable because we already have the measurements in that way, the scientific method is becoming backwards where we get the data already, what do we want to ask of it? it is happening in life science research as well. the article at the top here
comes from the harvard business review. it's interesting. it's a business journal. it's a great for junior faculty to read. it talks about how to organize your lab and things. in 2010, they said data-driven
science is the next big scientific revolution. we started with experimentation in the 1400s, we went to theory in the 16 and 1700's. we went to simulation and models in the 1900s and now data-driven science in the 2000s and beyond.
nature cities is shameful if we don't make use of the data we put out there -- nature says -- science says we need to make data maximally available. and theilanceet says it's impediment to public health if we don't share that data. this data comes from devices
like this. of course i got to bring my props. on the vest a standard gene chip. this is what it looks like in real life. it's kind of small. the size of your thumbnail.
this let's us quantitate every gene in the genome. the blueprints from dna to proteins. and you all know, especially any of you who spent any time at the bench, we love to scale. when we get one of them to work, we love to make 96 wells of
them. so this is the nixth well version of that. you can see one version of it there. each well there has one of those microarrays at the bottom there. and it's kind of amazing the chip like that, we can
quantitate every gene in the genome. and it's real amazing that actually this is a 15-year-old device now. we had this for 15 years! it's a commodity item, right? i don't know about you, but for us at stanford, it's, which is
the cheapest vendor today. they don't like to say it, but we really buy these things on price. $200 or $300 dollars. that's it on the left there. it looks absolutely still amazing to me. we actually make scientists
share this data on the internet. right? and you know, it's not perfect and not everyone shares. and not everyone gives away everything they are supposed give away but it's amazing how much data we have on the internet from this one
high-throughput measure of methodology today. so one place to go to get data is at the ncbi, of course, a lot of friends and my research career wouldn't exist if not for the ncbi. so all of you in the audience, i'm putting out a shout out for
you. ncbi geois the repository for these microarrays. and you can see as of a couple of weeks ago, we are up to 752,000 publicly available microarrays at the ncbi genome. that's just the u.s. repository. the europeans run one, called
the array express. if you don't look at the overlab, another 213,000 more arrays. suffice it to say these are growing like crazy. we are just a few weeks away from having one million publicly-available microarrays.
that's up from zero in 2002! and the data continues to grow exponentially. it's doubling every two years now. we are just finally slowing down to moore's law. computational power doubles every 18 months?
i think this is only just doubling that the rate now because it was tripling or quadrupling earlier. it's only slowing down because of the next repository i'm going show you. you can guess which one that is going to be.
here is the most amazing point of this. there is directed to a certain junior folks in the audience in the middle of this website, just in the middle, is a very simple search box. and that search box right here in the middle, i'll get the
brighter point here. a high school kid today on who needs to do a science fair project for high school, she can type, breast cancer, here. click go, and find and download 31,000 samples of breast cancer. about as easily as she can find itunes today.
31,000 samples of breast cancer digitally available. if you look, that's more than 1000 independent experiments on breast cancer. and what kind of student or question can they ask? well, there are breast cancer researchers who are religious
about their use of human cancers and biopsies. and you got folks who say, it's science because i'm doing this on cell lines. we got these who look at mouse models. and even high school kids can say, what is common across all
1000 experiments? i don't care if it was from a biopsy or cell line from a mouse? what is the core common denominator in breast cancer today? and the thing about 30,000 samples of breast cancer, that
is more samples of breast cancer than any breast cancer researcher will ever have in their lab because by definition, if they want to publish in a good journal, that one investigator has to put out their data plus their competitor's.
so folks who get to the repository and know what to do with this will have more samples than anyone else in the field. if it's not breast cancer, it's colon, prostate, hundreds of diseases. if you're skeptical about it, i'm proud of these five high
school kids who either placed in intell, westing house or seemence from my lab in the past three years. each of these kids was in the top 300 science kids in the country. high school kid today get to meet obama, they have an astroid
named after them and all that. a high school kid today has more digital access to samples than any researcher in the field if we choose to empower them with the questions they should be asking about data. now, if rna is such a thing, dna repositories are growing like
crazy, dbgap is growing faster than yeo in some ways. here it's not so easy. a high school kid can't just click to download the data. we are finicky about privacy about genotips. here you print the permission forms to get irb approval.
and here the stick is harder. it's not the journals, it's nih and the trust that makes scientists share this data even before publication. so we got to tolerate with these embargo dates. we have to give them a fair shot at publishing before we do.
why is it worth all that hasssnel i want to show the second one here which you can't read. this is the entire framingham art study. okay? the reason why we know the word, "cholesterol" in the united
states. now i think it's kind of nifty that with the permission, i can download 14,000 people's genotypes. but i think it's more amazing i can just download 10 or 20 years of phenotypes that clinical measurements and research
measurements. and those guys at framingham, they are pretty nifty with this and they found every possible predictor for the cs. you're wrong. sitting in there might be the next big diagnostic for disease they just haven't thought about
looking at yet. and in the old days, you essential ve to work there to get to this data. now we can all access this because the data transparency reproducibility. so there is rules here that is amazing world.
that's number 2 on a list of hundreds of such experiments waiting for you to ask an interesting question. now we had this data. it goes back to gen bank 29 years ago. the idea we should share research data.
and we can look at this and start to get ideas. this disease could be treatable with this struck we have a lot of this repositioning work and this diagnostic can predict this particular disease. and where we used to get stuck, or is how do we get convinced as
a computational researcher, how to convince the biologist or clinician scientist how to test my idea. i'll give you an example. we had a prediction for serum marker for aml. that's a nasty form of leukemia. you might want to detect who is
reoccurring from that sooner than otherwise because it's lethal. you think about bone marrow transplantation. so we go to our cancer center, i love my cancer center folks. i'm a member of the cancer center.
and i say, do have you any saved serum or plasma on aml? we have a biobank there. do we have saved samples? and they said, you know what? microarrays were invented at stanford, we saved the tumor cells and extracted rna, but they never saved the liquid
around the cells. the serum areolas ma. that said get up irb approval put up the posters and in about a year, should you have enough samples to test your computational prediction. what do i do? you just google for it and find
a company like this. conversant bio. browse by disease. what are we studying? bladder cancer? brain cancer? breast cancer? here is leukemia. what do i want to buy?
bone marrow samples? what type of cells? here is plasma from peripheral blood leukemia. each patient, got an id, age rate, sex, alcohol, tobacco, all the drugs, all the samples available for this patient. $65 each.
we buy them all. and we validate our markers this way. it's not a big deal to get cancer serum, plasma, pathology samples, tissue microarrays. where does this come from? these are big hospitals in the middle of the country.
not affiliated with medical schools. this is stuff on its way to the trash. and they strip off a label put on a new one, get identifiers and some interesting companies resell this for researchers this comes on 48 hours on dry
ice. i can't beat this price. i can't even beat this kind of turn around time with my own institution today. and so i love, i love biobanks and biorepositories but they better be at least as efficient than what we gret a company on a
website today. but then the purists say, the biologists say it's descriptive. who cares about this. i don't believe in t. unless you show it in a mouse model. it's got to be mess and sufficient. -- necessary and sufficient.
and then i show them assays depot.com . kind of sounds like home depot, assay depot.com . do i want to run biology, chemistry, drug metabolism, pharmacokinetics, pharmacology. what kind of mouse do i want to run?
bone, infectious disease, inflammation, neurology, cancer, eye, ear, pain, respiratory. you'll hear a diabetes story from me. let's click on diabetes. here is a standard mouse model. this is a 16-mouse experiment. this mouse has been eating a lot
and getting diabetes since the 1970s. biopsy these however you want, two groups of 8, 4 groups of 4. test any mouse you want, try any drug you want. this study is here. what am i covering up? the price.
$9000 for the service. 9-week turn around time and literally, add to shopping cart. now entire translational mouse experiment is just purchasable with a credit card today. so i'm not trying to disrespect anyone who does mouse studies but if this is all you do, there
is a company that will do it for nobody pores their own sequencing gels anymore. remember? so this is the logical extension from that. now, some of you with very sharp eyes i saying, this is in china. and you're outsourcing this and
may be off shoring this as well. is that really cool? when you click on add to shopping cart, all 1and 33 labs and companies -- woon33 labs and companies compete for your business including those in maryland and wisconsin. in fact, it's easier to order
from this than amazon. let's say i just want fda approved ones or gmp or aaa labs or usda. let's say i have to have gmp. that's 20 of the 133. whatever certifications you want, this experiment gets done. now i know some of you're dieing
to criticize me on the quality of the output from these i'll just simply answer, if it's this cheap, run two of them from two companies! and this is what we do. how many biologists go across the hall to get their neighbor to dot same experiment?
they don't do that. we'll order two of them. so is this a world where the validation methods are commodtides. this isn't rocket science anymore. people sell these services even in the united states.
so what i'm going argue is i think it's about the translational pipeline. we talk a lot about ncats and the others. but we think about make clinal and molecular measurements and high-throughput and ask interesting questions and trials
and we run a trial and apply statistical computational methodded to that daya -- data and then validate a drug or biomarker. i'm going argue 3-4 today are commodtiesed. we have plenty of data. the statistical computational
methodding in my field, we put out so many websites and tools, it's hard to distinguish from noise there and the validation sites that testing of that drug and biomarker is completely commodtiesed today. i'm going argue that nobody is ever going to outsource asking
good questions. if there is one take home point from the summer kids and for the college students and trainees, you're never going to go out of business if you know ho how to ask good questions. in my lab, we say outsource everything but the question now
at this point. asking good questions will never go out of style. another way to think about it is, i live in silicon valley. we are so used to even starting companies, kids start companies like facebook in a dorm room, yahoo!, google, started in dorm
rooms. apple started in garage and hp. i think we have so much going for us, i think the next amgen, the next gen tech is going to start in a garage. where are my garage biotext here? maybe the next pfizer.
have a million arrays in my garage. i have every mouse model purchasable in my garage. all that is vest enough room to get someone going to get the ideas flowing here. that's all you need is in your this is what we practice.
we spin out companies out of my lab. i want to practice what i preach in this way. so, i'm going spend most of my time talking about new findings and then in some ways, with no diswant to any of the other investigators my lab or papers
we published in the last 10-15 years, this is my favorite paper of them all. many might not have seen it. i want to explain how we got to this point. it dresses a particular disease called type two diabetes. it's the public health menace of
the world. 12% of american health care dollars, 20 million people, a world health problem. we have a lot of drugs for diabetes. we still need new drugs. the latest one dpp4 inhibitors which flect secretion.
we can effect how insulin is repleased effect how insulin has effects and these drugs reach billion dollars markets because we need to do therapies for we still don't know why people get type ii diabetes. and nor do we really have a universal biological mechanism
so just address one of the hardest conditions. you say let's take our informatics aopen and apply it to type ii. a lot of it got inspired from this paper in "science" in april 2009. i want to take a minute to
explain what this showed. it comes from the field of mouse psychiatry. think about what it means for a mouse to have anxiety and depression. sometimes it's the only twistudy the genetics of these kinds of psychiatrist diseases.
and these guys were annoyed and depressed because none of their answers were matching because one group had one way of thinking about deflection a mouse and another group said, depression in a mouse means this. and how do you agree?
so what they did is 10 years prior, they said we are all going to agree. we are all going to agree. this is what it means to have deexpression anxiety in a mouse and add in a mouse. we are going to agree with the standards.
and low and behold, all of their answers start to match. everyone starts to get the same answers. but why this paper? because the minute they tweaked any one parameter in the experimental model, to change what kind of water they use or
what kind of chow it was or what strain it was, all the answers were not generalizable. entire field essentially had overfit their model. they all agreed on one model. they all standardized on it and they all get the same answers and it has zero relevance for
anything else today. and what this paper is saying, maybe today, we should deliberately do different things. why standardize? let's try different experimental conditions that are all logical and if we end up with something
uncommon, maybe that's the most resilient of the answers out that is the causal mechanism. so let's take this to heart and study type ii diabetes. by studying type two, i'm going look at my row array experiments. there is almost a million
publicly available coming up soon. almost a million. when i look for experiments, what i'm looking at are experiments from diabetes and samples from control. diabetes and controls. diabetes and control.
but, i'm looking at stat, muscle, liver, beta cells, the cells that make the insulin in pancreas from rat, mouse or human. three species, four tissues, and just simply asking what is in common? what is in common?
so how do we do this? let's make a list of every gene in the genome and count how many times does that jean show up as being different in one of the 130 experiments? here is a graphic from 0, 130 would be where the flags are and it ends at 75.
let's just start by making a list of every gene, 25,000 genes in the background, almost every gene in the genome, a quarter of the genes change in expression level in one of the 130 if rare to get some gene to change in some experiment and there is a quick long tail bulge
around 20 or so. so experiments change in certain tissues more than others. but you can see, there is a big curve to the left. most genes don't change in many let's take genes we know have something to do with diabetes. the ones from the so-called gwas
studies, tcf7l2, we have leptin receptor which are not really gwas hits but another genetic hits. we have ppr gamma and all you can see is the known genes seem to be changing in more of these microexperiments than the background list of every gene in
the genome here. so we are staring at this and staring at this and we are staring at this. how do we simply ask, what is this little red dot over here? there is a gene changing at 78 out of 130 experiments. 78 out of 130, and it's not one
of the gold standards ones. no one pursued it for type ii diabetes yet. so i'm going call that gene a for the moment. i'll tell you all the answers at the end. let's keep this susspence here. and we're looking at gene a and
it has a very interesting symbol. the symbol starts with cd. so that means a cd molecule. it's a gene that or a cell surface protein that someone somewhere sorted cells on. so it's a cell surface protein. in fact, it's a functioning
receptor. we look lower on our list and gene b, and this red dot is the ligand for that receptor. that's the binding receptor. now we are interested because these receptors are never even pursued. and this ends up being a lot of
collaborative work with many labs across japan and the university of tokyo, k university and other universities, all run by my postdoc, now staff member, and had a whole network of friends helping with this study where we didn't have a collaborating lab.
you know how we purchased these because we are not going to let that slow us down here. so we look at our database and it says, if are going to pick a tissue, chase this down to adipose. so the fat really sees a change in gene a lot followed by islet
and then liver and muscle. this graph says 90% of the adipose experiments were the receptor change, the ligand also changed. that means it might be what we call odd crin effect. the ligand is being made in the same place the receptor has an
effect as well. so now we can get wildtype mice, c57 black 6 mice. you can see graphs here. give it a high-fat diet and the receptor level goes up and we start make pictures of the fat and stain for the cd molecule and you can see interestingly,
it's not the fat cells but it's the inflammatory cells around the fat cells. and we think they are macrophages but they could be other types of inflammatory cells as well. so this is not a fat cell issue but this is the inflammatory
cell around it making these particular cd molecules. the graph shows the ligand and the receptor ark co-expressed. some correlation between them at the rna level. so we look that the receptor and think it's interesting. it is in the fat tissue.
we start to think, maybe we should knockout this gene in a mouse. and then we go to jackson labs and we realize this gene was knocked out in a mouse 11 years ago. by the iminologists. the iminologists were interested
in the cd molecule. this receptor was cloned in 1985. this is a low number. not even triple digits here. it's been knocked out 11 years no immunologists did put a glue com teron the mouse. they didn't look at the sugar
levels in the mouse. they didn't do a fasting study or any of that. so it's about a 6000 dollar experiment to get this mouse from jacksonville labs. so we get the knockout mouse and stain for macrophages and you can see from back to the room,
there were few youer inflammatory cells. it's not zero. and over here, you can quantitate that. how does the mouse do functionally? the mouse without the receptor does better than the wildtype.
it's more insulin sensitive. so in other words it doesn't die from diabetes missing this receptor t does better than the regular, the wildtype mouse, whether we look at fasting blood sugar, insulin, normal fat, high fat, no change in weight. no matter which ones you look
at, this mouse insulin sensitive compared to the wildtype here. so this tells us lower is better. if we can knockout this function, boy maybe this new therapeutic targets for type ii at least treat diabetes in the let's connect it back to humans.
you can buy slides of fat from humans. there is plenty of fat to go around from liposuction and cosmetic surgery. companies out there put these on slides and sell them for a couple hundred dollars. here is a 36bmi, almost 37,
57-year-old woman. chock full of macrophages making this receptor. it's not just a mouse thing. human fat has it as well. when look and we see, you know, this cd molecule isn't just on a cell surface, but there is a collual form in the blood.
you can measure it in the blood in the serum. that's a $500 kit. and the level in the blood correlates with something called the a1c. the blood test we use in diabetics. actually, that is what we follow
once you're diagnosed to see how your blood sugars are doing. low ser better. less of the receptor in the blood the less you lower your hemoglobin a1c. these 55 folks didn't even have so already as this level is going up, your hemoglobin a1c is
going up. higher is worse t correlates where a homer ir, proxy for measurement insulin resistence. everything is telling us low ser this is a great therapeutic target. so hell, let's just design a drug against this one.
so how do you make a drug? simplest possible drug. here is the receptor. let's just treat these with antireceptor antibodies. we'll treat wildtype with antireceptor antibodies and as a control, i control other antibodies.
so what happens to the mice? these are wildtype mice, high-fat diet. in about a week, we can wipe out the inflammatory response, not 100%. in about a week, wipe out the inflammatory response and in about a week, we can lower their
blood sugar. so let's recap here for a second. we have a prototype new drug, serum companion diagnostic, got the knockout mouse, human and mouse immunohistochemistry, 18 months of work, and we did it with the same date any high
school kid can get to today. and just sitting in that public data are many other findings we have now since started doing this for type i diabetes, similar derm alupus. six cancers. the kids call this crowd sourcing when you get your
friends -- this is retroactive crowd sourcing. we had great science generating this data. now let's pretend we told them to do this and use the data forward thinking. the data is sitting there. it's high value.
for some reason, we think if it's free and on the internet, it must be valueless. because we seen plenty of cats playing piano videos. i'm telling you ncbi is free and on the internet, it's extremely valuable. podunk university doesn't get
grants run microarrays. it's the best universities, the best investigators. it's our peer that is jen 38 but for some are reason, it's ncbi, we think it's valueless for some reason. this is what is possible with the data we put out today.
so, this is cd44, it binds many more things than acid. the ligand is osteopont in to connect the dots here. and the osteopont in knockout was published in a jci paper it's insulin sensitive. everything is in the same direction here.
but it binds 40 different reseptember expose never made the connection to this particular receptor. anticd44 is in drug development today for of all things, it happens to be a marker for cancer stem cells on top of everything else.
now, it might not be in the same cd44. it's a complicated receptor. 20 axons and 10 spliced and radically different shapes with alternative splicing. who knows if the one in the cancer stem cell is the same one effecting the fat and beta
cells. but there are people trying to develop this as a therapeutic for the cancer field and maybe we should peek and see what is happening to the blood sugar of some of these patients as well. so there is a lot of future for this molecule many others we are
finding this way. now obviously you know that diabetes isn't just going to be a problem of genes and gene expression. why are we having obesity and diabetes ep demsnuck it's the environment. you can blame it on cheesecake
factory portions, thousands ever channels of tv with sofa in front of them. kids don't have recess, standardized testing. a lot of reasons for diet and obesity today. are we really so sure that there is nothing else in the
environment that is leading to the diabetes epidemic? so, people who studied environment, epidemiologists and i have great friends and colleagues who are epidemiologists. but they have a great field, they look at one environmental
factor and if they show it's positive, they show a big cohort, they get a jama paper out of that. new england journal of medicine paper and the last time the geneticists looked at one gene to start was like 15 years ago. geneticists say let's look at
everything, phase i, get some hits, phase ii, validate and subsequent papers for deep sequencing. they look at everything. so we said, why don't the epidemiologists, environmental scientists look at everything? because everything in the
environment is huge. it's unbounded. it's not just whatever 6 billion base pairs it's a huge number of things that could be in the people started to think about expose ohm and vire ohm. these are not my words. people don't like people making
new "ohm" words. these were in an article. this judged why can't we start to think about environmental causes the same way we think about genetics causes. instead of going after one by one, start with as many we are measuring and try to associate
them with a particular condition. and so we did this as a prototype. we started a experiment. a grand student -- grad student in my lab did this. we borrowed from a publicly available repository known as ed
haines. national health and nutritional examination survey put out by the cdc. a function of the cdc. it's not research. this show they inform congress as to the health in the united when you're born, whether you
are born in the united states or your children were born, their heights and weights were plotted on a growth charts. they come from the cdc cross sectional date from ed haines. fee people realize that they have been measuring thousands of pesticides, heavy metals,
toxins, on all of these americans. these are randomly selected cords every two years. and they look at blood and urine toasts hundreds of thousands of individuals from those cohorts. can you believe it? all that data is publicly
available for you to do any science you opt this. so we said, let's look at the fasting blood sugars and classify them as having diabetes or not. and look at all these 1999-2000. the way to read this, in this cohort they had 13pcbs
measures. these guys had 16 heavy metals. these guys had phthalates and phenols and viruses and volatile compounds. each cohort had a different set and to be clear, these aren't static numbers. sometimes it takes a cdc five
years to finally measure the measurement that may made from the cohort pheiffer years ago. these numbers are changing -- five years ago. let's run every one of these in our equation control for age, sex, socioeconomic status, ethnicity ask, bm i and see if
any of that's factors associate with fasting blood sugars for type ii diabetes this looks like 200 boring logistic regression equations. the reason why it was written up in 100 newspapers is because we called it an ewas. environment wide association
study. and we plotted it on a manhattan plot, which you're used to seeing chromosome by chromosome. now we'll put the nutrients here, phthalates here and we have a bond in red here and we'll often nature genetics rules and we will only talk
about the ones that are positive in two cohorts or more. and the pesticide band in the 1980s for cancer risk is elevated in the blood and urine of americans with higher fasting blood sugar and odds ratios of 1.8 and 3.2. those numbers are higher than
any gene in the genome for type ii diabetes today. now we have done this for lipids, we have done it for blood pressure. you will see a lot of these papers coming out but this is the way we think about the surprising folks in berkley now
starting to come up with en vire row chips. let's measure 1000 of these environmental factors. some of you want to purchase and develop fixed bit. a thing that sticks in your pocket and counts highlights you're walking up and down
stairs. you're be able to get measurements people to sample that environment. and we shouldn't be ignoring that environment and bringing it to the end, the whole story together, i don't want miss the environmental factors anymore
either. so now let's put together. you know we are in an era of sequencing excitement. it's now previous generation sequencing and we will get whole genome sequences on a lot of our patients, if not all. i will talk about helixes with
30,000 genome. it's already delisted. many others coming and going, packed biois on track to sequencing the human genome if 15 minutes prompting discover magazine to come up with an article entitled the jiffy lube of the genome.
those of you know who drivers a car, jiffy lube promises to change your oil in 30 minutes. this will be faster and cheaper than that. they generate 20 tara bites ever data every 15 minutes of data. so get a estimate of the slide and the data.
so if two companies already offering 1000 dollar genomes, plus labor and reagents, lumina is to t the current pricing, academic pricing is if we order more than 100 genomes, our price today is about $3000. what an amazing sentence that is.
people order these in units of more than 100. that's just when you think about it, it's stunning. genomics on tract to sequence 8y genomes a day if they don't lay everyone off -- 80 genomes a day. on track with linear trends to
reach 33 dollars per genome at the end of this decade. some of you might pay more than that just to park around here in bethesda. so, suffice it to say, it will be zero dollars. and i'm just going to say it will be negative.
so the companies that think it's worth for them to get your genome, if you're willing to suffer some discounts. i think there is car insurance champs already pay you or give you a discount on car insurance if you place a gp! your car to see if you're a good
driver. why wouldn't this be different? i'm not even going to show you the moore's law graph on this one that everyone always shows. one kind of quick take home point. this is not theoretical. if you to start playing with
go to complete genomics website and download 69 human genomes. that way it's not theoretical. problems in my class they teach every spring -- spring. go download all the human genomes, unpack and cluster them and the right answer better be, you get the migration of people
on planet earth. so i should be able to see that migration when you clusty ther have you one week to do this because your final projects are due the following week. i can expect undergradding to cluster and download and deal with 69 human genomes and they
do it. we shouldn't be scared of this because this is what the next generation will be doing. we can get with the program or just be scared of this. if you have any interest in what this stuff is, you can put out and i think you need to sign up
for the permission form but it's free data. just go play with it. so, we were working with steve quake and steve quake has given talks here and you know he say very entrepreneurial guy. he published a paper, one of the first faculty members to get the
genome by 2009 and he started a company, and paid 30,000 for his company to get his genome sequenced. he got another repository and the short weed archive. so this is a landmark paper in but it wasn't clear what is the medical relevance of having your
genome sequenced? steve quake presented as a patient. and so let me walk you 32 this. this si paper we wrote on him presenting as a patient. his team and me and my team and experts who gave a talk earlier this year and his team, steve
quake and ethics and george church is on every paper in this field. and so i'm a doctor. so let me present a case to you. here say 40 year-old male presenting good health. no complaints or symptoms. why presenting this family
history of sudden death? here is steve quake and here is his nephew that died at 19 years of age. didn't wake up one morning, sudden cardiac death. steve quake now presents this cardiac geneticist and says, doc, am i at risk for sudden
death too? here is my genome. patient has a heart rate, blood pressure, blah, blah, blah. patient presents with 2.8 million snps, 752 copy number variants and by the way, your next patient waiting for the if you waiting room.
no one is ever going to give us more than 15 minutes to deal with this. so this paper became a prototype exercise of what it could be like in the future for doctors to deal with a patient with a genome in 15 minute encounter we did some things right.
we did a lot of things wrong but it just became a prototype in that way. so, first, the cardiac risk. you and ashley could give an hour-long talk on this and one slide from him, suffice to say it was not an easy answer. there are many genes that are
known to have variants associated with sudden deaths. well, good, lucky. steve quake doesn't have any of those known variants in those known genes boheas got other mutant explains those same genes, they are private because we haven't sequenced enough to
know how many people have this. so we can run all the computational predictions and they all give different answers. in the end, what will you tell steve? instead of extreme exercise, why don't you limit yourself to moderate exercise?
that doesn't sound wishy-washy to you, it definitely seto me. the other problem with coronary artery disease, is a heart attack and there is a history of coronary artery disease in his family. he has lpa which sets him up to have coronary disease.
plug in his age and lipid levels and phenotypes, the risk score would say not to start on statens. so the lpa, we said we should start you on statens steve. it's been written about in many newspapers. he decided not to start on the
we have yet to find the gene in the genome for compliance with and he, undoubtedly has the risk allele. this could be a hour-long talk on a cardiac. this is one slide. another hour-long talk would be from russel.
i'm sure he gave it earlier in the sequence. long story short, we could say something about one lun 50 drugs given the genome, 150 drugs. if he decides to throw in statens, he won't gate kind of lethal myopathy. warfarin, if he might respond to
this better, suffice it to say, 150 drugs, we could say something about the dose, when it might work, whether it might be adverse events. that could be hour-long talk. i'm not going to talk about it my lab was cast with all the rest of medicine.
we said, this must be simple. this must be a master database somewhere with all the deceased snps, all the variants in the genome and how do we intersect that with someone's genome? so these are the ones we know and love. the gwas cat slog great.
i talk about it every talk i give -- catalog. people read this data. 5000 different diseases and the problem, is almost 1/3 of the records there didn't really record which was the risk version of the gene and which
was the protective version of the gene. which allele was which? making this unusable for personalized medicine. now i love this database. we us to like crazy and this is done with next to zero resources but this is the best we have.
it's only the gwas. when we have a positive hit at 50 people validated it, those 50 don't show up in the gwas catalog either nor do the candidate genes sequencing done prior to gwas. and it's another great database here, 66,000 paper curated.
so it's gene-based poorly. you can see what diseases are associated but it wasn't curated to the db snp level. we just know that gene was involved. everyones this disease is not the common stuff, commercial database, pseudoacademic, many
mutations done and even there they didn't write down what the odds ratio and which genotype was which? which was the worst and which was the better? so, we knew this day was coming my lab. and so, four years ago, we said,
we got just start curating, rereading everything paper in human disease genetics. just going to read them all. here is an example of the paper. 2008, journal inflammatory bowel disease. a great example of a pulse would gwas paper.
crohn's disease, nine loci, 700 finished patients. this paper and others, we had more than 100 different features, whether gwas, nongwas, what the is exact disease, fasting blood sugar, population, gender, ethnicity. 800 terms for ethnicity in this
database. of course we wrote down the alleles and p values and odds ratio, technology case control, longitudinal and we map everything to something awesome called the unified medical lodge system. everyone of you knows some piece
of this whether it's gene ontology, icd910, snow med. gene symbols, and now a 23-year-old initiative from the national library of medicine intersected those terms together. how does this look? a lot of it activities in
annoying cables like this. this percent, that percent zoom to multiply them to figure out how many people were in each bucket. journal of hematology, 8 snps, this receptor, case controlled patients. and this table is side ways.
you have to rotate it. i love computer science and love natural language processing. you can see why this is hard. you forget where the tabs go no standard way of looking at so some of you who have really sharp eyes noticed i showed this same snp in two papers.
and the first paper says, alleles c and t and this paper sort of welcome trust says it's an a. but most snps have one choice or the other choice. so how do you get an a? what are the alleles for this? look at each one of these up in
db snp and alleles are c and t. unless it's wrong and then 11% of all papers in human disease genetics report on the probe they used and not the double strand dna of the patient. so now we got to fix them all. and we have. and now you see why this is such
a damn hard problem for computers to do. so we have done that. that's one problem. here is 1000. we lead more than 10,000 papers. 140,000 snps,3000 plus diseases. way over half a million records and this is changing our allele
for us. and we will keep going like crazy. how we do this is published in a paper in 2010. now thaty with have this, we can start to make all sorts of inferences on the patients. so for example, steve quake, we
know he is a 40-year-old white male california. he starts with 9% of chance of getting alzheimer disease. let's add back the snps with or the most believable ones are at the top. the least at the bottom. this is the number of people in
each of those papers and if you're conservative doctor or patient, you can draw the line here and say i only believe it if that's 3 papers. odds are 3% instead of 9% of getting alzheimer's. from a young whipper snapper doctor, i believe everything in
the literature, and i follow this down with 170 people and my odds are 1%. i'm not going to tell you which of these are right or wrong. you can decide with the confidence levels and your conservative nature or not here. so that's why you call a
riskogram. each arrowhead represents the probability based on prevalence. just forget where these go is the hartest part. we don't have any master list of how many americans have every we don't. we have a dodva.
some hmos. we have medicare and medicaid. but we don't vey master list of how many of how many people have which disease. so just getting the prevalence is next to impossible. we goingeled each one of these to figure out someone's best
guesstimate. that's the pretest probabilitiy and by the way, for those who are really aficionados in this field, because we think in pretest and post test and not odds ratios, that whole scary incidental problem goes away. by sequencing genomes, i'm going
to see rand dom stuff that will make me spend a lot of health care dollars to prove you don't have that kind of rare thing. let me really crystal clear. if i have a one in a million disease, one in a million chance to get it, now i tell you you're 100 time more likely to get it,
it's still 1-10,000 chance you're going to get it, close to rounding down here compared to the stuff that will kill you some day. if i spend money on preventing anything, it will be at the top so what are we going to do about the risk?
i talked about the environment, i talked about pesticides. i think we can start to figure out what are the environmental influences that are influencing our risk of getting these i think the environment is the new prescription for the doc. in other words, what can i do to
compensate for my genome? i love gene therapy and all that and changing my dna and that's still years away the last time i looked. what can do i to my environment to chae or influence and we came up with a figure like this and the way to read this is
the same diseases are on want to size so bigger font size means lier likelihood of getting so if you end up with these you might end up with that. all along the edge here, are known published factors that influence ideology of getting that disease.
we came up with a trick to mine the hell out of the mesh nerves pubmed. everyone knows about the mesh terms, some are etiology and adverse events. we mine those to figure out automated way how to draw this. so steve quake, you shouldn't
smoke. you should exercise. you should watch your diet. you shouldn't drink too much alcohol. probably didn'ty need a genome for those four. but steve quake, you have slightly higher risk for
parkinson's disease and in 10 papers, pesticides influences the ideology of parkinson's. this is crude. this is qualitative. this is first pass. but this is a way i think docs and patients talk about their we're doing great on exercise,
let's work on your diet and come back in six months and then we'll work on your alcohol. i think doctors and patients are supposed to talk this way with each other. so just in closing to leave time for some questions, let me skip forward a little bit.
take home points, molecular, critical, epidemiological data. the tools exist. we can get the diagnostics, therapeutics, disease mechanisms. integration is incredibly powerful mechanism. why do i trust the results in
one lab and see what is in common across 100. personalized medicine is great when you think about dna but it's greater than and equal to dnat includes that environment t includes clinical measurement and molecular measurements besides dna.
we shouldn't equate those to. so i have to encourage you to think that bioinformatics is just more than building tools. think carefully whether the world needs yet another database-backed website do. something that people have already done doing.
if you know your tool the best, i encourage to you use the tool. and don't just show people url how to get to the website, show the world you what found with your tool. i think you'll get uptake and you'll be able to change the world that way.
the hardest part for me is this, how do we find new investigators in this field? willing to put a career on publicly available data? because unfortunately, when i see the biology grad students going to their first rotation, they for grad school, go to the
lab, starting rotation project, pick up someone else's data and they go and they are frustrated and not getting experiment to work and what does the principal investigator say? if you didn't collect that data, you can't trust it. what a negative way to do
science today. that's all i can say. when we have thousands, tens of thousands ever well funded labs generating data on taxpayer's dollars and the data is already publicly available. again, it's not perfect. many people still not sharing
and they should be. the data is out there and i just encourage you to thik about that data as being high-value data to launch on these i'm always looking for postdocs, we are also hiring faculty. a long list of collaborations and a lot of labs in tokyo and
taylor university and others and many others. and i can say i'm blessed with 16nih grants from these 9 institutes of nih. these four give me more. these five give me less but i still love them. and i still don't make that 1.5
million dollar threshold that brings me in front of the in quisition committee. so, a lot of this is good citizenship participating in other people's grants. march of dimes, hp, and internal cancer grants and seed grants and i thank my admin and tech
staff wherever i go. i'd never get a good grant or paper out the door without them and i always thank my wife who starts companies with me now, reads every major paper and grant that comes out of my lab and who lets me go all over the place to give talks like this.
fauci. >> so, you told us before about the innovations you made and some of the people you worked with, the innovations they made in text mining. you eluded to. some can you speak more about that?
>> we are not strictly using nlps and more crude tools. how do you figure out what is in a million microarrays? you can't human eyeball those. so we use concept identification to figure out wadate is in those. those postdoc vs to write that
annotation when they up load so we use classification like snow med and icd9 and human eyeball them to figure that out. we use the terminology for the identification but some of these things are just by the time you figure out how to read the tables, we will have already
curated the literature there. so i can't solve this computationally. i'm just going to higher a team to do this. we want to do the science with the results of that thing. not just figure out how to get it to be done in a computational
let's go to you next. >> you said publicly available datasets and publicly available information and you create derivative datasets from that. like the 130 microarrays. do you post those as publicly available datasets for other people to download and to ask
questions? >> yes. that's a great question. the question is on how much do we reissue the 130 datasets there n general, we don't because to be honest, there is no real place to put new annotations on that data.
to be strictly clear, an investigator might say here is an ob-ob house and a wildtype. we read this as diabetes and normal. so we have our own annotations. right now there is no place for people to just reannotate that kind of data.
i think folks, have been talking about building distributed annotations for data but the raw data itself in general to help reproduce experiments, we give out the r code to do this, but we don't want to step on the original investigator's data. we want them to site their data
really. so but i don't know if i answered your question. >> not really. >> let me try again. >> the original microarray datasets were not a sophisticated database. they posted them on the
websites. they just made them available to other researchers and i think that would be extremely valuable if -- because it's really hard to get large datasets out in normalizing and all that. >> i agree. you're right.
so a lot of times, for a lot of the papers, we put like the matrix on our lab, wiki because no other official place to put that kind of stuff. i think what you're getting at is it is so hard to do. i'll admit i made this seem easier than it is.
mapping the annotations of a chip from six years ago from today, these are hard. in general, our methology seshare some of the tools we have used in our findings so we give out the tools to map the arrays, and we look at or have a geo-based browser, we give out
those tools but you're right n general, this is still too hard to do. we are just talking earlier. there is still no easy tool use to just intersect the data this all i'm saying is, i'm not real going say it's easy to do today. but what i'm trying to convince
you is that is possible to do and i think if we can see the value of doing this, i think people will come and generate the tools. i think i want more science to be done this way to even justify these tools. besides just my lab doing it.
>> you mentioned when you built your database that you were using the nhgri gwas catalog and recent studies. so is the shortcomings or things we didn't find in the gwas catalog, did they fix those? and the second question is, as you are crossing the papers, are
any of the advances in how we publish papers and how we tag them better -- >> that's a great q the gwas catalog itself. i gave a static snapshot of it. it continues to grow and expand. we were just talking earlier today with folks from the gwas
catalog how to improve the methology and learn what we have done and nih just en parked on initiative to fund extramurally a consortium -- en parked -- a database goes beyond gwas. so for example, the variance that has some clinical testing or preapproval or research-based
approval or go beyond to the research tests. there is an initiative coming out imminently to fund larger group of extramural folks to do in the meantime, gwas cat slog great because it's a set of at least the gwass there but we have all have problems in how to
cure 8 this date. let me make a point. i have the highest respect for gwas catalog. a lot of our problems come from the researchers publishing the papers. and some are quite deliberately not even telling us they are
finding the snps in their they show a manhattan plot and don't mention which allele was which. neither gwas catalog or initiative will give any deal, automated or not where the investigators are deliberately trying to not share that data.
i think that, if hi to put any attention on anything, it will be that. that research published -- correct papers the right genotypes and the right alleles, and we should put that data out there so other initiatives can start to use
this data as well. long answer. >> got two questions. the first one realities to the and second one is, how do you -- is there going to be a diabetes cure? and second one, how do you read 100,000 papers?
>> so, we read 10,000 and we will read 100,000 soon enough. is this a cure for diabetes? no one would ever love to use the word, "cure." we like "therapies." i think it's interesting there is something global out there across these species we missed.
and so it's another kind of thing i realized that when people run microexperiments, a lot of times those biologists are expecting answers from that experiment. and then they see something not there they were expecting and they say that's inflammatory
smuts or something in the sample. and it's the signal here. so i'm pretty high on this particular molecule and i think we should develop drugs against it. there there is no published small molecule for cd44.
there is maybe something since then, a library of pharmaceutical companies somewhere. nih particularly good at releasing and opening the libraries or repositioning through or chris ought in's efforts so we can get to the cop
pounds and screen. is it a cure? i don't know. we could use more therapies. in the meantime, why i'm not a diabetes doctor is i'm busy working on the next 10 diseases because i want to be agnostic that way and try to at some
point the pharmaceutical companies want to do something what was the second question? >> how do you do the 10,000 papers? >> manually eyeball them. you can crowd source, outsource this, offshore this. people all over the world want
to read these papers for you. it's just a fact of life. the younger folks know about for example the amazon mechanical tech. we didn't tech. this you need a higher level. to be really crystal clear, i tried twice to get nih funding
to build this resource and goodnight get any funding on so i used my endowment funds to build this and hired a company to start reading these. and i didn't make a big deal of i said we need this database. i'm not going to spend out how to do this in an automated way.
i need this data. >> is there a website yet for reading papers. >> how to read these papers is tricky. you want to know the truth? we curated all these four times. we started with a high school kid.
high school kid can read 50 papers all summer and you tell what you think you're going to get out of this because we didn't even start with the date. he said they are interesting and we think it's becausible to do. this so we hired a group to do the next thousand.
this is great but we should have written down this and this. so do it again and again. and why am i saying this? because sometimes when we try to set up the right perfect data model, it takes us 5 years of committees and meanwhile we curated 5 times.
at some point jut start doing it some way and build the perfection media in the good. i think i'm blocking you from drinks next door. so, we'll call it quits. >> thank you. and i'd also like to thank faes for sponsoring this and also a
resuspension in the library right now. so join us and we'll have further conversation. thank you.
No comments:
Post a Comment