WEBVTT 1 00:00:04.660 --> 00:00:17.940 E3410 x8539 Conf Room: Good morning, everyone my name's Simon Malcolm. I'm the Deputy Assistant Director for the Biological Sciences here at the National Science Foundation. It's my pleasure to welcome you all to this morning's 2 00:00:17.940 --> 00:00:35.630 E3410 x8539 Conf Room: bio distinguished lecture. So the Nsf. Biological Sciences directorate supports biological research cutting edge biological research that spans the different scales from temporal through geographical, and also to 3 00:00:36.651 --> 00:00:56.659 E3410 x8539 Conf Room: temporal and geographical scales, and it also supports the physical and human infrastructure to be able to conduct that research. So the the aim of the biological Sciences distinguished lecture series is to bring in some researchers who have cutting edge research to share with us. 4 00:00:56.660 --> 00:01:25.769 E3410 x8539 Conf Room: And often this research is not completely aligned with the biological sciences as we have today, we have the recognizing that a lot of the research that happens in other disciplines can also play a role in biology. So it's my pleasure today to welcome the today's lecture, which is presented by the division of molecular and cellular Biosciences division and we're going to be bringing 2 lectures today. We have Dr. Scott Jackson, and Dr. Ethan Pickering. 5 00:01:26.151 --> 00:01:55.530 E3410 x8539 Conf Room: both of whom are at Bayer crop sciences. And did you have a celebration when the buyer labor, cousin won the Bundesliga or not? Okay, good. I'm pleased. I'm pleased you celebrated that. So Dr. Jackson, is the genetic pipeline design lead at Bayer where he leads a team of researchers who work to design optimal crop improvement strategies. He holds an Ms. And A. BA. Phd. From the University of Wisconsin, Madison. 6 00:01:55.690 --> 00:02:16.910 E3410 x8539 Conf Room: and conducted a post office fellowship with the University of Minnesota. Prior to joining Bayer, he held faculty positions at Purdue University, beyond the University of Georgia, leading the center for applied genetic technologies at Uga, and his research is focused on understanding the evolutionary history of plant genomes, allowing us to better engineer crops for the future. 7 00:02:16.910 --> 00:02:27.059 E3410 x8539 Conf Room: and this work has been funded by several Nsf. Awards, not surprisingly, including those focused on soybeans, rice and peanuts. Peanuts. Always a good idea when you're in Georgia. 8 00:02:27.060 --> 00:02:44.150 E3410 x8539 Conf Room: Dr. Ethan Pickering leads by his AI genomics modeling team where his work focuses on building novel, AI models, architectures, and other tools that overcome challenges in crop genomics. He is a lecturer at Mit, where he previously had a postdoc 9 00:02:44.465 --> 00:03:02.469 E3410 x8539 Conf Room: and he also holds degrees from case Western Reserve University and Caltech. So, as I mentioned, the research in biology can be advanced by tools developed in other disciplines. And so one such tool which we're hearing a lot about these days, of course, is the focus of today's lecture, which is artificial intelligence or AI 10 00:03:02.822 --> 00:03:15.520 E3410 x8539 Conf Room: so today Scott and Ethan are going to discuss how they combine the fields they cover AI and biology to advance, plant and animal breeding, and how the 2 fields can can work in concert. 11 00:03:15.660 --> 00:03:29.510 E3410 x8539 Conf Room: So and during the talk they will weave in their career journeys and work that highlights the ways in which AI and biology can work together. So it's my pleasure in welcoming Scott and Ethan to Nsf. 12 00:03:29.780 --> 00:03:31.430 E3410 x8539 Conf Room: Okay, thank you. 13 00:03:32.560 --> 00:03:33.420 E3410 x8539 Conf Room: Her pitch 14 00:03:34.050 --> 00:03:44.569 E3410 x8539 Conf Room: perfect. Great to be back here. I guess the one thing that was not in my bio, and it's actually not in my bio slightly. There's I did a couple of stints here at Nsf. As a rotating program officer. 15 00:03:44.660 --> 00:03:46.969 E3410 x8539 Conf Room: Some of the people in the room were 16 00:03:47.240 --> 00:03:54.203 E3410 x8539 Conf Room: here when I did those they were in Arlington at the old place, for security is much easier. But just 17 00:03:56.750 --> 00:04:13.409 E3410 x8539 Conf Room: there's some message at the top. There may just a quick, you know, this is something I I spent 19 years in academia and last 5 years in industry. And one thing I learned about industry is, you almost start. You start almost every talk or presentation with a bio slide your journey, and how you got there. 18 00:04:13.820 --> 00:04:23.879 E3410 x8539 Conf Room: And so I have this up, and I didn't have the the soccer barely recruiting, which just won for the first time in their 100 plus your history. The German championship, football, slash soccer. 19 00:04:24.100 --> 00:04:27.040 E3410 x8539 Conf Room: which is really cool. So 20 00:04:27.310 --> 00:04:38.349 E3410 x8539 Conf Room: I did my graduate work at at Wisconsin, work, my potato and a number of other things working chromosome biology, solidogenetics. But within a plant breeding program. So my Phd's and plant breeding but did a lot of work on chromosome biology 21 00:04:38.830 --> 00:04:44.630 E3410 x8539 Conf Room: did a postdoc of Minnesota. I got the football, or they're the mascots for all the places I've been up there. 22 00:04:44.920 --> 00:04:47.520 E3410 x8539 Conf Room: so as a gopher for 2 years 23 00:04:48.084 --> 00:04:58.609 E3410 x8539 Conf Room: for those that haven't lived in Minneapolis is cold in the winter, very cold, and up there I worked on a wild rice, and that started 24 00:04:58.650 --> 00:05:06.420 E3410 x8539 Conf Room: part of my future lab that worked on rice species while on relatives of cultivated rice, and how you might use information for most cultivated relatives to improve rice. 25 00:05:07.310 --> 00:05:14.519 E3410 x8539 Conf Room: I got a faculty position at Purdue in 2,001 started the exact same day that Cliff Wild started on Faculty Purdue. 26 00:05:15.177 --> 00:05:17.440 E3410 x8539 Conf Room: I know he looks much older. 27 00:05:17.867 --> 00:05:29.960 E3410 x8539 Conf Room: and they hired me to work on Soybean. The only thing I knew about Soybean I went there is that you drove down the highway in the Midwest. Tall stuff was corn in the short stuff. It's probably soybean, and that's literally all I knew about Soybean. 28 00:05:30.100 --> 00:05:36.199 E3410 x8539 Conf Room: and they took a chance and hired me, and I spent the next 15 plus years, working on soaving other legumes. 29 00:05:36.370 --> 00:05:37.600 E3410 x8539 Conf Room: And 30 00:05:38.790 --> 00:05:52.650 E3410 x8539 Conf Room: one interesting aspect, this is the early days of sequencing. We're part of the group that helps sequence swinging and some other leg use. And it sort of ties into some discussions we've been having this morning around workforce training and machine learning AI and bring that to bear on under biological questions and problems. 31 00:05:52.780 --> 00:05:55.010 E3410 x8539 Conf Room: As we were generating all genomic data 32 00:05:55.370 --> 00:06:06.280 E3410 x8539 Conf Room: biologists regenerating it, but didn't know what to do with it. So how do you get the mathematicians and computer scientists and data scientists to have an interest in this biological problem? So we went through. So the same morning curve. But 20 years ago 33 00:06:07.100 --> 00:06:14.850 E3410 x8539 Conf Room: I went to Georgia, became a bulldog in 2,011 after a year. Here my first stint as a rotating program officer 34 00:06:16.093 --> 00:06:25.598 E3410 x8539 Conf Room: and finally, but I was at a place that won a National championship with their football team, and I worked on peanuts. Another legium. Why is that that bear correct? Uga! 35 00:06:26.430 --> 00:06:36.690 E3410 x8539 Conf Room: I joined Bayer 5 years ago this August, and I think my Bible is a little bit off. I actually lead the North America swinging and cotton pipelines now, as of a year and a half ago. 36 00:06:37.150 --> 00:06:49.109 E3410 x8539 Conf Room: So on the R&D scale, I'm more on the development side of it. Now spend a lot of time with commercial partners and the growers talking about our products. What is that they need and how we're gonna deliver those from a genetic perspective. 37 00:06:49.680 --> 00:07:10.650 E3410 x8539 Conf Room: So it's my academic career. One slide, 19 years bunch of students and postdocs. Not all of them. But my, my, my training is in plant breeding my passion was chromosome biology, and as you start sequencing genomes, genomes being able to tie to any sequence to understand how chromosomes behave, how they function, what the structure is. 38 00:07:10.660 --> 00:07:16.440 E3410 x8539 Conf Room: And then one of my other passions is poly play, which is prevalent in plants. So we did a lot of sequencing poly plants 39 00:07:16.480 --> 00:07:30.152 E3410 x8539 Conf Room: looking at us, structural phase of polypoi genes, or the the Beta poly of genes and polypoons and other aspects of of how polyplays evolve, and then try to use that information, to understand how that's approved. Crops in a more efficient way 40 00:07:33.170 --> 00:07:36.110 E3410 x8539 Conf Room: moved to Bayer 2019. 41 00:07:36.810 --> 00:07:42.509 E3410 x8539 Conf Room: And this is getting to to the, to the purpose of the talk today. So talk about 42 00:07:43.890 --> 00:07:54.240 E3410 x8539 Conf Room: background and and plant breeding the genomics for a number of years hired in the bear. When I first was hired in bear, I was leaving a group in R&D focused on, how do we use genomic information? 43 00:07:54.590 --> 00:08:03.430 E3410 x8539 Conf Room: Where do we generate that genomic information? And how do we use it more efficiently? What tools we put on top of that to make better decisions and breeding pipelines to get the products to our growers that they want 44 00:08:04.260 --> 00:08:13.730 E3410 x8539 Conf Room: and very quickly realize realize the scale, the scope, the the pace of everything happen, industries dramatically, dramatically, dramatically different than it than academia. 45 00:08:14.216 --> 00:08:29.909 E3410 x8539 Conf Room: When you think about a genome, a genetic experiment, academia, it's 3 reps, 3 locations 3 years we don't use that doesn't even we don't. Don't talk about those numbers at all. You know we're talking 60 80 reps a year and tens of thousands of genetic entities within those reps. 46 00:08:30.450 --> 00:08:41.730 E3410 x8539 Conf Room: and having genetic information on all those. And so what we we have is a is a massive, pipeline pushing project of millions of pro hundreds of thousands of progeny through on an annual basis 47 00:08:41.870 --> 00:08:44.530 E3410 x8539 Conf Room: in in very, in various steps of that pipeline. 48 00:08:44.720 --> 00:08:45.850 E3410 x8539 Conf Room: And if you 49 00:08:46.310 --> 00:08:56.360 E3410 x8539 Conf Room: think about breeding, basically, it's a large funding. You create a bunch of progeny. You take them, take through various cycles of testing to get down to the very few that you want at the end. That's sort of like looking for a needle in the haystack. 50 00:08:56.840 --> 00:09:11.110 E3410 x8539 Conf Room: You create a huge pile, hey? You want to find that one winner. So you that one winner. So you spend the next 10 years after you create this huge pile trying to figure out which of these hundreds of thousands you created is going to be the one that's going to be a successful variety or hybrid. 51 00:09:11.820 --> 00:09:18.830 E3410 x8539 Conf Room: and we generally a lot. Generate a lot of data, genotyping things along the way, sequencing things along the way, collecting phenotypic data. 52 00:09:18.870 --> 00:09:23.959 E3410 x8539 Conf Room: And you can begin to build automation and tools around that to connect things together and be able to 53 00:09:24.290 --> 00:09:32.140 E3410 x8539 Conf Room: impute genetic information, infer what the phenotype might be based on relatives and prodding your grandparents of that that entity. 54 00:09:32.480 --> 00:09:39.960 E3410 x8539 Conf Room: And so we built a lot of resources. We hire a lot of data scientists, computer scientists to help build this infrastructure. These models tie these things together. 55 00:09:40.200 --> 00:09:42.939 E3410 x8539 Conf Room: But at the end of the day we're still looking for that needle. 56 00:09:43.070 --> 00:09:55.410 E3410 x8539 Conf Room: And so we get a little bit more efficient doing using these things to find the needle. We're still making hundreds and hundreds and hundreds of thousands of progeny, Regina, typing them. We're testing them, trying to get down to those those few needles that we want. We want to move forward. 57 00:09:56.740 --> 00:10:06.769 E3410 x8539 Conf Room: So maybe just on this slide here, that that thing that looks like a cross section of a brain is actually representation of the of our maze germ plasm based on genetic information. 58 00:10:06.990 --> 00:10:15.010 E3410 x8539 Conf Room: and it looks like 2 lobes of a brand. Those are the male and female hedonic pools. So we create hybrids. And those are the 2 pools that we breed with them. 59 00:10:17.020 --> 00:10:27.299 E3410 x8539 Conf Room: So as you can imagine over the past 20 years. As we built and scale this infrastructure to to try and find these needles and this massive amount of of plant entities that we generate. 60 00:10:27.470 --> 00:10:36.279 E3410 x8539 Conf Room: we create a lot of automation to to get them to collect the data. We need everything from the the genetics all the way down to how they perform the field. 61 00:10:37.350 --> 00:10:49.180 E3410 x8539 Conf Room: And so we have. We have centers where this the seeds are sent, the seeds are chipped, so they take a small section out of a of a piece of seed the genotype that section, and and then we move that seed board either into 62 00:10:49.350 --> 00:10:54.760 E3410 x8539 Conf Room: a waste can, because we don't want to plant it, or we put in a greenhouse or field based on the genetic information that we get from that chip. 63 00:10:55.080 --> 00:10:59.049 E3410 x8539 Conf Room: And this is all automated and and central lab facilities. 64 00:11:00.880 --> 00:11:21.199 E3410 x8539 Conf Room: Once we go from N, knowing what of the millions of seeds that we ship annually and get genetic information on, to know which of the hundreds of thousands of one actually plant those get sent to a central central packaging facility which looks a lot like an Amazon warehouse. It's conveyor belts. It's automation. These things come in. They get packaged into what we call sets. 65 00:11:21.440 --> 00:11:26.219 E3410 x8539 Conf Room: The cassettes can get sent out to to centers planning centers around the world. 66 00:11:26.760 --> 00:11:38.300 E3410 x8539 Conf Room: and they're planted in the fields. Bottleneck assessment. We know where every plot, every seat, you know, genotype of everything in the in that, in the in the, in, the, in that field? And we know where it is geographically geocated. 67 00:11:39.090 --> 00:11:55.070 E3410 x8539 Conf Room: And then we collect data throughout the season. So how does it perform. How's it perform and and stress? How's it perform? with? To various disease pressures we fly uids or drones to collect that? When does it flower. When does it mature? When is it setting seed? All these other things? We collect all this data. 68 00:11:55.920 --> 00:12:07.769 E3410 x8539 Conf Room: So we start with millions of projects with genotype plant hundreds of thousands to start collecting scientific information. And over the next 7, 8, 9 years. When are those hundreds of thousands down to the 10 or 20 that we're gonna move towards commercial products. 69 00:12:08.750 --> 00:12:10.290 E3410 x8539 Conf Room: It's an expensive process. 70 00:12:10.390 --> 00:12:12.789 E3410 x8539 Conf Room: Generate lots and lots and lots of data. 71 00:12:13.590 --> 00:12:18.130 E3410 x8539 Conf Room: Lot of this is automated within within large greenhouses. So the one here in Marana 72 00:12:18.270 --> 00:12:23.319 E3410 x8539 Conf Room: so 5 or 10 acres I can't remember. It's 10 acres under glass, 10 73 00:12:24.227 --> 00:12:35.619 E3410 x8539 Conf Room: all automated to to to to to start cycling populations more rapidly to move the genetics of a population doing multiple cycles per year rather than one cycle per year and planting in the field. 74 00:12:35.975 --> 00:12:40.090 E3410 x8539 Conf Room: So we can move the genetics of a population more quickly and then move them out into the field for testing. 75 00:12:42.020 --> 00:12:53.179 E3410 x8539 Conf Room: So if you think about breeding over time, going back to domestication thousands years ago, where people are picking things that didn't, for the seats didn't fall on the ground. So we got not shattering. Those are sort of major changes 76 00:12:53.430 --> 00:13:03.690 E3410 x8539 Conf Room: to breeding. In the early 19 hundreds we started applying statistical models. Hybrid seed was first developed in 1920, 1930. And commercially, in 1940, s. 1950, S. 77 00:13:04.330 --> 00:13:06.990 E3410 x8539 Conf Room: We started applying 78 00:13:09.030 --> 00:13:19.480 E3410 x8539 Conf Room: modern harvesting tools, catching yield as they come up to Harvester. We start doing local markers in the the nineties, and really full full blast in the 2,000. 79 00:13:20.050 --> 00:13:23.489 E3410 x8539 Conf Room: And those are sort of like evolutions. And how we've done plant improvement. 80 00:13:24.510 --> 00:13:26.640 E3410 x8539 Conf Room: At Bayer Monsanto 81 00:13:27.190 --> 00:13:30.530 E3410 x8539 Conf Room: Bear bought Monstano 5 years ago since fair. 82 00:13:31.040 --> 00:13:42.449 E3410 x8539 Conf Room: but they they sort of break it into breeding 1.0 2.0 and 3.1 point 0 is just they acquired a lot of genetics and germ plasminc companies to get the genetics. Get that? Get those tools to start creating those those winning varieties. 83 00:13:43.150 --> 00:13:52.980 E3410 x8539 Conf Room: Reading 2.0 and 3 point are really, really about increasing the precision. So with knowing where you're planting things predicting where you want to plant them, based on what the expected performances 84 00:13:53.360 --> 00:14:01.900 E3410 x8539 Conf Room: breeding 3.0 has really around the digital enablement. So all the automation around c chipping, getting genetic information on all the millions of progenies at the very beginning 85 00:14:01.910 --> 00:14:05.680 E3410 x8539 Conf Room: to know which one the which ones you want to plant in those initial stages of testing 86 00:14:06.660 --> 00:14:15.080 E3410 x8539 Conf Room: and what we're the phase we're in now. And this is where we're Ethan's gonna take over here in a minute is really thinking more about design. 87 00:14:15.160 --> 00:14:29.509 E3410 x8539 Conf Room: So can we flip this breeding strategy from creating millions of progeny trying to get down to those 10? They're gonna be the winners. Can we think more intentionally about how we create those populations at the beginning, knowing what our growers need? And can we design the genetics more intentionally 88 00:14:29.570 --> 00:14:33.190 E3410 x8539 Conf Room: using modern tools? All the data that we've generated over the past 10 years 89 00:14:33.220 --> 00:14:37.719 E3410 x8539 Conf Room: to note to more, to to create the the the chances 90 00:14:37.890 --> 00:14:42.402 E3410 x8539 Conf Room: and reduce the haystack to get those needles that they're gonna be those winners in in the growers fields. 91 00:14:43.362 --> 00:14:47.797 E3410 x8539 Conf Room: So with that, I'm gonna turn over to Ethan. 92 00:14:49.294 --> 00:14:57.205 E3410 x8539 Conf Room: Alright. Thanks for really excited to be here. Written number of, you know, Nsf proposals and things like that. And seeing this 93 00:14:57.520 --> 00:15:06.962 E3410 x8539 Conf Room: all over the place, and having the opportunity to actually go talking and stuff so nice. Oh, thanks, it's gonna help a lot. So 94 00:15:07.500 --> 00:15:31.920 E3410 x8539 Conf Room: let's say, I think we have a couple of slides to push through here. I just wanna quickly acknowledge. So I get to lead an AI genomics research team right now there, and a number of different Phd, researchers who've done phenomenal work of last year, too. Just wanna make sure I mentioned them Bobby and Katie Alexis Katiana, shiny but kouchering so a little bit about myself since we always have these timelines and 95 00:15:33.025 --> 00:15:45.030 E3410 x8539 Conf Room: scott gave a little bit of background, so I'll do it as well. Even though we look the same age. Mine's a lot more abbreviated. In time, and let me move something here real fast. 96 00:15:45.030 --> 00:16:09.310 E3410 x8539 Conf Room: So so my youth was actually in in agriculture. So I grew up on a vegetable farm in Ohio, and we're primarily growing sweetcorn. Really enjoyed it a lot. But I started to recognize biology was for very unpredictable, complex, and going all over in different directions. But some of the machines that we were using and kind of the engineering that was around, agriculture 97 00:16:09.310 --> 00:16:34.050 E3410 x8539 Conf Room: was much more predictable, and something that you could fix it really fix a broken implement or something, whereas with the biology had a little bit more luck. So I decided to take a route into engineering at Case Western. And then calc mit. And this was really looking at, being much more based in math and physics and calculus to explain the physics. And then AI, to explain some of the other 98 00:16:34.050 --> 00:16:45.019 E3410 x8539 Conf Room: components. And this was all about trying to learn. How are we going to be able to predict something, build some model to predict an outcome, and if you can bridge, predict it. 99 00:16:45.090 --> 00:17:10.160 E3410 x8539 Conf Room: then you can start designing for very intentionally and then I had the unpredictable move that got a call one day about a position at Bayer, and whether or not I'd be interested, starting to go back into these messy biological complex problems that are not so predictable. So it's been a very uncomfortable jump into the unpredictable aspect. But it's been a lot of fun. And so one of the things about 100 00:17:10.190 --> 00:17:12.640 E3410 x8539 Conf Room: jumping into the biological domain. 101 00:17:12.849 --> 00:17:21.630 E3410 x8539 Conf Room: One of the questions that I get very often. It's it's consistent. And I have to wrestle with every days I'll get the question. 102 00:17:22.480 --> 00:17:47.827 E3410 x8539 Conf Room: can you interpret your model? Can you give us the interpretation of your model? And generally that answer today is going to be. No, it's a nonlinear AI model. No? Well, not. I cannot give you an interpretation, not today. But that's not necessarily the purpose. It's for prediction, not necessarily interpretation. So I'm gonna make a couple of arguments about why, that's particularly important here. 103 00:17:48.320 --> 00:17:50.240 E3410 x8539 Conf Room: oh, this isn't clicking anymore. 104 00:17:51.020 --> 00:17:51.725 E3410 x8539 Conf Room: So 105 00:17:52.660 --> 00:18:08.979 E3410 x8539 Conf Room: in the background of playing around in physics for a long time and being very interested in physics and calculus. I think it's interesting to look back at how physics changed in time, and how that was, how it developed. So for most of history, physics was a field of philosophy. 106 00:18:09.609 --> 00:18:19.520 E3410 x8539 Conf Room: So there are 3 branches. You had physics, then you had logic and ethics, and if you were to propose anything physics, you had to reason with that between ethics and logics. 107 00:18:19.520 --> 00:18:44.419 E3410 x8539 Conf Room: logic, and a human experience. And so you were not able to propose something unless you could interpret it and explain it within all 3 parts of the field. And so this was a very qualitative over quantitative approach to how physics was described, and that was for 2 millennia starting with Aristotle and the Aristotelian physics all the way up until Copernicus, Copernicus, and Galileo were starting 108 00:18:44.420 --> 00:18:50.249 E3410 x8539 Conf Room: to change some things. There's really Newton and Leibniz when they introduce calculus 109 00:18:50.270 --> 00:18:57.060 E3410 x8539 Conf Room: and calculus absolutely transformed the way that physics move forward and how things were designed. 110 00:18:58.260 --> 00:19:06.960 E3410 x8539 Conf Room: But there's a what calculus was not seen is necessarily a golden. It wasn't perfect right off the bat. So 111 00:19:07.170 --> 00:19:35.409 E3410 x8539 Conf Room: like neural networks. And AI, this lack of interpretability also plagued calculus when it was originally introduced. And I really like this? this quote here, that calculus is often taught as if it is a pristine thing emerging Athena like complete and hole from the head of suits. It is not, it's take. It took over 200 years for us to actually create the foundations of modern calculus, and there was a lot of concern about how it worked. 112 00:19:35.410 --> 00:19:59.869 E3410 x8539 Conf Room: So in particular noon, and Leven said, Hey, here's a tool. It predicts particularly accurately. It works very well, and it works very effortlessly, but they couldn't articulate or explain or interpret this to the various philosophers and physicists of the seventeenth century. And so a lot of people push back on this and really what noon. And Leibniz said back, well, this wasn't exactly our goal. 113 00:20:01.340 --> 00:20:08.203 E3410 x8539 Conf Room: but the engineers and I'm an engineer. So I really like this? Kind of approach, said, Well, whatever that's fine. 114 00:20:08.640 --> 00:20:33.749 E3410 x8539 Conf Room: I don't. We don't care necessarily about the interpretability, but if we can predict accurately or predict something we can design. And this is gonna be really nice. And we can move forward. And it was this interaction between those new designs. Those new steps that engineers took that provided a lot of the data that essentially created the foundations of calculus which took about 200 years before. We had the modern calculus set of work for 115 00:20:33.750 --> 00:20:38.709 E3410 x8539 Conf Room: work with now. So I think the the purpose here is to really mention that I think 116 00:20:39.670 --> 00:21:03.910 E3410 x8539 Conf Room: this is a provocative statement that in interpretability is not necessarily the goal of what we're trying to do with AI. But that prediction is the goal. And here's my statement that I believe neural networks or AI will be to biology. What calculus was provides us a way to start interpreting or predicting from some input variables, some downstream output variables. 117 00:21:04.398 --> 00:21:31.510 E3410 x8539 Conf Room: And there's a particular reason for why neural networks, I think, are are unique and useful for biology versus calculus with physics. And because physics has a ton of classical laws. And it's relatively the universe is always seeking equilibrium. So it's kind of fall that's rolling down the hill the entire time. It's relatively elegant. It's relatively, and calculus is also very elegant. 118 00:21:31.510 --> 00:21:44.479 E3410 x8539 Conf Room: When we look at biology we don't have all these laws. And I really like this. This is something pulled out of the dissertation from 2022 from a caltex student 119 00:21:44.770 --> 00:22:12.149 E3410 x8539 Conf Room: that life perpetuates its existence out of equilibrium against the will. The second law firm, and I think that aspect there, against the will of what thermodynamics wants to do is why biology is so complex, and why we've had such a hard time understanding it from other tools like calculus, because it's using it is fighting. And if you've ever seen a fight. It's never elegant. It's always something crazy. That's it's going up this 120 00:22:12.220 --> 00:22:16.580 E3410 x8539 Conf Room: inclined march. So conflict of biology. 121 00:22:16.690 --> 00:22:23.229 E3410 x8539 Conf Room: this complex neural nets and AI are. So I think it's the right tool for us to start predicting. 122 00:22:24.979 --> 00:22:48.370 E3410 x8539 Conf Room: So now this kind of gets more into just the general motivation of of why we're doing this in agriculture and such and we know that agriculture must adapt faster than ever. We have a number of different pressures going on. We have massive population increase. It's gonna require 60% increase in agricultural production. We have ever changing growing conditions that we have to deal with, we have 123 00:22:48.890 --> 00:23:05.810 E3410 x8539 Conf Room: larger spreads of disease due to globalization. We need to make sure that with regulations that we meet the societal demands for how food is produced. And finally, we have to do all this, somehow, the 60% increase and all those other constraints without blowing up the planet 124 00:23:06.328 --> 00:23:27.319 E3410 x8539 Conf Room: and when we look at generally what we have with respect to data that's in agriculture, I think we have a really great opportunity to start accelerating even faster about. How we start designing because of all the different data sets that are popping up across the planet and the different opportunities that we can hopefully pull from that data. 125 00:23:29.240 --> 00:23:34.309 E3410 x8539 Conf Room: now, Scott was mentioning this, and I think, and so I'll go somewhat quickly here. But 126 00:23:34.450 --> 00:23:55.669 E3410 x8539 Conf Room: when we look at agricultural data, we see it increasing in a number of different ways. So it's not only in scale, but it's in resolution, and it's source and type. And so this demands that we have likewise advancements in modeling capabilities. In particular, on the AI side of things. So I like. This is kind of a nice example of 127 00:23:55.670 --> 00:24:12.584 E3410 x8539 Conf Room: what were the genomic resolution resolutions that you could get that skill or a big company? And we're very close to seeing the ability to look at full full assemblies for a lot of the different lines that we're we're producing. 128 00:24:12.960 --> 00:24:36.229 E3410 x8539 Conf Room: Now, this is a similar we see it not only in base pair resolution, but also transcript domics, gene expression data that's coming on other advancements and and gene ontology. And as we continue to learn about gene interactions. And there's a similar story here between weather soil, management and imaging. So we're getting all this data. It's all increasing in scale. 129 00:24:36.230 --> 00:24:59.269 E3410 x8539 Conf Room: and all has different data types. And so that traditionally would be a problem. Because we have base pairs here, we have time series weather data. We have care categorical management approaches. We have scalar variables that we see in the soil. All of these are very different data sources. 130 00:24:59.886 --> 00:25:13.119 E3410 x8539 Conf Room: And so AI provides a really unique, flexible opportunity that you can start synthesizing all these different multimodal data streams to one particular architecture to help you design. 131 00:25:14.180 --> 00:25:40.000 E3410 x8539 Conf Room: Today, I'll show a couple of quick examples. Just focusing on the G part. So we're just gonna focus on the genomics and what we can do of modeling genomics to a phenotype. So the phenotypes that I'll mention here some observations that say yield height, disease, resistance, and we'll be using a genotype factor of some sort, some resolution to map to that phenotype. And then we, of course, have some noise. 132 00:25:40.650 --> 00:26:01.520 E3410 x8539 Conf Room: and there's 4 pieces. Of this approach of going from genotype to phenotype that we'll care about the first one is the architecture of an AI model. So an architecture that is the bones. This gives the structure that skips most of the properties that we can expect out of a model will be embedded in the design of the architecture. 133 00:26:01.520 --> 00:26:17.839 E3410 x8539 Conf Room: And we'll show a kind of a pool of well, I think it's cool approach where we start putting information biologically informed components into our architecture to make it predict that increase accuracy. 134 00:26:17.960 --> 00:26:46.739 E3410 x8539 Conf Room: the second one is lost. Functions so lost functions are the learning criteria that you can use for your AI model? And they're very important because they define the design. The design question that you care about. And so we should make sure that our learning and our loss functions align with those, and then also show 2 quick other approaches. Here active learning is an idea. And AI, where you're. It's very similar to jump genomic selection 135 00:26:46.960 --> 00:27:07.869 E3410 x8539 Conf Room: where you have an AI model and you have your system. And you're gonna allow them to interact with each other. So they get to talk, and they get to update and continue to progress towards some downstream goal. And then we'll say a couple of quick things about large language models and their applications right now. So, jumping into the architecture. 136 00:27:08.830 --> 00:27:24.560 E3410 x8539 Conf Room: so one of the questions that we wanted to answer was, could we start embedding domain knowledge into our models? And so first, when we look at the left side of the data that we have at scale. At Bayer we have 137 00:27:24.560 --> 00:27:46.030 E3410 x8539 Conf Room: tens of millions of phenotypes, these being in yield disease, etc, and we have perhaps over 100,000 unique genotypes. But these are at marker levels. So we have very coarse information. It might only be 10,000 base pairs or something around those lines. So we're missing a lot what's really going on? And the genotypes that we care about 138 00:27:46.690 --> 00:28:03.619 E3410 x8539 Conf Room: now. On the other hand, when we look at domain knowledge and things like, say gene regulatory networks or gene ontology terms these provide some really high fidelity, information, things that we clearly know, or at least at this point in time believer particularly important. 139 00:28:03.720 --> 00:28:16.319 E3410 x8539 Conf Room: Those are really high fidelity pieces of information. But the problem is is that we have very little data called model. So if we have gene expression data, typically, we might only have a couple of different gen types. So you can't really make a design model with that. 140 00:28:16.880 --> 00:28:36.599 E3410 x8539 Conf Room: So we said, Well, what if you could combine those 2? So you could take the general structure of a neural net with all of these parameters, and we can embed that domain knowledge in the center of it and make the model have to learn to predict through this particular graph, and so 141 00:28:36.870 --> 00:28:48.456 E3410 x8539 Conf Room: to give a couple of more reasons for why this. We think this is a good idea, not only from a biological standpoint, but from a mathematical standpoint, is that. Graphs are very attractive for this. 142 00:28:49.353 --> 00:29:11.349 E3410 x8539 Conf Room: this approach is one off the shelf AI models which we see a lot of off the shelf, AI models being used. And that's that's a bit of a concern. I would say we wanna be very particular of how we're using our AI models. And so we're gonna get over parameterization. Now, if we build a graph, we can reduce that complexity substantially 143 00:29:11.750 --> 00:29:15.319 E3410 x8539 Conf Room: the other problem with off the shelf. AI models are 144 00:29:15.540 --> 00:29:27.749 E3410 x8539 Conf Room: pretty much all AI models is that they struggle with understanding very long range interactions. So if we know that we have some gene say chrome one and another gene grows on 10, their Billings and base pairs away. 145 00:29:27.940 --> 00:29:29.669 E3410 x8539 Conf Room: An AI model 146 00:29:30.180 --> 00:29:46.519 E3410 x8539 Conf Room: generally is never going to be able to pick that up. It's never gonna be able to understand that if we have a graph, we can call out those known interactions very quickly and very explicitly. And so that provides a very big Mac Mini, advantage. 147 00:29:47.470 --> 00:30:06.049 E3410 x8539 Conf Room: So here's an example of building one of these Bio gn ends. So yeah, we call them, bioinform Gn ends. And we're building, this is all open source data. Actually. So we built the graph from the genontology resource. So we asked, okay, here are various genes that we have in the maze genome 148 00:30:06.050 --> 00:30:18.820 E3410 x8539 Conf Room: build us a graph of all the different interactions. And then we took that graph. And then we linked that graph up to the marker sets that we have. So that way. You had base pairs within a certain distance are going to be linked to that gene. 149 00:30:18.850 --> 00:30:38.449 E3410 x8539 Conf Room: And then we put there's some that were just really far away. We didn't necessarily need to do this, but they're really far away. And so we put them into their own little neural net. And this was using the genomes fields data set. And we were able to see somewhere around 1520 increases. And our routine squared route mean squared error 150 00:30:38.960 --> 00:31:06.169 E3410 x8539 Conf Room: with yield plan. Heighten your head. What I'm most excited about this approach is that this is organism agnostic. So there's a ton of other genontology graphs that you could build for a number of other different data sets that exist out there and start to continuously learn through other organisms about what these graphs could look like. These graphs are not unique. There's not one silver bullet to graph, most likely. But you could tune these to 151 00:31:06.320 --> 00:31:24.659 E3410 x8539 Conf Room: explicit questions that you care about. So here we cared about yield. So we kind of just have to have everything. But if we cared about something much more specific. Say something like flowering time, we could build a graph that's very explicitly defined for flowering time, and we don't really care about a number of other interactions. Perhaps. 152 00:31:26.680 --> 00:31:44.259 E3410 x8539 Conf Room: So now, to loss functions, this is gonna be the most mathematical component of this. I'll I'll go a little bit quicker through it. I think we have so much time. So to talk about lost functions which are learning functions. The general goal of creating a lost function is that, or whenever you have any model. 153 00:31:44.300 --> 00:32:10.050 E3410 x8539 Conf Room: you want your observed values to align with your predicted values. So you want to be along this perfect prediction line and so anything above this line is over predicted anything below this line it's under predicted. And the goal is that you want to push these as close together as possible. So typically when we train a model, we'll use something like mean squared error or use mean average air. And generally, this is just gonna take all the points and try to squish them. 154 00:32:10.610 --> 00:32:21.289 E3410 x8539 Conf Room: But when we look at a lot of the data that we work with, and the design rule that we care about for genomic selection and crop improvement is that if we look at all the data that we have. 155 00:32:21.686 --> 00:32:43.130 E3410 x8539 Conf Room: We're trying to. Typically, if this is yield, we're trying to improve, yield and most of our data does not sit anywhere near the upper bounds of the things that we really care about, products that we want to design. So what can this lead to? What can lead to very poor tailwise, because mean 156 00:32:43.700 --> 00:32:51.750 E3410 x8539 Conf Room: mean that we're only going. We tend to emphasize all the data points where all the data points exist. And there's no. 157 00:32:51.820 --> 00:32:57.369 E3410 x8539 Conf Room: If these are anti correlated in any way to be tailored events, they will just spread out. 158 00:32:57.410 --> 00:33:02.840 E3410 x8539 Conf Room: And that makes that means that for what we're trying to design, for we're not gonna be very good at predicting. 159 00:33:02.940 --> 00:33:32.290 E3410 x8539 Conf Room: There's a second case of this, and this one's not. I don't observe this one too often. But observe this one all the time, especially in agricultural data. And I argue, this is perhaps even worse. This is compression where we have observed data that extends a a pretty long span, and our model is only able to predict over a much shorter span. So it doesn't even understand the edges whatsoever in both of those tails, whether that be yield or say disease, resistance. 160 00:33:32.470 --> 00:33:36.570 E3410 x8539 Conf Room: So what we can do if we're thinking about this from a design perspective. 161 00:33:37.223 --> 00:33:51.800 E3410 x8539 Conf Room: We can actually create lost functions that target only learning about the tails or prioritize not only but prioritize learning about the tails, while at some other time giving up a little bit on the meet, so there's no free lunch. But you're allowed to 162 00:33:51.800 --> 00:34:11.870 E3410 x8539 Conf Room: pivot yourself towards what you actually want to design for and so there's some interesting work. This comes out of the Mit post, Doc lab that we're working with extreme events. And how do you? How do you tease out extreme and rare events from different systems with AI. And one of the ways is to build these lost functions. 163 00:34:12.730 --> 00:34:17.489 E3410 x8539 Conf Room: So I'm gonna jump past some of the I had a proof. But 164 00:34:17.969 --> 00:34:30.560 E3410 x8539 Conf Room: I don't know. We'll pass up the proof. How's the right crowd? It's a very elegant way to build in 165 00:34:31.000 --> 00:34:33.010 E3410 x8539 Conf Room: known constraints. Is that 166 00:34:33.618 --> 00:34:36.009 E3410 x8539 Conf Room: exactly? Yeah. Yup. 167 00:34:36.270 --> 00:34:43.870 E3410 x8539 Conf Room: if I have if I'm trying the last function, I'm just trying to understand what you're telling us. 168 00:34:44.520 --> 00:35:10.520 E3410 x8539 Conf Room: I, one option would be just to ignore half of the data and only focus function or focus on the data that's in the region that you want. That's not what you're doing. You're just waiting the data somehow. Yeah, so there, there can be cases where that data in the middle is very useful for understanding the extremes. 169 00:35:10.680 --> 00:35:29.229 E3410 x8539 Conf Room: But sometimes there's data that comes at the expense of understanding those extremes. So that's why. So that's why we wait it that way. And and we waited in this very particular way to make sure it's a continuous distribution. And so you're able to in the case that everything's perfectly correlated. 170 00:35:29.320 --> 00:35:30.580 E3410 x8539 Conf Room: It still works 171 00:35:30.700 --> 00:35:41.104 E3410 x8539 Conf Room: great across the entire span. So yeah, we still don't want to throw away by just completely ignoring we'd likely be missing a lot of information. 172 00:35:42.639 --> 00:35:57.799 E3410 x8539 Conf Room: so here's an example of using that for disease. And so disease is a great goal. Any problematic disease by definition means that resistance is going to be rare. If it wasn't problematic, then it wouldn't be rare, and we wouldn't really care so much. 173 00:35:58.210 --> 00:36:17.520 E3410 x8539 Conf Room: And in all these cases we have so resistances over here. This is this tale. Very limited data, mostly everything here susceptible. And if you use the standard genomics model. You get this compression effect. So you see everything being compressed to the mean 174 00:36:17.844 --> 00:36:43.189 E3410 x8539 Conf Room: the average value. And so your model is just giving you tons of average values out left and right. But you can start pulling this to you can start pulling and teasing out these resistance components of the genetics by adding in one of these lost functions, and then it's really hard to see with the green here. But this ends up removing the compression and gives you more of a diagonal line on your predicted 175 00:36:43.190 --> 00:36:47.600 E3410 x8539 Conf Room: and observed. And so here we don't have this over prediction of the means. 176 00:36:47.600 --> 00:37:06.230 E3410 x8539 Conf Room: So now we're we're telling our model that you need to focus very explicitly on what makes things rare. In this case, which is for diseases. This means that we can now move way faster when we see this problematic diseases in terms of finding the right germ plasm and then breeding those 177 00:37:08.627 --> 00:37:18.792 E3410 x8539 Conf Room: one other part here on kind of genomic selections. That's very useful for teaching that after teasing that out for genomic selection, but we can start implementing some ideas of active learning. 178 00:37:19.530 --> 00:37:36.370 E3410 x8539 Conf Room: I think many people are probably familiar with genomic selection. But we typically will go test some set of genetics. Observe the phenotypes. We then train some model, and then we try to use that model to choose the next set of genotypes to put out in the field. 179 00:37:36.992 --> 00:37:54.589 E3410 x8539 Conf Room: Now, this has typically traditionally taken this approach where this is this is what we call the acquisition function, and tells you which new genetics you want to put out in the field, and traditionally, we just exploited. So the model says, this is the best. Let's put that out there. 180 00:37:54.590 --> 00:38:14.350 E3410 x8539 Conf Room: but that doesn't allow the model to ever learn about other interesting ideas that are out there. So we need to make sure we start embedding some exploratory terms. So when we put this way. We're not biasing our model just to one particular solution, but allowing it to search the space much more dynamically. 181 00:38:14.490 --> 00:38:36.840 E3410 x8539 Conf Room: And we've done a little bit of some analysis of various different genomic data sets. And really, all this gift is trying to show is that this is a blue blog. Is all the data, and the model is picking out all these red terms which are the high performers. They can do it at extremely efficient levels. If it has an exploration term, and so this is perhaps maybe one tenth of the data. 182 00:38:36.840 --> 00:38:44.589 E3410 x8539 Conf Room: So there's massive accelerations that we potentially see if we do appropriate exploration and active learning techniques. 183 00:38:45.440 --> 00:38:49.810 E3410 x8539 Conf Room: And then the final thing, large language models. I have to say it because everyone's doing it. 184 00:38:50.607 --> 00:38:53.609 E3410 x8539 Conf Room: so one of the things that we're interested in. 185 00:38:54.149 --> 00:39:20.960 E3410 x8539 Conf Room: There is that you have a massive genome, and you need to find what are interesting regions for us to go and edit and so we've been using some of the large language models. Find out, find segments that have unetholated regions. Accessible chromatin conserve not going sequences and transcription factor, binding sites. And so we use these models to try to figure out where that is, and then say, Hey, that's a good high end value 186 00:39:21.249 --> 00:39:35.700 E3410 x8539 Conf Room: or high value editing target. And then we go try to collect that data. Now, we have the nice advantage that we have a ton of data on our very specific germ plasma that we wanna make specific edits. So that really helps with building some of these models. 187 00:39:37.633 --> 00:40:03.029 E3410 x8539 Conf Room: And just to start wrapping up here. I talked all about genetics, but there's so much opportunity in the soil and weather and management components here, as well as imaging to either image image some of these things like weather or management practices, and imaging that gives you much better high resolution, phenotyping, and phenotypes that we have yet to even start modeling or observing 188 00:40:03.475 --> 00:40:07.830 E3410 x8539 Conf Room: and those will all fit very nicely. And the AI architectures. 189 00:40:09.211 --> 00:40:26.049 E3410 x8539 Conf Room: I like showing this slide that that we have. I didn't build this slide that somebody didn't kinda nice to show the progress of what we've done, but I think, even though we come a very long way. The next steps are gonna have to go beyond the bushel breaker. Not just gonna be about efficiency, but about other things of 190 00:40:26.408 --> 00:40:43.629 E3410 x8539 Conf Room: how can we make sure that we meet the livelihoods of farmers, and also the regulations and societal pressures, how foods produced and other sustainability metrics. And I think being able to synthesize all these different data streams is gonna be very critical and going beyond 191 00:40:43.630 --> 00:40:45.798 E3410 x8539 Conf Room: the traditional bushel per acre. 192 00:40:46.380 --> 00:40:49.910 E3410 x8539 Conf Room: So a last couple comments here about 193 00:40:50.140 --> 00:41:06.270 E3410 x8539 Conf Room: where I think, maybe on the educational training side, how must that shift? To recognize the opportunity of these data driven levels. I would say that first, that AI and Ag requires a bit of a perspective change, and this is 194 00:41:06.270 --> 00:41:29.580 E3410 x8539 Conf Room: that interpretability and explainability which are important, and things that we should continue to ask questions about, but they should not undermine the capability of prediction and design. And sometimes you see that that something can't be interpreted. Explain, we don't move forward with it, but prediction and design. We don't necessarily need interpreting explainability, at least not today. We'll give it some time. 195 00:41:29.830 --> 00:41:42.507 E3410 x8539 Conf Room: The next part is formalizing quantitative design goals and really making sure that our design goals for line are aligning perfectly with what we're doing with our tools that we have 196 00:41:42.990 --> 00:42:05.180 E3410 x8539 Conf Room: and that's more of an engineering perspective here of trying to teach these creative solutions with clear assumptions and hypotheses and boundaries. That we wanna operate in. And the third one is that we still need to make sure we identify problems from deep biological domain knowledge. I I think one of the most interesting things over the last 2 years being at there is 197 00:42:05.220 --> 00:42:18.239 E3410 x8539 Conf Room: the very critical conversations that I've had with a lot of career biologists that have been absolutely that they've been amazing in terms of figuring out what are the problems we can solve. 198 00:42:18.240 --> 00:42:39.770 E3410 x8539 Conf Room: So this deep biological domain knowledge can't, can't go away here in this this discussion of these kind of 3 items, going forward. And maybe if I leave one last thing. This is kind of how I see it as as this is, gonna be a work of arts of some sort. And I think the engineering mindset really comes and building the frame setting the boundary conditions and the design goal. 199 00:42:40.174 --> 00:42:51.910 E3410 x8539 Conf Room: Ai is really the tool, and biology provides all the different colors and interesting components that we can use to start painting. This picture going forward 200 00:42:51.930 --> 00:42:55.690 E3410 x8539 Conf Room: so. But that I think that was the end of what we had. 201 00:42:56.130 --> 00:42:57.220 E3410 x8539 Conf Room: thanks. Bye, bye. 202 00:43:04.470 --> 00:43:16.129 E3410 x8539 Conf Room: I can hear. Yeah. Thank you, Scott. Thank you, Ethan. Do we? So we now have a little bit of time for some questions. Do we have any questions in the room? 203 00:43:17.170 --> 00:43:21.649 E3410 x8539 Conf Room: Please, do we need a microphone down here? 204 00:43:21.760 --> 00:43:24.310 E3410 x8539 Conf Room: So the people in here is also in. Yeah. 205 00:43:27.370 --> 00:43:29.260 E3410 x8539 Conf Room: here it comes. 206 00:43:30.802 --> 00:43:34.477 E3410 x8539 Conf Room: Yeah. Oh, this microphone is working. 207 00:43:38.990 --> 00:43:57.220 E3410 x8539 Conf Room: Yeah. Hi, yeah, thanks. I'm Chris Aguissa. I'm a plant physiologist in in Ios. So. So my question to you is, I imagine that in your data you're looking at you, then disease, resistance. But you probably are, I, I assume, are integrating data from the environment as well. 208 00:43:57.420 --> 00:44:06.320 E3410 x8539 Conf Room: And imagine that you guys have amazing sensors. And you know, measurements of all the differences in environmental conditions during the day during the seasons. 209 00:44:06.380 --> 00:44:26.499 E3410 x8539 Conf Room: So how hard is it to integrate all this into you, then disease, resistance, or just you? And is it better like with the precision that I assume you guys have either in greenhouses or feuds. Is it better to look at things very specifically? Or is it better to look at all the changes, all the complex changes in the environment. Is it 210 00:44:26.660 --> 00:44:47.199 E3410 x8539 Conf Room: in a way better to look at all the noise at once? Or is it better to be very specific. So it's gonna it, it will depend on your design goal. So in the case of, we want a germ plasm that operates really well and very select region. Then we can be very specific for that. If we want this to be many broad acres. 211 00:44:47.200 --> 00:44:58.590 E3410 x8539 Conf Room: then we're no longer going for a very specific performance. But now a distribution of performances. So we wanna make sure that that that germ plasm is gonna operate in a number of different environments. 212 00:44:58.912 --> 00:45:19.900 E3410 x8539 Conf Room: And so that changes that that changes your design goal. And then you are going to be. You're still specific. But it's just a different set of. Now, you're specific over a wide range of topics. Whereas before you're now specific over a smaller range of topics. So yeah, whenever you're training these models they have. They have a finite set of 213 00:45:20.060 --> 00:45:43.039 E3410 x8539 Conf Room: when you have your architecture and your data, there's a finite amount of learning that can be achieved. And you have to. You have to choose exactly where you want that learning to explicitly go and so I think it brings a lot more to the table if you define that very clearly. But on the the concept of just more data that's coming through with environment. 214 00:45:43.040 --> 00:45:54.440 E3410 x8539 Conf Room: There is a there is a little bit of a caveat to that one. So, for example, I did a lot more fluid fluid mechanics and Phd. In Postdoc. And 215 00:45:54.930 --> 00:46:14.609 E3410 x8539 Conf Room: those are really complex systems that if you look over the last 30 years of weather, 30 years of weather is nowhere near enough weather to really understand how weather is operating. So we need a lot more data on the environmental side to be particularly accurate or high fidelity with what's going on. 216 00:46:14.690 --> 00:46:26.308 E3410 x8539 Conf Room: So I think it's it's great that we're continuously getting more information about the environment. But the total weather scenarios. We probably still have to box those in on just a little bit more. 217 00:46:34.880 --> 00:46:36.700 E3410 x8539 Conf Room: so just wondering 218 00:46:37.186 --> 00:46:48.983 E3410 x8539 Conf Room: you're introducing anything new into the equation along, you know, with synthetic biology, synthetic genes. That's what I said. Because what occurs to me, you're you're bringing all this. 219 00:46:49.620 --> 00:46:52.220 E3410 x8539 Conf Room: you know, 1 million dollar technologies. 220 00:46:52.510 --> 00:46:59.720 E3410 x8539 Conf Room: All this information. But preceding you has been millions of years of evolution and 4,000 years of farming. 221 00:47:00.010 --> 00:47:09.319 E3410 x8539 Conf Room: Who I know. I wonder if you're just using the same set of genes, how much design space there is to actually move into. 222 00:47:09.340 --> 00:47:10.629 E3410 x8539 Conf Room: With all these. 223 00:47:10.950 --> 00:47:13.639 E3410 x8539 Conf Room: you know, high high tech approaches. 224 00:47:13.750 --> 00:47:16.980 E3410 x8539 Conf Room: and as you generate new 225 00:47:17.800 --> 00:47:25.310 E3410 x8539 Conf Room: types, I suppose not upon biologists, but new types of different species. Were you sacrificing 226 00:47:25.530 --> 00:47:33.470 E3410 x8539 Conf Room: in terms of, for example, taste right? Because you don't have a new design space to move into? 227 00:47:33.730 --> 00:47:44.270 E3410 x8539 Conf Room: Yeah. So defining that problem, we will sacrifice. There's a potential that we, you might have a better answer for this one. Okay. 228 00:47:44.890 --> 00:48:09.630 E3410 x8539 Conf Room: so so yeah, does it, if if we only care about yield, and that's the only thing that we're measuring, and that's what the model is going after. There. It's not guaranteed that everything else goes away. But it is definitely a risk that everything goes away. Now we have a lot more typically than just yield that we're designing, for there's a number of other metrics. That exist, and all of those kind of go into the calculation of a multi objective 229 00:48:10.153 --> 00:48:16.199 E3410 x8539 Conf Room: design principle. Maybe you were getting at a another, a different point there about. 230 00:48:16.580 --> 00:48:19.890 E3410 x8539 Conf Room: Have we pretty much seen most of the genomic 231 00:48:20.240 --> 00:48:41.059 E3410 x8539 Conf Room: we squeezed everything out of there. Maybe from an I I don't. I don't think it's true, but we could say from a traditional breeding standpoint, let's assume that is true. I think editing just by itself, and what we can do there is going to completely change that and introduce a whole new set of variations that is going to continue to move 232 00:48:41.460 --> 00:48:50.300 E3410 x8539 Conf Room: to move the boundaries. So even if that were the case, I think the new technology is going to do that. We often think of GM as a very static. 233 00:48:50.310 --> 00:49:02.369 E3410 x8539 Conf Room: Yeah, they're not. They continue to evolve even within breeding programs. So you get, you know, newer comments get gene duplications the genomes, dynamic transposons moving changing, how genes work. 234 00:49:02.530 --> 00:49:08.490 E3410 x8539 Conf Room: And that continues to drive the variation that they're gonna have to continue capturing these models because that continues to to evolve over time. 235 00:49:08.970 --> 00:49:15.230 E3410 x8539 Conf Room: I'm reminded of a paper back in the late 90 s. For my postdoc advisor, who was a chief science officer at Usda for a while. 236 00:49:15.510 --> 00:49:31.519 E3410 x8539 Conf Room: There's a breeding program in barley at Minnesota. They have the same genetics, I think, 60, some years, and they continue to make yield improvements. And the question was, Where's that? Come? Where's it coming from? Just remodel, it should stop, but it keep. It keeps moving. And so there's all these other processes. It's a dynamic genome. 237 00:49:32.060 --> 00:49:33.480 E3410 x8539 Conf Room: There's things happening. 238 00:49:40.160 --> 00:49:57.269 E3410 x8539 Conf Room: So for first, I just have to say, love the talks all the way through there, and I'm so happy. I get to go to dinner with you so I can pick your brain. It's just limited to one question is is difficult. But the the there was one thing in one of your slides, I thought was really interesting, as it combines here 239 00:49:57.270 --> 00:50:15.679 E3410 x8539 Conf Room: with which is showing this slide, which is that importance around computational thinking, engineering, thinking and biology. Thinking. There's one more piece that I think is is really you're in a really cool spot for which is genetics thinking, and specifically with maize researchers cause 240 00:50:15.950 --> 00:50:31.219 E3410 x8539 Conf Room: you work with folks that have to sort of plan experiments years in advance, and have this really limited number of iteration cycles that don't constrain as much on the data, computational thinking or the engineering thinking. But you had this number up there 241 00:50:31.240 --> 00:50:45.890 E3410 x8539 Conf Room: 26. We have 26 more years before we hit 2050 and and all of these dire warnings that come out there, and I was just sort of wondering around, how do you think about what you can do within that timeframe 242 00:50:45.890 --> 00:51:04.369 E3410 x8539 Conf Room: as you start today and specifically around thinking about how you set yourself up for the most success in the future, and if you have any predictions about you know where you'll be in 26 years, where we'll be in 26 years, as we sort of think forward, both in the technology and the programs that are in play right now. Just what are your thoughts and and how you think about that? 243 00:51:04.400 --> 00:51:23.859 E3410 x8539 Conf Room: Yeah, I think it's yeah, on the on the 20 sixth year piece. This was so this is one of my concerns coming to to coming to bear originally was, oh, we have to deal with. I was used to experiments that we would. We had simulations that we would have the AI work with. And so it 244 00:51:23.860 --> 00:51:52.520 E3410 x8539 Conf Room: you get results back within a couple of minutes. And now now you go. Okay, well, we're gonna if we had 2 inbred lines ready to go and put that in the field as a hybrid, you have to wait a whole year to get that data back. Now, if you wanna design a new inbred and you have to cross it, you have to cross it. Go through a number of different processes, like the the shortest timeframe, I think, is 3 years to get a hybrid data point. So if we were today to do something, we're not gonna get that data point in 3 years. 245 00:51:52.780 --> 00:52:00.330 E3410 x8539 Conf Room: So so I think that's a, really, it's a really interesting question. And I think it, it really underlines the aspect of 246 00:52:00.910 --> 00:52:29.630 E3410 x8539 Conf Room: we. We have to look really far downstream and ask the question, are we exploring enough? Because I know. And this is something that maybe the public sector will be better at because we get down to points with, we have to be able to provide a certain set of products that are gonna be high performing. And sometimes we just have to exploit. We have to take the things that we have right now and sometimes can't look downstream. So I think that's a that's a big question of risk 247 00:52:29.710 --> 00:52:37.440 E3410 x8539 Conf Room: that I hope we can solve. But yeah, there's a lot of things asking someone to make a decision 3, 4, 5 years downstream 248 00:52:38.050 --> 00:52:41.736 E3410 x8539 Conf Room: stuff. I don't know if I answer that in any useful way, but 249 00:52:42.360 --> 00:52:46.790 E3410 x8539 Conf Room: so I generally think of, you know, if you're modeling, or that 250 00:52:47.150 --> 00:53:12.880 E3410 x8539 Conf Room: AI modeling that you're mostly being able to look for predictions. Well, you don't extrapolate. You can look within your data set. But you're not really being able to predict far outside of your data set. Is that true? Given the constraints and the loss functions? That you're incorporating in the models. So that if I'm really looking for something new that's gonna enable me to, you know. Do agriculture in, you know. 251 00:53:13.000 --> 00:53:17.949 E3410 x8539 Conf Room: 2050 will I be able to find that? Or do I need to really 252 00:53:18.318 --> 00:53:21.819 E3410 x8539 Conf Room: how do I push those models? So I get that extrapolation? 253 00:53:21.900 --> 00:53:28.330 E3410 x8539 Conf Room: Yeah. Yeah. So when you do this active learning approach. The goal is to be 254 00:53:28.420 --> 00:53:45.109 E3410 x8539 Conf Room: going on the edge. You're gonna try to find the edges continuously. And so you are trying to extrapolate. And one of the things that definitely is hard for me to discuss multiple times is your model, and extrapolation is, gonna be wrong. So many times you might be. If you're working in extreme events. 255 00:53:45.240 --> 00:54:04.960 E3410 x8539 Conf Room: it's so rare that 99.5% of the time you're gonna be wrong. And a lot of people don't like that answer. But it's a it's a reality of you do have to. You're gonna be wrong most of the time. But if you run the statistics and you run a number of different simulations. You can see that being wrong is worth it because you're gaining 256 00:54:05.100 --> 00:54:29.333 E3410 x8539 Conf Room: understanding and data assets that are much more interesting because they're diverse and they are solving. They are answering a question in the space of genetics. Essentially so. I think that that kinda pairs well, with that question is that most of the time an extrapolation? The models are wrong, and that's a really hard discussion to have. But it's very you. It's useful, wrong versus 257 00:54:29.690 --> 00:54:35.088 E3410 x8539 Conf Room: being right and not moving anything. So 258 00:54:36.944 --> 00:54:48.404 E3410 x8539 Conf Room: we need 3 years to prove that you were wrong. To to do this, and you have to convince this to your leadership. 259 00:54:49.630 --> 00:54:50.720 E3410 x8539 Conf Room: So 260 00:54:50.810 --> 00:55:03.229 E3410 x8539 Conf Room: fantastic talk. Thank you so much. So I'm just you know. You're very fortunate in having the luxury of having all of these data and years and years decades of research on maize. 261 00:55:03.410 --> 00:55:09.390 E3410 x8539 Conf Room: So given your experience of working with maids, you know, recognizing as we. 262 00:55:09.400 --> 00:55:13.809 E3410 x8539 Conf Room: as the climate continues to change, we're going to have to bring in additional crops. 263 00:55:13.830 --> 00:55:17.169 E3410 x8539 Conf Room: you know, be they orphan, be they new crops. 264 00:55:17.560 --> 00:55:32.910 E3410 x8539 Conf Room: What recommendations would you give to researchers who are working on some of these orphan or these new crops that we're we're now developing to be able to leverage AI to the best extent possible, to be able to improve them as quickly as possible. 265 00:55:33.410 --> 00:55:59.960 E3410 x8539 Conf Room: Well, I'd say, maybe the the first point that is on this kind of active learning question is being willing to spread out your data and allow the model to paint the lines in between it. That's gonna mean that your limited resources to go work with are going to be spent as efficiently towards building something that can predict. Well, so I think that's something from. If you're starting from scratch and you have that opportunity. 266 00:55:59.960 --> 00:56:10.326 E3410 x8539 Conf Room: Be super. If have a mindset of data is an asset that you have to take a risk analysis approach to to really get the best data. And if you do that. 267 00:56:10.920 --> 00:56:30.339 E3410 x8539 Conf Room: most of the data sets that exist out there, if you do a historical like kind of a historical analysis on it. You only need about 5 to 10% of the data that's out there to get the accuracy, and you could throw away the other 80. So there's a ton of experiments, ton of wasted data. So, taking an approach like that, I think, is very critical from the AI side 268 00:56:30.880 --> 00:56:32.360 E3410 x8539 Conf Room: admiral. 269 00:56:32.790 --> 00:56:35.810 E3410 x8539 Conf Room: Some extent this is happening so obviously, corn 270 00:56:36.130 --> 00:56:44.249 E3410 x8539 Conf Room: forbearers, the most profitable crop, most most resources going to it. So a lot of things get developed, and in corn, maize, and then propagated. 271 00:56:44.400 --> 00:56:46.929 E3410 x8539 Conf Room: found a sore, and then rice and wheat, and 272 00:56:47.480 --> 00:56:53.070 E3410 x8539 Conf Room: and even into the vegetables. We have a veg division that has like 70 different vegetables that they breed 273 00:56:53.110 --> 00:57:04.510 E3410 x8539 Conf Room: so over time. These things, you know, we we first introduced all the genotyping sea chipping that started in corn, and then made its way down to the other crops. Same with all the genotyping genomic resources. 274 00:57:05.470 --> 00:57:16.470 E3410 x8539 Conf Room: And if you think about orphan crops cover crops, and some of the cover crusts and things like that which we're investing in. And other companies are investing in those technologies that he's talking about are also making their way into those as 275 00:57:16.530 --> 00:57:18.650 E3410 x8539 Conf Room: orphan crops as well. So yeah. 276 00:57:18.790 --> 00:57:35.680 E3410 x8539 Conf Room: it's happening slowly. Alright. Well, thank you again. It's wonderful presentation. And I'm sure there'll be additional opportunities to meet with Nsf. Staff for the rest of the day. So let's thank 277 00:57:38.070 --> 00:57:39.050 E3410 x8539 Conf Room: sure. Yep. 278 00:57:42.750 --> 00:57:49.599 E3410 x8539 Conf Room: okay, that was great nice.