Video: Orchestrating Large Scale Omics Workflows​ with Seqera at Arcus Biosciences​ | Duration: 1160s | Summary: Orchestrating Large Scale Omics Workflows​ with Seqera at Arcus Biosciences​ | Chapters: Welcome and Introduction (3.4399998s), Omics Data Architecture (31.25s), Transitioning to Nextflow (250.83499s), Nextflow Architecture Overview (373.41998s), Data Processing Pipeline (462.92s), Nonclinical Data Explained (773.37s), Cloud Bursting Methods (844.395s), Data Management Practices (921.875s), Pipeline Chaining Complexity (961.07s), Concluding Remarks (1065.195s)
Transcript for "Orchestrating Large Scale Omics Workflows​ with Seqera at Arcus Biosciences​": Hello, and welcome back to Nextflow Summit infrastructure track. I hope you enjoyed the the world's shortest coffee break. We're keeping things powering through here today. We've got lots of talks to get through. I'm really excited about the next one. We're gonna hear from Renee and Stav who work at Arcus Biosciences, hearing about their automation of large scale omics. So, without delay, we'll hear the talk. Thanks so much. Hi, everyone. We're from Arcus Biosciences. I'm Vinay, a DevOps engineer, with my colleague, Stav, our bioinformatics engineer. At Arcus, our mission is to pursue a cure for cancer. The images that you see on the right represent something really special to us. It's an initiative started by our design team who use AI to bring our patients' stories to life. We think that these procreate beautifully capture what drives us every day. Today, we want to show you how Seqera is also helping us advance that mission by enabling us to build and orchestrate large scale omics for growth. So let's dive in. I will cover the part about what we do at Arcus Biosciences and the breadth of our genomics architecture, and then Stav will dive deep into our whole exome pipeline and how we analyze that data out of us. So who are we? We are a clinical biopharmaceutical company, which means we not only use small molecules, but we also use monoclonal antibodies to develop combination therapies. This is for several diseases, but especially cancer. And the cancers are of multiple and diverse types, such as kidney cancer, lung cancer, as well as pancreatic cancer. Currently, we are we have about seven active molecules that we're studying in 13 clinical trials, and we collect data from over 400 biopsies from various patients. And then we also have nonclinical data from thousands of patients. Today, we will only cover how we process clinical data. As part of these studies, we need to leverage omics data because it helps us understand the biology of the targets. So that way, we would understand how the pathway dependencies work. We understand the proposed mechanism of action. So when a drug binds a target, what kind of molecular changes happen. And then we also look at biomarkers, which are specific to cancer. For example, PDL one or other gene signatures from the pathways that we're exploring. Now let's talk about the diverse type of data that we collect, and each of them is important because they answer the question. The first is the whole exome sequencing data, which helps us understand mutations, RNAseq for gene expression at bulk level, and then SC RNAseq for, gene expression at a singular cell. And then we also do proteomics, which is a bit challenging to collect, but it helps quantify the functional protein units within a sample. So, essentially, the flow is DNA, RNA, and protein. We also operate in a hybrid cloud, which means we not only have an on prem HPC with over 2,500 cores and some GPU, but we have the ability to get more compute through any of the big players like AWS, Azure, or GCP. Or if there's a specific service that they provide, we would like to use that as well. So we needed something that could fit into this paradigm. And it was important that we needed an interface that is not only user friendly, but it has a CLI or a new API to integrate with other systems that we use. So this is kinda like our problem statement. Now let's try to look at how we now let's look at how we try to solve it. When we first began processing large volumes of sequencing data, we found that out of the box, BC Bio NextGen offer a great solution. Reason being, it supported multiple modalities off the bat like variant calling, RNA seq, shift seq. It did extensive benchmarking of the best practices. As you can see, it shows all the performance of the tooling, in comparison with other tools. It helped with integration with other bioinformatics tools, and then it was easy to launch. You had a YAML config file. You can just launch it in a Slurm in, like, an e c two or and then it was really helpful. It helped us standardize our workflows and come up with best practices for our use case. But as our sequencing demands grew and our data got more complex, some challenges emerged. For example, in scalability, we have limited cloud support and containerization support. We also had issues customizing the pipelines. There were some maintenance issues when it comes to dependencies or updates. And then it was a CLI first. Monitoring and troubleshooting was a little difficult, and then it wasn't as accessible as something which has a nicer GUI. This is when we transitioned to Nextflow and Secure platform, which is a Groovy based DSL. It's built for scientific workflows, and it's highly scalable. It has parallelization right off the bat, and it has all the container and cloud support that we are looking for on top of the executor support, for example, your local HPC. And it had it was very user friendly. It had GUI out of the box where anyone who probably didn't have enough programming experience can also launch pipelines for their sample. In short, Nextflow met our evolving requirements at Arcus Biosciences. Now let's look at where we fit Nextflow into our architecture. This diagram explains, a bird's eye view of all of our components. Our secure platform is deployed as an OpenShift Kubernetes cluster, and then it has some accessory executors for our use cases. Now let's take a look at one step at a time. The raw data gets dropped into s three as a fast queue file. It gets executed into one of the, executors, for example, Slurm, if you don't need a lot of compute on prem or we burst into AWS. We also have pair of bricks which Slava will cover. Then the process data gets back into s three. This in will enclose all the gene expression or variant call format. Those get data engineered into parquet files by merging with other modalities, and then that becomes, like, the source for our downstream analysis by using r, Python, or even MCP in today's science. Let's look at that architecture again. The data flows through here. Based on the use case, we launch it. The data gets processed. Other data gets merged, and this becomes our place where we do our analysis. I will now pass on to Stav, who will cover more extensively about our old exome pipeline and the data leaks. Thank you, Vinay, for the comprehensive overview of Arcus Biosciences and our tech stack. So at Arcus Biosciences, one of our primary objectives is to use somatic mutations and assess the relationship to clinical outcomes. We do this by running and, of course, CEREC of best practices for variant calling pipeline, which is widely cited. So CEREC is, the flowchart is shown on the left. It does a series of different steps. It does preprocessing, does variant calling, and it also does annotation. What's nice about CEREC is that the annotation resources are packaged in, which reduces development time and and presents out of box running capability. This is of importance to Arcus Biosciences because now we have consequence information for our variance, which we can assess the clinical significance of the patient's somatic mutations. With CEREC, we can process hundreds of samples in a matter of days and burst onto AWS, when we need more power or use our on premises HPC. So our postcard pipeline optimizes somatic variant calling by extending n, of course, CERECs to our needs. So essentially, it takes in input files from the CEREC output files. We run an additional variant caller called VARScan. We also run microsatellite instability detection. And what we do with our VCF files now is we merge them using ensemble variant calling, which we learned from BC Bio, to ensure that we get the most confident variant calls for our tumor patients. We then convert our outputs to TSVs, annotate them, concatenate them across samples, and publish to an output directory. With Nextflow, we have been able to add a custom Arcus pipeline that meets our goals for whole XM sequencing by manually training from Sarek. So there's a problem that Sarek and Post Sarek are two different pipelines, and Nextflow does not support importing the pipelines or chaining pipelines together. So we built a workaround. To get around this, we used, accessory r script in the Nextflow post our pipeline that lists those files of interest from Sarek into a data frame, joins it to a sample sheet via sample ID, and then creates a combined channel for downstream processing of post Sarek workflows. This can be automatically triggered by multiple approaches. We specifically use the care kit and parameter substitution. And while this is a reasonable workaround, it's still not perfect and seamless and requires a hacky solution. We think that importing pipelines as built in feature is valuable to others in the scientific community as well. Another challenge we face performing whole exome sequencing is high run time for certain steps. Genome alignment is a key bottleneck in genome sequencing analysis even when splitting up across different workers. One solution present is NVIDIA Parabricks, which is a GPU accelerated genomic analysis toolkit. An initial test run showed a reduction of run time by 10 x for our test sample. So we're testing NVIDIA PowerBricks and our GPU enabled on premises HPC, and we plan to work with Seqera to implement NVIDIA PowerBricks for the other, steps of whole XM sequencing analysis. So I just talked about the challenges that we faced, but I would like to wrap up by demonstrating what we do with the data after processing. So with all this data, we build a data lake for streamlining clinical trial analysis. We merge modalities from different departments including omics, histopathology, and clinical data, and we store these files after transforming them into parquet and s three for querying via athena r or other tools. Now we have our data lake and we adhere to FAIR data processes to enable our scientists to find and access our data, easily pull across the different modalities, and create reproducible analysis, which enables our bioinformatics scientists to query and create plots on the fly. In conclusion, Arcus works with complex data from a large number of patients and trials. Seqera provides us a modern powerful solution for Bioinformat's pipeline orchestration at scale. And Nextflow's flexibility has allowed us to expand NFCore pipelines beyond their out of box running capability, which enables seamless handling from sample biopsies to impact analysis, and pipeline training will be critical for more complex workflows. We look forward to working with the community on improving Nextflow's capabilities, and I'd just like to take a second to acknowledge the Bioinformatics and IT team at Arcus Biosciences, the Securic partnership that we've had since 2021, and a big thank you to Nextflow and the NF core community for building pipelines and enabling science at scale. Thank you, and we'll now take questions in the q a. Thanks very much, both. That was a a fantastic talk. I like I like these talks that really go go into the weeds and and talk about what's behind the scenes powering these things. It's, it's really interesting to see what different setups that people have. There's a couple of questions. I've got one or two of mine as well. Renee, maybe I'll start with you because you're you're a first to the talk. You you mentioned in saying that you have that it wasn't a big part of your talk, but I was kind of curious what you meant by that. Like, what do you know what that data is? Can you say? Sorry. I think I lost, like, the initial part of the question. Is it possible to do that? Just at the start of your talk, you mentioned you have so many you have, like, a bunch of, clinical data, which is what you're talking about. But you said you also have a lot of nonclinical data, and I wonder what that was. Yeah. So nonclinical is, like, mostly, like, some of the, what you say, you do like drug stability analysis. Right? So it's not clinical for the patient, but how the drug would behave, like, in various temperatures. For example, is it stable at 25 room temperature, refrigeration, if it's, like, a hot and humid climate? So we have a lot of data under that as well. Sure. I'm also very interested about the the method you mentioned about bursting to cloud and sort of selecting different compute environments. Is are you doing that kind of outside Nextflow or are you doing it within Nextflow? Because, I mean, I know you can do hybrid cloud solutions within Nextflow, but I guess you're you're kind of controlling the execution of the Nextway process. So how does that work? Yeah. So that is using the computer environments, using the Seqera platform. So, so in most cases, we do on prem. But if we need to burst out, then we will use AWS. We've done that in two approaches. We do, we most we'll have an instance set up in AWS elastic Kubernetes as well just, safety. Because sometimes there's, like, throttling and, you know, you have to add, like, wait times when you burst out from your on prem cluster into AWS. But, yeah, we have, like, both of those options, whichever works. We just use that at the time because, sometimes there's that problem. That's when we just launch everything off of the last thing. But since all since you guys have have it all, like, you can source from GitHub or any other source control, we don't really have to change the pipeline much. It's the same stuff, but you just run it at different pace. Just plug a switch. And did are you doing that manually, or is that is that logic automated? I think that's based on the use case. For now, we haven't automated it. So it's, like, manually. We're like, okay. We need probably 5,000 course, and it all depends upon the deadlines. We need it faster than we just, like, decide on before we launch it. And one one might one more of mine before I go onto the ones. What kind of data lake are you using? Is there kind of a a product you're using under the hoods for the data lake? No. It's just a a bunch of parquet files and s three, and then we use, like, Athena to query and things like that. Cool. Nice. Right. As as predicted, I thought you might get lots of questions about pipeline chaining because it's a it's a it's a hot topic. Yeah. And and, yeah, there's there's lots of people who are certainly interested in it, but it's quite kind of a bit of a gnarly problem. I'm gonna start with Maxine. He says, how complex is it to work downstream, of a pipeline comp of a pipeline as complex as CyArk? Because CyArk, I mean, CyArk does a lot. How how difficult was it to kind of chain onto it? Yeah. I could talk about that. So the main issue with working with downstream data as input from CIREC is being able to join all the different files together from the germline files to the somatic files and making sure that maps correctly. And we run a bunch of different tools, each on either the BAM file or the BCF files, and it creates complexity of making sure that the channel output is how you expect it to be. And so a quite critical portion of that is making sure that you parse the metadata, directly from the file names, which is what I use to, build that kind of pipeline, essentially. But from there, I just use classic, Nextflow. I adopted some NF core, tools as well and put that as a process into our code base. But, I tried to keep it as simple as possible because, yeah, Sarek is a very complex pipeline. And so selecting the specific inputs that I need and just running it on that and making the pipeline as sequential as possible helps out with that. Does that answer your question? Yeah. I I think there's a lot of interest in it's kind of a running joke about, deleting Sarac and kind of breaking Sarac up into smaller pipelines once pipeline chaining is unlocked. But I think also building on top of it, like, is is great that you've managed to make this work. I just be careful not to go. But if people are interested in in pipeline chaining, tune in to the keynote again tomorrow. We have a few things coming for you, on the next side of things. So hopefully hopefully, we're moving in the right direction. One more question. How seamless is migrating a workflow between, AWS and GCP? And do you have a preference for one versus the other? So, I think when I mentioned GCP, it was more about we have that as, like, in our problem statement that we have access to GCP. We haven't actually run Nextflow on GCP. So, yeah, it's AWS and on prem for now. But that was more like a problem statement that, hey. We do have the ability to use any of the cloud providers, and what solution would work best for our outlets? That does. So sorry if I can state that clearly. Like and I guess, to be honest, if all of your data sat in AWS s three buckets, that that's a pretty good reason to to keep your compute there as well. Yep. Okay. I'm gonna try not to run over time. So, thank you very much, both of you. Really great talk. Lots of good feedback from the audience. So, thanks very much, and, see you soon. Yeah. Thanks for having us. Yeah. Thank you. Bye.