Outputs

Deliverables

D1.1 - Data Management Plan (DMP) - M6 

The Data Management Plan constitutes a requirement of the LIGATE project’s obligation to adhere to the European Commission’s Open Research Data Pilot which enables open access to research data where possible. LIGATE Data Management Plan (DMP) describes the data and program repositories which can be used by third parties for purposes such as data mining, exploitation or validation of the project’s results, for example. It describes data that will be acquired or produced during the project, how the data will be managed, annotated and stored, the standards to use and how the data will be handled and protected during and after the completion of the project. The Plan will be updated at regular intervals, in line with the project’s activities. Related documents include D1.1, “Requirements and Specifications and Integration Plan”. In general, he data and tools are or will be made freely available; the few exceptions involve licensed or proprietary software and details describing the restrictions on their use are given.

D1.2 – Requirements and Specifications and Integration Plan – M9 

D1.1 describe the requirements and specifications of the project’s software components together with the constraints imposed by the available hardware architectures. Related documents include the Data Management Plan (D1.2). It lists the hardware resources provided by HPC centres and other partners, including computer systems which will become available in the coming months. It details the specifications and requirements for the lead users of the LIGATE solution, DOMPE’ and TOFMOTION, followed by other components such as LIGEN docking engine, GROMACS molecular modelling software and the HyperQueue workflow engine,. It also provides a description of the integration plan and configuration management.

D1.3 - Initial Validation Result - M18 

Initial validation of the different solution components and of the solution as a whole. The validation tests have been defined for the following components:

  • LiGen. Reference datasets and scripts have been defined which allow the testing of the LiGen components separately or their integration in a LiGen workflow. Thus, it is possible to run a thorough validation of all the software or a quicker test to check for errors after a software update. To facilitate the validation of the software on a new system, all the test data and run scripts are maintained in a dedicated repository: https://gitlab.hpc.cineca.it/ligate/ligen-testbed.
  • HyperQueue. The software is shipped with a testing suite which can be rapidly built via Python scripts.
  • GROMACS. The program relies extensively on detailed testing. All work and changes are subject to unit testing (and new code is not accepted without unit tests) in a continuous integration (CI) fashion. In fact, the CI system is so efficient that it has more detected more than a dozen bugs which were due to compilers rather than the simulation code.
  • D1.4 Solution
    Validation  

    This deliverable reports on the achievements made by the LIGATE project to ensure the robustness and scalability of the Computer Aided Drug Discovery (CADD) solution (and respective components) that derived from the project. Various components support the workflows (Pose Selector and Binding Affinity Predictor) that compose the complete CADD solution: Computer Aided Drug Discovery (CADD). The process involved the definition of tests and respective validation, which has occurred both at component and holistic solution levels. The validation has been achieved on HPC systems at IT4I and CINECA. Additionally, this report also provides a gap analysis about the portability of the solution to other HPC systems, with a focus on the European ecosystem, including an evaluation against the European Processing Initiative (EPI) roadmap.

    D1.5 Computer Vision Simulation Use Case Porting and Validation

    Description of the optimization and porting of a computer vision simulation to demonstrate that LIGATE is a generic platform for real-world applications.

    D1.6 Structured Grid Simulation Use Case Porting and Validation

    This document describes the source code and application, Cronos, ported to SYCL and Celerity within Task T1.5. The document is organized as follows: Section 2 describes prerequisites as well as instructions for building and running the application and associated tests. Section 3 elaborates on the functionality of the application and important challenges and considerations encountered during the porting process. Section 4 discusses verification and correctness, while Section 5 presents performance results obtained through benchmarking. Finally, Section 6 concludes this document and provides knowledge on lessons learned while working on this part of the project.

    D2.1 – Application Code Accelerated with SYCL  - M9

    A document was produced describing the source code submitted along with D2.1 deliverable. The document describes the contents of the two codebases released for deliverable D2.1, the main challenges and adopted solutions in the development of the SYCL accelerated versions, code portability in terms of SYCL implementations and supported target architectures and introduces strategy for future update and maintaining of the codebases. More in detail, a SYCL implementation of LIGEN is being developed having started from ligen-geodock and ligen-score. Each CUDA kernel in CUDA implementation is mapped to a SYCL kernel in SYCL implementation.
    The document contributes to the deliverable “Ligate Software Release” released for the first milestone MS1 and follows the indications provided by the data plan (D1.2) and the requirements, specifications and integration plan (D1.1). 

    D2.2 – Application Code Accelerated with Celerity - M18 

    3 celerity porting options have been identified, the most promising being to use Celerity to distribute existing GPU workload per-node. However, during latest discussion in WP2 it turned out that based on Polimi's most recent work there might actually be a fourth hypotesis which is under review from UniSA, POLIMI and UIBK.

    D2.3 - Intermediate runtime and autotuning framework - M18

    This intermediate report describes the progress made on runtime system optimization, including autotuning, scheduling and data distribution improvements, as well as energy optimization.

    D2.4 Final runtime and autotuning framework 

    The final runtime and autotuning report describes the state of the programming model and runtime system after the completion of the autotuning, heterogeneity and energy optimization tasks. It will summarize the framework features, the advancements made in the optimization tasks, and evaluate its performance.

    D3.1 – Specification of data/ API requirements - M9 

    D3.1 report contains the results of analysis of API needs for the modules to be properly integrated. Most interfaces concern data format and parameters. It describe LiGen modules, with overviews of each module’s goal, input/output and command line arguments. The interfaces of the LiGen modules have been updated to build flexible workflow, also including external tools. Most of the already existing modules of LiGen have been refactored to consider the new defined interfaces. For GROMACS work has focused on requirements for automatic topology and parameter generation, and automatically decide simulation length to reach a target precision. The document also shows examples of module composition for more complex workflows, including Virtual Screening, pre-processing, docking, scoring and free energy calculations. and considers a preliminary analysis for the HyperQueue tool to manage submission of the workflows. In summary, API needs have been analyzed and reviewed and interfaces for modules development have been defined, as well as settling on common input/output standard formats. Consistency between application concepts (e.g., data types) has been verified, and a plan for API evolution has been defined that will pave the way for subsequent D3.2 and D3.3 deliverables.

    D3.2 - Data translators
    and code 

    Extensions to existing data translators when possible, and implementation of new ones either as scripts, standalone programs or in the original applications to allow exchange of data and interact within the solution.

    This work has been highly successful. While there will be additional efforts in the second half of the project to fully integrate with newly developed interfaces, in particular to better identify and handle broken, incorrect or simply unsupported molecules (for a particular parameter set the user has selected), the achievements here have allowed us to execute our first trial benchmark tests of large-scale free energy and docking workflows.

    D3.3 - Release of New APIs and data formats

    Implementation of new interfaces and data formats to describe jobs and results when the application APIs cannot be extended or adapted (e.g. when they break application legacy).

    D3.4 - Release of Enhanced SW modules  

    Release of modifications in the existing codes with support for fully automated parametrization of target compounds, analysis and clustering of docking poses, automated setup of free energy calculations based on structures with docked compounds, and enabling users to specify a requested target precision of free energy calculations.

    D4.1 - Initial analysis on the machine learning module - M18

    Strategy for binding affinity prediction and pose selection was shared and requirements for successful integration with other virtual screening modules were collected. A new data-generation phase has just finished for the pose selection task that has produced significantly more poses to train the model. Model has been retrained with promising results. Pose selector: different strategies are being explored, e.g.: 3D Convolutional Neural Networks, Graph Neural Networks and Mixture Density Networks. 3 open source SW with brilliant results on CASF datasets are being tested and adapted to the use case at hand. Complex alignment has been granted within LiGen workflow, i.e.: no need to rely on specific methods, (such as Frobenius norm evaluation) to ensure rototranslation and permutation invariance wrt system of reference changes.

    D4.2 - Release of the machine learning module 

    Final analysis, selection and implementation of Machine Learning techniques.

    D4.3 - Intermediate solution for data management  - M18

    This deliverable reports architecture of the first version of the IO storage platform optimized for LIGATE use cases. CINECA is testing ETL and a basic ML pipeline (PCA) using RAPIDS and NVTabular using 3 trajectories (.xtc binaries) of size 1,5, and 10 G for testing. Currently, the framework does not seem stable, debugging ongoing both on CINECA side and with NVIDIA developers.

    D4.4 - Final solution for data management  

    Revised architecture description of the first version of the I-O storage platform optimized for LIGATE use cases.

    D4.5 - Workflow benchmark specification and initial results - M18

    HyperQueue framework for job management was presented. Efficient multi-GPU and multi-node execution of AI applications and frameworks on the GPU nodes of Karolina supercomputer was demonstrated. Under development an experimental API for GROMACS pipeline to make it more robust, maintainable and scalable. 

    D4.6 - Final version of scalable workflow  

    Stable release of the scalable end-to-end data processing pipeline for LIGATE use cases. 

    D5.1 - Integrated Platform  

    Based on the specs defined and updated in Task 1.1 and the integration plan defined in Task 1.2 both refer to D1.1, in this task, the CADD platform development and integration start. This task has a twofold purpose to deploy the CADD Platform development and integration on the EuroHPC target systems and on the other side on the customized computing nodes provided by E4.

    D5.2 - Validated Platform on Known Dataset  

    This task is devoted to functionally validate the outcome of the CADD platform enhancement. This task includes the preparation of a dataset composed by known compounds derived from literature and publicly available databases.
    This dataset is used as a test-case for the platform for the continuous validation and performance assessment of the platform. It defines the baseline for the CADD platform validation process and assesses the improvement reached in terms of the selected metrics. A specific action has been carried out to:

    • Definition of benchmark datasets for validation of ligand poses and binding affinity calculations; quality control of structures datasets (DOMPE, UNIBAS).
    • Evaluation of the baseline of the CADD platform.
    • Evaluation and definition of the novel baseline of the intermediate and final tool version (Alpha, Beta, Candidate, Release).

    D5.3 - Validated Industrial Use Case   

    This use case is implemented on the CINECA pre-exascale machine using the entire set of nodes. In this task, industrial validation and assessment phases have been carried out by considering as target metrics the performance, energy efficiency and scalability of the target CADD platform. With respect to the Deliverable 5.2, the target of this task is to use the platform on a specific industrial case to assess the limits of the developed platform on industry and social relevant scenario. Specific progress performed to:

  • Collaboration with a national research program on microbial drug resistance in CH on selecting relevant targets and testing hits: (UNIBAS).
  • Definition of a use case (e.g. AMR drug targets); structural modelling of target receptor variation (e.g. resistance mutations, emerging pandemic strains) (UNIBAS).
  • Extreme-scale structure-based virtual screening using the released version of LIGATE to select a privileged library of putative antibiotics (DOMPE, POLIMI, CINECA, IT4I, KTH).
  • Synthesis and acquisition of chemical library (DOMPE);
  • Experimental validation of hits on selected use cases in vitro assays (DOMPE, UNIBAS).
  • D6.1 Setup and Maintenance of LIGATE Website & Social Media  

    Coordination of dissemination (LIGATE webpage, workshops, conferences, articles, and book) and communication (use of social networks such as LinkedIN, Facebook, Twitter and YouTube) activities.

    D6.2 Initial Dissemination and Communication Plan  

    This document provides the coordination of dissemination and communication activities (LIGATE webpage, workshops, conferences, articles, and book).
    It shows in detail the different steps, stages, messages and tools we are using to widely spread the progress and results of the project.

    D6.3 Initial Exploitation Plan   

    To define the strategy to exploit the project results. The activity, led by CHELONIA in tight collaboration with DOMPE, analyzed the strategies and implemented the actions to maximize the utilization of the LIGATE CADD platform and results, beyond the project partners and time-frame.

    D6.4 Dissemination and Communication Plan   

    This document has been prepared to report on the communication and dissemination activities conducted throughout the life of the LIGATE project. It describes the methods used to facilitate the wide-spread of information and knowledge from the results created by the project, among and beyond the members of the consortium. KPIs are provided to assess impact.

    D6.5 Final Exploitation Plan   

    Outlined herein are the strategies for maximizing the utility of the LIGATE platform, including potential market opportunities, collaborative initiatives, and approaches to navigate foreseeable challenges. With a comprehensive grasp of the platform’s versatility and its market potential, we position LIGATE as a pivotal force in the next generation of computational drug discovery.
    Building on the innovative modular approach of the LIGATE project, we delve into specific exploitable results that are poised to influence various sectors. This part of the exploitation plan outlines key outcomes such as HyperQueue, Ligen, Gromacs, and Celerity, detailing their potential application and integration paths. Comprehensive insights into these technologies confirm their alignment with the foundational goals of our building block strategy, ensuring a seamless transition into the broader objectives of promoting and implementing these solutions effectively in the post-project phase.

    D7.1 - Lessons Learnt from the LIGATE Project  

    In this deliverable we draw conclusions of LIGATE’s experience and we provide lessons learnt for further research opportunities and collaboration.

    Milestones

    SYCL version on LiGen running on GPUs was a critical achievement and was necessary to compare software performances wrt its CUDA version. Since this result had not been attained yet, Partners agreed to postpone release, which nonetheless was an internal milestone and thus its delayed release had no impact on Project evaluation by the EC.
    2 SYCL versions of LiGen modules (i.e.: LiGen Dock & LiGen Score) for a single molecule smoothly running on NVIDIA GPUs were produced. Since neither PoliMI nor UniSa had AMD GPUs available, no tests were carried out with these hardware resources.

    Afterwards, a first integrated version of the CADD platform first version of the programming tools associated to the platform was released and on April 23rd 2024 the consortium assessed the candidate release of the CADD platform, integrated and ready for the functional experiments in WP5. The CADD workflow has been executed on eurohpc machines.

    Keep Updated

    Subscribe to our newsletter

    Get Latest Updates!

    ADDRESS

    Ligate
    c/o Dompé farmaceutici
    Via Tommaso De Amicis, 95
    80145 Napoli, Italy

    This project has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 956137. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and Italy, Sweden, Austria, Czech Republic, Switzerland.