Publications

What is a Publication?
39 Publications visible to you, out of a total of 39

Abstract (Expand)

Computational workflows describe the complex multi-step methods that are used for data collection, data preparation, analytics, predictive modelling, and simulation that lead to new data products. They can inherently contribute to the FAIR data principles: by processing data according to established metadata; by creating metadata themselves during the processing of data; and by tracking and recording data provenance. These properties aid data quality assessment and contribute to secondary data usage. Moreover, workflows are digital objects in their own right. This paper argues that FAIR principles for workflows need to address their specific nature in terms of their composition of executable software steps, their provenance, and their development.

Authors: Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes, Daniel Garijo, Yolanda Gil, Michael R. Crusoe, Kristian Peters, Daniel Schober

Date Published: 2020

Publication Type: Journal

Abstract

Not specified

Authors: Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes, Daniel Garijo, Yolanda Gil, Michael R. Crusoe, Kristian Peters, Daniel Schober

Date Published: 2020

Publication Type: Journal

Abstract (Expand)

No presente artigo é apresentado uma avaliação de desempenho de um Framework de Redes Filogenéticas no ambiente do supercomputador Santos Dumont. O trabalho reforça os benefícios de paralelizar o paralelizar o framework usando abordagens paralelas baseadas em Computação de Alta Vazão (CAV), e Computação de Alto Desempenho (CAD). Os resultados da execução paralela do framework proposto, demonstram que este tipo de experimento da bioinformática é apropriado para ser executado em ambientes de CAD; apesar de que nem todas as tarefas e programas componentes do framework tenham sido criados para usufruir de escalabilidade em ambientes de CAD, ou de técnicas de paralelismo em diferentes níveis. A análise comparativa da execução dos cinco pipelines de forma sequencial (como desenhado e usado originalmente por bioinformatas) apresentou um tempo estimado de 81, 67 minutos. Já a execução do mesmo experimento por meio do framework executa os cinco pipelines de forma paralela e usufruindo de um melhor gerenciamento das tarefas, gerando um tempo total de execução de 38,73 minutos. Essa melhora é de aproximadamente 2, 11 vezes em tempo de execução sugere que a utilização de um framework otimizado leva à diminuição do tempo computacional, à melhora de alocação de recursos e ao tempo de espera na alocação.

Authors: Rafael Terra, Kary Ocaña, Carla Osthoff, Lucas Cruz, Philippe Navaux, Diego Carvalho

Date Published: 19th Oct 2022

Publication Type: InProceedings

Abstract (Expand)

In the last years, the development of technologies, such as next-generation sequencing and high-performance computing allowed the execution of Bioinformatics experiments of high complexity and computationally intensives. Different Bioinformatics fields need to use high-performance computing platforms to take advantage of the parallelism and tasks distribution, through specialized technologies of scientific workflows management systems. One of the Bioinformatics fields that need high-performance computing is phylogeny, a field that expresses the evolutive relations between genes and organisms, establishing which of them are most related evolutively. The phylogeny is used in several approaches, such as in the species classification; in the discovery of individuals’ kinship; in the identification of pathogens origins, and even in conservation biology. A way of representing these phylogenetic relations is using phylogenetic networks. However, the construction of these networks uses computationally intensive algorithms that require the constant manipulation of different input data. This work aims the development of a framework for construction of explicit phylogenetic networks, modeling a scientific workflow that adds different methods for the construction of the networks and the required input data treatment. The framework was developed to allow the use of multiple flows from the workflow in an automated, parallel, and distributed manner in a single execution and also to be executable in high- performance computing environments, constituting a challenging task, once the tools used are not developed focused in this environment. To orchestrate the workflow tasks, the scalable parallel programing library Parsl was used, allowing to do optimizations in the workflow’s tasks execution, performing better management of the resources. Two versions of the framework were developed, called Single Partition and Multi Partition, differing in the manner in which the resources are used. In tests performed, there was an improvement in the execution time of about five times when compared to the sequential execution of a flow without the optimizations. The framework was validated using public data of Dengue virus genomes, which were processed, annotated, and executed in the framework using the Santos Dumont supercomputer. The construction of the genomes’ explicit phylogenetic networks indicates that the framework is a functional, efficient, and easy to use tool.

Authors: Rafael Terra, Kary Ocaña, Carla Osthoff, Diego Carvalho

Date Published: 18th Feb 2022

Publication Type: Master's Thesis

Abstract (Expand)

Processos evolutivos e dispersão de genomas de Dengue no Brasil são relevantes na direção do impacto e vigilância endemo-epidêmico e social de arboviroses emergentes. Árvores e redes filogenéticas filogenéticas permitem exibir eventos evolutivos e reticulados em vírus originados pela alta diversidade e taxa de mutação de recombinação homóloga frequente. Apresentamos um workflow científico paralelo e distribuído para redes filogenéticas desenhado para trabalhar com a diversidade de ferramentas e recursos em experimentos da biologia computacional e acoplados a ambientes de computação de alto desempenho. Apresentamos uma melhoria no tempo de execução de aproximadamente 5 vezes em comparação com a execução sequencial em análises de genomas de dengue e com identificação de eventos de recombinação.

Authors: Rafael Terra, Micaella Coelho, Lucas Cruz, Marco Garcia-Zapata, Luiz Gadelha, Carla Osthoff, Diego Carvalho, Kary Ocaña

Date Published: 18th Jul 2021

Publication Type: InProceedings

Abstract (Expand)

Background A new era of flu surveillance has already started based on the genetic characterization and exploration of influenza virus evolution at whole-genome scale. Although this has been prioritizedd by national and international health authorities, the demanded technological transition to whole-genome sequencing (WGS)-based flu surveillance has been particularly delayed by the lack of bioinformatics infrastructures and/or expertise to deal with primary next-generation sequencing (NGS) data. Results We developed and implemented INSaFLU (“INSide the FLU”), which is the first influenza-oriented bioinformatics free web-based suite that deals with primary NGS data (reads) towards the automatic generation of the output data that are actually the core first-line “genetic requests” for effective and timely influenza laboratory surveillance (e.g., type and sub-type, gene and whole-genome consensus sequences, variants’ annotation, alignments and phylogenetic trees). By handling NGS data collected from any amplicon-based schema, the implemented pipeline enables any laboratory to perform multi-step software intensive analyses in a user-friendly manner without previous advanced training in bioinformatics. INSaFLU gives access to user-restricted sample databases and projects management, being a transparent and flexible tool specifically designed to automatically update project outputs as more samples are uploaded. Data integration is thus cumulative and scalable, fitting the need for a continuous epidemiological surveillance during the flu epidemics. Multiple outputs are provided in nomenclature-stable and standardized formats that can be explored in situ or through multiple compatible downstream applications for fine-tuned data analysis. This platform additionally flags samples as “putative mixed infections” if the population admixture enrolls influenza viruses with clearly distinct genetic backgrounds, and enriches the traditional “consensus-based” influenza genetic characterization with relevant data on influenza sub-population diversification through a depth analysis of intra-patient minor variants. This dual approach is expected to strengthen our ability not only to detect the emergence of antigenic and drug resistance variants but also to decode alternative pathways of influenza evolution and to unveil intricate routes of transmission. Conclusions In summary, INSaFLU supplies public health laboratories and influenza researchers with an open “one size fits all” framework, potentiating the operationalization of a harmonized multi-country WGS-based surveillance for influenza virus.

Authors: Vítor Borges, Miguel Pinheiro, Pedro Pechirra, Raquel Guiomar, João Paulo Gomes

Date Published: 1st Dec 2018

Publication Type: InProceedings

Abstract (Expand)

This report reviews the current state-of-the-art applied approaches on automated tools, services and workflows for extracting information from images of natural history specimens and their labels. We consider the potential for repurposing existing tools, including workflow management systems; and areas where more development is required. This paper was written as part of the SYNTHESYS+ project for software development teams and informatics teams working on new software-based approaches to improve mass digitisation of natural history specimens.

Authors: Stephanie Walton, Laurence Livermore, Olaf Bánki, Robert W. N. Cubey, Robyn Drinkwater, Markus Englund, Carole Goble, Quentin Groom, Christopher Kermorvant, Isabel Rey, Celia M Santos, Ben Scott, Alan Williams, Zhengzhe Wu

Date Published: 14th Aug 2020

Publication Type: Journal

Abstract (Expand)

Predicting the physical or functional associations through protein-protein interactions (PPIs) represents an integral approach for inferring novel protein functions and discovering new drug targets during repositioning analysis. Recent advances in high-throughput data generation and multi-omics techniques have enabled large-scale PPI predictions, thus promoting several computational methods based on different levels of biological evidence. However, integrating multiple results and strategies to optimize, extract interaction features automatically and scale up the entire PPI prediction process is still challenging. Most procedures do not offer an in-silico validation process to evaluate the predicted PPIs. In this context, this paper presents the PredPrIn scientific workflow that enables PPI prediction based on multiple lines of evidence, including the structure, sequence, and functional annotation categories, by combining boosting and stacking machine learning techniques. We also present a pipeline (PPIVPro) for the validation process based on cellular co-localization filtering and a focused search of PPI evidence on scientific publications. Thus, our combined approach provides means to extensive scale training or prediction of new PPIs and a strategy to evaluate the prediction quality. PredPrIn and PPIVPro are publicly available at https://github.com/YasCoMa/predprin and https://github.com/YasCoMa/ppi_validation_process.

Authors: Yasmmin Côrtes Martins, Artur Ziviani, Marisa Fabiana Nicolás, Ana Tereza Ribeiro de Vasconcelos

Date Published: 6th Sep 2021

Publication Type: Journal

Abstract (Expand)

Point cloud datasets provided by LiDAR have become an integral part in many research fields including archaeology, forestry, and ecology. Facilitated by technological advances, the volume of these datasets has steadily increased, with modern airborne laser scanning surveys now providing high-resolution, (super-)national scale, multi-terabyte point clouds. However, their wider scientific exploitation is hindered by the scarcity of open source software tools capable of handling the challenges of accessing, processing, and extracting meaningful information from massive datasets, as well as by the domain-specificity of existing tools. Here we present Laserchicken, a user-extendable, cross-platform Python tool for extracting statistical properties of flexibly defined subsets of point cloud data, aimed at enabling efficient, scalable, distributed processing of multi-terabyte datasets. We demonstrate Laserchicken’s ability to unlock these transformative new resources, e.g. in macroecology and species distribution modelling, where it is used to characterize the 3D vegetation structure at high resolution (<10 m) across whole countries or regions. We further discuss its potential as a domain agnostic, flexible tool that can also facilitate novel applications in other research fields.

Authors: C. Meijer, M.W. Grootes, Z. Koma, Y. Dzigan, R. Gonçalves, B. Andela, G. van den Oord, E. Ranguelova, N. Renaud, W.D. Kissling

Date Published: 1st Jul 2020

Publication Type: Journal

Abstract (Expand)

Quantifying ecosystem structure is of key importance for ecology, conservation, restoration, and biodiversity monitoring because the diversity, geographic distribution and abundance of animals, plants and other organisms is tightly linked to the physical structure of vegetation and associated microclimates. Light Detection And Ranging (LiDAR) — an active remote sensing technique — can provide detailed and high resolution information on ecosystem structure because the laser pulse emitted from the sensor and its subsequent return signal from the vegetation (leaves, branches, stems) delivers three-dimensional point clouds from which metrics of vegetation structure (e.g. ecosystem height, cover, and structural complexity) can be derived. However, processing 3D LiDAR point clouds into geospatial data products of ecosystem structure remains challenging across broad spatial extents due to the large volume of national or regional point cloud datasets (typically multiple terabytes consisting of hundreds of billions of points). Here, we present a high-throughput workflow called ‘Laserfarm’ enabling the efficient, scalable and distributed processing of multi-terabyte LiDAR point clouds from national and regional airborne laser scanning (ALS) surveys into geospatial data products of ecosystem structure. Laserfarm is a free and open-source, end-to-end workflow which contains modular pipelines for the re-tiling, normalization, feature extraction and rasterization of point cloud information from ALS and other LiDAR surveys. The workflow is designed with horizontal scalability and can be deployed with distributed computing on different infrastructures, e.g. a cluster of virtual machines. We demonstrate the Laserfarm workflow by processing a country-wide multi-terabyte ALS dataset of the Netherlands (covering ∼34,000 km2 with ∼700 billion points and ∼ 16 TB uncompressed LiDAR point clouds) into 25 raster layers at 10 m resolution capturing ecosystem height, cover and structural complexity at a national extent. The Laserfarm workflow, implemented in Python and available as Jupyter Notebooks, is applicable to other LiDAR datasets and enables users to execute automated pipelines for generating consistent and reproducible geospatial data products of ecosystems structure from massive amounts of LiDAR point clouds on distributed computing infrastructures, including cloud computing environments. We provide information on workflow performance (including total CPU times, total wall-time estimates and average CPU times for single files and LiDAR metrics) and discuss how the Laserfarm workflow can be scaled to other LiDAR datasets and computing environments, including remote cloud infrastructures. The Laserfarm workflow allows a broad user community to process massive amounts of LiDAR point clouds for mapping vegetation structure, e.g. for applications in ecology, biodiversity monitoring and ecosystem restoration.

Authors: W. Daniel Kissling, Yifang Shi, Zsófia Koma, Christiaan Meijer, Ou Ku, Francesco Nattino, Arie C. Seijmonsbergen, Meiert W. Grootes

Date Published: 1st Dec 2022

Publication Type: Journal

Powered by
(v.1.16.0)
Copyright © 2008 - 2024 The University of Manchester and HITS gGmbH