AccBD – Accelerated Biomarker Candidate Discovery
Idea and relation to VEDLIoT
In our context, features represent potential biomarker candidates. Biomarkers are measurable indicators for medical risk factors, for a biological condition, to study a disease, to predict a diagnosis, to determine the state of a disease or the effectiveness of a treatment. To obtain potential biomarker candidates (features), multiple sources are possible. In addition to classical laboratory measurements, social-demographic values and imaging data, modern high-throughput technologies like mass spectrometry or next-generation sequencing provide large amounts of additional measurements. In “omics” fields like proteomics, metabolomics and transcriptomics, thousands of proteins, genes, and metabolites can be measured. Although such measurements provide valuable information for diagnostic and prognostic purposes, they usually lead to high-dimensional data with very small sample size. The small sample size results from the fact that sample processing for high-throughput technologies can be laborious and expensive. For example, biomarker pilot studies may include well over 10,000 biomarker candidates (features) but fewer than 100 samples.
Advanced analysis of such high-dimensional data is computationally expensive and currently not performable in a reasonable time frame. This could lead to medical insights remaining undiscovered. Therefore, AccBD proposes to accelerate our biomarker candidate discovery workflow, an ensemble of embedded feature selection methods. The project aims to achieve this acceleration by optimal parallelization on the heterogeneous VEDLIoT hardware platform, distributing the computation across different computing substrates, such as CPU, GPU or FPGA. Parallelization on heterogeneous hardware and a focus on reducing energy consumption is not yet common practice for method development in bioinformatics. Nevertheless, the project wants to show that using alternative hardware to save time and reduce energy consumption is possible.
While this is not a typical IoT application by itself, the identification of biomarker candidates acts as an enabler for smart health applications, such as smart watches or smart homes. The COVID-19 pandemic has shown that wastewater monitoring for pathogens – not only the Coronavirus – can help to confine infection outbreaks. Application to such monitoring data is also possible.
- The Yeo Johnson power transformation is a necessary preprocessing step for the data analyses. To replace its basic Python version, an accelerated C version will be developed and optimized.
- The developed C-Version of the Yeo Johnson power transformation will be adapted to the requirements of an FPGA pipeline. Runtime and energy consumption will be optimized. The accelerated C- and FPGA-Versions of the Yeo Johnson power transformation will be connected to Python. Both options will be published as open-source software.
- Parallelization of the overarching bioinformatics workflow in Python. Analysis of its acceleration and different parallelization options in comparison to the respective overhead. Comparison of the respective energy consumption.
- Performance analysis of GPU versus CPU usage in model training for machine learning models with small sample sizes. For model training, the Python library lightGBM is used as either GPU or CPU version.
Data analysis for modern medical research with a very small sample size compared to a large number of features faces specific challenges. One of them is related to the curse of dimensionality. In addition, skewed data distributions and outliers are common in these biological datasets. Algorithms designed specifically for this use case are computationally intensive. To accelerate them, AccBD wants to investigate the optimal use of the heterogeneous VEDLIoT hardware platform for our complete workflow, employing heterogeneous and reconfigurable computing. To achieve acceleration and energy savings, the different compute kernels of the application are analyzed and mapped to the respective optimal hardware substrates (or micro-servers) of the VEDLIoT platform. By doing so, both performance and energy efficiency are increased due to applying parallel computing in combination with optimization of the compute architecture itself.
All parts of the software developed and enhanced during the project will be publicly available as open-source software. From a medical point of view, this research will enable the improvement of machine learning for biomarker pilot studies. Only an accelerated workflow will enable to study larger multi-omics or high throughput-based datasets. It will be possible to obtain more reliable results by applying state-of-the-art feature selection ensemble techniques based on, e.g., bagging and boosting. Enhanced reliability will increase the likelihood of finding more relevant biomarker candidates.
However, being available as a public good, the software can be used in alternative contexts as well.