Papers by Arinze Akutekwe
Development of Dynamic Bayesian Network for the Analysis of High-Dimensional Biomedical Data Infe... more Development of Dynamic Bayesian Network for the Analysis of High-Dimensional Biomedical Data Inferring gene regulatory networks (GRNs) from time-course expression data is a major challenge in Bioinformatics. Advances in microarray technology have given rise to cheap and easy production of high-dimensional biological datasets, however, accurate analysis and prediction have been hampered by the curse of dimensionality problem whereby the number of features exponentially larger than the number of samples. were significantly increased by the cisplatin and oxaliplatin platinum drugs; while expression levels of Polo-like kinase and Cyclin B1 genes, were both decreased by the platinum drugs. The methods and results obtained may be useful in the designing of drugs and vaccines.

Inferring gene regulatory networks (GRNs) from time-course expression data is a major challenge i... more Inferring gene regulatory networks (GRNs) from time-course expression data is a major challenge in Bioinformatics. Advances in microarray technology have given rise to cheap and easy production of high-dimensional biological datasets, however, accurate analysis and prediction have been hampered by the curse of dimensionality problem whereby the number of features exponentially larger than the number of samples. Therefore, the need for the development of better statistical and predictive methods is continually on the increase. The main aim of this thesis is to develop dynamic Bayesian network (DBN) methods for analysis and prediction temporal biomedical data. A two stage computational bionetwork discovery approach is proposed. In the ovarian cancer case study, 39 out of 592 metabolomic features were selected by the Least Angle Shrinkage and Subset Operator (LASSO) with highest accuracy of 93% and 21 chemical compounds identified. The proposed approach is further improved by the appli...

Hypertension is a chronic medical condition that the blood pressure in the arteries is elevated. ... more Hypertension is a chronic medical condition that the blood pressure in the arteries is elevated. Hypertension can lead to damaged organs, as well as several illnesses, such as renal failure (kidney failure), aneurysm, heart failure, stroke, or heart attack. In our investigation, ten subsets were designed for male hypertension patient and control group. In this paper we apply t-test and entropy feature selection methods using 2fold and 5fold cross validation as our model selection methods with K-Nearest neighbour classifier. Among these groups, 3 number of biomarkers set were chosen (1,3,9) for 4 tables (t-test; 2-fold and 5-fold; entropy; 2-fold and 5-fold). From these biomarker sets which has the highest accuracy which is the measurement used for the classifier assessment was analysed and taken to the best models for each sub-set table. Each sub-set tables were analysed with each other and we tried to find the most appropriate biomarker. The defined biomarker was searched within da...

2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 2015
Inferring gene regulatory networks (GRNs) from time-course expression data is a major challenge i... more Inferring gene regulatory networks (GRNs) from time-course expression data is a major challenge in systems biology and comprehensive understanding of its dynamics is difficult. Most temporal inference methods for the dynamics of GRNs assume linear dependencies among genes but this strong assumption of linearity among genes does not truly represent the dynamics of the GRNs which are inherently nonlinear. Other parametric and non-parametric methods for modeling nonlinear dynamical systems such as the S-systems and causal identification structure (CSI) have been proposed for modeling time-course nonlinearities in GRNs; however, these methods are statistically inefficient and analytically intractable especially in high dimensions. To overcome these problems, we propose an algorithm based on optimized recurrent neural network (RNN) and dynamic Bayesian (DBN) network called RNN-DBN. The inference algorithm for our DBN is based on nonlinear state space Elman recurrent neural network. Results on Drosophila Melanogaster nonlinear time-course benchmark dataset shows our method outperforms the G1DBN inference method based on linear model assumptions. The algorithm is further applied to time-course ovarian cancer dataset. The results show that the expression levels of three of five significant hub genes (flap structure-specific endonuclease 1, kinesin family member 11 and CDC6 cell division cycle 6 homolog (S. cerevisiae)) were decreased by oxaliplatin, but remained constant with cisplatin platinum drugs. These may therefore be potential drug candidates for ovarian cancer.

Machine learning and statistical techniques have been applied in identifying biomarkers and const... more Machine learning and statistical techniques have been applied in identifying biomarkers and constructing predictive models for the early diagnosis of various diseases including Hepatocellular Carcinoma. These include the method of direct random walk for feature selection and classification using logistic regression. In this paper, we apply a two-stage approach for the discovery of novel bio-network in the diagnosis of the hepatocellular carcinoma. The results show that seven features selected by the Least Angle Shrinkage and Selection Operator (LASSO) using a 10-fold cross-validation yielded the best class discriminatory performance with the highest accuracy of 96.25%. The dynamic Bayesian Network modeling stratified the following biomarkers as associated with the disease; Transient receptor potential cation channel, subfamily V, member 3 gene might inhibit E2F transcription factor 4, p107/p130-binding and Superkiller viralicidic activity 2-like 2 that might also play inhibitory rol...

2014 IEEE 6th International Conference on Adaptive Science & Technology (ICAST), 2014
ABSTRACT Computational Intelligence methods have been applied to the automatic discovery of predi... more ABSTRACT Computational Intelligence methods have been applied to the automatic discovery of predictive models for the diagnosis of Hepatocellular Carcinoma (a.k.a liver cancer). Evolutionary algorithms have lent themselves as efficient and robust methods for evolving best parameter values that optimize feature selection methods. Different computational methods for discovering more robust set of molecular features for liver cancer have been proposed. These include methods combining other nature-inspired evolutionary algorithms such as Particle Swarm Optimization, with classifiers like Support Vector Machine (SVM). In this paper, we apply different variants of Differential Evolution algorithm to optimize the parameters of feature selection algorithms using a proposed two-stage approach. Stage one fine-tunes the parameters of the feature selection methods and selects high quality features. In stage two, Dynamic Bayesian Network (DBN) is applied to infer temporal relationships of the selected features. We demonstrate our method using gene expression profiles of liver cancer patients. The results show that the SVM-based predictive model with the radial basis function kernel yielded a predictive accuracy of 100%. This model and a sub-set of the features consist of only 8 features (genes) that have been regarded as most informative set for the diagnosis of the disease. In addition, among all these eight genes, the DBN model of the selected features reveals that SPINT2 gene inhibits HGF activator which prevents the formation of active hepatocytes growth factor, which makes up over 80% of liver cells.
2014 IEEE 6th International Conference on Adaptive Science & Technology (ICAST), 2014

IET Systems Biology, 2015
Accurate and reliable modelling of protein-protein interaction networks for complex diseases such... more Accurate and reliable modelling of protein-protein interaction networks for complex diseases such as colorectal cancer can help better understand mechanism of diseases and potentially discover new drugs. Different machine learning methods such as empirical mode decomposition combined with least square support vector machine, and discrete Fourier transform have been widely utilised as a classifier and for automatic discovery of biomarkers for the diagnosis of the disease. The existing methods are, however, less efficient as they tend to ignore interaction with the classifier. In this study, the authors propose a two-stage optimisation approach to effectively select biomarkers and discover interactions among them. At the first stage, particle swarm optimisation (PSO) and differential evolution (DE) are used to optimise parameters of support vector machine recursive feature elimination algorithm, and dynamic Bayesian network is then used to predict temporal relationship between biomarkers across two time points. Results show that 18 and 25 biomarkers selected by PSO and DE-based approach, respectively, yields the same accuracy of 97.3% and F1-score of 97.7 and 97.6%, respectively. The stratified analysis reveals that Alpha-2-HS-glycoprotein was a dominant hub gene with multiple interactions to other genes including Fibrinogen alpha chain, which is also a potential biomarker for colorectal cancer.

Comprehensive understanding of gene regulatory networks (GRNs) is a major challenge in systems bi... more Comprehensive understanding of gene regulatory networks (GRNs) is a major challenge in systems biology. Most methods for modeling and inferring the dynamics of GRNs, such as those based on state space models, vector autoregressive models and G1DBN algorithm, assume linear dependencies among genes. However, this strong assumption does not make for true representation of time-course relationships across the genes, which are inherently nonlinear. Nonlinear modeling methods such as the S-systems and causal structure identification (CSI) have been proposed, but are known to be statistically inefficient and analytically intractable in high dimensions. To overcome these limitations, we propose an optimized ensemble approach based on support vector regression (SVR) and dynamic Bayesian networks (DBNs). The method called SVR-DBN, uses nonlinear kernels of the SVR to infer the temporal relationships among genes within the DBN framework. The two-stage ensemble is further improved by SVR parameter optimization using Particle Swarm Optimization. Results on eight insilico-generated datasets, and two real world datasets of Drosophila Melanogaster and Escherichia Coli, show that our method outperformed the G1DBN algorithm by a total average accuracy of 12%. We further applied our method to model the time-course relationships of ovarian carcinoma. From our results, four hub genes were discovered. Stratified analysis further showed that the expression levels Prostrate differentiation factor and BTG family member 2 genes, were significantly increased by the cisplatin and oxaliplatin platinum drugs; while expression levels of Polo-like kinase and Cyclin B1 genes, were both decreased by the platinum drugs. These hub genes might be potential biomarkers for ovarian carcinoma.

Machine learning and statistical techniques have
been applied in identifying biomarkers and const... more Machine learning and statistical techniques have
been applied in identifying biomarkers and constructing
predictive models for the early diagnosis of various diseases
including Hepatocellular Carcinoma. These include the method
of direct random walk for feature selection and classification
using logistic regression. In this paper, we apply a two-stage
approach for the discovery of novel bio-network in the
diagnosis of the hepatocellular carcinoma. The results show
that seven features selected by the Least Angle Shrinkage and
Selection Operator (LASSO) using a 10-fold cross-validation
yielded the best class discriminatory performance with the
highest accuracy of 96.25%. The dynamic Bayesian Network
modeling stratified the following biomarkers as associated
with the disease; Transient receptor potential cation channel,
subfamily V, member 3 gene might inhibit E2F transcription
factor 4, p107/p130-binding and Superkiller viralicidic activity
2-like 2 that might also play inhibitory roles against protein
tyrosine phosphatase, non-receptor type 4 (megakaryocyte),
with all having their cDNA sources from the liver.

Hypertension is a chronic medical condition that the blood pressure in the arteries is elevated. ... more Hypertension is a chronic medical condition that the blood pressure in the arteries is elevated. Hypertension can
lead to damaged organs, as well as several illnesses, such as renal failure (kidney failure), aneurysm, heart failure,
stroke, or heart attack. Therefore, it is important to undertsand the mechanims of the disease and identify
relevant molecular markers of the disease. Therefore, in this study, we used a hypertension data set that was
collected from male subjects in Taiwan [3] to identify potential biomarkers. The data set includes 77 patients with
non-medicated young-onset male hypertensive cases and 82 male controls. Expression level of 22184 genes was
also measured for each subject. For the statistial assesment, ten different subsets of the samples, each of which
includes more or less the same number of samples, were constructed for the male hypertension patient and control
group. In order to anaylse the data set and identify potential genomic biomarlers, two statistical methods were
applied, namely, t-test and entropy feature selection methods using 2-fold and 5-fold cross validation methods. In
order to validate of the selected genes, K-Nearest neighbour classifier method was utilised. Among these groups,
3 number of biomarkers set were chosen (1,3,9) for 4 tables (t-test; 2-fold and 5-fold; entropy; 2-fold and 5-fold).
From these biomarker sets which has the highest accuracy which is the measurement used for the classifier
assessment was analysed and taken to the best models for each sub-set table. Each sub-set tables were analysed
with each other and we tried to find the most appropriate biomarker. The defined biomarker was searched within
database in order to find relationship with the disease. Consequently, highly recurrent and highly accurate
candidate genes can be further analysed. Further analysis (both database and wet study ) can be suggested for the
highly recurrent genes like Hs. 683236 (null), Hs. 475902, 420541, 656129, 647705 and 657792 as they may be
potential biomakers that could further help understand the mechanism
of the dieseas and be used for early diagnosis of the disease

IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), Jun 2014
Machine learning and other computational techniques have been applied in identifying biomarkers a... more Machine learning and other computational techniques have been applied in identifying biomarkers and constructing computational predictive models for early diagnosis of ovarian cancer. Most studies focus on large biopolymers such as DNA, RNA and Proteins but small metabolic molecules have received significantly less attention. In addition, studies have focused only on the analysis of classification performance of the biomarkers selected by various feature selection methods but do not consider possible temporal relationship among feature subsets. In this paper, we propose a two-stage bio-network discovery approach for ovarian cancer metabolites. At the first stage, feature selection is carried out using four different selection methods. The best features are selected based on overall best classification performance. At the second stage, Dynamic Bayesian Network (DBN) is used to model the temporal relationship among the stratified features. The results show that 39 features out of a total of 592 metabolomics features selected by the Least Angle Shrinkage and Selection Operator (LASSO) feature selection method yielded the highest predictive accuracy of 93%. Two DBN methods are then used to model the temporal relationships among the 39 features. The results show consistently significant relationships between the cancer biomarkers at features 219 and 225, and between features 543 and 219 across time points.

Conference proceedings : ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual Conference, 2014
Computational and machine learning techniques have been applied in identifying biomarkers and con... more Computational and machine learning techniques have been applied in identifying biomarkers and constructing predictive models for diagnosis of hypertension. Strategies such as improved classification rules based on decision trees have been proposed. Other techniques such as Fuzzy Expert Systems (FES) and Neuro-Fuzzy Systems (NFS) have recently been applied. However, these methods lack the ability to detect temporal relationships among biomarker genes that will aid better understanding of the mechanism of hypertension disease. In this paper we apply a proposed two-stage bio-network construction approach that combines the power and computational efficiency of classification methods with the well-established predictive ability of Dynamic Bayesian Network. We demonstrate our method using the analysis of male young-onset hypertension microarray dataset. Four key genes were identified by the Least Angle Shrinkage and Selection Operator (LASSO) and three Support Vector Machine Recursive Feature Elimination (SVM-RFE) methods. Results show that cell regulation FOXQ1 may inhibit the expression of focusyltransferase-6 (FUT6) and that ABCG1 ATP-binding cassette sub-family G may also play inhibitory role against NR2E3 nuclear receptor sub-family 2 and CGB2 Chromatin Gonadotrophin.

2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2014
ABSTRACT Machine learning techniques for automatic discovery of biomarkers and construction of pr... more ABSTRACT Machine learning techniques for automatic discovery of biomarkers and construction of predictive models have been applied for the diagnosis of colorectal cancer. Strategies such as Empirical Mode Decomposition (EMD) combined with Least Square Support Vector Machine (LS-SVM) have been proposed. Other methods using Discrete Wavelet Transform (DFT) and Support Vector Machine classifier have also been applied. However these methods adopt filter method of feature selection, which ignore interaction with the classifier, resulting in poor selection of features. They are also not able to detect temporal relationships among biomarkers that will aid better understanding of the disease. In this paper, we apply a two-stage bio-discovery approach which is a hybrid of classifier-dependent embedded feature selection methods, and modelling of temporal association of selected features using Dynamic Bayesian Network. Particle Swarm Optimization is also used to tune the parameters of feature selection algorithms for improved generalization performance. We demonstrate our method using the serum protein profiles of colorectal cancer patients. Results show that 21 features selected by the Support Vector Machine Recursive Feature Elimination with linear kernel had 99.09% generalization performance which outperforms that from previous studies. In addition, the analysis stratified Angiotensinogen (serpin peptidase inhibitor, clade A, member 8) that might inhibit IgA-inducing protein homolog (Bos taurus) and Gem (nuclear organelle) associated protein 2 might play inhibitory role against Alpha-2-HS-glycoprotein, which is also associated with liver cancer, and all having their cDNA sources from the bowel.

Computational Intelligence methods have been applied to the automatic discovery of predictive mod... more Computational Intelligence methods have been applied to the automatic discovery of predictive models for the diagnosis of Hepatocellular Carcinoma (a.k.a liver cancer). Evolutionary algorithms have lent themselves as efficient and robust methods for evolving best parameter values that optimize feature selection methods. Different computational methods for discovering more robust set of molecular features for liver cancer have been proposed. These include methods combining other nature-inspired evolutionary algorithms such as Particle Swarm Optimization, with classifiers like Support Vector Machine (SVM). In this paper, we apply different variants of Differential Evolution algorithm to optimize the parameters of feature selection algorithms using a proposed two-stage approach. Stage one fine-tunes the parameters of the feature selection methods and selects high quality features. In stage two, Dynamic Bayesian Network (DBN) is applied to infer temporal relationships of the selected features. We demonstrate our method using gene expression profiles of liver cancer patients. The results show that the SVM-based predictive model with the radial basis function kernel yielded a predictive accuracy of 100%. This model and a sub-set of the features consist of only 8 features (genes) that have been regarded as most informative set for the diagnosis of the disease. In addition, among all these eight genes, the DBN model of the selected features reveals that SPINT2 gene inhibits HGF activator which prevents the formation of active hepatocytes growth factor, which makes up over 80% of liver cells.
Uploads
Papers by Arinze Akutekwe
been applied in identifying biomarkers and constructing
predictive models for the early diagnosis of various diseases
including Hepatocellular Carcinoma. These include the method
of direct random walk for feature selection and classification
using logistic regression. In this paper, we apply a two-stage
approach for the discovery of novel bio-network in the
diagnosis of the hepatocellular carcinoma. The results show
that seven features selected by the Least Angle Shrinkage and
Selection Operator (LASSO) using a 10-fold cross-validation
yielded the best class discriminatory performance with the
highest accuracy of 96.25%. The dynamic Bayesian Network
modeling stratified the following biomarkers as associated
with the disease; Transient receptor potential cation channel,
subfamily V, member 3 gene might inhibit E2F transcription
factor 4, p107/p130-binding and Superkiller viralicidic activity
2-like 2 that might also play inhibitory roles against protein
tyrosine phosphatase, non-receptor type 4 (megakaryocyte),
with all having their cDNA sources from the liver.
lead to damaged organs, as well as several illnesses, such as renal failure (kidney failure), aneurysm, heart failure,
stroke, or heart attack. Therefore, it is important to undertsand the mechanims of the disease and identify
relevant molecular markers of the disease. Therefore, in this study, we used a hypertension data set that was
collected from male subjects in Taiwan [3] to identify potential biomarkers. The data set includes 77 patients with
non-medicated young-onset male hypertensive cases and 82 male controls. Expression level of 22184 genes was
also measured for each subject. For the statistial assesment, ten different subsets of the samples, each of which
includes more or less the same number of samples, were constructed for the male hypertension patient and control
group. In order to anaylse the data set and identify potential genomic biomarlers, two statistical methods were
applied, namely, t-test and entropy feature selection methods using 2-fold and 5-fold cross validation methods. In
order to validate of the selected genes, K-Nearest neighbour classifier method was utilised. Among these groups,
3 number of biomarkers set were chosen (1,3,9) for 4 tables (t-test; 2-fold and 5-fold; entropy; 2-fold and 5-fold).
From these biomarker sets which has the highest accuracy which is the measurement used for the classifier
assessment was analysed and taken to the best models for each sub-set table. Each sub-set tables were analysed
with each other and we tried to find the most appropriate biomarker. The defined biomarker was searched within
database in order to find relationship with the disease. Consequently, highly recurrent and highly accurate
candidate genes can be further analysed. Further analysis (both database and wet study ) can be suggested for the
highly recurrent genes like Hs. 683236 (null), Hs. 475902, 420541, 656129, 647705 and 657792 as they may be
potential biomakers that could further help understand the mechanism
of the dieseas and be used for early diagnosis of the disease
been applied in identifying biomarkers and constructing
predictive models for the early diagnosis of various diseases
including Hepatocellular Carcinoma. These include the method
of direct random walk for feature selection and classification
using logistic regression. In this paper, we apply a two-stage
approach for the discovery of novel bio-network in the
diagnosis of the hepatocellular carcinoma. The results show
that seven features selected by the Least Angle Shrinkage and
Selection Operator (LASSO) using a 10-fold cross-validation
yielded the best class discriminatory performance with the
highest accuracy of 96.25%. The dynamic Bayesian Network
modeling stratified the following biomarkers as associated
with the disease; Transient receptor potential cation channel,
subfamily V, member 3 gene might inhibit E2F transcription
factor 4, p107/p130-binding and Superkiller viralicidic activity
2-like 2 that might also play inhibitory roles against protein
tyrosine phosphatase, non-receptor type 4 (megakaryocyte),
with all having their cDNA sources from the liver.
lead to damaged organs, as well as several illnesses, such as renal failure (kidney failure), aneurysm, heart failure,
stroke, or heart attack. Therefore, it is important to undertsand the mechanims of the disease and identify
relevant molecular markers of the disease. Therefore, in this study, we used a hypertension data set that was
collected from male subjects in Taiwan [3] to identify potential biomarkers. The data set includes 77 patients with
non-medicated young-onset male hypertensive cases and 82 male controls. Expression level of 22184 genes was
also measured for each subject. For the statistial assesment, ten different subsets of the samples, each of which
includes more or less the same number of samples, were constructed for the male hypertension patient and control
group. In order to anaylse the data set and identify potential genomic biomarlers, two statistical methods were
applied, namely, t-test and entropy feature selection methods using 2-fold and 5-fold cross validation methods. In
order to validate of the selected genes, K-Nearest neighbour classifier method was utilised. Among these groups,
3 number of biomarkers set were chosen (1,3,9) for 4 tables (t-test; 2-fold and 5-fold; entropy; 2-fold and 5-fold).
From these biomarker sets which has the highest accuracy which is the measurement used for the classifier
assessment was analysed and taken to the best models for each sub-set table. Each sub-set tables were analysed
with each other and we tried to find the most appropriate biomarker. The defined biomarker was searched within
database in order to find relationship with the disease. Consequently, highly recurrent and highly accurate
candidate genes can be further analysed. Further analysis (both database and wet study ) can be suggested for the
highly recurrent genes like Hs. 683236 (null), Hs. 475902, 420541, 656129, 647705 and 657792 as they may be
potential biomakers that could further help understand the mechanism
of the dieseas and be used for early diagnosis of the disease