Papers by Milan Vukicevic

Group Decision and Negotiation
Crowdsourcing and crowd voting systems are being increasingly used in societal, industry, and aca... more Crowdsourcing and crowd voting systems are being increasingly used in societal, industry, and academic problems (labeling, recommendations, social choice, etc.) due to their possibility to exploit "wisdom of crowd" and obtain good quality solutions, and/or voter satisfaction, with high cost-efficiency. However, the decisions based on crowd vote aggregation do not guarantee high-quality results due to crowd voter data quality. Additionally, such decisions often do not satisfy the majority of voters due to data heterogeneity (multimodal or uniform vote distributions) and/ or outliers, which cause traditional aggregation procedures (e.g., central tendency measures) to propose decisions with low voter satisfaction. In this research, we propose a system for the integration of crowd and expert knowledge in a crowdsourcing setting with limited resources. The system addresses the problem of sparse voting data by using machine learning models (matrix factorization and regression) for the estimation of crowd and expert votes/grades. The problem of vote aggregation under multimodal or uniform vote distributions is addressed by the inclusion of expert votes and aggregation of crowd and expert votes based on optimization and bargaining models (Kalai-Smorodinsky and Nash) usually used in game theory. Experimental evaluation on real world and artificial problems showed that the bargaining-based aggregation outperforms the traditional methods in terms of cumulative satisfaction of experts and crowd. Additionally, the machine learning models showed satisfactory predictive performance and enabled cost reduction in the process of vote collection.

With the accumulation of large amounts of health related data, predictive analytics could stimula... more With the accumulation of large amounts of health related data, predictive analytics could stimulate the transformation of reactive medicine towards Predictive, Preventive and Per-sonalized (PPPM) Medicine, ultimately affecting both cost and quality of care. However, high-dimensionality and high-complexity of the data involved, prevents data-driven methods from easy translation into clinically relevant models. Additionally, the application of cutting edge predictive methods and data manipulation require substantial programming skills, lim-iting its direct exploitation by medical domain experts. This leaves a gap between potential and actual data usage. In this study, the authors address this problem by focusing on open, visual environments, suited to be applied by the medical community. Moreover, we review code free applications of big data technologies. As a showcase, a framework was devel-oped for the meaningful use of data from critical care patients by integrating the MIMIC-II da...

Modern medical research and clinical practice are more dependent than ever on multi-factorial dat... more Modern medical research and clinical practice are more dependent than ever on multi-factorial data sets originating from various sources, such as medical imaging, DNA analysis, patient health records and contextual factors. This data drives research, facilitates correct diagnoses and ultimately helps to develop and select the appropriate treatments. The volume and impact of this data has increased tremendously through technological developments such as highthroughput genomics and high-resolution medical imaging techniques. Additionally, the availability and popularity of different wearable health care devices has allowed the collection and monitoring of fine-grained personal health care data. The fusion and combination of these heterogeneous data sources has already led to many breakthroughs in health research and shows high potential for the development of methods that will push current reactive practices towards predictive, personalized and preventive health care. This potential i...

Recent Applications in Data Clustering, 2018
In this chapter, we propose a methodology for behavior variation and anomaly detection from acqui... more In this chapter, we propose a methodology for behavior variation and anomaly detection from acquired sensory data, based on temporal clustering models. Data are collected from five prominent European smart cities, and Singapore, that aim to become fully "elderly-friendly," with the development and deployment of ubiquitous systems for assessment and prediction of early risks of elderly Mild Cognitive Impairments (MCI) and frailty, and for supporting generation and delivery of optimal personalized preventive interventions that mitigate those risks, utilizing smart city datasets and IoT infrastructure. Low level data collected from IoT devices are preprocessed as sequences of activities, with temporal and causal variations in sequences classified as normal or anomalous behavior. The goals of proposed methodology are to (1) recognize significant behavioral variation patterns and (2) support early identification of pattern changes. Temporal clustering models are applied in detection and prediction of the following variation types: intra-activity (single activity, single citizen) and inter-activity (multiple-activities, single citizen). Identified behavioral variations and anomalies are further mapped to MCI/frailty onset behavior and risk factors, following the developed geriatric expert model.

Decision Support Systems X: Cognitive Decision Support Systems and Technologies, 2020
This paper deals with the role of experts and crowds in solving important societal issues. The au... more This paper deals with the role of experts and crowds in solving important societal issues. The authors argue that both experts and crowds are important stakeholders in collective decision making which should jointly participate in the decision-making process to improve it. Usually studied in different research areas, there have been a few models that integrate crowds and experts in a joint model. The authors give an overview of the advantages and disadvantages of crowd and expert decision making and highlight possibilities to connect these two worlds. They position the research in the area of Computational Social Choice (COM-SOC) and crowd voting, emerging fields that bring great potential for collective decision making. COMSOC focuses on improving social welfare and the quality of products and services through the inclusion of community or clients into the decision-making process. Despite these altruistic goals, there are several shortcomings that call for the engagement of experts in voting procedures. The authors propose a simple participatory model for weighting and selection of voters and votes through the integration of expert rankings into crowd voting systems.

Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics, 2018
Traditionally, machine learning extracts knowledge solely based on data. However, huge volume of ... more Traditionally, machine learning extracts knowledge solely based on data. However, huge volume of knowledge is available in other sources which can be included into machine learning models. Still, domain knowledge is rarely used in machine learning. We propose a framework that integrates domain knowledge in form of hierarchies into machine learning models, namely logistic regression. Integration of the hierarchies is done by using stacking (stacked generalization). We show that the proposed framework yields better results compared to standard logistic regression model. The framework is tested on the binary classification problem for predicting 30-days hospital readmission. Results suggest that the proposed framework improves AUC (area under the curve) compared to logistic regression models unaware of domain knowledge by 9% on average.

Modern medical research and clinical practice are more dependent than ever on multi-factorial dat... more Modern medical research and clinical practice are more dependent than ever on multi-factorial data sets originating from various sources, such as medical imaging, DNA analysis, patient health records and contextual factors. This data drives research, facilitates correct diagnoses and ultimately helps to develop and select the appropriate treatments. The volume and impact of this data has increased tremendously through technological developments such as high-throughput genomics and high-resolution medical imaging techniques. Additionally, the availability and popularity of different wearable health care devices has allowed the collection and monitoring of fine-grained personal health care data. The fusion and combination of these heterogeneous data sources has already led to many breakthroughs in health research and shows high potential for the development of methods that will push current reactive practices towards predictive, personalized and preventive health care. This potential ...
ADBIS, TPDL and EDA 2020 Common Workshops and Doctoral Consortium, 2020

Gaussian Conditional Random Fields (GCRF) are a type of structured regression model that incorpor... more Gaussian Conditional Random Fields (GCRF) are a type of structured regression model that incorporates multiple predictors and multiple graphs. This is achieved by defining quadratic term feature functions in Gaussian canonical form which makes the conditional log-likelihood function convex and hence allows finding the optimal parameters by learning from data. In this work, the parameter space for the GCRF model is extended to facilitate joint modelling of positive and negative influences. This is achieved by restricting the model to a single graph and formulating linear bounds on convexity with respect to the models parameters. In addition, our formulation for the model using one network allows calculating gradients much faster than alternative implementations. Lastly, we extend the model one step farther and incorporate a bias term into our link weight. This bias is solved as part of the convex optimization. Benefits of the proposed model in terms of improved accuracy and speed are...

Data Mining, 2018
We propose a method for quantitative analysis of predictive power of laboratory tests and early d... more We propose a method for quantitative analysis of predictive power of laboratory tests and early detection of mortality risk by usage of predictive models and feature selection techniques. Our method allows automatic feature selection, model selection, and evaluation of predictive models. Experimental evaluation was conducted on patients with renal failure admitted to ICUs (medical intensive care, surgical intensive care, cardiac, and cardiac surgery recovery units) at Boston's Beth Israel Deaconess Medical Center. Data are extracted from Multi parameter Intelligent Monitoring in Intensive Care III (MIMIC-III) database. We built and evaluated different single (e.g. Logistic regression) and ensemble (e.g. Random Forest) learning methods. Results revealed high predictive accuracy (area under the precision-recall curve (AUPRC) values >86%) from day four, with acceptable results on the second (>81%) and third day (>85%). Random forests seem to provide the best predictive accuracy. Feature selection techniques Gini and ReliefF scored best in most cases. Lactate, white blood cells, sodium, anion gap, chloride, bicarbonate, creatinine, urea nitrogen, potassium, glucose, INR, hemoglobin, phosphate, total bilirubin, and base excess were most predictive for hospital mortality. Ensemble learning methods are able to predict hospital mortality with high accuracy, based on laboratory tests and provide ranking in predictive priority.

In health care predictive analytics, limited data is often a major obstacle for developing highly... more In health care predictive analytics, limited data is often a major obstacle for developing highly accurate predictive models. The lack of data is related to various factors: minimal data available as in rare diseases, the cost of data collection, and privacy regulation related to patient data. In order to enable data enrichment within and between hospitals, while preserving privacy, we propose a system for data enrichment that adds a randomization component on top of existing anonymization techniques. In order to prevent information loss (inclusive loss of predictive accuracy of the algorithm) related to randomization, we propose a technique for data generation that exploits fused domain knowledge and available data-driven techniques as complementary information sources. Such fusion allows the generation of additional examples by controlled randomization and increased protection of privacy of personally sensitive information when data is shared between different sites. The initial e...
In this paper, we propose a methodology for behavior variation and anomaly detection from acquire... more In this paper, we propose a methodology for behavior variation and anomaly detection from acquired sensory data, based on temporal clustering models. Data are collected from smart cities that aim to become fully “elderly-friendly”, with the development and deployment of ubiquitous systems for assessment and prediction of early risks of elderly Mild Cognitive Impairments (MCI) and frailty. Our results show that Hidden Markov Models (HMMs) allow efficient (1) recognition of significant behavioral variation patterns and (2) early identification of pattern changes.

Analyzing human trajectories based on sensor data is a challenging research topic. It has been an... more Analyzing human trajectories based on sensor data is a challenging research topic. It has been analyzed from many aspects like clustering, process mining, and others. Still, less attention has been paid on analyzing this data based on hidden factors that drive the behavior of people. We, therefore, adapt the standard matrix factorization approach and reveal factors which are interpretable and soundly explain the behavior of a dynamic population. We analyze the motion of a skier population based on data from RFID-recorded ski entrances of skiers on ski lift gates. The approach is applicable to other similar settings, like shopping malls or road traffic. We further applied recommender systems algorithms for testing how well we can predict the distribution of ski lift usage (number of ski lift visits) based on hidden factors, but also on other benchmark algorithms. The matrix factorization algorithm showed to be the best recommender score predictor with an RMSE of 2.569 ± 0.049 and an ...

2018 26th International Conference on Software, Telecommunications and Computer Networks (SoftCOM), 2018
The global human population is aging rapidly, however, living longer does not necessarily mean li... more The global human population is aging rapidly, however, living longer does not necessarily mean living healthy, active and independent life. The emerging disruptive technologies like the Internet of Things (IoT) are proving instrumental in addressing this prominent societal challenge. Urban IoT infrastructures, designed to support the Smart City vision, enable capturing of personal data for analyzing behaviour of elderly people. Activities within the Horizon 2020 City 4Age project are aimed at showing that behavioral analysis can help detect and mitigate risks of Mild Cognitive Impairment (MCI) and frailty problems of elderly. This paper presents the latest developments in extending the configurability and flexibility of the comprehensive City4Age computational model for risk detection. The proposed model extensions have demonstrated seamless adaptation to specific characteristics of various urban contexts, as well as seamless “pluggable” integration of various combined evolving and extendable parameterized algorithm implementations and methods for behaviour variation and risk recognition, based on relevant statistical and machine learning techniques.

Journal of Chromatography A, 2020
In micellar liquid chromatography (MLC), the addition of a surfactant to the mobile phase in exce... more In micellar liquid chromatography (MLC), the addition of a surfactant to the mobile phase in excess is accompanied by an alteration of its solubilising capacity and a change in the stationary phase's properties. As an implication, the prediction of the analytes' retention in MLC mode becomes a challenging task. Mixed Quantitative Structure - Retention Relationships (QSRR) modelling represents a powerful tool for estimating the analytes' retention. This study compares 48 successfully developed mixed QSRR models with respect to their ability to predict retention of aripiprazole and its five impurities from molecular structures and factors that describe the Brij - acetonitrile system. The development of the models was based on an automatic combining of six attribute (feature) selection methods with eight predictive algorithms and the optimization of hyper-parameters. The feature selection methods included Principal Component Analysis (PCA), Non-negative Matrix Factorization (NMF), ReliefF, Multiple Linear Regression (MLR), Mutual Info and F-Regression. The series of investigated predictive algorithms comprised Linear Regressions (LR), Ridge Regression, Lasso Regression, Artificial Neural Networks (ANN), Support Vector Regression (SVR), Random Forest (RF), Gradient Boosted Trees (GBT) and K-Nearest neighbourhood (k-NN). A sufficient amount of data for building the model (78 cases in total) was provided by conducting 13 experiments for each of the 6 analytes and collecting the target responses afterwards. Different experimental settings were established by varying the values of the concentration of Brij L23, pH of the aqueous phase and acetonitrile content in the mobile phase according to the Box-Behnken design. In addition to the chromatographic parameters, the pool of independent variables was expanded by 27 molecular descriptors from all major groups (physicochemical, quantum chemical, topological and spatial structural descriptors). The best model was chosen by taking into consideration the Root Mean Square Error (RMSE) and cross-validation (CV) correlation coefficient (Q2) values. Interestingly, the comparative analysis indicated that a change in the set of input variables had a minor impact on the performance of the final models. On the other hand, different regression algorithms showed great diversity in the ability to learn patterns conserved in the data. In this regard, testing many regression algorithms is necessary in order to find the most suitable technique for model building. In the specific case, GBT-based models have demonstrated the best ability to predict the retention factor in the MLC mode. Steric factors and dipole-dipole interactions have proven to be relevant to the observed retention behaviour. This study, although being of a smaller scale, is a most promising starting point for comprehensive MLC retention prediction.

International Journal on Artificial Intelligence Tools, 2019
It is commonly understood that machine learning algorithms discover and extract knowledge based o... more It is commonly understood that machine learning algorithms discover and extract knowledge based on data at hand. However, a huge amount of knowledge is available which is in machine-readable format and ready for inclusion in machine learning algorithms and models. In this paper, we propose a framework that integrates domain knowledge in form of ontologies/hierarchies into logistic regression using stacked generalization. Namely, relations from ontology/hierarchy are used in stacking manner in order to obtain higher, more abstract concepts. Obtained concepts are further used for prediction. The problem we solved is unplanned 30-days hospital readmission, which is considered as one of the major problems in healthcare. Proposed framework yields better results compared to Ridge, Lasso, and Tree Lasso Logistic Regression. Results suggest that the proposed framework improves AUC by up to 9.5% on pediatric datasets and up to 4% on morbidly obese patients’ datasets and also improves AUPRC b...

Scientific reports, Jan 12, 2018
Intrinsically disordered proteins (IDPs) are characterized by the lack of a fixed tertiary struct... more Intrinsically disordered proteins (IDPs) are characterized by the lack of a fixed tertiary structure and are involved in the regulation of key biological processes via binding to multiple protein partners. IDPs are malleable, adapting to structurally different partners, and this flexibility stems from features encoded in the primary structure. The assumption that universal sequence information will facilitate coverage of the sparse zones of the human interactome motivated us to explore the possibility of predicting protein-protein interactions (PPIs) that involve IDPs based on sequence characteristics. We developed a method that relies on features of the interacting and non-interacting protein pairs and utilizes machine learning to classify and predict IDP PPIs. Consideration of both sequence determinants specific for conformational organizations and the multiplicity of IDP interactions in the training phase ensured a reliable approach that is superior to current state-of-the-art me...
Annals of translational medicine, 2018
Artificial Intelligence in Medicine, 2016
Highlights • Integration of domain knowledge (in the form of ICD-9-CM hierarchical nomenclature o... more Highlights • Integration of domain knowledge (in the form of ICD-9-CM hierarchical nomenclature of diseases) and learning algorithm (Tree-Lasso logistic regression) led to increased interpretability of predictive models while predictive performance is not affected significantly. • A quantitative analysis of interpretability is given based on information loss caused by dimensionality reduction. • The method is evaluated and analysed for hospital readmission prediction for SID pediatric patient data in California. • The resulting models are interpreted for general pediatric population, as well as several important subpopulations, and the interpretations of models comply with existing medical understanding of pediatric readmission.
Uploads
Papers by Milan Vukicevic