Prediction of metabolic ageing in higher education staff using machine learning: A pilot study

The detection of individuals with obesity or overweight allows to predict the prevalence of health risks, such as premature death, disabilities and other chronic diseases. This study describes a pilot conducted on the members of a higher education staff in the city of Matehuala, Mexico. It involved processing anthropometric measurements, health indicators and the results of bioelectrical impedance analysis using machine learning techniques. The goal was to identify the metabolic aging of individuals. The recorded data were used to create a database that was subsequently employed in four different classification models: decision tree, random forest, artificial neural networks and adaptive boosting. Additionally, four statistical techniques were utilized to determine variable importance scores: Pearson, Chi 2 , Anova, recursive elimination method and the variance inflation factor. The variable importance score was employed to identify the features that were most consistently repeated across methods. This analysis concluded that both anthropometric measurements and the results of bioelectrical impedance analysis provide valuable references for identifying obesity and overweight in individuals. Among the anthropometric measurements that exhibited a greater impact on the models' predictions were waist-to-height ratio, hip and arm circumferences, body mass index, systolic and diastolic blood pressure and heart rate. Additionally, body fat and muscle mass also contributed significantly.


INTRODUCTION
In 2022, the World Health Organization (WHO) identified that there are more than 1 billion people in the world with obesity, of which 650 million are adults.In Mexico, obesity is a significant concern.The 2020 National Health and Nutrition Survey on Covid-19 found that 76% of adult women are overweight or obese, compared to 72.1% of men (Shamah-Levy et al., 2021).Mexican adults aged between 29 and 69 engage in approximately 300 min per week of moderate to vigorous physical activity.However, about 29% do not meet the minimal recommendation of 150 min per week (Medina et al., 2013).
On the other hand, nearly 50% of Mexican adults have metabolic syndrome due to sedentary behaviors, physical inactivity, unhealthy dietary habits and poor sleep patterns (Macias et al., 2021).A study on the sociodemographic and anthropometric characteristics of adults aged 20 to 69 years in Mexico City revealed that the prevalence of participants classified in the highest sitting time category (≥ 420 min/day) increased by 8% over nine years.This increase had an impact, leading to a rise of 5.4% in overweight/obesity and a 1.3% increase in the diagnosis of diabetes (Medina et al., 2017).
In higher education institutions, staff work activities are predominantly centered in offices, inducing sedentary behaviors.Given this, monitoring the health condition of the staff is crucial for detecting potential health risks that could contribute to the development of chronic diseases.
Previous studies focused on higher education staff have shown that poor nutritional habits and lack of physical activity promotes the prevalence of overweight/obesity.Consequently, it is important to implement strategies aimed at reducing obesity and promoting well-being among the teaching population (Rodriguez-Guzman, 2006;Freedman et al., 2010;He et al., 2014;Rodrigues-Rodrigues et al., 2018).
According to the WHO, obesity and overweight can be identified from the anthropometric measurements of an individual.For adults, body mass index (BMI) and waist circumference (WA) serve as reliable indicators to discern obesity and overweight.Anthropometric measurements encompass a variety of body metrics, including weight, height, standing length, skin folds, circumferences (head, waist, hip, etc.), length of limbs and widths (shoulder, wrist, etc.).The Official Mexican Standard (NOM) defines the parameters and anthropometric criteria considered to determine abdominal obesity within the Mexican population (Shamah-Levy et al., 2017).Table 1 shows the values of BMI and WA values used for classifying obesity in Mexican adults.
In this study, we aimed to determine the prevalence of obesity and overweight among the staff of a higher education institution situated in the city of Matehuala, Mexico.The assessment was based on BMI and WA measurements.Considering WA, our observations indicate that 72% of males and 60.5% of females among the staff members are classified as abdominally obese.Evaluating the BMI of the observed group, we found that 60% of males and 31.5% of females are categorized as overweight, while 16% of males and 5.2% of females fall under the classification of obesity.Among those identified as obese, 12% of males and 5.2% of females belong to the obese class I category, and 4% of males are in the obese class II category, while no females fall within this category.None of the individuals belong to the obese class III category.When considering the entire monitored staff as a collective, 65% are abdominally obese, 42.8% are overweight, and 9.5% are obese.In total, 52.3% of the observed group are categorized as overweight or obese.
While the utilization of BMI, WA and waist-to-height ratio (WHtR) for predicting mortality remains effective, an alternative approach involves analyzing the outcomes of bioelectrical impedance analysis (BIA).This method has been suggested for monitoring and tracking the health status of individuals, including those with chronic conditions such as obesity (Ricciardi and Talbot, 2007;Heydari et al., 2011;de-Mateo-Silleras et al., 2019;Aldobali et al., 2022).BIA results have previously been utilized to estimate body fat percentage (BF) and correlate it with assessing the risk of diseases or mortality (Böhm and Heitmann, 2013).In 2021, the significance of the association between fat, visceral fat (VF) and muscle mass (MM) obtained through BIA in identifying metabolic syndrome as a health concern was recognized (Pouragha et al., 2021).
Machine learning (ML) is the science of programming computers so that they can learn from the information that has been provided to them (Geron, 2019).There are multiple ML techniques that can be used to build projects related to healthcare, with the aim of improving medical diagnosis or assisting health staff in the process of identifying a patient's condition (Sprogar et al., 2001;Javaid et al., 2022;Manickam et al., 2022;Payal et al., 2022).Common ML techniques used for these purposes are DTs, Logistic Regression (LR) and Support Vector Machine (SVM).Classification algorithms are effective to predict syndromes related to the prevalence of overweight and obesity (Chatterjee et al., 2020;Gutierrez-Esparza et al., 2020;Safaei et al., 2021;Crowson et al., 2022;Dhabarde et al., 2022;Strzelecki and Badura, 2022).
The classification process uses the features that have the biggest impact on the prediction of the objective.The selection of feature importance is very relevant, especially in classification problems with few samples (Mohd and Awang, 2021).In Archer and Kimes (2008), the authors evaluate the effectiveness of using the variable importance score in the Random Forest (RF) technique, concluding that this methodology is applicable in classification problems when the objective is to produce an accurate classifier.Also using RF, in Chen et al. (2020)  Class II Class III WHO < 18.5 18.5-24.925.0-29.930.0-34.9 35.0-39.9> 40.0 25.0-29.9presented in order to reduce the number of features based on the identification of the variable importance measures (VIMs); the authors evaluated and compared the accuracy of specific RF, SVM, K-Nearest Neighbors (KNN) and Linear Discriminant Analysis (LDA) classification models.Additionally, in Gregorutti et al. (2017) and Senan et al. (2021), Recursive Feature Elimination (RFE) is used to identify VIMs.Misra and Singh Yadav (2020) suggest that a less complex algorithm can improve the accuracy of the classification, so they propose a method that analyzes each of the features and registers its importance with a predictor variable.Supporting the statement that using several methods to obtain VIMs in a classification problem, offers more reliability and consistency in the classification of the objective (Kiang, 2003;Nithya and Ilango, 2019).The selection of features has helped to obtain important results in ML biomedical applications.For example, Gutierrez-Esparza et al. (2020) used VIMs in the prediction of metabolic syndrome in a Mexican population.McLaren et al. (2019) used VIMs to predict malignant lesions in the breast with magnetic resonance imaging as features.Also, Ganggayah et al. (2019) used VIMs to identify the factors that predict the survival of patients with breast cancer.Wilson et al. (2012) and Sparling et al. (2007) state that the factors contributing to overweight/obesity are diverse and require a comprehensive approach that takes into account environmental and cultural influences.They also emphasize the significance of early intervention in effectively reducing rates of overweight and obesity.
The problem addressed in this paper centers on the high prevalence of obesity and overweight.In Mexico, either the rate of obesity and the prevalence of metabolic syndrome are alarming, potentially leading to the suffering of longterm health conditions.The sedentary lifestyle, unhealthy dietary habits and poor sleep patterns among Mexican adults contribute to the high rates of obesity and metabolic syndrome.The problem is worsened by the increasing prevalence of overweight/obesity among higher education staff, who are primarily engaged in sedentary activities.
The motivation behind this study is based on the need to reduce the prevalence of obesity and metabolic syndrome among higher education staff in Mexico.By utilizing ML techniques, the paper aims to contribute to the development of effective strategies for identifying health risks and promoting wellbeing.The study's focus on higher education staff underlines the importance of creating interventions tailored to specific work environments to mitigate the adverse health effects associated with sedentary behaviors.
In the present study we use data obtained in a health condition monitoring initiative involving the staff of a higher education institution situated in Matehuala, Mexico.The aim is to identify health risks through the application of ML techniques.The features include individual records comprising anthropometric measurements, glucose levels, and results obtained from BIA. Python and Scikit Learn were used to implement four classification algorithms based on ML, and four statistical techniques, that helped to compute VIMs of the features in the prediction of the individual's risk of having obesity, by observing the body age or metabolic age.

MATERIALS AND METHODS
In biomedical applications based on supervised learning, medical data are used to train the algorithm in accordance with its relation to the target.In the present pilot study, a database was created with 63 records, identifying anthropometric measurements, glucose levels and BIA results as features (Fig. 1(a)).Fig. 1(b) depicts the block diagram representing the process flow: during the Extract Transform Load (ETL) phase, data is retrieved from the database and cleaned by replacing missing values with the computed mean.Additionally, feature scaling is performed at this stage.In the Exploratory Data Analysis (EDA) phase, statistical analyses are conducted using the univariate methods (Pearson, Chi 2 and ANOVA), and both RFE and variable importance factor (VIF) methods.Within the ML model block, the following classifiers are implemented: DT, RF, artificial neural networks (ANN) and adaptive boosting (AdaBoost), aiming to obtain VIMs through Shapley additive explanations (SHAP) values.For quality assessment of the classifiers, the F1 score, the Area Under the Receiver Operating Characteristic curve (AUC-ROC), and the confusion matrix were utilized as metrics.
The recorded data include the following anthropometric measurements: age (AG), weight (WE), height (HE), BMI, WA, WHtR, arm circumference (AR), hip circumference (HP), systolic blood pressure (SBP), diastolic blood pressure (DBP), and heart rate (HR).Additionally, the health indicator includes glucose (DX), and the following functional fitness parameters: MM, VF, body fat (BF) and body age.Ageing (AGG) is defined as the ratio age/body age.
Weight, BMI, MM, MA, VF, BF and body age were derived from the BIA results obtained using an Omron HBF-514C body monitor.This device sends electrical currents through the hands via electrodes that the individual holds with both hands and through the feet via electrodes placed on the scale's surface.This combination allows for an analysis of both the upper and lower body (Pribyl et al., 2011).Participants were instructed not to exercise and to fast on the test day, including refraining from coffee.Blood pressure (BP) was measured using an inflatable cuff with a gauge around the arm, providing measurements in millimeters of mercury (mmHg) for DBP and SBP.Waist (WA), hip (HP) and arm (AR) circumferences were measured using a tape measure in centimeters.Heart rate (HR) or pulse was measured at the wrist on the radial artery in beats per minute.
The WHtR is calculated by dividing waist by height measurement in centimeters.GLU measurements were taken using a blood sugar meter, with blood samples collected from fingertip pricks, reported in millimoles per liter (mmol/L).The status of "aged" was utilized as the objective or label, determined based on the ratio between body age, and the subject's real age.Specifically, if AGG > 1.0, the subject is considered "aged."Table 2 displays the anthropometric measurements, blood glucose levels, and functional physical fitness indices obtained by BIA for the staff members.The data is presented with average values, standard deviations, as well as maximum and minimum values for each characteristic.On average, the staff members are 40 years old, with an average weight of 70 kg and height of 1.65 meters.According to Table 1, the staff is classified as overweight with an average BMI of 25.58 kg/m² (> 25), they exhibit normal glucose levels (< 99) and normal blood pressure (109/73).The BIA results gave an AGG of 1.13, indicating that the staff members are "aged" with a BF percentage of 32.9%.

Machine Learning
One of the goals of applying ML techniques to large datasets is to discover patterns among features.As listed above, some of the most important algorithms used in supervised learning are SVM, DT and RF.SVM and DT can be used for classification, and regression tasks on complex datasets, while the RF algorithm is built from many individual DT.DT learn the best way to divide the training dataset into smaller and even more smaller subsets until reaching the target prediction.In RF, the predictions from all the trees are used to make the final prediction of the target.
In ML algorithms, the relative importance of each feature is scored after training the algorithm.This method is helpful to get a better understanding of which characteristics are more important when a selection of features is required, in addition "to discovering complex relationships between predictors corresponding to interaction terms".In these algorithms, variable importance can be measured by observing the decrease in model accuracy if the values of a variable are randomly permuted (Peter et al., 2020).In this work, the following models were used: DT, RF, ANN, AdaBoost.

Statistical Analysis
In addition to the ML algorithms, other methods exist to identify VIMs.We implemented univariate analysis, the RFE method and the VIF calculation.
The univariate analysis method involves analyzing each variable in the dataset using Pearson, Chi 2 and ANOVA correlation tests.The value of 'p' is used as a criterion to determine the degree of importance of each characteristic.The Chi 2 correlation test determines whether the variables are related to the objective.The RFE method employs a ML model to iteratively remove variables with the least impact on the target prediction.Various models can serve as a basis for this technique, such as linear, SVM, DT, among others.The VIF factor provides a measure of collinearity that assesses whether two variables in the model are highly correlated and conveys similar information about the dataset's variance.In multiple regression, this helps to identify the most significant predictor variables.
Python and Scikit-Learn were utilized for the statistical analysis of the dataset, data processing, and modeling using ML techniques.Data normalization was performed before conducting the data analysis.For the classification modeling, the dataset was randomly split into two subsets: the training dataset (80% of the data) and the testing set (20% of the data).The prediction of whether an individual is aged or not aged was based on the AGG ratio.

RESULTS AND DISCUSSION
Table 3 shows the characteristics of the population by group: aged (AGG > 1.0) or not aged (AGG ≤ 1.0).According to the table, 60.3% of the population had a body age of 1.27 years older than the mean age of the staff.This group have an average weight of 74.37 kg, a BMI of 26.94 kg/m 2 and BF of 33%.Likewise, 39.6% of the population had a body age of 1.91 years younger than the mean age of the staff; with an average weight of 63.52 kg, a BMI of 23.51 kg/m 2 and BF of 32.88%.Both groups display normal glucose levels (< 99) and normal blood pressure (< 120/80).
The Pearson correlation coefficient was used to compare the characteristics of the two groups, the p value varies between 1 and -1 with 0 indicating that there is no correlation.The values of the anthropometries WE, BMI, WHtR, AR, HP, SBP and DBP are higher for the aged population, as well as the DX health indicator; and the measurements of MM, VF and BF obtained by BIA.

Classification Models
The F1-score from Scikit-Learn was used as a measure of accuracy for the classification tasks.The score is normalized, a value approaching 1.0 indicates the best performance.The accuracy of all the classification models was above 0.9, as follows: DT (1.0), RF (0.923), ANN (0.923), AdaBoost (1.0).Additionally, as a measure of performance, the AUC-ROC was computed for each classification model.A value close to 1.0 implies that the model is accurate.The ROC Curve is shown in Fig. 3.The AUC of all the classification models was above 0.9: DT (1.0), RF (0.9), Artificial ANN (0.95), AdaBoost (1.0).A confusion matrix was also used to visualize the specific accuracy for each class (aged or not aged).The confusion matrix helped to identify that all the models classified correctly 100% of the aged individuals.In addition, the RF and the ANN models classified correctly only 80% of the not aged labels, with the rest of the models scoring 100%.

Variable Importance
Since DT, RF and AdaBoost performed better among the prediction models based on ML, SHAP values were used to identify the importance of each feature and its impact on the prediction.A SHAP value of zero indicates little contribution to the prediction, so the further from zero declares higher contribution.For the Pearson and Chi 2 correlation tests, Anova, RFE method and the VIF factor, the features are ordered by importance according to the scores given by the statistical analysis.Table 4 shows the top nine features characterizing metabolic age as a result of applying each method.
The anthropometric measurements that exhibit a greater impact on the prediction of obesity and overweight include WHtR, SBP and DBP, HP and AR, HR and BMI.Additionally, the BIA results, specifically BF and MM, also show a significant impact on the models' predictions.
Within the complete dataset, the classification models identified the most significant features as BMI, BF, WHtR, DBP and HE.Meanwhile, the statistical methods highlighted BF, SBP, WHtR and HP as the most important features.The features that demonstrated greater significance in predicting metabolic aging, considering the eight proposed methods, include BMI, BF and WHtR.
These findings align with previous studies conducted on populations of various ethnicities, where BMI, WA and WHtR are suggested indicators for assessing abdominal obesity and cardiometabolic risk.However, the authors of these studies have acknowledged certain limitations when using each parameter separately.
Devajit and Haradhan ( 2023) studied BMI as one of the most popular anthropometric tools to measure body fitness in order with the intention of uncovering its constraints in accurately assessing obesity in individuals of different ethnicity.The authors found that does not capture effectively and proficiently status of overweight/obesity across all populations, regardless of sex, age, socioeconomic standing, and ethnic background.Ashwell et al. (2011) completed a study on individuals with different ethnicity, about the utilization of WHtR in detecting abdominal obesity, along with the possible health risks associated with it.The study's results indicate that WHtR surpasses WA as a more accurate predictor for diabetes, dyslipidemia, HR, and the risk of cardiovascular disease; and that abdominal obesity offers more effective instruments for discerning cardiometabolic risks linked to obesity compared to BMI.On the other hand, previous research has studied the relation of BF with obesity and metabolic AGG.Sandeep et al. (2010), produced comprehensive gene expression profiles across both visceral and subcutaneous fat stores in Asian Indian individuals with and without diabetes.Additionally, the researchers assessed multiple intermediary phenotypic traits related to diabetes, including distinct anthropometric attributes, indicators of insulin resistance and secretion, glycemic control metrics, distribution of BF, among others.The authors conclude that adipose tissue pathology is linked to diabetes in both subcutaneous and VF deposits holding a crucial role in the development of metabolic syndrome.
In regards of BF as an indicator for obesity, Jensen ( 2008) study the roles of distinct fat deposits concerning the storage and release of fatty acids in both healthy individuals and those with obesity; with the aim to discuss the disagreement regarding to the fact that upper body or visceral obesity increases the risk for conditions such as type 2 diabetes and that elevated quantities of lower BF are independently linked to a decreased risk of metabolic issues.Also, that VF mass has a more pronounced correlation with an abnormal metabolic profile compared to subcutaneous fat in the upper body.The results concludes that abdominal fat accumulation in individuals with overweight is highly associated with the metabolic complications of obesity.
Previous studies performed on Mexican population also discuss the use of BMI, WA, BF and WHtR as indicators for obesity.Sanchez Soto et al. ( 2012) found that 80% of people with obesity had high percentage of BF.
In Gutierrez-Esparza et al. ( 2020), the authors used ML algorithms to prioritize health parameters, aiming to identify the most suitable variables for classifying Metabolic Syndrome (MetS) within the Mexican population of the city of Tlalpan.They used Correlation-based Feature Selection (CFS) and Chi 2 filter methods to identify pertinent features for diagnosing MetS.In their results, WHtR, coupled with the Adult Treatment Panel III (ATP III) variables (excluding waist measurement), outperforms WAIST and BMI in terms of classification accuracy, in the prediction of metabolic syndrome in Mexican population.
In Barquera et al. (2020), the authors analyzed the data of 16,256 individuals to study the prevalence of obesity among Mexican adults while considering various physical and sociodemographic factors, and subsequently, to assess trends in these prevalence rates over time.The classification considered obesity (according to WHO standards), abdominal adiposity (as per IFD criteria), and short stature (following NOM-008-SSA3-2017).The researchers used LR models to identify the correlation between obesity and various risk factors.The results showed that heigh plays an important role in identification of obesity in Mexican women and men, although it was more notorious in women, along with WA as a complementary index that allows the evaluation of VF accumulation.The authors recognize BMI as an indicator of the risk of comorbidities associated with excessive adipose tissue, although, they state that this indicator is not very accurate for assessing adiposity at an individual level.
BMI, WHtR, WA and BF are useful to assess cardiovascular disease risk, metabolic syndrome and obesity.Also, BMI is a relevant predictor associated with mortality due to chronic kidney disease and cardiovascular peril in diabetic patients (Sanabria-Arenas, 2015;Mendoza-Niño et al., 2023;Russo et. al., 2023).
The findings emerging from this investigation could offer valuable insights for shaping healthcare initiatives for Mexican population, especially those working in higher education institutions.Including the staff's behaviors in future studies, such as sedentary lifestyles, reduced sleeping hours, lack of health awareness and long working hours, may enhance the efficiency of healthcare supervision and the design of strategies for supervision, preemptive measures and active involvement.

CONCLUSION
Sedentary behaviors in people can lead to obesity or overweight.Therefore, monitoring an individual's health condition is essential for detecting potential health risks that could progress into chronic diseases.This work described the results of data modelling focused on anthropometric measurements collected from members of a higher education staff.The anthropometric measurements included age, waist, hip and ARs; heigh, BP, HR, BMI, among others.Additionally, the results of BIA such as BF, VF, MM and body age were incorporated.The health indicator glucose was also considered.These parameters were used as features in four classification models.Also, the data was analyzed using the univariate method, RFE and VIF.The objective was to determine the variable importance to identify which features played a more crucial role in predicting metabolic aging within the group.
The contributions of this work that collectively enrich the understanding of obesity, its assessment, and its links to metabolic aging, particularly within the Mexican population and higher education staff are: • An in-depth analysis of a specific population's health characteristics is provided.A detailed statistics about the population's body age in relation to their mean age, along with their average weight, BMI, BF percentage, glucose levels and BP is presented.This comprehensive exploration emphasize the variations and potential health implications within the studied population.• A correlation analysis using Pearson correlation coefficients to identify relationships between various characteristics of the population is conducted.This analysis reveals which attributes are positively or negatively correlated and offers insights into potential connections between different health indicators.• The performance of different ML classification models for predicting metabolic aging is evaluated.The F1-score and AUC-ROC as measures of accuracy and performance are applied.All classification models performed an excellent discrimination, achieving high accuracy scores (above 0.9): DT (1.0), RF (1.0), ANN (0.923), AdaBoost (1.0).A ROC curve is also provided to visualize the accuracy of each model, supporting the effectiveness of ML techniques in predicting metabolic aging.
• The SHAP values to interpret the importance of features in the prediction models was introduced.They are used to measure the impact in the prediction for each feature and the results are compared to find coincidence to the variable importance obtained from the statistical methods.
Both anthropometric measurements and the results of BIA provide valuable references for identifying obesity and overweight in individuals.
≥ 25 in low height adults Abdominal obesity according to the Official Mexican Standard Male ≥ 90 cm Female ≥ 80 cm BMI = Actual weight (kg)/ height (m) * Low height = Less than 1.50 meters in adult female and less than 1.60 meters in adult male.Source: INSP (2018).
Fig. 1.Block diagram that describes the work process

Fig. 3 .
Fig. 3. ROC curve of the different classification models

Table 1 .
different methods are Obesity classification by BMI and WA in Mexican adults, according to the Official Mexican Standard (NOM) and the WHO

Table 2 .
General description of the anthropometries, blood glucose measurements, and functional physical fitness indices obtained by BIA for the staff members

Table 4 .
Importance of the features characterizing metabolic ageing by method.