In a paper published in the journal Digital Chemical Engineering, researchers aimed to predict the solubility of salicylic acid (SA) in 13 solvents using machine learning (ML) by applying six algorithms: neural network, linear regression, logistic regression, decision tree (DT), random forest (RF), and k-nearest neighbors (KNN), to 217 samples based on 15 variables (13 solvents, temperature, and pressure). The RF algorithm achieved the lowest total error, while KNNs had the highest error, highlighting the effectiveness of machine learning in accurately predicting solubility.
Related Work
Previous research on SA, a natural phenolic compound used extensively and extensively in treating skin disorders, highlights its historical and medicinal importance. SA, known for its exfoliating and comedolytic properties, is used in conditions like acne and photodamage.
Salicylic acid's primary metabolite also contributes to aspirin's anti-inflammatory and cancer-preventive effects. Determining SA solubility in various solvents is crucial but traditionally costly and time-consuming. While thermodynamic methods have been used, they involve complex calculations and struggle with large data sets.
Predicting SA Solubility
This study employed six ML algorithms to predict salicylic acid solubility in various solvents: DT, KNN, linear regression, logistic regression, RF, and neural network. The analysts assigned data and labels to variables X and T, and the DT model was trained using the `fitrtree` function with default parameters. Similarly, the KNN model was trained using the `fitcknn` function with the number of neighbors set to 3. After assigning data to variables X and T, the team trained the linear regression model using the `fitlm` function.
Logistic regression was implemented using the `fitglm` function, specifying a binomial distribution and the logit link function. The `TreeBagger` function, with 90 trees, trained the RF model. Lastly, the neural network model was constructed with 10 hidden neurons, with the data split into training, testing, and validation sets (60%, 30%, and 10%, respectively). The training used the `train` function with default activation functions.
The performance of each algorithm was assessed by calculating the total error between predicted and experimental values. Experimental data were gathered from reliable sources, ensuring comprehensive coverage by including multi-component and single-component systems.
Special attention was given to temperature variations during data collection. The details of input and output variables, including the solvents, number of samples, temperature ranges, and solubility ranges for each solvent, were summarized in a comprehensive table.
The solvents studied included water, methanol, ethanol, ethyl acetate, PEG 300, 1,4-dioxane, and 1-propanol. Data for solvents such as ethanol and water were taken across multiple temperatures, and solubility was measured in mole fractions. The independent parameters in this study were the 13 different solvents, temperature, and pressure.
These parameters served as inputs to the machine learning models, which then predicted the solubility of salicylic acid. The dependent parameter was the predicted solubility of salicylic acid based on the given independent parameters. This approach facilitated a robust computational modeling framework, providing valuable insights for pharmaceutical applications.
SA Solubility Prediction
The experimental dataset included 217 samples across solvents such as methanol, water, ethanol, ethyl acetate, PEG 300, etc. ML methods offer greater flexibility than traditional thermodynamic methods, adapting to diverse problems and data sets. During implementation, a command was added to ensure all predicted values were positive, addressing any negative predictions by the algorithms.
The team evaluated the performance of each algorithm based on the total error between predicted and experimental values. The total error for the neural network model was 0.0096964, with high R values for training, testing, and validation sets indicating effective model performance.
The linear regression model achieved a total error of 0.015122, while logistic regression had a total error of 0.020409. The k-NN algorithm resulted in a total error of 0.024768, demonstrating acceptable prediction quality. The decision tree algorithm performed significantly with a total error of 0.0066577, and the RF algorithm had the lowest total error of 0.00016835, showcasing its superior predictive capability.
Overall, the ML approach to predicting salicylic acid solubility based on input variables like solvents, temperature, and pressure proved effective. Despite the extensive experimental data, all six algorithms showed desirable performance, with the RF algorithm exhibiting the highest accuracy and best agreement with experimental results. The results underscore the potential of ML models in enhancing pharmaceutical research and development by providing accurate solubility predictions.
Conclusion
In summary, solubility is crucial in drug development, impacting absorption and clinical response. This study explored the solubility of salicylic acid across 16 solvents under varying temperature and pressure conditions. Utilizing ML due to the extensive experimental data, six algorithms were employed: linear regression, logistic regression, neural network, DT, RF, and KNN.
Overall, all algorithms performed well predicting solubility, with the RF algorithm yielding the best results. These findings underscored the significance of ML in enhancing the crystallization process of salicylic acid production in the pharmaceutical industry.