The importance of water resources for human, animal, and plant life is paramount. Recent efforts combine artificial intelligence methods such as Random Forest (RF) with logical-mathematical models to predict water quality (WQ). In a recent paper published in the journal Water, researchers developed a rule-based inference method to generate WQ labels. Applying RF, physicochemical parameters, and expert insights, a predictive water quality model is formulated.
Background
The critical importance of water as a life-sustaining resource cannot be overstated, particularly in arid landscapes facing escalating water scarcity. Despite the Earth's surface being predominantly covered by water, the availability of high-quality water suitable for agricultural, livestock and human use is becoming increasingly limited. WQ assessment involves various parameters, including temperature, pH, REDOX potential, electrical conductivity (EC), and dissolved oxygen (DO). Additionally, elements such as arsenic (As), copper (Cu), boron (B), and lead (Pb) contribute significantly to this assessment.
The Loa River, located within the Atacama Desert, presents a complex scenario. Geological richness gives rise to distinctive physicochemical attributes, marked by concentrations of arsenic and boron, as well as salinity fluctuations from the river's origin to its mouth. Amid intense mining, heightened water stress, human impact, and shifting climates, this research pioneers the application of AI and data science to predict WQ in this intricate context.
Exploring ecological environments
The research zone encompasses a segment of the Loa River basin within Chile's Antofagasta Region. Stretching over 200 kilometers from its source to its Pacific Ocean outlet, the Loa River's significance cannot be overstated. This study focuses on seven WQ monitoring stations near the city of Calama. The area under investigation extends from a point before the Loa River intersects its main tributary, the Salado River, to a point after the intersection with the Calama exit.
As Chile's longest river (440 km), the Loa River originates at an elevation of 3950 meters above sea level (m.a.s.l.) and traverses the arid core of the Atacama Desert, acting as a vital green corridor. The Loa River basin stands as the largest in this intensely arid desert. Nevertheless, intensive mining activities have led to a decline in the river's flow due to aquifer depletion, adversely affecting indigenous communities and local flora and fauna. The Salado River, a significant tributary, is known for elevated arsenic (As) levels and is fed by the El Tatio geothermal field.
Modeling WQ and classification in the Loa river basin
The research begins by establishing WQ labels and classifications based on the physical and chemical properties of the Loa River basin. A set of numbered WQ labels is generated using physicochemical data from the Directorate General of Water (DGA), delineating specific ranges by applying threshold values and expert insights. These labels are further categorized into three levels: low, medium, and high, relevant to diverse water use scenarios.
Subsequent steps involve the construction and validation of a WQ prediction model at different observation points. This model harnesses the power of the RF methodology, utilizing significant physicochemical parameters as predictors and WQ as the target variable. The process includes stages such as feature selection, dataset division for training and testing, and the creation of RF models. Model outcomes are assessed using essential metrics such as accuracy, precision, and recall.
Comparative analysis of label generation and RF outcomes highlights the importance of predictor variables and sample distribution. The significance of physicochemical parameters is unveiled through scatter matrices. The overall research methodology adheres to established practices for modeling and analyzing WQ dynamics.
Results of predictive modeling and variable analysis
The study confirms pairwise independence among the chosen predictive variables. WQ ranges specific to the Atacama Desert are generated, informed by Chilean WQ regulations and expert insights. Importantly, independent variable ranges are defined by meticulously analyzing expert knowledge, existing literature, and established WQ standards.
Based on these independent variable ranges, WQ values are generated to serve as dependent variables for training datasets in predictive models. The RF training process involves seven datasets, with a thorough comparison to determine optimal algorithm parameters. Enhanced precision is achieved when utilizing "information gain" and "Gini index," leading to a notable 75% precision with 50 trees.
Conclusion
In summary, the current study introduces an innovative approach to identifying significant physicochemical parameter values and ranges within arid desert environments. These ranges seamlessly align with the characteristics of the Loa River basin, culminating in the generation of labels for water quality classification. The proposed method leverages production rules derived from threshold values and expert knowledge adapted to the region's specific context.
The study also pioneers the use of RF to predict water quality, circumventing the need for the traditional WQ Index. This approach holds the potential for broader application in diverse contexts, paving the way for predictive models attuned to varied geochemical conditions.