Is AI's future at risk? Scientists warn that poorly structured data is undermining AI's potential—and they have a plan to fix it.
Research: The Logic and Architecture of Future Data Systems. Image Credit: The KonG / Shutterstock
In a recent Views & Comments column published in Engineering, researchers Jinghai Li and Li Guo from the Chinese Academy of Sciences offer profound insights into the future development of data science, particularly focusing on its crucial role in artificial intelligence (AI).
The article begins by highlighting the increasing significance of scientific data systems in research and development (R&D). Data has become the linchpin of AI's rapid progress, influencing every stage of AI model development, from training to evaluation and optimization. However, scientific data stems from long-term research on multi-level complex spatiotemporal dynamic processes and presents numerous challenges. The current incomplete understanding of these complex spatiotemporal structures leads to issues in data accumulation, modeling, and application.
For instance, in image recognition, image data has a hierarchical structure. Convolutional neural networks (CNNs) leverage this structure for image recognition. However, suppose the logic and architecture of data systems do not align with the data's inherent characteristics. In that case, it can result in problems such as model prediction errors, poor generalization ability, and increased computational costs. This not only affects AI and data science but also poses a challenge to scientific research. Different researchers may obtain varying data for the same phenomenon, and inappropriate averaging techniques for complex spatiotemporal structures can overlook crucial relationships.
The researchers propose that future data collection and processing should adhere to certain principles to address these issues. Given complex systems' multi-level and multi-scale nature, data collection should clarify multi-level characteristics, spatiotemporal structural characteristics, and key variables. Additionally, it should define the critical conditions for regime transitions and annotate unobtainable data.
The article also emphasizes the importance of rearranging AI models into a multi-level architecture. Taking large language models (LLMs) as an example, by integrating text data's inherent logic and structure into their construction, LLMs can better capture semantic information, enhancing text comprehension, sentence generation, and logical reasoning capabilities.
Currently, data collection and processing principles are often neglected, restricting the development of data systems and AI. The researchers call for researchers and practitioners to fully recognize the significance of data system logic and architecture. A global standard protocol framework and operation guide for hierarchically structured data are needed to foster a high-quality data ecosystem and promote the healthy development of AI. Applying the principle of mesoscale complexity in data-related processes also shows promise for data science and AI.
In conclusion, in the new research paradigm, attention to the multi-level structures of complex systems during data-related activities and AI analysis is essential. This requires strict adherence to the principle that data behavior and functional relationships should match the research object, which also poses higher requirements for interdisciplinary research.
Source:
Journal reference: