Oracle-MNIST Dataset Unveils Challenges for ML in Ancient Chinese Character Recognition

In an article published in the journal Scientific Data, researchers from the Beijing University of Posts and Telecommunications, China, developed an innovative dataset of ancient Chinese characters called Oracle-MNIST for benchmarking, testing, and improving machine learning (ML) algorithms. Their dataset consists of 30,222 grayscale images of 10 categories of oracle characters, which are the oldest hieroglyphs in China.

Study: Oracle-MNIST Dataset Unveils Challenges for ML in Ancient Chinese Character Recognition. Image credit: Gusztav Bartfai/Shutterstock
Study: Oracle-MNIST Dataset Unveils Challenges for ML in Ancient Chinese Character Recognition. Image credit: Gusztav Bartfai/Shutterstock

Background

ML is a branch of artificial intelligence (AI) that enables computers to learn from data and execute tasks such as regression, clustering, anomaly detection, speech recognition, computer vision, and natural language processing. Additionally, it can be utilized to recognize text characters. Commonly, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) computer vision techniques are employed to identify textual data within documents.

These algorithms often require large, specialized datasets for training and evaluating performance. A frequently used dataset in computer vision is MNIST which contains 70,000 images of handwritten digits ranging between 0 and 9. However, this dataset has become too straightforward for modern ML algorithms. Therefore, there is a need for more realistic and challenging datasets that can capture the requirements and variations of real-world scenarios.

About the Research

In the present paper, the authors introduced an Oracle-MNIST dataset based on the collection of oracle characters/scripts from the YinQiWenYuan website, a large oracle-bone platform constructed by AnYang Normal University. The oracle characters are an ancient Chinese way of writing found on turtle shells and animal bones and serve as a historical record of the Shang Dynasty (around 1600-1046 B.C.). This writing system is a treasure for understanding old Chinese civilization, culture, history, and language. Scholars employ it to gain insights into past societal norms.

The oracle bone script helps experts in figuring out details about ancient Chinese lifestyles, playing a significant role in the analysis of historical writings. It serves as a special tool for revealing secrets of the past and providing a glimpse into ancient times. However, recognizing these oracle characters is difficult for both experts and machines, as many of them have been damaged, distorted, or corrupted over the centuries. Moreover, the oracle characters exhibit a high degree of intra-class variance and inter-class similarity due to the different writing styles of ancient Chinese.

The Oracle-MNIST dataset comprised 28 × 28-pixel 8-bit grayscale images of 30,222 ancient characters from 10 categories, selected from the commonly used characters in oracle-bone inscriptions/surfaces. The image surface suffers from unique and extremely serious noises caused by thousands of years of aging and burial. These images contain various writing styles which made them more realistic and difficult for ML models. The dataset was divided into a training set of 27,222 images and a test set of 300 images per class. A version with a resolution of 224 × 224 and the original red-green-blue (RGB) images was also provided. The dataset was intended to serve as a benchmark for pattern classification, with particular challenges related to image noise and distortion.

Research Findings

The study evaluated the proposed dataset using several state-of-the-art ML algorithms, such as CNNs, SVMs, and k-nearest neighbors (k-NN). It compared the obtained results with the original MNIST dataset and two modified MNIST datasets, namely EMNIST and Fashion-MNIST. The authors found that the Oracle-MNIST dataset is more difficult than the other datasets, as it has a lower classification accuracy and a higher confusion matrix. They also analyzed the sources of errors and difficulties in the Oracle-MNIST dataset, such as the noise level, the stroke number, the stroke direction, and the character structure.

The newly designed dataset has potential applications in both ML and ancient literature fields. For ML research, the dataset can provide a realistic and challenging testbed for developing and testing new algorithms and models, as well as for comparing and benchmarking existing ones. The dataset can also stimulate new research directions and innovations in ML, such as noise robustness, style invariance, and character recognition.

For ancient literature research, the dataset can facilitate the study and interpretation of oracle characters and ancient civilizations, as well as the preservation and dissemination of cultural heritage. The dataset can also enable the collaboration and communication between ML and ancient literature communities and foster interdisciplinary research and education.

Conclusion

In summary, the presented novel dataset is open-source, effective, and efficient for benchmarking several ML models. This dataset is based on the oracle characters and is designed to be a direct drop-in replacement for the original MNIST dataset, but it poses more challenges due to the image noise and distortion caused by thousands of years of burial and aging, and the large variance in writing styles by ancient Chinese. The authors evaluated the designed dataset and highlighted its effectiveness by employing it in different algorithms such as random forest (RT), K-NN, decision tree (DT), etc. They emphasized that Oracle-MNIST is more difficult and realistic and can become a valuable asset for archaeologists and paleographists.

The researchers acknowledged challenges and limitations, including data scarcity and imbalance, abrasion and noise, and large variance and similarity in the oracle character data. They suggested that future work can explore more robust and discriminative features, more comprehensive evaluation, and more practical applications such as understanding ancient texts, historical chronology, and cultural heritage preservation.

Journal reference:
Muhammad Osama

Written by

Muhammad Osama

Muhammad Osama is a full-time data analytics consultant and freelance technical writer based in Delhi, India. He specializes in transforming complex technical concepts into accessible content. He has a Bachelor of Technology in Mechanical Engineering with specialization in AI & Robotics from Galgotias University, India, and he has extensive experience in technical content writing, data science and analytics, and artificial intelligence.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Osama, Muhammad. (2024, January 25). Oracle-MNIST Dataset Unveils Challenges for ML in Ancient Chinese Character Recognition. AZoAi. Retrieved on December 22, 2024 from https://www.azoai.com/news/20240125/Oracle-MNIST-Dataset-Unveils-Challenges-for-ML-in-Ancient-Chinese-Character-Recognition.aspx.

  • MLA

    Osama, Muhammad. "Oracle-MNIST Dataset Unveils Challenges for ML in Ancient Chinese Character Recognition". AZoAi. 22 December 2024. <https://www.azoai.com/news/20240125/Oracle-MNIST-Dataset-Unveils-Challenges-for-ML-in-Ancient-Chinese-Character-Recognition.aspx>.

  • Chicago

    Osama, Muhammad. "Oracle-MNIST Dataset Unveils Challenges for ML in Ancient Chinese Character Recognition". AZoAi. https://www.azoai.com/news/20240125/Oracle-MNIST-Dataset-Unveils-Challenges-for-ML-in-Ancient-Chinese-Character-Recognition.aspx. (accessed December 22, 2024).

  • Harvard

    Osama, Muhammad. 2024. Oracle-MNIST Dataset Unveils Challenges for ML in Ancient Chinese Character Recognition. AZoAi, viewed 22 December 2024, https://www.azoai.com/news/20240125/Oracle-MNIST-Dataset-Unveils-Challenges-for-ML-in-Ancient-Chinese-Character-Recognition.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Self-Supervised Learning Boosts Sewer Anomaly Detection With Better Accuracy