In an article recently submitted to the ArXiv* server, researchers emphasized the significance of eye tracking across diverse domains such as vision research and usability assessment by presenting an open-source gaze-tracking solution for smartphones with a primary goal of achieving accuracy without the need for extra hardware. By harnessing machine learning techniques, this approach successfully achieved precise eye tracking on smartphones, achieving levels of accuracy comparable to high-priced mobile trackers.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Using the Massachusetts Institute of Technology (MIT) GazeCapture dataset, the method was able to replicate crucial findings related to ocular behavior and saliency analysis. The adoption of smartphone-based gaze tracking holds significant potential, particularly in addressing reading comprehension challenges and facilitating expanded research participation. This scalability not only contributes to advancements in vision research but also extends its benefits to areas like accessibility enhancement and healthcare applications.
Background
In recent years, eye tracking has gained prominence across vision research, linguistics, and usability assessment. Yet, much of the focus has been on desktop displays with specialized and costly hardware, limiting accessibility and scalability. Meanwhile, smartphones have transformed human-computer interaction, prompting a need for understanding eye movement on these mobile devices. Despite extensive smartphone use, a gap exists in studying ocular motion patterns.
Proposed method
The gaze tracking model employs a multilayer feed-forward convolutional neural network (ConvNet) with distinctive components for precise gaze prediction. The process commences by extracting essential facial features from input images via MobileNets-based face detection. The base model is trained on the MIT GazeCapture dataset to effectively predict gaze locations. Eye regions are processed through individual ConvNet towers, scaled to 128x128x3 pixels, with convolutional layers of decreasing kernel sizes. Rectified Linear Units (ReLUs) introduce nonlinearity, and horizontal flipping ensures symmetry in learning. ConvNet outputs merge with fully connected layers handling eye landmarks.
The final regression head predicts x and y screen coordinates. Fine-tuning using calibration data and a lightweight support vector regression (SVR) model further refines gaze predictions. The combined approach results in accurate gaze tracking, adhering to Google's methodology, and implementation includes meticulous data preparation, post-training quantization, adaptive learning rate strategies, distinctive loss functions, and evaluation metrics. SVR personalization enhances gaze tracking accuracy, particularly in scenarios with varied gaze positions, underlining the significance of tailored adjustments based on dataset characteristics.
Experimental Results
The PyTorch-trained models display promising predictive capabilities, showing results slightly differing from Google's figures yet remaining valid throughout the experimental context. Using the second-to-last layer's output from the PyTorch model, SVR personalization aims to enhance gaze tracking precision. The impact of SVR is not consistently positive but yields promising outcomes. Notably, SVR benefits from a larger dataset in the 70/30 split scenario, resulting in significant improvement. Visual representations demonstrate SVR's influence on predictions and its nuanced outcomes. Overall, the approach emphasizes the potential for personalized gaze tracking enhancement using SVR.
In the pursuit of refining gaze tracking accuracy, incorporating affine transformations emerges as a promising avenue. By harnessing insights from network-generated forecasts, an exploration unfolded into the potential benefits of applying affine transforms to improve accuracy. Within this framework, transformations involving shifts, scales, and rotations were explored. This approach demonstrated a significant reduction in foundational model error, showcasing its potential for enhancement even though its impact might be less pronounced than SVR training.
In parallel, enhancements to the PyTorch model were introduced, including adjustments to Batch Normalization's epsilon value and learning rate scheduling parameters. These modifications were strategically designed to enhance the model's training dynamics and augment overall performance and convergence speed. Visual comparisons between the current and previous implementations underscored subtle differences in output clustering, highlighting the intricate balance between precision and generalization. The implementation was also extended to TensorFlow, achieving results comparable to previous PyTorch models.
Furthermore, the evaluation was conducted using the MIT and Google split datasets, with different train-test splits and variations. SVR results demonstrated substantial improvements in model performance, particularly evident in both MIT and Google split scenarios. An individualized approach using SVR, incorporating diverse train-test setups, showcased variations in error reductions among different users. However, challenges stemming from limited data availability were evident, suggesting ongoing efforts for refining performance under varying conditions. This comprehensive analysis laid the foundation for enhancing gaze-tracking accuracy through a multifaceted approach involving both model enhancements and personalized techniques.
Conclusion and Future Work
In summary, the study presents a gaze-tracking solution for smartphones and discusses its precise interaction with the Google model binary, understanding SVR patterns across different model versions, and enhancing the model's performance through training with Google's normalization function. Rigorous testing using proprietary app data allowed for a thorough comparison with Google's binary model outputs. Comparative analysis with alternatives like iTracker was conducted, exploring avenues for efficacy improvement through network expansion. Instances of data leakage were identified during SVR fitting on the Google split version, highlighting a concern to be addressed in future work. This comprehensive approach thoroughly evaluated the model's performance and potential enhancements.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.