In a paper published in the journal Engineering Applications of Artificial Intelligence, researchers proposed a novel feature engineering methodology for high-frequency financial data using time series segmentation. The proposed approach enables the extraction and analysis of variables in intraday trends and enables the forecasting of response variables using artificial intelligence (AI) models. Specifically, the methodology focuses on estimating volatility, duration, and direction of future intraday trends using extreme gradient boosting (XGBoost) for multiclass classification.
Background
The increasing importance of AI in finance has led to the widespread use of machine learning techniques for extracting knowledge from large financial datasets. However, the irregular intervals and multiple variables present in high-frequency data require AI methods that do not assume specific data distributions. The combination of high-frequency data analysis and AI-based forecasting has gained significant interest from scientific and private sectors, with a focus on accurate predictions and higher returns.
Extracting relevant features from financial market data is crucial for effective machine learning techniques. Previous research has primarily concentrated on fixed sampling schemes and horizons, overlooking the need to group volatility values based on intraday trend movements. To address this gap, a new problem is introduced, and a methodology is developed that tackles these challenges and contributes to the field.
Related work
While existing research in applying AI to high-frequency financial data has focused on feature engineering for fixed-composition data subsets, this study introduces a novel approach by constructing data subsets with variable composition. This addresses a new problem in AI using high-frequency financial data and specifically proposes a methodology for forecasting intraday volatility and directional movements. Previous studies have utilized machine learning models such as gradient descent boosting, random forest, support vector machines (SVM), and artificial neural networks for volatility forecasting. Directional forecasting, on the other hand, has been approached using log short-term memory (LSTM) networks, SVM, and other types of neural networks. The methodology presented in this work stands out for its unique approach to high-frequency directional forecasting.
Methodology
The proposed methodology involves extracting features from intraday trends and limiting order book states within these trends. A multistage feature engineering approach is employed to achieve this. The first step includes partitioning the transaction time series into segments with variable lengths based on the irregular durations of intraday trends. The second step involves synchronizing the order book states with trade times to obtain variables associated with each order book state. Finally, multiple conversions are applied to the variable set within each segment to derive the features mentioned, constituting the input for AI models.
Experimentation and application
The experimentation is conducted using a dataset consisting of trades and buy/sell orders from 20 assets listed on the Brazil Stock Exchange (B3). The dataset spans 206 trading days from July 2, 2018, to May 6, 2019, and undergoes cleaning procedures to remove errors and inconsistencies. The developed methodology is applied to extract features for the three response variables: duration, volatility, and direction.
The application of the methodology focuses on feeding AI models with classification problems. After segmenting the trade series and extracting features, the segments are classified using labeling based on the response variable. Embedding is then performed to build the set of samples from the extracted feature vectors. The samples are structured with lagged variables and a specific step. The XGBoost algorithm is chosen for modeling due to its speed and efficiency. The model's performance is evaluated using performance metrics such as confusion matrices, kappa, and F1-score.
Results
The performance metrics for each machine learning model used to forecast volatility, duration, and direction are analyzed. The best results are obtained in volatility estimation, followed by duration and direction. The analysis of variable importance highlights the significance of certain variables for each response variable. For volatility, variables related to volatility per unit of time and total squared log returns per second are found to be the most important. Duration is influenced by variables such as interval duration and durations between trades. In the case of direction, interval duration, return per second, squared log return per second, and value per second are identified as the most critical variables. Trade series variables are found to be more important compared to those from the limit order book.
Conclusion
In conclusion, a feature engineering methodology was developed in the study to extract features from high-frequency intervals and predict three response variables: volatility, duration, and direction. The methodology incorporates time series segmentation and the inclusion of order book data. The best performance is observed in volatility forecasting, followed by duration and direction.
The analysis of variable importance reveals the greater impact of trade variables compared to variables from the limit order book. The developed methodology can be applied to other high-frequency time series problems, although scalability considerations are necessary when dealing with larger volumes of observations to strike a balance between dimensionality reduction and information loss.