ISSN: 2685-9572 Buletin Ilmiah Sarjana Teknik Elektro
Vol. 8, No. 2, April 2026, pp. 561-575
Trends and Gaps in Transformer-Based EEG Modeling: A Review of Recent Developments
Yuri Pamungkas 1, Abdul Karim 2, Myo Min Aung 3, Muhammad Nur Afnan Uda 4, Uda Hashim 5
1 Department of Medical Technology, Institut Teknologi Sepuluh Nopember, Indonesia
2 Department of Artificial Intelligence Convergence, Hallym University, Republic of Korea
3 Department of Mechatronics Engineering, Rajamangala University of Technology Thanyaburi, Thailand
4 Department of Electronic Engineering (Computer), Universiti Malaysia Sabah, Malaysia
5 Department of Electrical and Electronics Engineering, Universiti Malaysia Sabah, Malaysia
ARTICLE INFORMATION | ABSTRACT | |
Article History: Received 10 October 2025 Revised 30 December 2025 Accepted 07 May 2026 | In recent years, Transformer-based deep learning architectures have emerged as a powerful paradigm for modeling EEG signals, offering superior capability in capturing spatial–temporal dependencies compared to traditional convolutional or recurrent networks. However, the diversity of model designs, limited dataset generalization, and lack of standardization have created challenges in evaluating their true potential for real-world applications. This review addresses these issues by systematically examining the evolution, performance, and methodological trends of Transformer-based EEG models published between 2022 and 2024, highlighting both achievements and research gaps. The main contribution of this study is to provide a comprehensive mapping and critical analysis of Transformer architectures applied to EEG classification, feature extraction, and signal decoding tasks. Using the Scopus database, a structured search was conducted following specific inclusion criteria (English, peer-reviewed, open-access journal papers from 2022–2024) and a well-defined query combining EEG and Transformer-related keywords. Data from 63 eligible studies were extracted and categorized according to authorship, dataset, architecture type, EEG application, and evaluation metrics. Results show that hybrid Transformer models dominate recent research, achieving accuracies above 90% in tasks such as motor imagery, emotion recognition, seizure detection, and sleep staging. Pure Transformers like ViT and BERT-like models also demonstrate competitive performance but face scalability and interpretability challenges. In conclusion, Transformer-based EEG modeling is advancing rapidly, yet future efforts must focus on model efficiency, explainability, and benchmark standardization to enable broader clinical and real-world adoption. | |
Keywords: Electroencephalography (EEG); Transformer Architecture; Brain-Computer Interface (BCI); Deep Learning; Attention-Based Modeling | ||
Corresponding Author: Yuri Pamungkas, Department of Medical Technology, Institut Teknologi Sepuluh Nopember, Indonesia. Email: yuri@its.ac.id | ||
This work is open access under a Creative Commons Attribution-Share Alike 4.0 | ||
Document Citation: Y. Pamungkas, A. Karim, M. M. Aung, M. N. A. Uda, and U. Hashim, “Trends and Gaps in Transformer-Based EEG Modeling: A Review of Recent Developments,” Buletin Ilmiah Sarjana Teknik Elektro, vol. 8, no. 2, pp. 561-575, 2026, DOI: 10.12928/biste.v8i2.14933. | ||
Electroencephalography (EEG) is a non-intrusive brain imaging modality that captures electrical brain activity with very fine-grained temporal precision, providing crucial insights into cognitive processes, neural disorders, and BCIs [1][2]. Over the past decade, the accelerated advancement of artificial intelligence and deep learning techniques has fundamentally reshaped EEG analysis paradigms from conventional feature-based methods into data-driven modeling [3][4]. Early approaches (such as SVM, k-NN, and shallow neural networks) were limited in effectively addressing the inherently non-stationary characteristics and high noise levels present in EEG signals [5]-[7]. Subsequent developments in deep learning, particularly CNNs and RNNs, improved spatial and temporal feature extraction but still struggled to capture extended-range dependencies and interactions across multiple EEG channels [8][9]. This challenge makes a key research problem, how to effectively model holistic spatiotemporal dependencies over time in EEG data for improved accuracy, generalization, and interpretability.
The rise of Transformer-based architectural frameworks has brought a fundamental conceptual transformation in time-series and biomedical signal modeling [10]. Initially developed for NLP, Transformers employ attention-driven self-referential mechanisms that enable the model to learn context-aware relationships among temporally or spatially separated elements in a sequence [11]. This capability has proven demonstrated substantial advantages for EEG signal analysis, where inter-channel and temporal dependencies play a vital role [12]. More recent architectural adaptations (such as Vision Transformers (ViT), Time-Series Transformers, and hybrid CNN-Transformer frameworks) have been successfully applied to tasks derived from EEG signal analysis including affective state recognition, seizure detection, motor imagery classification, and sleep staging. These models outperform conventional deep networks by leveraging global attention mechanisms to learn spatial-temporal interactions, offering new possibilities for clinical and cognitive applications [13]-[15].
However, despite the promising performance, several gaps remain in the current Transformer-based EEG research landscape. Many studies use small or single-source datasets, limiting reproducibility and generalizability [16]-[18]. Furthermore, high model complexity and limited interpretability challenge their deployment in real-world biomedical environments [19]. There is also a lack of standardized benchmarks and explainability frameworks to validate Transformer-based EEG models across datasets and applications [20]. Addressing these limitations requires a structured synthesis of existing research to understand what has been achieved and what remains underexplored.
The aim of this review is to systematically analyze and summarize recent developments in Transformer-based EEG modeling, focusing on architectures, datasets, applications, and evaluation strategies. By identifying consistent trends and critical gaps, this review provides an evidence-based perspective on how attention-based deep learning is reshaping EEG analysis. The contribution of the research is to present a comprehensive mapping and critical evaluation of Transformer-based EEG studies, highlighting methodological innovations, performance insights, and future research directions toward more interpretable, scalable, and clinically applicable EEG models. This work not only situates current progress within the broader evolution of AI in biomedical signal processing but also establishes a roadmap for advancing Transformer-based approaches in brain signal understanding.
This review employed a rigorous and methodologically organized strategy to identify, select, and analyze recent studies focusing on deep learning frameworks grounded in Transformer architectures for the analysis of EEG signals. The literature search was conducted using the Scopus database, which was selected due to its extensive coverage of peer-reviewed and high-quality publications across the fields of biomedical engineering, neuroscience, and artificial intelligence. The search was performed in October 2025 to ensure the inclusion of the most recent and relevant studies. To retrieve articles that directly addressed the integration of Transformer architectures with EEG data analysis, a precisely designed search query was used, combining specific keywords with logical Boolean operators. The final search query was as follows: (“transformer” OR “vision transformer” OR “temporal transformer” OR “spatio-temporal transformer” OR “time series transformer”) AND (“EEG” OR “electroencephalography” OR “brain-computer interface” OR “BCI”) AND (“classification” OR “feature extraction” OR “representation learning” OR “signal decoding” OR “mental state recognition” OR “emotion recognition”). This search query was applied to article titles, abstracts, and keywords to maximize relevance while minimizing unrelated results.
The inclusion criteria were defined to ensure both scientific quality and topical precision. Articles were considered eligible if they were published between 2022 and 2024, written in English, and categorized as peer-reviewed journal articles. To enhance transparency and reproducibility, only open-access publications were included, allowing full access to methodological details and results. Conference proceedings, theses, preprints, and non-peer-reviewed documents were excluded. Studies that did not involve EEG data or Transformer-based architectures were also removed during the screening process. This inclusion strategy ensured that the reviewed papers represent the most current, credible, and accessible contributions in the field.
For each study that met the inclusion criteria, detailed information was extracted following a consistent classification scheme. The extracted data included the author(s), year of publication, dataset used, Transformer architecture employed (such as Vision Transformer, Time-Series Transformer, or hybrid models), EEG task or application (for example, emotion recognition, seizure detection, sleep staging, or BCI classification), and evaluation metrics (including accuracy, precision, recall, etc). This systematic extraction enabled a clear comparison of methodologies and outcomes across studies. The compiled data were then organized into a review Table 1 that allowed thematic categorization and identification of trends, gaps, and future directions. Through this structured methodology, the review aims to provide a comprehensive and evidence-based understanding of how Transformer architectures have been applied in EEG analysis and what methodological challenges remain for future exploration.
Table 1. Summary of Recent Studies on Transformer-based EEG Modeling (2022–2024)
Pure Transformer architectures, which rely solely on self-attention mechanisms without convolutional or recurrent components, have demonstrated remarkable adaptability in EEG modeling between 2022 and 2024. These models, including Vision Transformer (ViT), Time-Series Transformer (TST), and BERT-like variants, leverage their inherent capacity to represent extended temporal relationships over long time horizons and encode comprehensive contextual interactions spanning multiple EEG channels. Unlike hybrid networks that combine CNNs or RNNs, pure Transformer designs process EEG data as sequences or tokenized patches, allowing them to learn intrinsic spatial–temporal correlations more efficiently. This conceptual transition has facilitated improved cross-domain generalization capability across subjects and datasets, particularly for tasks requiring high-resolution temporal inference such as seizure detection.
One of the earliest implementations was proposed by Xie et al. [21], who introduced multiple Transformer configurations (s-Trans, t-Trans, s-CTrans, t-CTrans, and f-CTrans) applied to PhysioNet EEG motor imagery data. Their model achieved up to 83.31% accuracy in 2-class classification, proving that self-attention alone could effectively model motor-related brain activity. Similarly, Hussein et al. [22] adopted a Multi-Channel Vision Transformer (ViT) for seizure prediction using large-scale EEG datasets (CHB-MIT, AES-Kaggle, Melbourne iEEG) and achieved an outstanding AUC of 0.99 and accuracy of 99.8%, illustrating the capacity of ViT to capture inter-channel relationships in multi-electrode EEG signals. Zhao et al. [42] further validated the robustness of ViT-based models in clinical seizure detection on intracranial EEG from Juntendo Hospital, reaching over 90% accuracy and F1-score of 0.92. These results highlight how pure self-attention mechanisms can excel in pathological EEG tasks traditionally dominated by CNN-based frameworks.
Emotion recognition and mental state decoding have also benefited from pure Transformer designs. Zhou et al. [34] developed a Dual-Channel Transformer trained on DEAP and custom emotion datasets, achieving 97.3% accuracy, while Lu et al. [41] proposed Bi-ViTNet, a bidirectional ViT architecture that enhanced emotion recognition accuracy to 96.25% on SEED and SEED-IV datasets. Both studies demonstrated that global attention improves affective EEG decoding by modeling inter-channel synchronization patterns and temporal dependencies. In a similar context, Hu et al. [70] introduced STAFNet, a self-attention-based Transformer network that achieved 97.9% accuracy in emotion classification, further confirming the dominance of pure attention models in affective computing.
The generalization and scalability of Transformer-based EEG decoding were further explored by Kim et al. [58], who proposed Dfformer, a generalized Transformer architecture for multiple EEG decoding tasks, achieving accuracy up to 84% across datasets such as BCI IV-2a, IV-2b, and Sleep-EDF. Beiramvand et al. [57] also utilized a standard Transformer for mental workload estimation using Muse and Enobio headsets, reporting up to 88% accuracy, emphasizing that even compact, pure Transformer architectures can yield robust performance on low-density EEG systems. Furthermore, Peng et al. [63] proposed an MBMD Transformer for seizure subtype classification across CHSZ and TUSZ datasets, achieving 93.5% accuracy, underscoring the adaptability of self-attention to multi-class clinical EEG problems.
Hybrid Transformer architectures, which integrate self-attention mechanisms with convolutional, recurrent, or graph-based layers, have become the dominant trend in EEG analysis between 2022 and 2024. These models were developed to address the inherent shortcomings of Transformer-only models in modeling detailed localized dependency patterns and to improve computational resource efficiency when processing EEG signals characterized by high dimensional feature spaces. By combining the Transformers’ capacity for modeling global contextual representations with the CNNs’ strong ability to extract localized features or the RNN-based sequential temporal modeling capability, hybrid frameworks achieve superior performance across diverse EEG tasks such as seizure detection, emotion recognition, sleep staging, and motor imagery classification. This synergistic interaction at the architectural level has proven especially effective in addressing the inherent non-stationary behavior and noise characteristics of EEG signals while improving the transparency of model decision-making and overall robustness can be seen in Table 2.
Several early studies demonstrated the effectiveness of hybrid Transformer designs. Liu et al. [23] combined a Convolutional Transformer with a ResNet–LSTM backbone to detect mask-wearing from speech-related EEG data (MASC dataset), reporting a UAR of 82.2% and AUC of 0.874. Wang et al. [25] proposed GLU-Oneformer, a hybrid Transformer incorporating gating and convolutional components for driving fatigue detection, achieving 86.97% accuracy and F1-score of 85.23%. Xu et al. [26] employed a Transformer encoder integrated with traditional features for stress classification, reaching up to 92.7% accuracy on a custom dataset. These early implementations established the foundation for combining deep feature extraction and self-attention for physiological signal understanding.
The hybridization trend became more pronounced in 2023, where researchers began systematically blending CNN and Transformer layers for specific EEG applications. Chen et al. [30] designed a Dual-Branch CNN + ViT model for Alzheimer’s disease classification using the OpenNeuro dataset, achieving 80.23% accuracy and AUC of 82.19%. Song et al. [36] introduced EEG Conformer, which fused convolutional front-ends with Transformer encoders to jointly capture temporal and spatial EEG dependencies, achieving up to 92.4% accuracy for both motor imagery and emotion tasks. Likewise, Wang et al. [40] utilized an ECA Swin Transformer for stroke rehabilitation BCI, yielding 87.67% accuracy, illustrating how combining CNN attention modules with Transformer blocks enhances clinical EEG modeling.
Further refinements appeared in 2024 with increasingly complex hybrid designs. Oh et al. [48] proposed a CNN–Transformer hybrid for sleep onset prediction, achieving a mean absolute error (MAE) of 9.8 minutes, while Wang et al. [51] used a CNN–Transformer with Diffusion for depression diagnosis, reaching 93.7% accuracy on multi-source EEG datasets (Mumtaz2016, Arizona2020). Ren et al. [59] combined CNN and Transformer modules to classify error-related potentials, achieving 78.7% accuracy, while Ding et al. [65] introduced CNN-Former for asynchronous SSVEP decoding, reporting 93.2% accuracy. These studies demonstrate how CNN layers improve low-level EEG feature encoding, allowing the Transformer component to focus on learning global contextual dependencies across channels and time segments.
Hybrid architectures have also integrated graph and multimodal learning elements. Wang et al. [28] developed CWTFFNet + Multi-Graph Convolution, combining graph-based EEG representations with Transformer attention for seizure prediction, achieving an AUC up to 0.984. Du et al. [73] proposed MES-CTNet, a multimodal EEG–stimuli Transformer for emotion recognition, reaching up to 98.3% accuracy, while Pradeepkumar et al. [54] introduced a Cross-Modal Transformer for sleep stage classification, achieving 84.7% accuracy by fusing EEG and auxiliary physiological data. These works underscore the potential of hybrid Transformers not only in enhancing feature learning but also in enabling multimodal integration for more comprehensive neural decoding.
Table 2. Distribution of Pure and Hybrid Transformer-Based EEG Models
Category | Transformer Models | References |
Pure Transformer | ViT, DeiT, Swin, Longformer, Global Adaptive Transformer, MSMAE, STAFNet, DFformer, TBEEG, DAMGCN, ADT, MBMD, Bi-ViTNet, ECA Swin, MST-Net, EEGformer, SCAM-Learning | [21],[22],[24],[27]-[29],[33],[35],[37]-[43],[55]-[58],[60]-[64],[66]-[70],[73],[75]-[78],[81],[82] |
Hybrid Transformer | CNN–Transformer, LSTM–Transformer, GRU–Transformer, GCN–Transformer, Capsule–Transformer, GAN–Transformer, Diffusion–Transformer | [23],[25],[26],[30]-[32],[34],[36],[44]-[54],[59],[65],[71]-[74],[79],[80],[83] |
Transformer-based architectures have been extensively embraced and increasingly utilized across a broad spectrum of EEG applications from 2022 to 2024, spanning cognitive, affective, motor, and clinical domains. Their ability to model both spatial and temporal dependencies through self-attention mechanisms has enabled substantial and measurable enhancements in prediction precision, robustness across unseen data, and transparency of model decision-making compared to traditional deep learning methods. As summarized in the reviewed studies, Transformer-based EEG applications can be broadly categorized into five major domains: motor imagery and brain–computer interface (BCI) systems, seizure and neurological disorder detection, emotion and mental state recognition, sleep stage and fatigue monitoring, and cognitive workload or neuropsychological assessment can be seen in Figure 1.
Figure 1. Transformer-based EEG application
A major research focus has been on improving the decoding accuracy and cross-subject generalization of EEG-based BCI systems. Xie et al. [21] pioneered the use of multi-variant Transformer models (s-Trans, t-Trans, s-CTrans, t-CTrans, f-CTrans) for motor imagery classification on the PhysioNet dataset, achieving 83.31% accuracy in two-class tasks. Subsequent works such as Chen et al. [35] and Song et al. [36] introduced CSP + Transformer and EEG Conformer architectures, pushing accuracy to 89.6% and 92.4%, respectively, across BCI Competition IV datasets. Hu et al. [39] and Yeom et al. [79] advanced this line with MSATNet and Query-Only Attention Transformer, which achieved accuracies above 84%, emphasizing the role of attention mechanisms in modeling channel-level interactions and improving transfer learning across sessions. Furthermore, Wang et al. [31] and Ding et al. [65] employed Transformer-based decoders for SSVEP BCI classification, reaching 95.4% and 93.2% accuracy, while Lee et al. [52] demonstrated practical control applications through a DeiT-based calibratable network achieving over 80% accuracy in robotic arm control tasks. These studies highlight the transformative role of attention-driven models in enhancing both precision and adaptability of EEG-based BCIs.
Another dominant application area involves seizure prediction and diagnosis of neurological disorders. Hussein et al. [22] utilized a Multi-Channel ViT for seizure prediction using datasets such as CHB-MIT and AES-Kaggle, achieving an AUC of 0.99 and accuracy of 99.8%. Tian et al. [43] demonstrated near-perfect detection (100% accuracy) using a CNN–Transformer hybrid, while Shi et al. [55] proposed B2-ViT, obtaining an AUC up to 0.923. Holguin-Garcia et al. [83] further achieved 99.76% accuracy in seizure classification using a CNN–Transformer Encoder. Beyond epilepsy, Chen et al. [30] and Khan et al. [49] leveraged Transformer-based frameworks for Alzheimer’s diagnosis, reaching 80.23% and 98% accuracy, respectively, and He et al. [27] applied a Transformer–LSTM–GRN hybrid to predict depth of anesthesia from VitalDB, with RMSE 4.7. These outcomes demonstrate how Transformers can capture complex neural signatures for both acute and chronic neurological monitoring.
Transformers have also excelled in affective and cognitive state decoding due to their capacity to learn long-range dependencies in EEG sequences. Zhou et al. [34] and Lu et al. [41] achieved high-performance emotion recognition using Dual-Channel Transformer and Bi-ViTNet, obtaining 97.3% and 96.25% accuracy, respectively. Subsequent studies, including Chen et al. [67] with DAMGCN (99.42%), Hu et al. [70] with STAFNet (97.9%), and Lu et al. [62] with Convolution Interactive Transformer (98.57%), achieved near-perfect results on SEED and DEAP datasets. These findings underscore the efficacy of attention mechanisms in modeling inter-hemispheric synchronization and affective EEG dynamics. Beyond emotion, Tigga et al. [46] used AttGRUT for depression detection (98.67%), while Wang et al. [51] achieved 93.7% in depression diagnosis using a CNN–Transformer + Diffusion model. Together, these studies affirm that Transformer-based models offer reliable decoding of complex emotional and psychiatric EEG patterns.
Transformer-based architectures have also contributed to advancements in sleep and fatigue-related EEG analysis. Yao et al. [33] introduced VSTTN, integrating Swin and Longformer Transformers for sleep stage classification, achieving 89.24% accuracy, while Zhang et al. [50] and Seraphim et al. [80] reached 88.9% and 89% accuracy using TSEDSleepNet and SPDTransNet, respectively. Oh et al. [48] applied a CNN–Transformer hybrid for sleep onset prediction, reporting a mean absolute error of 9.8 minutes, and Ye et al. [75] used CA-ACGAN for fatigue detection, achieving 90.7% accuracy. These models effectively capture temporal continuity across sleep cycles and microstate transitions, enabling fine-grained monitoring for health and cognitive performance applications.
Transformers have also been employed to assess cognitive workload and integrate multimodal neural data. Li et al. [64] proposed MST-Net for cognitive load classification, achieving 89.1% accuracy, while Peng et al. [63] applied MBMD Transformer for seizure subtype analysis, reaching 93.5% accuracy. Liu et al. [53] explored speech envelope reconstruction using ADT Network, demonstrating moderate correlation scores (~0.168), revealing the potential of Transformers in EEG-based auditory decoding. Moreover, Du et al. [73] and Pradeepkumar et al. [54] incorporated cross-modal EEG–stimuli fusion using MES-CTNet and Cross-Modal Transformer, achieving 98.3% and 84.7% accuracy, respectively. These multimodal approaches indicate the growing role of Transformers in integrating cross-sensory or behavioral information to enrich EEG interpretation.
Between 2022 and 2024, Transformer-based EEG modeling has evolved rapidly, revealing several emerging trends that shape the future of neural signal analysis. One prominent trend is the increasing architectural diversity of Transformer designs tailored for EEG signals. Researchers have transitioned from generic self-attention models toward specialized architectures that integrate spatial, temporal, and frequency-domain information. For example, Xie et al. [21] pioneered multiple Transformer variants (s-Trans, t-Trans, f-CTrans) that separately capture spatial and temporal dependencies, while Wang et al. [28] and Yao et al. [33] extended this idea through multi-graph and hybrid attention mechanisms that encode inter-channel connectivity patterns. The introduction of architectures such as VSTTN (Swin + Longformer) [33], EEG Conformer [36], and MSATNet [39] further highlights the growing emphasis on multi-scale attention fusion, allowing models to simultaneously learn local dynamics and long-range dependencies across EEG channels. This diversification of Transformer backbones signals a maturation of the field, where architectures are increasingly domain-specific rather than adapted directly from computer vision or NLP.
Another significant trend is the emergence of emotion and mental-state decoding as a dominant application area for Transformers in EEG research. Studies such as Zhou et al. [34], Lu et al. [41], and Chen et al. [67] demonstrated that self-attention mechanisms excel in recognizing subtle emotional variations by modeling inter-hemispheric synchrony and temporal correlations, achieving accuracies exceeding 97%. Later models such as STAFNet [70], Convolution Interactive Transformer [62], and MES-CTNet [73] pushed this boundary further, incorporating multimodal signals and cross-session generalization, achieving results up to 99.42% accuracy. This trend reflects a paradigm shift from simple classification tasks to more complex cognitive inference, where Transformers serve as interpretable models capable of decoding human affective and psychological states in near real-time.
The third emerging trend is the integration of multimodal and cross-domain learning within Transformer frameworks. Models such as Cross-Modal Transformer [54] and MES-CTNet [73] fused EEG with complementary data sources (such as physiological or visual stimuli) to enhance contextual understanding. Similarly, Liu et al. [53] utilized Transformers for speech envelope reconstruction, bridging auditory and neural data, while Wang et al. [77] explored multimodal consciousness assessment (MutaPT) with accuracy reaching 85.7%. These multimodal approaches signify a growing interest in leveraging Transformers for neurocognitive data fusion, enabling more holistic interpretations of brain activity that align with real-world sensory integration.
A fourth notable direction involves the trend toward lightweight, efficient, and generalizable Transformers for real-world deployment. Researchers have begun optimizing architectures for wearable EEG systems and clinical applications with limited computational resources. Beiramvand et al. [57] achieved 88% accuracy using a compact Transformer on low-density EEG headsets (Muse, Enobio), while Busia et al. [76] developed EEGformer for wearable seizure detection, maintaining competitive accuracy (73–88%) despite reduced electrode counts. Similarly, Kim et al. [58] proposed Dfformer, a generalized Transformer for multiple EEG decoding tasks, illustrating efforts toward scalability and hardware efficiency. This reflects a broader movement toward real-time EEG analytics, where Transformer models are optimized for embedded or edge devices without sacrificing accuracy. Finally, there is a growing emphasis on explainability and interpretability in Transformer-based EEG modeling. With increasing model complexity, researchers have started integrating attention visualization and explainable AI (XAI) frameworks to interpret the neural features captured by Transformers. For instance, emotion recognition studies such as Lu et al. [62] and Chen et al. [67] incorporated attention map analyses to identify brain regions most influential for classification. This emerging practice enhances transparency and supports clinical adoption, as it aligns with the interpretive demands of medical decision-making.
Despite the impressive progress achieved between 2022 and 2024, Transformer-based EEG modeling still faces several critical challenges that hinder its broader adoption in clinical and real-world applications. One of the foremost gaps lies in the limited generalization and dataset diversity. Many studies, such as those by Xie et al. [21], Chen et al. [35], and Song et al. [36], evaluated their models on benchmark EEG datasets (e.g., PhysioNet, BCI Competition IV), which are relatively small, homogeneous, and collected under controlled laboratory conditions. As a result, models trained on these datasets often fail to generalize across subjects, sessions, or acquisition systems. Only a few works, such as Hussein et al. [22] and Busia et al. [76], attempted cross-dataset validation or wearable EEG integration, yet even these studies lack large-scale, standardized benchmarks that reflect real-world variability. Future research should thus prioritize the creation of multi-center EEG repositories and the adoption of cross-dataset transfer learning frameworks to improve model robustness and reproducibility across diverse recording environments.
Another key limitation is the computational complexity and inefficiency of current Transformer models. While large models such as Multi-Channel ViT [22] and DAMGCN [67] achieved exceptional performance (AUC up to 0.99 and accuracy near 99%), their high parameter count and memory requirements make them impractical for deployment in mobile or clinical devices. This challenge has motivated a recent shift toward lightweight architectures, such as Dfformer [58] and EEGformer [76], but these remain in early development and often sacrifice accuracy for efficiency. Therefore, future efforts should focus on model compression, pruning, and quantization techniques tailored for EEG data, as well as the exploration of efficient Transformer variants (e.g., Performer, Linformer) that preserve performance while reducing resource demand. These strategies are critical to enabling real-time EEG decoding in wearable neurotechnology and point-of-care diagnostic systems.
A further gap involves the lack of interpretability and neurophysiological validation. Although attention mechanisms inherently offer a degree of explainability, most Transformer-based EEG studies still function as black boxes. Few works, such as Lu et al. [62] and Chen et al. [67], visualized attention weights to highlight task-relevant EEG regions, but systematic approaches linking attention maps to neurophysiological phenomena remain rare. Without such interpretability, clinical adoption is limited, as practitioners require transparent insights into how EEG patterns correspond to cognitive or pathological states. Future research should incorporate explainable AI (XAI) frameworks that combine saliency mapping, channel importance ranking, and temporal attribution to ensure that Transformer decisions align with established neuroscientific knowledge. Integrating these methods with clinician-in-the-loop evaluations could bridge the gap between model interpretability and medical usability.
Another major research gap concerns the underexploration of multimodal and cross-domain learning in EEG Transformers. While some studies, such as Du et al. [73] and Pradeepkumar et al. [54], introduced cross-modal architectures by fusing EEG with other physiological or sensory modalities, most models still rely exclusively on EEG data. This limits their capacity to capture the broader neural–behavioral context underlying brain activity. Future directions should emphasize multimodal Transformers that integrate EEG with eye-tracking, facial expression, fNIRS, or physiological signals (e.g., GSR, ECG). Additionally, self-supervised and contrastive learning approaches could be applied to leverage large amounts of unlabeled EEG data, improving model pretraining and transferability across domains. Finally, there remains a pressing need for standardized evaluation frameworks and reproducibility protocols. Current studies use heterogeneous metrics, such as accuracy [21],[36],[70], AUC [22],[28],[55], and F1-score [42],[82], making direct performance comparison challenging. The absence of consistent validation splits, hyperparameter transparency, and open-source codebases further restricts scientific reproducibility. To address these issues, future research should establish benchmarking standards for Transformer-based EEG analysis, similar to those in computer vision and NLP, including shared datasets, unified evaluation pipelines, and public repositories can be seen in Table 3.
Table 3. Research Gaps and Future Directions in Transformer-based EEG Studies
This review highlights the rapid evolution and growing impact of Transformer-based architectures in EEG signal analysis from 2022 to 2024. The findings demonstrate that attention-driven models (both pure Transformers and hybrid variants) have achieved remarkable improvements across diverse EEG applications, including motor imagery, seizure detection, emotion recognition, sleep staging, and cognitive load estimation. The ability of Transformers to model global temporal–spatial dependencies has enabled them to outperform traditional deep learning methods while offering new perspectives for feature representation and interpretability. Furthermore, the emergence of domain-specific adaptations such as ViT, Swin Transformer, and EEG Conformer reflects the field’s shift toward architectures tailored to the intrinsic characteristics of EEG data. The integration of attention mechanisms with convolutional, recurrent, and graph-based modules has also proven highly effective in capturing multi-scale dynamics and enhancing robustness across datasets and subjects.
However, despite these advancements, multiple unresolved obstacles continue to remain that restrict real-world implementation and operational adoption of Transformer-based EEG models. Current research remains constrained by small, homogeneous datasets, high computational demands, and limited interpretability, which collectively hinder real-world translation. The lack of unified benchmarking standards and well-established reproducibility protocols further impedes objective performance evaluation and reliable cross-study comparison. Future research should focus on the design and optimization of Transformer architectures that balance computational efficiency with interpretability, leveraging self-supervised and multimodal learning, and establishing open EEG benchmarks that support large-scale, cross-domain training. By addressing these limitations, the next generation of Transformer-based EEG systems can progress from high-performance experimental models toward clinically reliable and scalable neurotechnology capable of transforming brain–computer interaction, cognitive monitoring, and neurological diagnostics.
DECLARATION
Author Contribution
All authors contributed equally to the main contributor to this paper. All authors read and approved the final paper.
Acknowledgement
The authors would like to acknowledge the Department of Medical Technology, Institut Teknologi Sepuluh Nopember, for the facilities and support in this research. The authors also gratefully acknowledge financial support from the Institut Teknologi Sepuluh Nopember for this work, under project scheme of the Publication Writing and IPR Incentive Program (PPHKI) 2025.
Conflicts of Interest
The authors declare no conflict of interest.
REFERENCES
Yuri Pamungkas (Trends and Gaps in Transformer-Based EEG Modeling: A Review of Recent Developments)