ISSN: 2685-9572 Buletin Ilmiah Sarjana Teknik Elektro
Vol. 8, No. 3, June 2026, pp. 673-683
Transformer-Based Semantic Retrieval for Cultural Heritage Question Answering
Tri Lathif Mardi Suryanto 1,2, Aji Prasetya Wibawa 1, Hariyono 3, Andrew Nafalski 4
1 Department of Electrical Engineering and Informatics, Universitas Negeri Malang, Indonesia
2 Department of information system, Universitas Pembangunan Nasional Veteran Jawa Timur, Indonesia
3 Department of History, Universitas Negeri Malang, Indonesia
4 Department of Electrical Engineering, University of South Australia, Australia
ARTICLE INFORMATION | ABSTRACT | |
Article History: Received 03 January 2026 Revised 27 April 2026 Accepted 25 May 2026 | Cultural heritage knowledge presents significant challenges for Question Answering (QA) systems due to their interpretive, context-dependent, and symbolically rich nature. While Transformer-based models have achieved strong performance in semantic representation, they remain prone to hallucination and contextual misalignment, particularly in culturally sensitive domains. This study proposes a Transformer-based cultural knowledge retrieval framework for domain-specific chatbots, combining a bi-encoder (MiniLM and MPNet) for efficient semantic retrieval and a cross-encoder (BERT-base) for fine-grained reranking. A curated dataset of 4,016 question–answer pairs in Indonesia is developed from cultural heritage sources and validated for contextual consistency. The proposed approach is evaluated using both quantitative and qualitative metrics, including accuracy, F1-score, Exact Match (EM), and semantic-based measures such as F1-BLEU, F1-EDIT, and F1-ANS. Experimental results show that while all models achieve high classification performance (accuracy up to 0.99), the BERT + MPNet configuration significantly outperforms others in answer quality metrics, indicating superior semantic fidelity. However, qualitative analysis reveals persistent issues of hallucination and contextual misalignment, highlighting the limitations of relying solely on statistical evaluation. These findings demonstrate that high numerical performance does not guarantee meaningful understanding in cultural domains. Therefore, this study emphasizes the need for hybrid evaluation frameworks and context-aware mechanisms to ensure epistemic fidelity. The proposed approach contributes to the development of more reliable and culturally grounded QA systems. | |
Keywords: Cultural Heritage QA; Transformer-Based Retrieval; Domain-Specific Chatbot; Semantic Similarity; Epistemic Fidelity | ||
Corresponding Author: Aji Prasetya Wibawa, Department of Electrical Engineering and Informatics, Universitas Negeri Malang, Indonesia. Email: aji.prasetya.ft@um.ac.id | ||
This work is open access under a Creative Commons Attribution-Share Alike 4.0 | ||
Document Citation: T. L. M. Suryanto, A. P. Wibawa, H. Hariyono, and A. Nafalski, “Transformer-Based Semantic Retrieval for Cultural Heritage Question Answering,” Buletin Ilmiah Sarjana Teknik Elektro, vol. 8, no. 3, pp. 673-683, 2026, DOI: 10.12928/biste.v8i3.15775. | ||
Cultural heritage knowledge represents a complex and interpretive domain where meaning is constructed through historical context, symbolic representation, and socio-cultural perspectives. Unlike general factual knowledge, cultural information is inherently ambiguous, multi-layered, and often context-dependent, making it difficult to model using conventional Natural Language Processing (NLP) approaches [1]-[3]. In recent years, Transformer-based architectures such as BERT and its variants [4]-[8] have demonstrated remarkable capabilities in capturing contextual semantics and improving performance across a wide range of language understanding tasks [9][10]. These advances have encouraged their adoption in domain-specific Question Answering (QA) systems, including applications in education [11]-[15], healthcare [16]-[19], and tourism [20], [21]-[24], demonstrating that chatbots are increasingly becoming the preferred choice for providing fast and responsive information services.
However, despite their strong representational power, Transformer-based models particularly large generative language models remain prone to producing responses that are semantically plausible but factually or contextually incorrect. This phenomenon, often referred to as hallucination, becomes significantly more problematic in cultural heritage domains, where inaccuracies may distort historical meaning and cultural interpretation [25][26]. Recent studies in cultural heritage AI systems also highlight that preserving contextual fidelity is more critical than maximizing surface-level semantic similarity [27]-[29], as cultural knowledge requires alignment with interpretive context rather than statistical likelihood alone.
To address these limitations, retrieval-based QA systems (Table 1) have emerged as a more reliable paradigm for domain-specific applications. By constraining answers to a predefined knowledge base, retrieval approaches reduce the risk of hallucination and ensure that generated responses remain grounded in verified sources [30][31]. Furthermore, embedding-based retrieval using Transformer encoders enables semantic matching beyond keyword overlap, allowing the system to handle linguistic variability and partial semantic equivalence [32][33]. Nevertheless, standard retrieval methods still face challenges in capturing fine-grained semantic relationships and resolving contextual ambiguity, particularly in culturally rich texts where polysemy and interpretive variation are prevalent.
Another critical limitation lies in the use of general-purpose multilingual models such as mBERT or XLM-R, which are not specifically optimized for culturally grounded domains. These models often suffer from semantic dilution and domain mismatch, resulting in reduced accuracy when applied to localized cultural knowledge [20][6][27]. In addition, conventional keyword-based retrieval techniques are inadequate for capturing contextual nuance, especially when identical terms may carry different meanings depending on historical or regional interpretation.
Table 1. Comparison with QA Systems
Study | Domain | Approach | Model | Evaluation Focus | Limitation |
COVID-19 QA | Retrieval-based | BERT | Accuracy, F1 | Limited semantic evaluation | |
Cultural Chatbot | Multimodal LLM | MLLM | Context awareness | Hallucination risk | |
Cultural Heritage QA | Knowledge Graph-based | KG-QA | Contextual relevance | Limited scalability | |
Tourism Chatbot | Retrieval-based | BERT | Accuracy | No answer-quality evaluation | |
Bangla QA | Domain-specific QA | BERT fine-tuning | EM, F1 | Weak contextual generalization | |
Knowledge Graph QA | Hybrid (KG + BERT) | BERT + KG | Semantic accuracy | High complexity | |
Industrial QA | LLM + Knowledge | LLM-based | Answer correctness | Expensive, unstable |
In this study, we propose a Transformer-based cultural knowledge retrieval framework designed specifically for domain-specific chatbot applications. The proposed system adopts a two-stage architecture: a bi-encoder model (MiniLM and MPNet) is first employed to perform efficient semantic retrieval of candidate answers, followed by a cross-encoder (BERT-base) that performs fine-grained reranking by modeling direct interactions between query and answer pairs. This hybrid approach leverages the efficiency of dense retrieval while preserving the contextual sensitivity of cross-encoder architectures, which have been shown to provide superior performance in relevance ranking tasks [32][10].
To support this study, we construct a domain-specific dataset question–answer pairs derived from Indonesian cultural heritage texts. The dataset is curated from historical narratives, cultural artifacts, and interpretive sources, and validated to ensure contextual consistency and epistemic reliability. All data are presented in Indonesia, reflecting the linguistic and cultural specificity of the domain. Overall, this work aims to bridge the gap between high-performing Transformer models and the need for culturally grounded, context-aware QA systems, by emphasizing not only retrieval accuracy but also epistemic fidelity and interpretive alignment in cultural knowledge representation.
This study utilizes a domain-specific Question Answering (QA) dataset consisting of 4,016 question–answer pairs in Indonesia, specifically curated from cultural heritage sources. The dataset is constructed using a many-question-to-one-answer scheme, where multiple semantically related questions correspond to a single grounded answer. The data sources include historical narratives, cultural artifacts descriptions, inscriptions, and interpretive literature related to Indonesian heritage.
To ensure epistemic validity, the dataset undergoes a manual validation process involving domain-informed curation, where each QA pair is verified for contextual consistency and historical alignment. This process is essential to minimize semantic drift and preserve cultural meaning.
Figure 1 preprocessing pipeline applied to the textual corpus, including normalization, Transformer-compatible tokenization, context-aware stopword handling, and semantic deduplication to ensure data quality and consistency prior to model training.
Figure 1. Text Preprocessing Pipeline for Semantic-Aware Data Preparation
The proposed system adopts a two-stage retrieval architecture, designed to balance efficiency and semantic precision in domain-specific QA. Overview of the proposed system architecture, consisting of representation learning (query and document encoding), bi-encoder-based retrieval for candidate generation, semantic filtering, cross-encoder reranking for relevance refinement, and answer construction, followed by evaluation.
Figure 2. Proposed Retrieval–Reranking Architecture for Domain-Specific Chatbots
In the first stage, both the query and candidate answers
are independently encoded into dense vector representations using pre-trained Transformer-based sentence encoders, namely MiniLM [37] and MPNet [10]. The semantic similarity between the query and each candidate answer is computed using cosine similarity:
(1) |
where and
denote the embedding vectors of the query and candidate answer, respectively.
Top- candidates are then selected based on similarity scores:
(2) |
This stage enables efficient large-scale retrieval while capturing semantic similarity beyond lexical overlap [32][33].
In the second stage, a cross-encoder model (BERT-base) is employed to rerank the retrieved candidates. Unlike the bi-encoder, the cross-encoder jointly encodes the query–answer pair, allowing it to model fine-grained interactions:
(3) |
This mechanism enables deeper contextual understanding and improves ranking accuracy, particularly in cases involving semantic ambiguity and cultural nuance ([28]).
The final answer is selected as:
(4) |
The models are trained using a supervised approach on the cultural heritage QA dataset (Table 2). In the first stage, the bi-encoder learns to represent queries and answers in a semantic space, so that relevant pairs are closer while irrelevant ones are farther apart. This enables efficient retrieval of candidate answers. In the second stage, the cross-encoder evaluates each query–answer pair jointly to produce more accurate relevance scores. This step helps refine the ranking by capturing deeper contextual relationships. Overall, this two-stage approach balances efficiency and accuracy, allowing the system to retrieve relevant answers while maintaining contextual precision in a domain-specific setting.
Given that cross-encoder models are computationally expensive, the proposed architecture is designed to balance performance and efficiency by limiting the reranking process to the top-k retrieved candidates. In this framework, the bi-encoder performs fast semantic retrieval through parallel encoding, enabling efficient candidate selection from the entire dataset. The cross-encoder, while more computationally intensive, is applied only to this reduced candidate set to perform fine-grained relevance scoring. This two-stage mechanism significantly reduces inference time compared to applying a full cross-encoder over all candidates, thereby making the system more practical for real-world chatbot deployment scenarios.
To evaluate retrieval effectiveness and answer quality, multiple metrics are employed to capture both statistical performance and semantic fidelity. Accuracy is used to measure the correctness of selected answers within a constrained candidate space, reflecting the system’s ability to identify the most relevant response. The F1-score is adopted to balance precision and recall in relevance prediction tasks [38][39]. Exact Match (EM) is used as a strict metric to assess whether the predicted answer exactly matches the ground truth, which is commonly applied in QA benchmarks [40].
To further evaluate answer quality beyond exact matching, F1-BLEU is utilized to measure lexical and semantic overlap between predicted and reference answers [41]. Additionally, F1-EDIT is applied to capture structural similarity based on token-level transformations, while F1-ANS is introduced to assess answer-level semantic correctness, particularly in domain-specific QA settings where contextual alignment is critical [42][43]. This combination of metrics enables a more comprehensive evaluation, bridging quantitative performance and contextual relevance in cultural knowledge retrieval systems. In addition to these metrics, retrieval-oriented evaluation such as ranking-based measures can be incorporated to further assess system effectiveness in candidate selection scenarios [30].
Table 2. Training Configuration
Component | Parameter | Value |
Optimization | Optimizer | AdamW |
Learning Rate | (2 \times | |
Batch Size | 16 | |
Epochs | 50 | |
Bi-Encoder Training | Loss Function | Contrastive Loss |
Embedding Model | MiniLM / MPNet | |
Top-k Retrieval | (k = 10) | |
Cross-Encoder Training | Model | BERT-base |
Loss Function | Binary Cross-Entropy | |
Input Format | [CLS] query [SEP] answer [SEP] | |
Regularization | Dropout | 0.1 |
Evaluation Setup | Train-Test Split | 80:20 |
Validation Strategy | Held-out test set |
The dataset used in this study consists of 4,016 question–answer pairs in Indonesia, specifically curated from cultural heritage sources. The statistical distribution shows relatively short textual units, with an average length of approximately nine tokens for both questions and answers, indicating a concise and focused QA structure. This characteristic is consistent with domain-specific QA datasets, where questions tend to be direct and contextually grounded rather than open-ended [44][45]. Furthermore, the vocabulary size of 6,251 unique tokens reflects a moderate level of lexical diversity, sufficient to represent cultural narratives while maintaining domain consistency can be seen in Table 3.
Table 3. Dataset Statistics of the QA Corpus
Statistic | Value |
Total QA pairs | 4,016 |
Average context length (tokens) | 9.94 |
Maximum context length (tokens) | 46 |
Average question length (tokens) | 9.16 |
Average answer length (tokens) | 9.94 |
Vocabulary size | 6,251 |
Percentage answerable questions | 100% |
Language | Indonesia |
The word frequency distribution further confirms that the dataset is highly question-driven, dominated by interrogative terms such as “what,” “how,” and “why,” which aligns with typical QA corpus structures [46][47]. More importantly, the prominence of culturally significant terms such as Durga, Dewi, and prasasti indicates that the dataset captures domain-specific semantic signals, which are critical for training context-aware retrieval systems. This observation is consistent with prior studies emphasizing the importance of domain-specific corpora in improving semantic retrieval and QA performance [20],[6],[28]. Unlike general-purpose datasets such as SQuAD, which focus on factual comprehension, this dataset embeds interpretive and symbolic knowledge, making it more challenging yet more representative of real-world cultural QA tasks can be seen in Figure 3 and Figure 4.
(a) | (b) |
Figure 3. (a) Word Cloud BEFORE Cleaning (Raw Text with Capitalization; (b) Word Cloud AFTER Cleaning (Lowercased & Normalized)
Figure 4. Top 10 Most Frequent Words
The comparative results (Table 4) demonstrate that all evaluated models achieve consistently high performance in classification-oriented metrics, with precision, F1-score, and accuracy values ranging between 0.96 and 0.99. This indicates that Transformer-based encoders, regardless of their configuration, are generally effective in identifying relevant answer candidates within a constrained retrieval space. Such findings align with previous studies showing that BERT-based architectures [9],[38][39] excel in text classification and relevance detection tasks due to their strong contextual representation capabilities.
Table 4. Comparative Results
Model Learning | Precision | F1-score | Accuracy | F1-BLEU | F1-EDIT | F1-ANS | EM |
BERT-base | 0.9720 | 0.9860 | 0.9750 | 0.8200 | 0.8600 | 0.8550 | 0.6804 |
BERT + MiniLM | 0.9563 | 0.9777 | 0.9600 | 0.6777 | 0.7499 | 0.7040 | 0.6321 |
BERT + Multilingual-MiniLM | 0.9887 | 0.9943 | 0.9900 | 0.5886 | 0.6761 | 0.6164 | 0.6534 |
BERT + MPNet | 0.9856 | 0.9928 | 0.9900 | 0.9382 | 0.9511 | 0.9511 | 0.7162 |
However, a more critical insight emerges when examining answer quality metrics. The BERT + MPNet configuration significantly outperforms all other models in F1-BLEU (0.9382), F1-EDIT (0.9511), and F1-ANS (0.9511), indicating superior semantic alignment and structural fidelity. This result supports prior research demonstrating that MPNet, which integrates masked and permuted language modeling, produces richer sentence embeddings compared to earlier models such as BERT and MiniLM [10]. In contrast, MiniLM-based models, while computationally efficient, exhibit noticeable degradation in answer quality, suggesting limitations in capturing fine-grained semantic relationships. Similar trade-offs between efficiency and semantic richness have been reported in lightweight Transformer variants [37],[48].
Interestingly, the multilingual MiniLM model achieves the highest classification scores but performs poorly in answer quality metrics. This discrepancy highlights the phenomenon of semantic dilution, where multilingual representations sacrifice domain-specific precision in favor of broader generalization [49],[27]. This finding is consistent with studies indicating that multilingual models often underperform in specialized domains due to insufficient contextual grounding [6],[50]. Therefore, while multilingual models are advantageous for cross-lingual applications, they may not be optimal for culturally specific knowledge retrieval tasks.
The training and validation loss (Figure 5) curves reveal clear differences in convergence behavior across models. The BERT + MPNet configuration consistently achieves the lowest loss values and demonstrates stable convergence throughout 50 epochs, indicating strong generalization capability. This observation aligns with findings that MPNet improves both training stability and representation quality by combining autoregressive and autoencoding pretraining objectives [10].
In contrast, BERT + MiniLM shows rapid initial convergence but suffers from higher validation loss, suggesting overfitting or insufficient representational capacity. This behavior is commonly observed in compressed models, where efficiency is achieved at the cost of reduced expressiveness [48],[37]. Meanwhile, the multilingual MiniLM model exhibits smoother convergence than MiniLM but maintains a higher validation loss than BERT-base, further reinforcing the impact of multilingual generalization on domain-specific performance. These results confirm that representation richness plays a crucial role not only in final performance but also in learning stability, particularly in culturally nuanced datasets.
(a) | (b) |
Figure 5. (a) Training loss; (b) Validation loss
Despite strong quantitative performance, qualitative analysis reveals critical limitations in the model’s ability to maintain contextual fidelity. As shown in Table 5, the model produces a hallucinated answer that introduces external knowledge does not present in the context. This behavior is consistent with the well-documented issue of hallucination in Transformer-based models [25][26], where responses appear plausible but are not grounded in the provided data.
Table 5. Hallucinated Answer
Context |
durga adalah dewi dalam agama hindu yang sering digambarkan sebagai simbol kekuatan dan perlindungan kosmis. dalam relief jawa timur, durga juga dimaknai sebagai figur penjaga keseimbangan, bukan semata-mata kekuatan destruktif. |
Question |
apa peran dewi durga dalam relief jawa timur? |
Ground Truth Answer |
dewi durga dimaknai sebagai figur penjaga keseimbangan dan simbol perlindungan kosmis. |
Output |
dewi durga berperan sebagai dewi kematian yang menguasai alam bawah dan roh orang mati. |
Similarly, Table 6 illustrates a case of contextual misalignment, where the model selects a generally correct answer but fails to address the specific interpretive context of the question. This type of error reflects a limitation in attention prioritization, where the model does not fully capture the most relevant segment of the context. Such issues have been observed in retrieval-based and generative QA systems alike, particularly in domains requiring nuanced interpretation [35],[51]. This finding confirms that relevance matching alone is insufficient, as models must also be capable of context-sensitive reasoning to ensure that selected answers truly align with the intended interpretive frame of the query.
Table 6. Confusion Answer
Context |
dalam tradisi hindu, durga dipuja sebagai dewi pelindung. sementara itu, dalam beberapa penafsiran lokal di jawa timur, figur durga mengalami pergeseran makna sebagai simbol kekuatan yang bersifat ambigu, tergantung pada konteks ritual dan sejarah lokal. |
Question |
bagaimana pemaknaan dewi durga dalam penafsiran lokal jawa timur? |
Ground Truth Answer |
dalam penafsiran lokal jawa jimur, dewi durga dimaknai sebagai simbol kekuatan yang bersifat ambigu dan kontekstual. |
Output |
dewi durga dalam tradisi hindu dipuja sebagai dewi pelindung. |
These findings highlight a fundamental limitation of current QA evaluation paradigms. While metrics such as F1-score and BLEU capture surface-level similarity, they do not guarantee epistemic fidelity, i.e., the alignment between generated answers and culturally grounded meaning. This limitation has also been emphasized in recent studies on cultural heritage AI, which argue that semantic correctness alone is insufficient without contextual and interpretive accuracy [27][28]. Therefore, evaluation frameworks must be expanded to include qualitative and context-aware criteria, as relying solely on numerical metrics risks overestimating system performance in culturally sensitive domains.
The results of this study demonstrate that high quantitative performance does not necessarily translate into meaningful understanding in cultural domains. While Transformer-based retrieval systems effectively model semantic similarity, they remain fundamentally limited by their reliance on statistical patterns rather than interpretive reasoning. This observation aligns with broader discussions in AI research regarding the limitations of data-driven models in capturing human-centered knowledge and meaning [25],[52]. This gap indicates that current models operate at the level of pattern recognition rather than true comprehension, which becomes particularly problematic when dealing with knowledge that requires interpretive depth and contextual awareness.
In the context of cultural heritage, knowledge is not merely a collection of facts, but a dynamic system shaped by history, interpretation, and social context. Therefore, QA systems designed for such domains must go beyond accuracy and incorporate mechanisms that preserve contextual integrity. Recent approaches suggest integrating knowledge graphs, human-in-the-loop validation, and retrieval-augmented reasoning to address these challenges [28],[31],[53]. Such integration is essential to bridge the gap between statistical modeling and meaningful knowledge representation, enabling QA systems to move toward more reliable and culturally grounded intelligence.
Overall, this study contributes to the growing body of research advocating for hybrid evaluation frameworks, where quantitative metrics are complemented by qualitative analysis to better capture the complexity of real-world knowledge systems. By highlighting the gap between statistical performance and epistemic fidelity, this work provides a foundation for developing more culturally aware and context-sensitive AI systems.
This research introduces a Transformer-based retrieval system aimed at cultural heritage Question Answering. The proposed method of integrating bi-encoder semantic retrieval with cross-encoder reranking achieves a balance between time and contextual accuracy. Experimental results suggest that although every model achieved high classification results, only high layers of MPNet performed the preservation of answer quality well and contextually aligned.
Among other findings was a concerning aspect (beyond the low contextual alignment). Epistemic fidelity was not evident based on the high results achieved. Even though models were performing high, they suffered from hallucinations and contextual alignment issues. To some extent, current methodologies based on Transformers are heavily dependent on high performing models. This gap is mostly dominant in the cultural context because it is dependent on true contextual and layered symbolics.
This research showcases the relevance of retrieval-based architectures combined with a hybrid framework and evaluation passages. The achievement of this research is a step away from the traditional accuracy-based methods towards thorough preservation of meaning combined with contextual basis. However, the traditional approaches combined with closed domains represent some of the limitations of this research. Most importantly, closed domains, knowledge grounding, and reasoning have not been addressed. To balance the context hallucinations and enhance the approach, knowledge graphs and reasoning should integrate as a human centered endpoint. Such approach may form a more dependable, culturally, and contextual QA system.
DECLARATION
Supplementary Materials
The dataset and supplementary materials used in this study are available upon reasonable request to the corresponding author.
Sustainable Development Goals
This study contributes to SDG 4 (Quality Education) by supporting digital learning through cultural knowledge systems, and SDG 11 (Sustainable Cities and Communities)
Author Contribution
All authors contributed to the conceptualization, methodology, and writing of this study. The first author led data curation, model development, and analysis, while the co-authors contributed to validation, supervision, and manuscript review. All authors approved the final version
Funding
This research was supported under grand DPPM, KEMENRISTEKDIKTI 2026.
Acknowledgement
The authors would like to thank domain experts and contributors, as well as Institutional support from Universitas Negeri Malang, UPN Veteran Jawa Timur, and Research Group B26 Unus Gradus Mille Impactus (B26-UGMI).
Conflicts of Interest
The authors declare no conflict of interest regarding the publication of this paper.
REFERENCES
Tri Lathif Mardi Suryanto (Transformer-based Semantic Retrieval for Cultural Heritage Question Answering)