ABWS: The Arabic Boundary-aware Word Segmentation Benchmarkfor Reproducible Evaluation
Keywords:
Arabic NLP, Morphological Segmentation, Benchmarking, Reproducibility, Boundary Errors, Error Taxonomy, Benchmark Traceability, Evaluation ConditionsAbstract
With the rapid adoption of natural language processing (NLP) systems for morphologically rich languages, it has become increasingly imperative to standardize a common set of measures and evaluation practices to ensure reproducibility and fair comparison. Arabic word segmentation serves as a foundational layer in the NLP software stack; however, the field remains fragmented due to inconsistent datasets and an overreliance on opaque, aggregate metrics that mask systemic architectural biases.
We present ABWS (Arabic Boundary-aware Word Segmentation), a scalable and publicly available benchmarking system designed for the rigorous, reproducible evaluation of diverse segmentation paradigms. To enable paradigm-agnostic comparison across rule-based, statistical, and neural models, ABWS introduces a canonical boundary vector abstraction that normalizes disparate system outputs into a unified evaluation interface. The benchmarking harness includes a manually verified gold-standard workload of 212{,}873 words across diverse genres and integrates seven widely used segmentation systems as reproducible baselines.
Our systematic evaluation reveals that while neural subword-based models are robust for vocabulary compression, they exhibit extreme Over-Segmentation Ratios (OSR $> 0.58$), leading to a significant drop in word-level exact match accuracy compared to rule-based engines. We further introduce Critical Boundary Accuracy (CBA), a linguistically weighted metric that prioritizes high-impact morphological boundaries. Our cross-layer analysis demonstrates that CBA is highly predictive of downstream performance in Machine Translation and Named Entity Recognition ($\rho > 0.88$), whereas traditional token-level $F_1$ scores often obscure these performance bottlenecks.
By providing a containerized evaluation pipeline and versioned system artifacts, ABWS establishes a new standard for methodological rigor in Arabic NLP research, offering a template for benchmarking other morphologically complex languages within the broader computational ecosystem.
References
1.Nizar Y. Habash. Introduction to Arabic Natural Language Processing. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, 2010. doi: 10.2200/S00277ED1V01Y201008HLT010.
2.Rabih Zbib, Erika Malchiodi, Jacob Devlin, David Stallard, Spyros Matsoukas, and Richard Schwartz. Machine translation of arabic dialects. In Proceedings of NAACLHLT, pages 49–59, 2012. URL: https://aclanthology.org/N12-1006.pdf.
3.Kareem Darwish. Building a shallow arabic morphological analyzer in one day. In Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, 2002. URL: https://aclanthology.org/W02-0506.pdf.
4.Kyle Gorman and Steven Bedrick. We need to talk about standard splits. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2786–2791, 2019. doi:10.18653/v1/P19-1267.
5.Nizar Habash and Owen Rambow. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of ACL, pages 573–580, 2005. URL: https://aclanthology.org/P05-1071.pdf.
6.F. Han et al. Open source evaluatology: A theoretical framework for open-source evaluation. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 4:100190, 2024. URL: https://doi.org/10.1016/j.tbench.2025.100190.
7.Xinyue Li, Heyang Zhou, Qingxu Li, Sen Zhang, and Gang Lu. Aicb: A benchmark for evaluating the communication subsystem of LLM training clusters. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 5:100212, 2025. doi:10.1016/j.tbench.2025.100212.
8.Jiyue Xie, Wenjing Liu, Li Ma, Caiqin Yao, Qi Liang, Suqin Tang, and Yunyou Huang. COADBench: A benchmark for revealing the relationship between AI models and clinical outcomes. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 4:100198, 2025. TBSE paper (uploaded PDF: S2772485925000110). doi:10.1016/j.tbench.2025.100198.
9.Mohamed Maamouri, Ann Bies, Tim Buckwalter, and Wigdan Mekki. The penn arabic treebank: Building a large-scale annotated arabic corpus. In NEMLAR Conference on Arabic Language Resources and Tools, 2004. URL: https://www.marefa.org/images/e/e8/The_penn_arabic_treebank_Building_a_large-scale_an_%281%29.pdf.
10.Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of ACL-08: HLT, 2008. URL: https://aclanthology.org/P08-2030.pdf.
Published
Issue
Section
License
Copyright (c) 2026 The Authors. Published by BenchCouncil Press on Behalf of International Open Benchmark Council

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.








