
2.2. Evaluation of Arabic Segmentation
Early evaluations of Arabic segmentation typically relied on
alignment with treebank-style gold annotations and reported
boundary-level precision, recall, and F
1
. However, treating all
boundaries as equally important obscures qualitatively differ-
ent error types, such as under-segmentation of proclitics versus
over-segmentation of stems [5]. Task-oriented studies demon-
strated that segmentation errors have asymmetric downstream
impact: over-segmentation may harm precision in information
retrieval, while under-segmentation may reduce recall or impair
translation quality [20, 21].
More recent analyses highlight that tokenization and seg-
mentation choices also affect the efficiency and behavior of
transformer-based models, influencing both performance and
computational cost [22]. Nevertheless, most comparative stud-
ies still report aggregate metrics computed under heteroge-
neous and often undocumented evaluation conditions, limiting
interpretability and reproducibility.
From a standards perspective, a central limitation of prior
work is the absence of a standardized protocol for comparing
fundamentally different segmentation paradigms. Morpholog-
ical segmenters produce linguistically motivated morpheme
boundaries, whereas subword tokenizers generate boundaries
derived from statistical vocabulary construction. Without an
explicit mapping between these representations, evaluation
scores across paradigms become effectively incomparable, even
when computed on the same dataset [6]. Reproducibility is
further hindered when code, data splits, normalization poli-
cies, and evaluation scripts are not fully specified or publicly
available [23].
2.3. Benchmarking Practices, Standards, and
Robustness
General-purpose NLP benchmarks such as GLUE and Super-
GLUE demonstrated the value of unified tasks, datasets, and
scoring protocols for accelerating progress through comparabil-
ity [24, 25]. Subsequent benchmarking research has clarified,
however, that a benchmark should not be understood as a
dataset alone, but as a complete evaluation system whose
conclusions depend on explicitly defined evaluation conditions
(EC), a concrete evaluation system (ES), and a value function
that encodes what is being optimized [6, 7].
Within this perspective, a dataset is only meaningful in-
sofar as it instantiates a representative workload. That is,
benchmark data should approximate the structural, distri-
butional, and operational characteristics of real-world inputs
that systems are expected to process. ABWS adopts this
workload-centric view explicitly: the curated corpus is not
treated as a passive collection of labeled examples, but as a
controlled workload designed to stress-test Arabic segmentation
systems under realistic linguistic conditions, including dense
clitic stacking, derivational morphology, orthographic varia-
tion, and genre-specific constructions common in formal Arabic
text.
Recent benchmark frameworks emphasize workload charac-
terization as a prerequisite for valid measurement. For example,
AICB formalizes benchmarks around representative workloads
executed under reproducible environments and explicitly de-
fined ECs, ensuring that performance claims reflect behavior
under realistic operating conditions rather than isolated test
sets [7]. Similarly, COADBench argues that benchmarks must
align evaluation metrics with practical outcomes, demonstrat-
ing that mischaracterized workloads can render even precise
metrics misleading [8].
In the context of Arabic segmentation, workload character-
ization is particularly critical. Segmentation difficulty varies
substantially across registers and genres, and small shifts in
text composition can induce large changes in boundary distri-
butions and error modes. ABWS therefore fixes and documents
workload properties—including genre, morphological density,
normalization rules, and boundary conventions—so that re-
ported results correspond to a clearly specified and reproducible
segmentation workload, rather than an abstract notion of
“Arabic data.”
Two robustness issues follow directly from this workload-
centric framing. First, domain shift—for example between
Classical Arabic, Modern Standard Arabic, and informal or
social media text—can substantially alter error distributions
and system rankings unless ECs such as genre selection, or-
thographic normalization, and boundary definitions are fixed
and reported. Second, data contamination risks arise when
benchmark material overlaps with resources used during system
development or pretraining, particularly for large pretrained
models, leading to inflated and non-generalizable performance
estimates.
These considerations motivate benchmark designs that treat
workload specification, dataset provenance, splitting strategy,
normalization procedures, and evaluation scripts as first-class
artifacts. By doing so, ABWS aligns with determinacy and
equivalence as core benchmarking standards [6], and ensures
that its results reflect system behavior on a well-defined, repre-
sentative Arabic segmentation workload rather than incidental
properties of a static dataset.
2.4. Our Position
ABWS is designed as a standards-oriented benchmark for
Arabic word segmentation. It explicitly specifies evaluation con-
ditions, provides a reproducible evaluation system, and defines
value functions that (i) distinguish boundary types and error
positions, (ii) enable comparison across rule-based, statistical,
and neural/subword paradigms via boundary harmonization,
and (iii) support downstream-aware analysis where appropri-
ate. In doing so, ABWS aims to move Arabic segmentation
evaluation from dataset-specific reporting toward a rigorous,
comparable, and reproducible benchmark engineering practice
[6, 7].
3. Formal Specification and Evaluation
Conditions
This section describes the architectural design of ABWS (Ara-
bic Boundary-aware Word Segmentation), a benchmarking
framework engineered to address fundamental limitations in ex-
isting Arabic segmentation evaluation practices. Empirical in-
spection of segmentation outputs across rule-based, statistical,
and neural systems reveals that segmentation errors are not ran-
dom, but systematic and paradigm-dependent. Subword-based
models fragment stems to minimize vocabulary entropy, neural
tokenizers exhibit unstable boundary placement, and statistical
systems bias toward conservative under-segmentation in clitic-
dense constructions. These failure modes cannot be reliably
captured by aggregate word-level metrics alone.
ABWS is therefore designed not as a static dataset, but
as a unified benchmarking harness that enables reproducible,
3