Open Source · AI Research
Alkebulan AI
Status
Shipped
Stack
[ THE THESIS ]
Why Western AI fails 800 million African language speakers
Off-the-shelf models treat language like English—splitting words by spaces. But in morphologically complex African languages, a single word is an entire sentence. A word like Abanoonyiboobubudamu (refugees) gets shattered into 11 meaningless fragments by standard tokenizers, destroying the semantic relationship before the neural network even sees it. I built AlkebulanAI to fix the architecture from the ground up, starting with Luganda as the benchmark for the hardest cases.
SHIPPED · Hugging Face · Apache-2.0 · v1.0.0 validated
SECTION 01 — DESIGN DECISIONS
Four calls that made it work
Each one overrides a default from the Western NLP stack that would have quietly broken the model on Bantu data.
Fix the tokenizer before touching neural weights
Symbolic morphology first (rule-based, language-specific), statistical learner second (SentencePiece). Boundaries are inserted before a single gradient flows.
Long agglutinated tokens produce spiky gradients
Off-the-shelf optimisers (SGD, AdamW) amplify the spikes by recency bias. Training diverges. Benchmarks lie. So the optimiser had to change too.
Two hyper-parameters, not six
Sekan exposes lr and history_size — that's it. No β₁, β₂, ε. Low-resource regimes cannot afford hyper-parameter sweeps, so the optimiser was designed without them.
A blueprint, not a Luganda tokenizer
ABLT v1 is the first instance. v2 adds four Bantu sisters via the same two-layer pattern. v3 extends to Nilotic, Cushitic, and Afro-Asiatic families. ~800M African-language speakers addressable.
SECTION 02 — TWO-STAGE TOKENISER
Symbolic morphology feeds a statistical learner
Input
raw Luganda text
- ▹ all 10 Luganda noun classes
- ▹ verb prefixes + extensions + negation + tense
- ▹ phonological compounds (mw/bw/ky/ny/…)
- ▹ geminates (mm/nn/jj/pp/tt/kk)
- ▹ loanword detection
- ▹ inserts morpheme boundaries ( · )
▹ vocab = 5,795 pieces
▹ trained on 108,765 monolingual Luganda sentences
▹ respects the boundaries from layer 1
▹ outputs subword pieces with ▁ word-start mark
Downstream
tokens → IDs → Mistral-7B + QLoRA + Sekan optimiser
SECTION 03 — THE INSIGHT
From calculus to code
How a 1-4-1 weighting pattern from numerical integration became the Sekan optimiser.
// Founder's notes
The training runs kept diverging. AdamW was reading every long Luganda agglutination as a real gradient signal and amplifying it — so the model was chasing noise harder with every step.
I went back to my calculus notes on numerical integration, and Simpson's 1/3 rule jumped off the page. It's a 4th-order method for approximating a definite integral using a symmetric 1-4-1 weighting across three points — the endpoints count once, the middle counts four times.
That's exactly what a gradient window needs. Take g_{t-2}, g_{t-1}, g_t — weight them 1-4-1, divide by 6, and you get a smoothed descent direction with 4th-order accuracy. Symmetric by construction, so recent spikes can't torch the run.
I derived the update step on paper first, checked the error term analytically, then translated it into a torch.optim.Optimizer with two hyper-parameters — no β₁, β₂, or ε — because low-resource regimes can't afford hyper-parameter sweeps.
First real run: eval loss 4.96 vs AdamW's 5.05, zero divergence events on the same Mistral-7B + QLoRA pipeline. The math held.
[ Compute reality ]
Ran on a Google Colab T4 — not H100 cluster compute. What this proves: the math holds under conditions any researcher in Africa can reproduce. A 10% perplexity drop from a pure optimiser change, on free-tier compute, is a result.
[ Pull quote ]
“The optimiser didn't need new invention. It needed older math.”
— Derivation notes, Jan 2025
SECTION 04 — THE SEKAN OPTIMISER
Simpson's 1/3 rule applied to gradient history
Symmetric. Bowl-shaped. 4th-order accurate. Two hyper-parameters.
Update rule
g_{t-2} + 4·g_{t-1} + g_t
ĝ_t = ─────────────────────────
6
θ_{t+1} = θ_t − η · ĝ_tA direct application of Simpson's 1/3 rule. The recent gradient is weighted 4× while the older ones are weighted 1× each — a symmetric 1-4-1 kernel that damps end-of-window spikes instead of amplifying them.
Why it works on Bantu
Damps spikes by construction
Adam amplifies recent gradients. Sekan averages them symmetrically. Long agglutinated tokens stop torching training.
4th-order accuracy · O(h⁴)
vs O(h²) for trapezoidal averaging. An order of magnitude more accurate estimate of the true descent direction per step.
Drop-in torch.optim.Optimizer
Works with LoRA, QLoRA, full fine-tuning. No framework rewrite required.
SECTION 05 — VALIDATION
Every number in validation/METRICS.md · reproducible
Benchmarked against mBERT and XLM-R on the same Luganda evaluation set.
Tokeniser comparison
Head-to-head · Mistral-7B + QLoRA · 600 steps · Colab T4 · same seed, data, compute
SECTION 06 — ENGINEERING HIGHLIGHTS
What this project actually proves
REPRODUCIBLE
Every claim is a notebook cell away
100% morpheme preservation and 0% OOV come from 02_ABLT_Tokenizer_Validation.ipynb. Every benchmark number lives in validation/METRICS.md with full provenance. No marketing numbers.
→ reproducible from scratch
DEPLOYED
Live LLM on Hugging Face
The v1 Mistral-7B fine-tuned with Sekan on ABLT-tokenised Luganda is live at huggingface.co/Psalms23Wave/Alkebulan-AI. Anyone with transformers installed can load it and translate EN↔Luganda in four lines.
→ real artifact, real inference
GOVERNANCE
Full Apache-2.0 legal + community stack
LICENSE + NOTICE + LICENSING + TRADEMARK + ATTRIBUTION + CLA + CODE_OF_CONDUCT + GOVERNANCE + MAINTAINERS + SECURITY + RELEASE_CHECKLIST + a step-by-step ADDING_A_LANGUAGE.md playbook. Most solo OSS projects have LICENSE. This one is audit-ready.
→ enterprise-adoptable
SECTION 07 — CODE PROOF
Sekan as a drop-in torch.optim.Optimizer
Two hyper-parameters. Simpson's 1/3 applied to a 3-step gradient window.
sekan_algorithm/sekan.py
python
# sekan_algorithm/sekan.py — excerpt
# Simpson's 1/3 rule: ĝ_t = (g_{t-2} + 4·g_{t-1} + g_t) / 6
class SekanOptimizer(torch.optim.Optimizer):
def __init__(self, params, lr=1e-3, history_size=3):
defaults = dict(lr=lr, history_size=history_size)
super().__init__(params, defaults)
@torch.no_grad()
def step(self, closure=None):
for group in self.param_groups:
lr = group["lr"]
for p in group["params"]:
if p.grad is None:
continue
state = self.state[p]
if "history" not in state:
state["history"] = []
state["history"].append(p.grad.detach().clone())
if len(state["history"]) > 3:
state["history"].pop(0)
if len(state["history"]) == 3:
g_tm2, g_tm1, g_t = state["history"]
g_hat = (g_tm2 + 4.0 * g_tm1 + g_t) / 6.0 # Simpson 1/3
else:
g_hat = p.grad
p.add_(g_hat, alpha=-lr)
return closure() if closure else None