← Back

Open Source · AI Research

Alkebulan AI

Status

Shipped

Stack

Mistral-7BHuggingFaceQLoRASentencePiecePython

[ THE THESIS ]

Why Western AI fails 800 million African language speakers

Off-the-shelf models treat language like English—splitting words by spaces. But in morphologically complex African languages, a single word is an entire sentence. A word like Abanoonyiboobubudamu (refugees) gets shattered into 11 meaningless fragments by standard tokenizers, destroying the semantic relationship before the neural network even sees it. I built AlkebulanAI to fix the architecture from the ground up, starting with Luganda as the benchmark for the hardest cases.

SHIPPED · Hugging Face · Apache-2.0 · v1.0.0 validated

Technical proof
100%
morpheme preserved
vs 61% mBERT · 57% XLM-R
0.00%
OOV rate
vs 3.8% mBERT · 2.1% XLM-R
2.6×
token compression
11.4 vs 29.8 per sentence
0
divergence events
AdamW logged 2 · 600-step comparison

SECTION 01DESIGN DECISIONS

Four calls that made it work

Each one overrides a default from the Western NLP stack that would have quietly broken the model on Bantu data.

01

Fix the tokenizer before touching neural weights

Symbolic morphology first (rule-based, language-specific), statistical learner second (SentencePiece). Boundaries are inserted before a single gradient flows.

02

Long agglutinated tokens produce spiky gradients

Off-the-shelf optimisers (SGD, AdamW) amplify the spikes by recency bias. Training diverges. Benchmarks lie. So the optimiser had to change too.

03

Two hyper-parameters, not six

Sekan exposes lr and history_size — that's it. No β₁, β₂, ε. Low-resource regimes cannot afford hyper-parameter sweeps, so the optimiser was designed without them.

04

A blueprint, not a Luganda tokenizer

ABLT v1 is the first instance. v2 adds four Bantu sisters via the same two-layer pattern. v3 extends to Nilotic, Cushitic, and Afro-Asiatic families. ~800M African-language speakers addressable.

SECTION 02TWO-STAGE TOKENISER

Symbolic morphology feeds a statistical learner

Input

raw Luganda text

Layer 1 · EnhancedLugandaTokenizer · rule-based · language-specific
  • ▹ all 10 Luganda noun classes
  • ▹ verb prefixes + extensions + negation + tense
  • ▹ phonological compounds (mw/bw/ky/ny/…)
  • ▹ geminates (mm/nn/jj/pp/tt/kk)
  • ▹ loanword detection
  • ▹ inserts morpheme boundaries ( · )
Layer 2 · SentencePiece Unigram · trained · language-agnostic

▹ vocab = 5,795 pieces
▹ trained on 108,765 monolingual Luganda sentences
▹ respects the boundaries from layer 1
▹ outputs subword pieces with ▁ word-start mark

Downstream

tokens → IDs → Mistral-7B + QLoRA + Sekan optimiser

SECTION 03THE INSIGHT

From calculus to code

How a 1-4-1 weighting pattern from numerical integration became the Sekan optimiser.

// Founder's notes

The training runs kept diverging. AdamW was reading every long Luganda agglutination as a real gradient signal and amplifying it — so the model was chasing noise harder with every step.

I went back to my calculus notes on numerical integration, and Simpson's 1/3 rule jumped off the page. It's a 4th-order method for approximating a definite integral using a symmetric 1-4-1 weighting across three points — the endpoints count once, the middle counts four times.

That's exactly what a gradient window needs. Take g_{t-2}, g_{t-1}, g_t — weight them 1-4-1, divide by 6, and you get a smoothed descent direction with 4th-order accuracy. Symmetric by construction, so recent spikes can't torch the run.

I derived the update step on paper first, checked the error term analytically, then translated it into a torch.optim.Optimizer with two hyper-parameters — no β₁, β₂, or ε — because low-resource regimes can't afford hyper-parameter sweeps.

First real run: eval loss 4.96 vs AdamW's 5.05, zero divergence events on the same Mistral-7B + QLoRA pipeline. The math held.

[ Compute reality ]

Ran on a Google Colab T4 — not H100 cluster compute. What this proves: the math holds under conditions any researcher in Africa can reproduce. A 10% perplexity drop from a pure optimiser change, on free-tier compute, is a result.

[ Pull quote ]

“The optimiser didn't need new invention. It needed older math.”

— Derivation notes, Jan 2025

SECTION 04THE SEKAN OPTIMISER

Simpson's 1/3 rule applied to gradient history

Symmetric. Bowl-shaped. 4th-order accurate. Two hyper-parameters.

Update rule

    g_{t-2} + 4·g_{t-1} + g_t
ĝ_t = ─────────────────────────
                6

θ_{t+1} = θ_t − η · ĝ_t

A direct application of Simpson's 1/3 rule. The recent gradient is weighted 4× while the older ones are weighted 1× each — a symmetric 1-4-1 kernel that damps end-of-window spikes instead of amplifying them.

Why it works on Bantu

  • Damps spikes by construction

    Adam amplifies recent gradients. Sekan averages them symmetrically. Long agglutinated tokens stop torching training.

  • 4th-order accuracy · O(h⁴)

    vs O(h²) for trapezoidal averaging. An order of magnitude more accurate estimate of the true descent direction per step.

  • Drop-in torch.optim.Optimizer

    Works with LoRA, QLoRA, full fine-tuning. No framework rewrite required.

SECTION 05VALIDATION

Every number in validation/METRICS.md · reproducible

Benchmarked against mBERT and XLM-R on the same Luganda evaluation set.

Tokeniser comparison

Metric
ABLT (ours)
mBERT
XLM-R
Morpheme preservation
100.00%
61%
57%
OOV rate
0.00%
3.8%
2.1%
Tokens / Luganda sentence
11.4
29.8
24.1
Compression gain vs mBERT
2.6×
1.0×
1.24×

Head-to-head · Mistral-7B + QLoRA · 600 steps · Colab T4 · same seed, data, compute

Metric
Sekan (ours)
AdamW
Final eval loss
4.9621
5.0473
Step-to-step loss σ
0.061
0.143
Divergence events
0
2
Steps to first plateau
~450
~600

SECTION 06ENGINEERING HIGHLIGHTS

What this project actually proves

REPRODUCIBLE

Every claim is a notebook cell away

100% morpheme preservation and 0% OOV come from 02_ABLT_Tokenizer_Validation.ipynb. Every benchmark number lives in validation/METRICS.md with full provenance. No marketing numbers.

reproducible from scratch

DEPLOYED

Live LLM on Hugging Face

The v1 Mistral-7B fine-tuned with Sekan on ABLT-tokenised Luganda is live at huggingface.co/Psalms23Wave/Alkebulan-AI. Anyone with transformers installed can load it and translate EN↔Luganda in four lines.

real artifact, real inference

GOVERNANCE

Full Apache-2.0 legal + community stack

LICENSE + NOTICE + LICENSING + TRADEMARK + ATTRIBUTION + CLA + CODE_OF_CONDUCT + GOVERNANCE + MAINTAINERS + SECURITY + RELEASE_CHECKLIST + a step-by-step ADDING_A_LANGUAGE.md playbook. Most solo OSS projects have LICENSE. This one is audit-ready.

enterprise-adoptable

SECTION 07CODE PROOF

Sekan as a drop-in torch.optim.Optimizer

Two hyper-parameters. Simpson's 1/3 applied to a 3-step gradient window.

sekan_algorithm/sekan.py

python

# sekan_algorithm/sekan.py — excerpt
# Simpson's 1/3 rule: ĝ_t = (g_{t-2} + 4·g_{t-1} + g_t) / 6

class SekanOptimizer(torch.optim.Optimizer):
    def __init__(self, params, lr=1e-3, history_size=3):
        defaults = dict(lr=lr, history_size=history_size)
        super().__init__(params, defaults)

    @torch.no_grad()
    def step(self, closure=None):
        for group in self.param_groups:
            lr = group["lr"]
            for p in group["params"]:
                if p.grad is None:
                    continue

                state = self.state[p]
                if "history" not in state:
                    state["history"] = []

                state["history"].append(p.grad.detach().clone())
                if len(state["history"]) > 3:
                    state["history"].pop(0)

                if len(state["history"]) == 3:
                    g_tm2, g_tm1, g_t = state["history"]
                    g_hat = (g_tm2 + 4.0 * g_tm1 + g_t) / 6.0   # Simpson 1/3
                else:
                    g_hat = p.grad

                p.add_(g_hat, alpha=-lr)
        return closure() if closure else None