Multi-domain Data Mapping Tool
Streamlit dashboard that maps source spreadsheets against a master classification workbook using a five-tier match strategy with rapidfuzz fuzzy matching.
// AI capabilities
- Hierarchical classification matching across 9 domains
- rapidfuzz-based fuzzy string matching at 80% threshold
- Header auto-detection with alias fallback
- Confidence-flagging for ambiguous matches
// Architecture flow
Overview
A Streamlit dashboard used internally for supplier onboarding. Drop a vendor's product spreadsheet in, the tool maps every column and value against the internal master classification workbook (New Classifications.xlsx), flags ambiguities for review, and exports a clean enriched workbook with audit tabs.
Problem
Onboarding new suppliers required manually mapping their flat product spreadsheets onto an internal nine-domain classification scheme: Division, Category, Product Group, Family, Brand, Gender, Season, Country of Origin, and HS Code. Each supplier used different column names, different value spellings, and different levels of detail. Days per supplier was the norm.
Approach
Codify the classification logic as a reference repository with indexed lookups. Run incoming data through a five-tier match strategy that escalates from exact code match all the way down to fuzzy string match. Surface ambiguities to a human operator who can override per row. Export an enriched workbook with audit tabs so the receiving team can see exactly how each value was mapped.
Architecture
- App: Streamlit (Python). Single-process local app for the data team.
- Reference repository: Master workbook parsed once and indexed for fast lookups across 9 classification domains.
- Header detection: Auto-detection with alias-based fallback. JSON config file holds known aliases per column.
- Five-tier matching: exact code → exact description → normalized → alias → fuzzy (80% threshold via rapidfuzz).
- Hierarchical validation: Division → Category → Product Group consistency enforced; mismatches surface as warnings.
- Export: Enriched workbook plus a "Mapping Report" tab and an "Ambiguities" tab.
- Tests: pytest with sample data; ruff for lint.
Tech stack
- Runtime: Python 3.11+, Streamlit
- Libraries: rapidfuzz, openpyxl, pandas, pytest, ruff
- Config: JSON (column aliases, value aliases, app settings)
Engineering highlights
- Five-tier match priority: exact code wins, then exact description, then normalized strings, then aliases, then fuzzy. Each tier annotates the result so reviewers see how it was matched.
- Hierarchy enforcement: a Family must roll up to a Product Group that rolls up to a Category. Inconsistencies are flagged, not silently corrected.
- Override-aware exports: manual overrides survive re-runs and show up in the audit tab.
- Sample data shipped: Adidas, Erke, Scubapro, Killtec, ON sample sheets prove the matcher across very different vendor styles.
Outcome
Operational. Reduces supplier onboarding from days per supplier to minutes, with reviewer-controlled confidence on every mapped value.
Lessons
- Classification problems benefit from explicit tiered match strategies more than from ML. The data team can audit "matched via alias" but not "the model said 0.72."
- A reference repository (one Excel workbook) is often a better source of truth than a database. The team owns and edits it.
- rapidfuzz at 80% is the right floor for this domain; tighter and you miss legit matches, looser and you produce noise.
Want to dig deeper?
Ask my AI agent anything about how this was built, what tradeoffs I made, or how it could fit your team.
Ask my AI →// related projects
Product Data Enrichment Dashboard
AI-assisted product enrichment pipeline with confidence scoring, source-tracked LLM proposals, and a queue-based architecture that never silently overwrites master data.
Scheduled Order Sync Automation
GitHub Actions-driven Python pipeline that syncs e-commerce export files into team Google Sheets twice daily with deduplication and structured status columns.
E-commerce Delivery Performance Dashboard
Multi-tenant logistics analytics platform merging Shopify orders with carrier delivery data across four stores, with regional SLA tracking and weighted operations scorecards.