2 min read

Multi-domain Data Mapping Tool

Streamlit dashboard that maps source spreadsheets against a master classification workbook using a five-tier match strategy with rapidfuzz fuzzy matching.

Role

Sole engineer

Year

2025

Status

live

dataautomationPython

// AI capabilities

Hierarchical classification matching across 9 domains
rapidfuzz-based fuzzy string matching at 80% threshold
Header auto-detection with alias fallback
Confidence-flagging for ambiguous matches

// Architecture flow

Overview

A Streamlit dashboard used internally for supplier onboarding. Drop a vendor's product spreadsheet in, the tool maps every column and value against the internal master classification workbook (New Classifications.xlsx), flags ambiguities for review, and exports a clean enriched workbook with audit tabs.

Problem

Onboarding new suppliers required manually mapping their flat product spreadsheets onto an internal nine-domain classification scheme: Division, Category, Product Group, Family, Brand, Gender, Season, Country of Origin, and HS Code. Each supplier used different column names, different value spellings, and different levels of detail. Days per supplier was the norm.

Approach

Codify the classification logic as a reference repository with indexed lookups. Run incoming data through a five-tier match strategy that escalates from exact code match all the way down to fuzzy string match. Surface ambiguities to a human operator who can override per row. Export an enriched workbook with audit tabs so the receiving team can see exactly how each value was mapped.

Architecture

App: Streamlit (Python). Single-process local app for the data team.
Reference repository: Master workbook parsed once and indexed for fast lookups across 9 classification domains.
Header detection: Auto-detection with alias-based fallback. JSON config file holds known aliases per column.
Five-tier matching: exact code → exact description → normalized → alias → fuzzy (80% threshold via rapidfuzz).
Hierarchical validation: Division → Category → Product Group consistency enforced; mismatches surface as warnings.
Export: Enriched workbook plus a "Mapping Report" tab and an "Ambiguities" tab.
Tests: pytest with sample data; ruff for lint.

Tech stack

Runtime: Python 3.11+, Streamlit
Libraries: rapidfuzz, openpyxl, pandas, pytest, ruff
Config: JSON (column aliases, value aliases, app settings)

Engineering highlights

Five-tier match priority: exact code wins, then exact description, then normalized strings, then aliases, then fuzzy. Each tier annotates the result so reviewers see how it was matched.
Hierarchy enforcement: a Family must roll up to a Product Group that rolls up to a Category. Inconsistencies are flagged, not silently corrected.
Override-aware exports: manual overrides survive re-runs and show up in the audit tab.
Sample data shipped: Adidas, Erke, Scubapro, Killtec, ON sample sheets prove the matcher across very different vendor styles.

Outcome

Operational. Reduces supplier onboarding from days per supplier to minutes, with reviewer-controlled confidence on every mapped value.

Lessons

Classification problems benefit from explicit tiered match strategies more than from ML. The data team can audit "matched via alias" but not "the model said 0.72."
A reference repository (one Excel workbook) is often a better source of truth than a database. The team owns and edits it.
rapidfuzz at 80% is the right floor for this domain; tighter and you miss legit matches, looser and you produce noise.

Want to dig deeper?

Ask my AI agent anything about how this was built, what tradeoffs I made, or how it could fit your team.

Ask my AI →

AI inside

Product Data Enrichment Dashboard

AI-assisted product enrichment pipeline with confidence scoring, source-tracked LLM proposals, and a queue-based architecture that never silently overwrites master data.

AILLMautomationfull-stack

Scheduled Order Sync Automation

GitHub Actions-driven Python pipeline that syncs e-commerce export files into team Google Sheets twice daily with deduplication and structured status columns.

automationPython

E-commerce Delivery Performance Dashboard

Multi-tenant logistics analytics platform merging Shopify orders with carrier delivery data across four stores, with regional SLA tracking and weighted operations scorecards.

full-stackanalyticsautomation