feat: exercise_aliases table + lookup_exercise() alias-aware wrapper
New SQLite table `exercise_aliases (alias, canonical, source)` seeded with ~40 common gym shorthand entries (OHP, RDL, "Bench", "Squat", plural/singular drifts, slang). Lookups go through this table first, then fall through to the strict exercise_db matcher — so the strict matcher's "false negative for ambiguous single tokens" property is preserved while still resolving every-day vocabulary. Schema decision: every seed row is tagged `source='seed'` and re-seeded on every init_db (deleted-then-reinserted), so editing the seed dict in code is the one source of truth. User-inserted rows are tagged `source='user'` and never touched by re-seeding. Migration path covers existing DBs where the `source` column didn't exist (those rows tagged 'seed' on first migration, then refreshed from the current seed). New helper db.lookup_exercise(name) wraps the alias resolution + the exercise_db.lookup() call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
ebd0016a62
commit
214596e26f
4 changed files with 233 additions and 26 deletions
|
|
@ -82,7 +82,19 @@ def _slim(entry: dict) -> dict:
|
|||
|
||||
|
||||
def lookup(name: str) -> Optional[dict]:
|
||||
"""Return the slim entry for the best name match, or None."""
|
||||
"""Return the slim entry for the best name match, or None.
|
||||
|
||||
Tiers (priority order):
|
||||
1. exact (case-insensitive)
|
||||
2. compressed exact — collapses hyphens/spaces ("Pull-ups" → "Pullups")
|
||||
3. word-boundary substring (only for multi-token inputs)
|
||||
4. token overlap requiring 100% coverage of the user's tokens
|
||||
(only for multi-token inputs)
|
||||
|
||||
Single-token inputs (e.g. "Bench", "Squat", "RDL", "OHP") that don't hit
|
||||
tier 1 or 2 return None — there's no robust way to disambiguate without
|
||||
an alias table.
|
||||
"""
|
||||
if not name:
|
||||
return None
|
||||
needle = name.strip()
|
||||
|
|
@ -93,48 +105,50 @@ def lookup(name: str) -> Optional[dict]:
|
|||
if not compressed:
|
||||
return None
|
||||
|
||||
# 1. Exact (case-insensitive)
|
||||
# 1. Exact (case-insensitive).
|
||||
hit = _BY_LOWER_NAME.get(lower)
|
||||
if hit:
|
||||
return _slim(hit)
|
||||
|
||||
# 2. Compressed exact — catches "Pull-ups" → "Pullups", etc.
|
||||
# 2. Compressed exact — "Pull-ups" → "Pullups".
|
||||
hit = _BY_COMPRESSED.get(compressed)
|
||||
if hit:
|
||||
return _slim(hit)
|
||||
|
||||
# 3. Compressed substring (either direction).
|
||||
substring_candidates: list[dict] = [
|
||||
e for e, c in _COMPRESSED if compressed in c or c in compressed
|
||||
]
|
||||
needle_toks = _tokens(needle)
|
||||
# Below here, partial matches only — and only for multi-token inputs.
|
||||
# A single token is too easily ambiguous (and short acronyms accidentally
|
||||
# hit character-level substrings of unrelated names).
|
||||
if len(needle_toks) < 2:
|
||||
return None
|
||||
|
||||
# 3. Word-boundary substring (lowercase). "bench press" in "bench press
|
||||
# with chains" — yes. "rdl" in "hurdle" — no (no word break).
|
||||
substring_candidates: list[dict] = []
|
||||
for entry in ALL:
|
||||
n = entry.get("name", "")
|
||||
if not n:
|
||||
continue
|
||||
nl = n.lower()
|
||||
if lower in nl or nl in lower:
|
||||
substring_candidates.append(entry)
|
||||
if substring_candidates:
|
||||
# Single-token generics ("Bench", "Squat", "Deadlift") match too many
|
||||
# specific DB entries. Refuse rather than confidently mislead the
|
||||
# user — the planned alias table will handle these properly.
|
||||
needle_toks = _tokens(needle)
|
||||
if len(needle_toks) == 1 and len(substring_candidates) > 2:
|
||||
return None
|
||||
# Prefer the shortest DB name (most specific to the typed input).
|
||||
substring_candidates.sort(key=lambda e: len(e["name"]))
|
||||
return _slim(substring_candidates[0])
|
||||
|
||||
# 4. Token overlap (Jaccard-ish). Require ≥1 shared token AND that the
|
||||
# shared portion covers ≥50% of the user's tokens, so "row" doesn't
|
||||
# match "single arm cable row machine" via one stop-token.
|
||||
needle_toks = _tokens(needle)
|
||||
if not needle_toks:
|
||||
return None
|
||||
# 4. Token overlap, 100% coverage of the user's tokens. So "BB Row" with
|
||||
# tokens {bb, row} only matches a DB entry that contains both — it never
|
||||
# silently latches onto a "row" entry that doesn't share the "bb" cue.
|
||||
best: tuple[float, dict] | None = None
|
||||
for entry, db_toks in _TOKENS:
|
||||
if not db_toks:
|
||||
continue
|
||||
overlap = needle_toks & db_toks
|
||||
if not overlap:
|
||||
if not needle_toks <= db_toks:
|
||||
continue
|
||||
coverage = len(overlap) / len(needle_toks)
|
||||
if coverage < 0.5:
|
||||
continue
|
||||
# Score = coverage, tiebreak by DB-name length (shorter wins).
|
||||
score = coverage - 0.001 * len(entry["name"])
|
||||
# All user tokens matched; tiebreak by DB-name length (shorter wins,
|
||||
# i.e. the most specific variant).
|
||||
score = -len(entry["name"])
|
||||
if best is None or score > best[0]:
|
||||
best = (score, entry)
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue