
If you type "わらう" (laugh) into Slack's emoji search, you get nothing. The same is true for "ぴえん" (a sad-cute slang for crying that took over Japanese Twitter around 2020), or "ばんざい" (a celebratory "hooray"). Unicode's CLDR (Common Locale Data Repository) ships official Japanese annotations for every emoji — but they're written in the register of an accessibility caption, not a search query. I curated 107 emojis by hand with the casual Japanese tags people actually type, and wrapped them in a ~200-line browser search.
🌐 Demo: https://sen.ltd/portfolio/emoji-search-jp/
📦 GitHub: https://github.com/sen-ltd/emoji-search-jp
Why CLDR's Japanese annotations don't make a search index
Unicode ships annotations for every emoji in every CLDR locale. Each entry has a "name" plus a handful of "keywords". For Japanese, the file looks like this:
| Emoji | Name | Keywords |
|---|---|---|
| 😂 | "うれしなき" | ["うれしなき", "かお", "かおえもじ"] |
| 🥺 | "もの欲しそうな顔" | ["かお", "かおえもじ", "けんめい", "もの欲しそう"] |
| 🙏 | "合掌した手" | ["お辞儀", "かたを下げる", "かんしゃ"] |
| 🎉 | "クラッカー" | ["クラッカー", "ハッピー", "パーティー", "紙ふぶき"] |
| 🐶 | "犬の顔" | ["いぬ", "おもしろい", "かお", "どうぶつ"] |
The accessibility job that produced these keywords is the right job for CLDR to do. They cover the visual content of the emoji in a formal-register, all-hiragana style that a screen reader can announce.
What they don't cover is what users actually type:
- 😂: missing
わらう/lol/草 - 🥺: missing
ぴえん(the slang that defines this emoji in 2020s Japanese) - 🙏: missing
お願い/ありがとう/ごめん— the contexts every user uses this emoji for - 🎉: missing
おめでとう - 🐶: missing
わんこ/ワン
There's an upstream proposal to expand CLDR annotations toward search use cases, but the file as it ships today is a captioning dictionary, not a search dictionary. The two have different shapes.
What this repo ships instead
107 emojis with 5-9 hand-curated tags each, totaling about 750 tag entries. Same schema as CLDR ({char, name_ja, name_en, tags, category}) so the two can be merged if you want both registers.
{
"char": "😂",
"name_en": "face with tears of joy",
"name_ja": "嬉し泣きの顔",
"tags": ["わらう", "大爆笑", "笑い泣き", "嬉し泣き", "lol"],
"category": "face"
},
{
"char": "🥺",
"name_en": "pleading face",
"name_ja": "うるうる目の顔",
"tags": ["ぴえん", "かわいい", "うるうる", "おねがい", "切ない"],
"category": "face"
},
{
"char": "🙏",
"name_en": "folded hands",
"name_ja": "合掌",
"tags": ["お願い", "おねがい", "ありがとう", "祈る", "感謝", "ごめん"],
"category": "gesture"
}
Tag-selection rules I followed while curating:
- Mix kana and kanji. "ねこ" and "猫" both belong on 🐱 because the user may or may not have committed an IME conversion.
- Mix register. Both the slang ("ぴえん") and the descriptive ("切ない") for 🥺.
- Borrow from CLDR. Take the official annotations as a baseline and add the spoken-Japanese coverage on top.
-
A handful of English tags.
lol/ok/love/cool— common enough in mixed-language Japanese chat that they earn their slot.
What I deliberately didn't do: shoot for "all 3700 emoji from the Unicode emoji-list". The lexicon is the product. A smaller, well-tagged set is more useful than a complete, badly-tagged one.
Weighted scoring, five tiers
Search is a linear scan with a five-tier scoring function. For each query token, against each emoji:
const SCORE = {
TAG_EXACT: 10, // token === a tag
TAG_PREFIX: 7, // token is a prefix of a tag
TAG_SUBSTRING: 4, // token is a substring of a tag (non-prefix)
NAME_JA_SUBSTRING: 3, // token in name_ja
NAME_EN_SUBSTRING: 1, // last-resort English fallback
};
export function scoreToken(emoji, token) {
if (!token) return 0;
let best = 0;
for (const tag of emoji.tags) {
const t = normalize(tag);
if (t === token) return SCORE.TAG_EXACT; // can't beat exact
if (t.startsWith(token)) best = Math.max(best, SCORE.TAG_PREFIX);
else if (t.includes(token)) best = Math.max(best, SCORE.TAG_SUBSTRING);
}
if (best > 0) return best;
if (normalize(emoji.name_ja).includes(token)) return SCORE.NAME_JA_SUBSTRING;
if (normalize(emoji.name_en).includes(token)) return SCORE.NAME_EN_SUBSTRING;
return 0;
}
The non-obvious choice is the weight gap between TAG_SUBSTRING (4) and NAME_JA_SUBSTRING (3) — a substring hit on a hand-curated tag still beats a coincidental substring hit in the descriptive name. So 顔 ranks "tags containing 顔" above "emojis whose name happens to contain 顔". Without the gap, every face emoji would tie with every cat emoji because both have 顔 in their name_ja.
Multi-token AND with sum-of-scores ranking
Whitespace splits the query into tokens. The default behaviour is every token must contribute at least one point, otherwise the emoji is dropped:
export function scoreEmoji(emoji, tokens) {
if (tokens.length === 0) return 0;
let sum = 0;
for (const tok of tokens) {
const s = scoreToken(emoji, tok);
if (s === 0) return -1; // any miss → drop
sum += s;
}
return sum;
}
So "わらう 顔" keeps 😂 (10 from tag わらう + 3 from name_ja containing 顔 = 13) and drops 🐱 (3 from 顔 but no signal for わらう).
The unit tests pin this:
test("scoreEmoji returns -1 if any token fails to match", () => {
const s = scoreEmoji(SAMPLE_FACE_WITH_TEARS, ["わらう", "車"]);
assert.equal(s, -1); // dropped
});
test("search drops emojis that don't satisfy ALL tokens", () => {
const results = search(SAMPLE, "わらう 顔");
const chars = results.map((r) => r.emoji.char);
assert.ok(chars.includes("😂"));
assert.ok(!chars.includes("🐱"));
});
NFKC normalization upfront
Users will type half-width and full-width letters, with stray IME spaces, and case-mixed. One normalize call handles all of it:
export function normalize(s) {
return String(s).normalize("NFKC").toLowerCase().trim();
}
LOL (full-width) becomes lol; わらう becomes わらう; Pien becomes pien. Everything downstream operates on the normalized form. The boundary is one line and gets tested:
test("search is case- and width-insensitive (NFKC)", () => {
const results = search(SAMPLE, "LOL");
assert.equal(results[0].emoji.char, "😂");
});
Stable sort matters more than the algorithm
Array.prototype.sort has been stable since ECMA-2019, so equal-keyed elements keep their original order. But equal scores are common in this dataset — 顔 lands at NAME_JA_SUBSTRING (3) for every face emoji — so I made the tie-breaker explicit:
matches.sort((a, b) => {
if (b.score !== a.score) return b.score - a.score;
return a.idx - b.idx; // tie-break by input order
});
The effect: when many emojis tie on score, the curated order (the order I wrote them into data.json) survives. I put the most-used emoji first in every category, so ties resolve to "the more-used one".
test("search is stable: equal-scoring matches keep input order", () => {
// "顔" → 3 emojis all at NAME_JA_SUBSTRING (3).
const results = search(SAMPLE, "顔");
const chars = results.map((r) => r.emoji.char);
assert.deepEqual(chars, ["😂", "🥺", "🐱"]); // input order preserved
});
What I didn't build
- Trie / inverted index — 107 entries × 7 tags is 0.1 ms of linear scan on a phone. The index only pays off past ~10,000 entries.
- Fuzzy / Levenshtein matching — typo tolerance complicates the score function. Prefix + substring already cover ~80% of real misses, and the cost of adding fuzzy is a noticeable jump in false positives.
- Skin-tone / gender variants — exploding the entry count by 5-10× hurts search quality. Native OS emoji pickers cover this better.
- Speech-readout accessibility annotations — that's exactly what CLDR is for. This repo is the other annotation register.
Try it
The demo at https://sen.ltd/portfolio/emoji-search-jp/ has the full lexicon. Try わらう, ぴえん, ばんざい, 猫, ハート, 寿司, お願い, ありがとう. / focuses the search box. Click an emoji to copy it.
Source: https://github.com/sen-ltd/emoji-search-jp — MIT, ~200 lines of JS plus 107 entries of curated data, 18 unit tests, no build step, no runtime dependencies.
🛠 Built by SEN LLC as part of an ongoing series of small, focused developer tools. Browse the full portfolio for more.
United States
NORTH AMERICA
Related News
Amazon Employees Are 'Tokenmaxxing' Due To Pressure To Use AI Tools
20h ago
UCP Variant Data: The #1 Reason Agent Checkouts Fail
6h ago

Décryptage technique : Comment builder un téléchargeur de vidéos Reddit performant (DASH, HLS & WebAssembly)
16h ago
How Braze’s CTO is rethinking engineering for the agentic area
10h ago
Encryption Protocols for Secure AI Systems: A Practical Guide
20h ago
