dauntless/quartz/components/scripts
うろちょろ ec26ebcc9e
feat: improve search tokenization for CJK languages (#2231)
* feat: improve search tokenization for CJK languages

Enhance the encoder function to properly tokenize CJK (Chinese, Japanese,
Korean) characters while maintaining English word tokenization. This fixes
search issues where CJK text was not searchable due to whitespace-only
splitting.

Changes:
- Tokenize CJK characters (Hiragana, Katakana, Kanji, Hangul) individually
- Preserve whitespace-based tokenization for non-CJK text
- Support mixed CJK/English content in search queries

This addresses the CJK search issues reported in #2109 where Japanese text
like "て以来" was not searchable because the encoder only split on whitespace.

Tested with Japanese, Chinese, and Korean content to verify character-level
tokenization works correctly while maintaining English search functionality.

* perf: optimize CJK search encoder with manual buffer tracking

Replace regex-based tokenization with index-based buffer management.
This improves performance by ~2.93x according to benchmark results.

- Use explicit buffer start/end indices instead of string concatenation
- Replace split(/\s+/) with direct whitespace code point checks
- Remove redundant filter() operations
- Add CJK Extension A support (U+20000-U+2A6DF)

Performance: ~878ms → ~300ms (100 iterations, mixed CJK/English text)

* test: add comprehensive unit tests for CJK search encoder

Add 21 unit tests covering:
- English word tokenization
- CJK character-level tokenization (Japanese, Korean, Chinese)
- Mixed CJK/English content
- Edge cases

All tests pass, confirming the encoder correctly handles CJK text.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-12-02 10:04:38 -08:00
..
callout.inline.ts fix(callout): Grid-based callout collapsible animation (#1944) 2025-04-26 11:05:51 -07:00
checkbox.inline.ts feat: support checkbox (closes #646) (#799) 2024-02-04 22:19:25 -08:00
clipboard.inline.ts feat(mermaid): improvement navigation (#1575) 2024-11-10 18:13:12 -05:00
comments.inline.ts feat(giscus): expose language option for Comments component (#2012) 2025-06-08 11:23:01 +02:00
darkmode.inline.ts perf: incremental rebuild (--fastRebuild v2 but default) (#1841) 2025-03-16 14:17:31 -07:00
explorer.inline.ts fix(explorer): Prevent html from being scrollable when mobile explorer is open (#1895) 2025-04-26 11:13:56 -07:00
graph.inline.ts revert(graph): roll back changes due to issues with Safari (#2067) 2025-07-30 18:43:36 +02:00
mermaid.inline.ts fix(mermaid): themechange detector + expand simplification 2025-03-11 11:45:45 -07:00
popover.inline.ts fix(popover): properly clear popover on racing fetch 2025-04-21 23:55:38 -07:00
readermode.inline.ts feat: reader mode 2025-04-17 19:45:17 -07:00
search.inline.ts feat: improve search tokenization for CJK languages (#2231) 2025-12-02 10:04:38 -08:00
search.test.ts feat: improve search tokenization for CJK languages (#2231) 2025-12-02 10:04:38 -08:00
spa.inline.ts Prevent double-loading of afterDOMReady scripts (#2213) 2025-11-27 14:51:56 -08:00
toc.inline.ts fix(toc): Fixed headers in second ToC element not highlight-able 2025-03-30 17:35:20 -07:00
util.ts perf(explorer): client side explorer (#1810) 2025-03-09 14:58:26 -07:00