SUEN

高分,高校

起初是為了核查高考價值語文分,所以基於 Google NotebookLM 部署了專門的轉化核查分析報表,一則給學生自己看自己需要解決的問題,一則可以隨時查給需要的家長:

真的分數

2025-11-14

在意分數當然是對的,但你為什麼在意分數?為了在意而在意,純愛分數?
不是的吧?你只是在意高考分數。

之後導師見面,把導生各科分都導入後發現,只要再補充 2025 高考錄取一份檔和相關分數線,加上海淀剛剛高三的期中賦分表,就可以硬算出一個高中生本次歷次考試分對應的高考分。說是硬算,因爲一則英語聽口沒計入,一則賦分表沒有官方資料。於是新開一個庫:

你會被錄取的大學

2025-11-20

導師見面今天,幾人在,幾人不在;好事。

Google NotebookLM 是支持多模態數據的,但這種筆記網頁難以反代,所以不翻牆依舊不可用。
但模式通了,一切就都有解。

前提是需要先把 2025 北京一分檔和分數線的 PDF,還有考試院官網爬下來的分科錄取資料,直接 json 化。拿 Antigravity 做了兩輪,還幼年,處理結果不精準也不齊全,不可用。單獨寫兩個腳本,一個處理數據一個負責核查,最終拿到精準 json 數據。然後,就可以部署到任意機器了。

本以為這項目最多半天搞定,最後⋯⋯整個週末,四個半天都砸進去了。
坑是真的有點多,中間還不得不更換了 VPS 重新部署。

這種 RAG 模式,效果肯定不如直接使用 Google NotebookLM ,所以估計此後也不會更新啥了,不好玩不好玩。

網頁: https://750.bdfz.net/

如果你可以翻牆,這個網頁效果還是不如這個 日常分到高考分

其實對一所學校而言,這個玩法的真正效度是自己所有學生日常分數和高考分數,多年前想做但沒有技術可以,現在技術可以了,嗯,也沒必要做了。

高考對更多人不過是一個不得不玩也並不好玩的遊戲,盡可能合理開掛吧。


This document explains how to deploy 750-bjgk-chat (FastAPI + Gemini SSE streaming + JSON-based corpus retrieval), and how the project uses bjgk_corpus.json for grounded Beijing Gaokao advising.


1. What this project is

750-bjgk-chat is a lightweight RAG-style chat service for Beijing Gaokao admissions guidance.

The service must only answer based on the JSON corpus, and will explicitly say “data missing” when retrieval fails.


2. Repo layout (expected)

text
750-bjgk-chat/
├─ server.py
├─ bjgk_corpus.json
├─ web/
│  ├─ index.html
│  ├─ style.css
│  └─ app.js
├─ requirements.txt (optional)
└─ systemd/
   └─ bjgk-chat-750.service (example)

On VPS we recommend installing to:

text
/opt/750-bjgk-chat

3. Requirements

3.1 macOS local dev

Your baseline (per your preference):

Install Python deps inside your fixed venv:

bash
python -V
which python

python -m pip install -U pip
python -m pip install fastapi uvicorn requests numpy google-generativeai

3.2 Ubuntu VPS

Install deps:

bash
python3 -V
python3 -m pip install -U pip
python3 -m pip install fastapi uvicorn requests numpy google-generativeai

4. Configuration

All config is environment-variable driven.

4.1 Core env vars

Variable Meaning Default
BJGK_CORPUS_PATH Path to corpus JSON ./bjgk_corpus.json
GEMINI_API_KEY Gemini key required
GEMINI_MODEL Gemini model gemini-2.5-flash
TOPK Retrieval top-k 8
CHUNK_TEXT_LIMIT Trim each chunk before sending 380
AUTO_EXPAND_QUERY Expand query for recall true
HARD_BLOCK_ON_EMPTY If true: stop when no retrieval false
PORT Local FastAPI port 8000
LOG_LEVEL Logging level INFO

4.2 Example .env for VPS

Create /opt/750-bjgk-chat/.env:

env
BJGK_CORPUS_PATH=/opt/750-bjgk-chat/bjgk_corpus.json
GEMINI_API_KEY=YOUR_KEY_HERE
GEMINI_MODEL=gemini-2.5-flash
TOPK=8
AUTO_EXPAND_QUERY=true
CHUNK_TEXT_LIMIT=380
HARD_BLOCK_ON_EMPTY=false
PORT=8000
LOG_LEVEL=INFO

5. Running locally (macOS)

From repo root:

bash
cd /Users/ylsuen/Desktop/750-bjgk-chat

# ensure your fixed venv is already active
python server.py

Test health:

bash
curl -sS http://127.0.0.1:8000/healthz | jq .

Test SSE chat:

bash
curl -N \
  -H "Content-Type: application/json" \
  -d '{"message":"北京大学 北京 2025 投档线 物理组","history":[]}' \
  http://127.0.0.1:8000/api/chat

6. JSON corpus: usage, format, pros/cons

6.1 How bjgk_corpus.json is used

At startup:

  1. server.py loads BJGK_CORPUS_PATH
  2. The JSON is normalized into a list of chunks
  3. Retriever builds:
    • Chinese bigram token corpus
    • TF-IDF matrix
    • BM25 stats
  4. Each user query:
    • optionally expands (AUTO_EXPAND_QUERY)
    • retrieves top-k chunks
    • injects them into <CONTEXT>
  5. LLM must answer only from <CONTEXT>

6.2 Supported JSON shapes

The loader accepts:

Shape A: list of dict chunks

json
[
  {"text": "...", "source": "2025本科普通批投档线", "page": 12},
  {"text": "...", "source": "一分一段表", "page": 3}
]

Shape B: dict containing a list

json
{
  "chunks": [
    {"text": "...", "source": "..."}
  ]
}

If no valid list can be found, healthz will show:

Corpus JSON must be a list of chunks or a dict containing a chunks list.

6.3 What a “chunk” should contain

Minimum:

Recommended:

Example:

json
{
  "text": "北京大学 2025 本科普通批 03(物理+化学) 投档线 687 ...",
  "source": "Beijing 2025本科普通批投档线",
  "page": 5,
  "year": 2025,
  "school": "北京大学",
  "group": "03",
  "requirement": "物理+化学"
}

Even if metadata is not used yet, adding it now enables phase-2 “structured lookup first”.


6.4 Strengths of JSON-based corpus

  1. Grounded, auditable answers
    Every answer can be traced to your curated dataset.

  2. Fast iteration
    Updating bjgk_corpus.json → restart → new knowledge online immediately.

  3. Works offline from the web
    No dependency on live crawling, avoids stale or hallucinated sources.

  4. Easy to version control
    JSON is diff-friendly; you can tag releases by admission season.


6.5 Weaknesses / limitations

  1. Chunking quality dominates accuracy
    With only ~30 chunks, missing a single synonym can cause retrieval miss.

  2. Unstructured text retrieval can be dumb
    When data is inherently tabular (score lines, percentile distributions), pure text RAG may:

    • retrieve the wrong year
    • retrieve a nearby school
    • miss exact “group id”

    This is why phase-2 should add structured lookup → LLM explanation.

  3. Small corpus = recall ceiling
    If the JSON does not contain that school/year/batch, the AI must say “data missing”.

  4. No automatic validation
    The backend trusts your JSON; if a line is wrong, the AI will repeat it.


6.6 Best practices for corpus maintenance

  1. Keep “one table per chunk”
    Avoid mixing multiple years/schools into one block.

  2. Put the key fields early
    Retriever heavily weights early tokens.

  3. Preserve original numbers verbatim
    Don’t reformat unless you must.

  4. Include aliases in text
    e.g., “北京大学(北大)”

  5. Version your corpus

    text
    bjgk_corpus_2025-11-23.json
    bjgk_corpus_latest.json -> symlink

7. Troubleshooting

7.1 Health check says chunks = 0

Check:

bash
ls -lh /opt/750-bjgk-chat/bjgk_corpus.json
python3 - <<'PY'
import json
print(type(json.load(open("/opt/750-bjgk-chat/bjgk_corpus.json"))))
PY

7.2 SSE streaming garbled text

You already fixed this in Nginx by disabling buffering and using utf-8.

Double-confirm:

7.3 AI says “data missing” but JSON contains it

Typical causes:

  1. Chunk doesn’t include the exact school/year keyword
  2. Chunk too long, key field buried
  3. Synonym mismatch
  4. Follow-up question loses context

Fix:


8. Upgrade workflow

  1. Update JSON
  2. Restart service
  3. Run smoke tests
bash
cd /opt/750-bjgk-chat

# update corpus
cp bjgk_corpus_new.json bjgk_corpus.json

systemctl restart bjgk-chat-750
sleep 1
curl -sS http://127.0.0.1:8000/healthz | jq .

9. Roadmap

To eliminate the remaining “JSON has it but AI missed it” problem:

Phase-2: Structured first, RAG second

This will convert the system from “keyword-search RAG” to a true data-driven advisor.

NO!NO!NO!
不會折騰了,RAG 還是死吧 :)