北京教育考試院網頁監控
在幾乎不透明的官方裡,教育考試院是勉強有一點有價值數據的網站,之前一直放在導航頁,最近因為爬取數據經常點到。所以乾脆把網站監控起來,直接對接到論壇。
中間不得不換了一台 VPS,最初選了一台 JP 機器結果直接被 BJEEA 100% 丟包⋯⋯
這幾年這類事情幾乎都成常態了,之前說互聯網,既不互,也不聯,更不網,就是因為這類。
換 SG ,嗯,好了。
Pattern: Mirror important announcements from any external site into a single Discourse summary topic
This document describes the full workflow we used for the BJEEA → Discourse pipeline:
- moving the watcher to a new VPS,
- deploying the Python watcher,
- wiring it into systemd,
- fixing UTF-8 body text,
- and adding a small helper script to “free” one URL so it can be reposted.
The same pattern can be reused for any other website you want to monitor.
1. High-Level Architecture
Goal:
When a target website (e.g. BJEEA) publishes a new article, we want:
- The watcher to detect it from the index page.
- Fetch the detail page.
- Extract metadata and the full body.
- Convert the body to Markdown.
- Append a nicely formatted reply to a fixed Discourse topic (summary thread).
- Store the article URL in a state file so the same article is not posted again.
Key components:
- A small Python project under
/opt/bjeea-watch(or any name). - A virtualenv in
/opt/bjeea-watch/.venv. - A state file in
/var/lib/bjeea-watch/state.json. - A systemd service (and optionally a timer) which runs the script every few minutes.
- A helper CLI
free_one_from_index.pyto release one URL from the “seen list” so the watcher will repost it.
This design deliberately keeps the watcher stateless in code and isolates state in a single JSON file.
2. Preparing a New VPS
These steps assume Ubuntu (root) and Python 3.12+ are available.
2.1. Basic packages
apt update
apt upgrade -y
apt install -y python3 python3-venv python3-pip git curl ca-certificates2.2. Directory layout
# Code lives here
mkdir -p /opt/bjeea-watch
cd /opt/bjeea-watch
# State lives here
mkdir -p /var/lib/bjeea-watch
chmod 700 /var/lib/bjeea-watchIf you want to run as a non-root user, create that user and chown the directories accordingly. For simplicity, this manual uses root.
3. Deploying the Watcher Code
3.1. Create and activate virtualenv
cd /opt/bjeea-watch
python3 -m venv .venv
source .venv/bin/activate3.2. Install Python dependencies
Typical stack (adjust as needed):
pip install --upgrade pip
pip install requests beautifulsoup4 markdownify python-dotenv3.3. Core watcher script
Your actual file may already exist. The essential responsibilities are:
- Define sections (e.g.
hk,gkgz) and their metadata. - Load / save state from
/var/lib/bjeea-watch/state.json. - For each section:
- Fetch the index URL.
- Extract a list of article URLs.
- Compute new_urls = index_urls − seen_urls.
- For each new URL:
- Fetch the detail page.
- Parse:
- title
- publication date
- body (main content container)
- Convert body → Markdown.
- Post to Discourse.
- Append URL to
seen_urls.
The watcher exposes a few helper functions we will reuse in the CLI script:
SECTIONS: Dict[str, SectionConfig]build_session()fetch_html(session, url)extract_article_links(soup, section_cfg)get_state_path()load_state(path)save_state(path, state_dict)
3.4. UTF-8 body text (fixing mojibake)
The key for Chinese content is to be strict and explicit about encoding. In the watcher, prefer:
import requests
from bs4 import BeautifulSoup
def build_session() -> requests.Session:
s = requests.Session()
s.headers.update({
"User-Agent": "bjeea-watch/1.0 (+your-email@example.com)",
})
return s
def fetch_html(session: requests.Session, url: str) -> BeautifulSoup:
resp = session.get(url, timeout=20)
resp.raise_for_status()
# Force UTF-8 decoding, do NOT rely on incorrect headers
content = resp.content.decode("utf-8", errors="strict")
soup = BeautifulSoup(content, "html.parser")
return soupThis avoids the “é«ä¸…” style mojibake you saw when the server sends UTF-8 but the client decodes as Latin-1.
4. Discourse Integration
4.1. Environment variables
Configure Discourse access via environment variables (or a .env file):
export DISCOURSE_BASE_URL="https://forum.rdfzer.com"
export DISCOURSE_API_KEY="<YOUR_API_KEY>"
export DISCOURSE_API_USERNAME="<YOUR_USERNAME>" # usually 'system' or an admin userEach section (hk, gkgz, etc.) can have its own summary topic ID, e.g.:
hk→ topic_id9657gkgz→ another topic ID
In code, the SectionConfig can carry topic_id.
4.2. Posting a reply
A minimal example using Discourse’s /posts.json API:
import os
import requests
DISCOURSE_BASE_URL = os.environ["DISCOURSE_BASE_URL"]
DISCOURSE_API_KEY = os.environ["DISCOURSE_API_KEY"]
DISCOURSE_API_USERNAME = os.environ["DISCOURSE_API_USERNAME"]
def post_reply(topic_id: int, markdown_body: str) -> int:
url = f"{DISCOURSE_BASE_URL}/posts.json"
payload = {
"topic_id": topic_id,
"raw": markdown_body,
}
headers = {
"Api-Key": DISCOURSE_API_KEY,
"Api-Username": DISCOURSE_API_USERNAME,
}
r = requests.post(url, json=payload, headers=headers, timeout=20)
r.raise_for_status()
return r.status_codeOn success, the watcher logs:
[INFO] bjeea_watch: Created reply in topic_id=9657 (status=200)5. Systemd Unit and Timer
5.1. Service unit
Create /etc/systemd/system/bjeea-watch.service:
[Unit]
Description=BJEEA page watcher - append updates into Discourse summary topic
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
WorkingDirectory=/opt/bjeea-watch
ExecStart=/opt/bjeea-watch/.venv/bin/python /opt/bjeea-watch/bjeea_watch.py
User=root
Group=root
# Or a dedicated user if you prefer
# Environment (or use EnvironmentFile=/etc/bjeea-watch.env)
Environment=DISCOURSE_BASE_URL=https://forum.rdfzer.com
Environment=DISCOURSE_API_KEY=<YOUR_API_KEY>
Environment=DISCOURSE_API_USERNAME=system
[Install]
WantedBy=multi-user.targetReload:
systemctl daemon-reload5.2. Timer (optional but recommended)
Create /etc/systemd/system/bjeea-watch.timer:
[Unit]
Description=Run BJEEA watcher every 5 minutes
[Timer]
OnBootSec=2min
OnUnitActiveSec=5min
Unit=bjeea-watch.service
Persistent=true
[Install]
WantedBy=timers.targetEnable and start:
systemctl enable --now bjeea-watch.timerYou can always trigger a manual run:
systemctl start bjeea-watch.service
journalctl -u bjeea-watch.service -n 50 --no-pagerSample healthy log:
Nov 30 13:06:16 sg-gc-bf python[146434]: 2025-11-30 13:06:16,004 [INFO] bjeea_watch: bjeea_watch.py starting (version 1.3.1)
Nov 30 13:06:28 sg-gc-bf python[146434]: 2025-11-30 13:06:28,789 [INFO] bjeea_watch: Extracted 20 article links from https://www.bjeea.cn/html/hk/index.html
Nov 30 13:06:28 sg-gc-bf python[146434]: 2025-11-30 13:06:28,789 [INFO] bjeea_watch: No new items for section hk.
Nov 30 13:06:29 sg-gc-bf python[146434]: 2025-11-30 13:06:29,204 [INFO] bjeea_watch: Extracted 34 article links from https://www.bjeea.cn/html/gkgz/index.html
Nov 30 13:06:29 sg-gc-bf python[146434]: 2025-11-30 13:06:29,205 [INFO] bjeea_watch: No new items for section gkgz.6. Helper Script
Sometimes you want to force the watcher to repost an article which is already in the state file. Typical use‑cases:
- You changed the formatting logic and want to regenerate a post.
- A Discourse error happened on the first attempt.
- You want to “replay” a specific URL to test changes.
6.1. CLI behavior
The helper script:
- Loads the same state file as the watcher.
- Optionally refetches the index page to print current URLs.
- Removes either:
- one URL that appears in the index but not yet processed (or an older one you choose), or
- a specific
--urlgiven on the command line.
- Writes a backup
state.json.bakbefore modifying the state. - Saves the updated
state.json. - The next watcher run will see that URL as new and post it again.
6.2. Example usage
cd /opt/bjeea-watch
source .venv/bin/activate
# Generic: free one URL from the hk section (taken from index)
python free_one_from_index.py --section hk
# Targeted: free a specific URL from the hk section
python free_one_from_index.py --section hk --url https://www.bjeea.cn/html/hk/faq/2025/1017/87375.htmlSample output:
[INFO] state file: /var/lib/bjeea-watch/state.json
[INFO] Section 'hk' currently has 67 seen URLs.
[INFO] Index page https://www.bjeea.cn/html/hk/index.html currently has 20 article URLs.
[INFO] Backup written to: /var/lib/bjeea-watch/state.json.bak
[INFO] Removed URL from section 'hk':
[INFO] https://www.bjeea.cn/html/hk/qxkb/2021/0108/77811.htmlAfter that, run the watcher again:
systemctl start bjeea-watch.service
journalctl -u bjeea-watch.service -n 30 --no-pagerYou should see a “Found 1 new items” log and a new reply in the Discourse topic.
7. Adapting to Other Sites
The architecture is reusable for any site with a reasonably stable HTML structure.
To adapt it:
- Add a new SectionConfig
- A unique key (e.g.
"shanghai-gaokao"). - Index URL.
- CSS selectors for article links.
- CSS selectors for article title, date, and body.
- Discourse topic ID (summary thread).
- A unique key (e.g.
- Extend the extractor
- Adjust
extract_article_linksto handle the new site’s index layout. - Implement a new
parse_article_<site_key>function if needed.
- Adjust
- Update state schema (if you add fields)
- Keep it backwards‑compatible when possible.
- Restart the service + timer
The rest (systemd, Discourse posting, UTF-8 handling, helper scripts) stays the same.
8. Safety & Operations Notes
- Backups:
- The state file is small but important. Regularly back up
/var/lib/bjeea-watch/state.json. - The helper script already writes
state.json.bakbefore changes.
- The state file is small but important. Regularly back up
- Rate limits:
- Use a reasonable polling interval (5–10 minutes) to avoid hammering the source site.
- You can also add sleep / jitter inside the watcher if you poll multiple index pages.
- Secrets:
- Never commit
DISCOURSE_API_KEYor any other secrets into Git. - Use an environment file (e.g.
/etc/bjeea-watch/env) and setEnvironmentFile=in the systemd unit.
- Never commit
- Monitoring:
- Use
journalctlto inspect logs. - Consider adding simple alerting (e.g. if the service starts failing repeatedly).
- Use
9. Minimal “Any-Site” Checklist
When you build a new watcher for another website, walk through this list:
- New VPS or existing node is ready (Python 3, git, systemd).
-
/opt/<project>and/var/lib/<project>created with correct permissions. - Virtualenv created, dependencies installed.
- Section configs defined with:
- index URL
- CSS for links
- CSS for title/date/body
- Discourse topic ID
- State file path wired in and verified.
- UTF-8 decoding forced for Chinese content.
- Discourse API info in environment variables or env file.
- systemd service + (optional) timer installed and enabled.
- Manual test run via
systemctl start ...&journalctl. - Helper script (
free_one_from_index.py) tested at least once. - Repository pushed to GitHub for version control.
Once this is in place, you can treat the watcher as a small, reusable “page → Discourse” bridge for any site you care about.