SUEN

北京教育考試院網頁監控

在幾乎不透明的官方裡,教育考試院是勉強有一點有價值數據的網站,之前一直放在導航頁,最近因為爬取數據經常點到。所以乾脆把網站監控起來,直接對接到論壇。

中間不得不換了一台 VPS,最初選了一台 JP 機器結果直接被 BJEEA 100% 丟包⋯⋯
這幾年這類事情幾乎都成常態了,之前說互聯網,既不互,也不聯,更不網,就是因為這類。
換 SG ,嗯,好了。


Pattern: Mirror important announcements from any external site into a single Discourse summary topic

This document describes the full workflow we used for the BJEEA → Discourse pipeline:

The same pattern can be reused for any other website you want to monitor.


1. High-Level Architecture

Goal:
When a target website (e.g. BJEEA) publishes a new article, we want:

  1. The watcher to detect it from the index page.
  2. Fetch the detail page.
  3. Extract metadata and the full body.
  4. Convert the body to Markdown.
  5. Append a nicely formatted reply to a fixed Discourse topic (summary thread).
  6. Store the article URL in a state file so the same article is not posted again.

Key components:

This design deliberately keeps the watcher stateless in code and isolates state in a single JSON file.


2. Preparing a New VPS

These steps assume Ubuntu (root) and Python 3.12+ are available.

2.1. Basic packages

bash
apt update
apt upgrade -y

apt install -y python3 python3-venv python3-pip                git curl ca-certificates

2.2. Directory layout

bash
# Code lives here
mkdir -p /opt/bjeea-watch
cd /opt/bjeea-watch

# State lives here
mkdir -p /var/lib/bjeea-watch
chmod 700 /var/lib/bjeea-watch

If you want to run as a non-root user, create that user and chown the directories accordingly. For simplicity, this manual uses root.


3. Deploying the Watcher Code

3.1. Create and activate virtualenv

bash
cd /opt/bjeea-watch
python3 -m venv .venv
source .venv/bin/activate

3.2. Install Python dependencies

Typical stack (adjust as needed):

bash
pip install --upgrade pip
pip install requests beautifulsoup4 markdownify python-dotenv

3.3. Core watcher script

Your actual file may already exist. The essential responsibilities are:

The watcher exposes a few helper functions we will reuse in the CLI script:

3.4. UTF-8 body text (fixing mojibake)

The key for Chinese content is to be strict and explicit about encoding. In the watcher, prefer:

python
import requests
from bs4 import BeautifulSoup

def build_session() -> requests.Session:
    s = requests.Session()
    s.headers.update({
        "User-Agent": "bjeea-watch/1.0 (+your-email@example.com)",
    })
    return s

def fetch_html(session: requests.Session, url: str) -> BeautifulSoup:
    resp = session.get(url, timeout=20)
    resp.raise_for_status()

    # Force UTF-8 decoding, do NOT rely on incorrect headers
    content = resp.content.decode("utf-8", errors="strict")
    soup = BeautifulSoup(content, "html.parser")
    return soup

This avoids the “高中…” style mojibake you saw when the server sends UTF-8 but the client decodes as Latin-1.


4. Discourse Integration

4.1. Environment variables

Configure Discourse access via environment variables (or a .env file):

bash
export DISCOURSE_BASE_URL="https://forum.rdfzer.com"
export DISCOURSE_API_KEY="<YOUR_API_KEY>"
export DISCOURSE_API_USERNAME="<YOUR_USERNAME>"  # usually 'system' or an admin user

Each section (hk, gkgz, etc.) can have its own summary topic ID, e.g.:

In code, the SectionConfig can carry topic_id.

4.2. Posting a reply

A minimal example using Discourse’s /posts.json API:

python
import os
import requests

DISCOURSE_BASE_URL = os.environ["DISCOURSE_BASE_URL"]
DISCOURSE_API_KEY = os.environ["DISCOURSE_API_KEY"]
DISCOURSE_API_USERNAME = os.environ["DISCOURSE_API_USERNAME"]

def post_reply(topic_id: int, markdown_body: str) -> int:
    url = f"{DISCOURSE_BASE_URL}/posts.json"
    payload = {
        "topic_id": topic_id,
        "raw": markdown_body,
    }
    headers = {
        "Api-Key": DISCOURSE_API_KEY,
        "Api-Username": DISCOURSE_API_USERNAME,
    }
    r = requests.post(url, json=payload, headers=headers, timeout=20)
    r.raise_for_status()
    return r.status_code

On success, the watcher logs:

text
[INFO] bjeea_watch: Created reply in topic_id=9657 (status=200)

5. Systemd Unit and Timer

5.1. Service unit

Create /etc/systemd/system/bjeea-watch.service:

ini
[Unit]
Description=BJEEA page watcher - append updates into Discourse summary topic
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
WorkingDirectory=/opt/bjeea-watch
ExecStart=/opt/bjeea-watch/.venv/bin/python /opt/bjeea-watch/bjeea_watch.py
User=root
Group=root
# Or a dedicated user if you prefer

# Environment (or use EnvironmentFile=/etc/bjeea-watch.env)
Environment=DISCOURSE_BASE_URL=https://forum.rdfzer.com
Environment=DISCOURSE_API_KEY=<YOUR_API_KEY>
Environment=DISCOURSE_API_USERNAME=system

[Install]
WantedBy=multi-user.target

Reload:

bash
systemctl daemon-reload

Create /etc/systemd/system/bjeea-watch.timer:

ini
[Unit]
Description=Run BJEEA watcher every 5 minutes

[Timer]
OnBootSec=2min
OnUnitActiveSec=5min
Unit=bjeea-watch.service
Persistent=true

[Install]
WantedBy=timers.target

Enable and start:

bash
systemctl enable --now bjeea-watch.timer

You can always trigger a manual run:

bash
systemctl start bjeea-watch.service
journalctl -u bjeea-watch.service -n 50 --no-pager

Sample healthy log:

text
Nov 30 13:06:16 sg-gc-bf python[146434]: 2025-11-30 13:06:16,004 [INFO] bjeea_watch: bjeea_watch.py starting (version 1.3.1)
Nov 30 13:06:28 sg-gc-bf python[146434]: 2025-11-30 13:06:28,789 [INFO] bjeea_watch: Extracted 20 article links from https://www.bjeea.cn/html/hk/index.html
Nov 30 13:06:28 sg-gc-bf python[146434]: 2025-11-30 13:06:28,789 [INFO] bjeea_watch: No new items for section hk.
Nov 30 13:06:29 sg-gc-bf python[146434]: 2025-11-30 13:06:29,204 [INFO] bjeea_watch: Extracted 34 article links from https://www.bjeea.cn/html/gkgz/index.html
Nov 30 13:06:29 sg-gc-bf python[146434]: 2025-11-30 13:06:29,205 [INFO] bjeea_watch: No new items for section gkgz.

6. Helper Script

Sometimes you want to force the watcher to repost an article which is already in the state file. Typical use‑cases:

6.1. CLI behavior

The helper script:

6.2. Example usage

bash
cd /opt/bjeea-watch
source .venv/bin/activate

# Generic: free one URL from the hk section (taken from index)
python free_one_from_index.py --section hk

# Targeted: free a specific URL from the hk section
python free_one_from_index.py   --section hk   --url https://www.bjeea.cn/html/hk/faq/2025/1017/87375.html

Sample output:

text
[INFO] state file: /var/lib/bjeea-watch/state.json
[INFO] Section 'hk' currently has 67 seen URLs.
[INFO] Index page https://www.bjeea.cn/html/hk/index.html currently has 20 article URLs.
[INFO] Backup written to: /var/lib/bjeea-watch/state.json.bak
[INFO] Removed URL from section 'hk':
[INFO]        https://www.bjeea.cn/html/hk/qxkb/2021/0108/77811.html

After that, run the watcher again:

bash
systemctl start bjeea-watch.service
journalctl -u bjeea-watch.service -n 30 --no-pager

You should see a “Found 1 new items” log and a new reply in the Discourse topic.


7. Adapting to Other Sites

The architecture is reusable for any site with a reasonably stable HTML structure.

To adapt it:

  1. Add a new SectionConfig
    • A unique key (e.g. "shanghai-gaokao").
    • Index URL.
    • CSS selectors for article links.
    • CSS selectors for article title, date, and body.
    • Discourse topic ID (summary thread).
  2. Extend the extractor
    • Adjust extract_article_links to handle the new site’s index layout.
    • Implement a new parse_article_<site_key> function if needed.
  3. Update state schema (if you add fields)
    • Keep it backwards‑compatible when possible.
  4. Restart the service + timer

The rest (systemd, Discourse posting, UTF-8 handling, helper scripts) stays the same.


8. Safety & Operations Notes


9. Minimal “Any-Site” Checklist

When you build a new watcher for another website, walk through this list:

  1. New VPS or existing node is ready (Python 3, git, systemd).
  2. /opt/<project> and /var/lib/<project> created with correct permissions.
  3. Virtualenv created, dependencies installed.
  4. Section configs defined with:
    • index URL
    • CSS for links
    • CSS for title/date/body
    • Discourse topic ID
  5. State file path wired in and verified.
  6. UTF-8 decoding forced for Chinese content.
  7. Discourse API info in environment variables or env file.
  8. systemd service + (optional) timer installed and enabled.
  9. Manual test run via systemctl start ... & journalctl.
  10. Helper script (free_one_from_index.py) tested at least once.
  11. Repository pushed to GitHub for version control.

Once this is in place, you can treat the watcher as a small, reusable “page → Discourse” bridge for any site you care about.