北京教育考試院網頁監控

2025-11-30 13:30

在幾乎不透明的官方裡，教育考試院是勉強有一點有價值數據的網站，之前一直放在導航頁，最近因為爬取數據經常點到。所以乾脆把網站監控起來，直接對接到論壇。

中間不得不換了一台 VPS，最初選了一台 JP 機器結果直接被 BJEEA 100% 丟包⋯⋯
這幾年這類事情幾乎都成常態了，之前說互聯網，既不互，也不聯，更不網，就是因為這類。
換 SG ，嗯，好了。

Pattern: Mirror important announcements from any external site into a single Discourse summary topic

This document describes the full workflow we used for the BJEEA → Discourse pipeline:

moving the watcher to a new VPS,
deploying the Python watcher,
wiring it into systemd,
fixing UTF-8 body text,
and adding a small helper script to “free” one URL so it can be reposted.

The same pattern can be reused for any other website you want to monitor.

1. High-Level Architecture

Goal:
When a target website (e.g. BJEEA) publishes a new article, we want:

The watcher to detect it from the index page.
Fetch the detail page.
Extract metadata and the full body.
Convert the body to Markdown.
Append a nicely formatted reply to a fixed Discourse topic (summary thread).
Store the article URL in a state file so the same article is not posted again.

Key components:

A small Python project under /opt/bjeea-watch (or any name).
A virtualenv in /opt/bjeea-watch/.venv.
A state file in /var/lib/bjeea-watch/state.json.
A systemd service (and optionally a timer) which runs the script every few minutes.
A helper CLI free_one_from_index.py to release one URL from the “seen list” so the watcher will repost it.

This design deliberately keeps the watcher stateless in code and isolates state in a single JSON file.

2. Preparing a New VPS

These steps assume Ubuntu (root) and Python 3.12+ are available.

2.1. Basic packages

    bash
    
apt update
apt upgrade -y

apt install -y python3 python3-venv python3-pip                git curl ca-certificates

2.2. Directory layout

    bash
    
    
  
# Code lives here
mkdir -p /opt/bjeea-watch
cd /opt/bjeea-watch

# State lives here
mkdir -p /var/lib/bjeea-watch
chmod 700 /var/lib/bjeea-watch

If you want to run as a non-root user, create that user and chown the directories accordingly. For simplicity, this manual uses root.

3. Deploying the Watcher Code

3.1. Create and activate virtualenv

    bash
    
cd /opt/bjeea-watch
python3 -m venv .venv
source .venv/bin/activate

3.2. Install Python dependencies

Typical stack (adjust as needed):

bash

pip install --upgrade pip
pip install requests beautifulsoup4 markdownify python-dotenv

3.3. Core watcher script

Your actual file may already exist. The essential responsibilities are:

Define sections (e.g. hk, gkgz) and their metadata.
Load / save state from /var/lib/bjeea-watch/state.json.
For each section:
- Fetch the index URL.
- Extract a list of article URLs.
- Compute new_urls = index_urls − seen_urls.
- For each new URL:
  - Fetch the detail page.
  - Parse:
    - title
    - publication date
    - body (main content container)
  - Convert body → Markdown.
  - Post to Discourse.
  - Append URL to seen_urls.

The watcher exposes a few helper functions we will reuse in the CLI script:

SECTIONS: Dict[str, SectionConfig]
build_session()
fetch_html(session, url)
extract_article_links(soup, section_cfg)
get_state_path()
load_state(path)
save_state(path, state_dict)

3.4. UTF-8 body text (fixing mojibake)

The key for Chinese content is to be strict and explicit about encoding. In the watcher, prefer:

    python
    
    
  
import requests
from bs4 import BeautifulSoup

def build_session() -> requests.Session:
    s = requests.Session()
    s.headers.update({
        "User-Agent": "bjeea-watch/1.0 (+your-email@example.com)",
    })
    return s

def fetch_html(session: requests.Session, url: str) -> BeautifulSoup:
    resp = session.get(url, timeout=20)
    resp.raise_for_status()

    # Force UTF-8 decoding, do NOT rely on incorrect headers
    content = resp.content.decode("utf-8", errors="strict")
    soup = BeautifulSoup(content, "html.parser")
    return soup

This avoids the “é«ä¸…” style mojibake you saw when the server sends UTF-8 but the client decodes as Latin-1.

4. Discourse Integration

4.1. Environment variables

Configure Discourse access via environment variables (or a .env file):

    bash
    
export DISCOURSE_BASE_URL="https://forum.rdfzer.com"
export DISCOURSE_API_KEY="<YOUR_API_KEY>"
export DISCOURSE_API_USERNAME="<YOUR_USERNAME>"  # usually 'system' or an admin user

Each section (hk, gkgz, etc.) can have its own summary topic ID, e.g.:

hk → topic_id 9657
gkgz → another topic ID

In code, the SectionConfig can carry topic_id.

4.2. Posting a reply

A minimal example using Discourse’s /posts.json API:

    python
    
    
  
import os
import requests

DISCOURSE_BASE_URL = os.environ["DISCOURSE_BASE_URL"]
DISCOURSE_API_KEY = os.environ["DISCOURSE_API_KEY"]
DISCOURSE_API_USERNAME = os.environ["DISCOURSE_API_USERNAME"]

def post_reply(topic_id: int, markdown_body: str) -> int:
    url = f"{DISCOURSE_BASE_URL}/posts.json"
    payload = {
        "topic_id": topic_id,
        "raw": markdown_body,
    }
    headers = {
        "Api-Key": DISCOURSE_API_KEY,
        "Api-Username": DISCOURSE_API_USERNAME,
    }
    r = requests.post(url, json=payload, headers=headers, timeout=20)
    r.raise_for_status()
    return r.status_code

On success, the watcher logs:

text

[INFO] bjeea_watch: Created reply in topic_id=9657 (status=200)

5. Systemd Unit and Timer

5.1. Service unit

Create /etc/systemd/system/bjeea-watch.service:

    ini
    
    
  
[Unit]
Description=BJEEA page watcher - append updates into Discourse summary topic
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
WorkingDirectory=/opt/bjeea-watch
ExecStart=/opt/bjeea-watch/.venv/bin/python /opt/bjeea-watch/bjeea_watch.py
User=root
Group=root
# Or a dedicated user if you prefer

# Environment (or use EnvironmentFile=/etc/bjeea-watch.env)
Environment=DISCOURSE_BASE_URL=https://forum.rdfzer.com
Environment=DISCOURSE_API_KEY=<YOUR_API_KEY>
Environment=DISCOURSE_API_USERNAME=system

[Install]
WantedBy=multi-user.target

Reload:

bash

systemctl daemon-reload

5.2. Timer (optional but recommended)

Create /etc/systemd/system/bjeea-watch.timer:

    ini
    
    
  
[Unit]
Description=Run BJEEA watcher every 5 minutes

[Timer]
OnBootSec=2min
OnUnitActiveSec=5min
Unit=bjeea-watch.service
Persistent=true

[Install]
WantedBy=timers.target

Enable and start:

bash

systemctl enable --now bjeea-watch.timer

You can always trigger a manual run:

    bash
    
systemctl start bjeea-watch.service
journalctl -u bjeea-watch.service -n 50 --no-pager

Sample healthy log:

    text
    
    
  
Nov 30 13:06:16 sg-gc-bf python[146434]: 2025-11-30 13:06:16,004 [INFO] bjeea_watch: bjeea_watch.py starting (version 1.3.1)
Nov 30 13:06:28 sg-gc-bf python[146434]: 2025-11-30 13:06:28,789 [INFO] bjeea_watch: Extracted 20 article links from https://www.bjeea.cn/html/hk/index.html
Nov 30 13:06:28 sg-gc-bf python[146434]: 2025-11-30 13:06:28,789 [INFO] bjeea_watch: No new items for section hk.
Nov 30 13:06:29 sg-gc-bf python[146434]: 2025-11-30 13:06:29,204 [INFO] bjeea_watch: Extracted 34 article links from https://www.bjeea.cn/html/gkgz/index.html
Nov 30 13:06:29 sg-gc-bf python[146434]: 2025-11-30 13:06:29,205 [INFO] bjeea_watch: No new items for section gkgz.

6. Helper Script

Sometimes you want to force the watcher to repost an article which is already in the state file. Typical use‑cases:

You changed the formatting logic and want to regenerate a post.
A Discourse error happened on the first attempt.
You want to “replay” a specific URL to test changes.

6.1. CLI behavior

The helper script:

Loads the same state file as the watcher.
Optionally refetches the index page to print current URLs.
Removes either:
- one URL that appears in the index but not yet processed (or an older one you choose), or
- a specific --url given on the command line.
Writes a backup state.json.bak before modifying the state.
Saves the updated state.json.
The next watcher run will see that URL as new and post it again.

6.2. Example usage

    bash
    
cd /opt/bjeea-watch
source .venv/bin/activate

# Generic: free one URL from the hk section (taken from index)
python free_one_from_index.py --section hk

# Targeted: free a specific URL from the hk section
python free_one_from_index.py   --section hk   --url https://www.bjeea.cn/html/hk/faq/2025/1017/87375.html

Sample output:

    text
    
    
  
[INFO] state file: /var/lib/bjeea-watch/state.json
[INFO] Section 'hk' currently has 67 seen URLs.
[INFO] Index page https://www.bjeea.cn/html/hk/index.html currently has 20 article URLs.
[INFO] Backup written to: /var/lib/bjeea-watch/state.json.bak
[INFO] Removed URL from section 'hk':
[INFO]        https://www.bjeea.cn/html/hk/qxkb/2021/0108/77811.html

After that, run the watcher again:

    bash
    
systemctl start bjeea-watch.service
journalctl -u bjeea-watch.service -n 30 --no-pager

You should see a “Found 1 new items” log and a new reply in the Discourse topic.

7. Adapting to Other Sites

The architecture is reusable for any site with a reasonably stable HTML structure.

To adapt it:

Add a new SectionConfig
- A unique key (e.g. "shanghai-gaokao").
- Index URL.
- CSS selectors for article links.
- CSS selectors for article title, date, and body.
- Discourse topic ID (summary thread).
Extend the extractor
- Adjust extract_article_links to handle the new site’s index layout.
- Implement a new parse_article_<site_key> function if needed.
Update state schema (if you add fields)
- Keep it backwards‑compatible when possible.
Restart the service + timer

The rest (systemd, Discourse posting, UTF-8 handling, helper scripts) stays the same.

8. Safety & Operations Notes

Backups:
- The state file is small but important. Regularly back up /var/lib/bjeea-watch/state.json.
- The helper script already writes state.json.bak before changes.
Rate limits:
- Use a reasonable polling interval (5–10 minutes) to avoid hammering the source site.
- You can also add sleep / jitter inside the watcher if you poll multiple index pages.
Secrets:
- Never commit DISCOURSE_API_KEY or any other secrets into Git.
- Use an environment file (e.g. /etc/bjeea-watch/env) and set EnvironmentFile= in the systemd unit.
Monitoring:
- Use journalctl to inspect logs.
- Consider adding simple alerting (e.g. if the service starts failing repeatedly).

9. Minimal “Any-Site” Checklist

When you build a new watcher for another website, walk through this list:

New VPS or existing node is ready (Python 3, git, systemd).
/opt/<project> and /var/lib/<project> created with correct permissions.
Virtualenv created, dependencies installed.
Section configs defined with:
- index URL
- CSS for links
- CSS for title/date/body
- Discourse topic ID
State file path wired in and verified.
UTF-8 decoding forced for Chinese content.
Discourse API info in environment variables or env file.
systemd service + (optional) timer installed and enabled.
Manual test run via systemctl start ... & journalctl.
Helper script (free_one_from_index.py) tested at least once.
Repository pushed to GitHub for version control.

Once this is in place, you can treat the watcher as a small, reusable “page → Discourse” bridge for any site you care about.