牆,內外兮上下
論壇上有學生發武漢大學“文科生女友”的微信公眾號文章,就想讓 AI 分析為什麼這篇文章把很多人寫噁心了。
於是老問題來了,直接複製粘貼會被微信做域名限制,圖片不可見。
純文字粘貼其實可以,但既然要粘,自然還是全貌好點。
之前複製老課文的微信公號也一樣,其實多年來都這樣。
滿世界都是牆,每一堵牆背後,都是人心。
有點煩。 老課文那篇是就是一張張右鍵圖片,逐一複製過來。想想,更煩。
判斷了下,一篇文章需要粘貼過來,大抵還是有點存個檔的意義的。不如一次做了。
腳本下載 html 文本和圖片。之後,再一張張拖到論壇去。然後把文件夾隔一段自動扔雲盤。
其實可以加把圖片自動轉存直接複製粘貼的效果,但念及也沒幾篇公號文章需要被轉,算了。
Taming the Dragon: Copying WeChat Articles with Images Intact
WeChat Official Accounts are a massive source of information, news, and articles, especially in the Chinese-speaking world. However, anyone who’s tried to simply copy an article’s content and paste it elsewhere (like a personal knowledge base, a document, or another platform) has likely run into a frustrating issue: the text copies over, but all the images appear broken, often replaced by a placeholder saying “This image is from the WeChat Official Account platform and cannot be cited without permission.”
Why does this happen? It’s WeChat’s hotlink protection (Referer
checking) kicking in. The image URLs point to WeChat’s servers (mmbiz.qpic.cn
), and these servers refuse to serve the image if the request doesn’t originate from an approved source (like WeChat itself).
This blog post chronicles the journey of building a robust solution to grab WeChat articles, including their images and formatting, for local use and potential sharing, focusing on a macOS environment.
The Goal: A Clean, Portable Copy
Our objective was clear: create a way to take a WeChat article URL and generate a local copy that:
- Includes all text content.
- Displays all images correctly.
- Preserves the original formatting as much as possible.
- Is easy to use via the command line on macOS.
Exploring Initial Solutions
We briefly considered several approaches:
- Manual Download & Replace: Copy text, manually save each image, then manually insert them back. Effective but incredibly tedious for image-heavy articles.
- Browser Extensions: Tools like SingleFile or note-taking web clippers can often grab full pages. Good option, but relies on third-party extensions working correctly and might not offer fine-grained control.
- Online Tools: Various web services claim to download WeChat articles. Concerns include reliability, privacy, potential costs, and ads.
- Self-Hosted Script: Building our own script offers the most control, flexibility, and avoids third-party dependencies (beyond standard libraries). This was the path we chose.
The Local Script Approach: getwc
on macOS
We decided to build a command-line tool, getwc
, invoked like this:
|
|
This tool would wrap a Python script responsible for the heavy lifting.
Core Logic (Python):
The Python script (wechat_article_parser.py
) would perform the following steps:
- Fetch HTML: Use the
requests
library to download the article’s full HTML source, mimicking a standard browserUser-Agent
. - Parse HTML: Use
BeautifulSoup4
to parse the HTML structure. - Find Content: Locate the main article content container (usually
<div id="js_content">
). - Find Images: Identify all
<img>
tags within the content container. Prioritize thedata-src
attribute (often used for lazy loading) oversrc
. - Fetch Images: For each image URL pointing to WeChat’s servers (
mmbiz.qpic.cn
), userequests
again to download the actual image bytes. Crucially, omit theReferer
header in this request to bypass hotlink protection. Determine the image’sContent-Type
(MIME type). - Process Images & Update HTML: This is where our approach evolved.
- Save Final HTML: Write the modified HTML content to a local file.
The Bash Wrapper (getwc
):
A simple Bash script was created at /usr/local/bin/getwc
to:
- Take the URL as a command-line argument.
- Validate the URL format.
- Call the Python script using a specific Python interpreter (more on this later).
- Capture the output (the path to the saved HTML file) from the Python script.
- Report success or failure to the user.
- Optionally, automatically open the generated HTML file.
First Attempt & Troubleshooting Round 1: Base64 and the Bloat Issue
Our initial Python script implementation aimed for a single, self-contained HTML file. The “Process Images” step involved:
- Converting the downloaded image bytes into a Base64 Data URI string (
data:image/png;base64,...
). - Replacing the
<img>
tag’ssrc
attribute with this huge Base64 string.
The Problem: While this worked perfectly for small articles, processing an article with many high-resolution images resulted in a massive HTML file (e.g., 42.2 MB!). Browsers struggled immensely to open such a large file containing embedded resources. Loading would hang, pages remained blank, or the browser tab would crash due to excessive memory consumption. Base64 encoding also increases data size by ~33%, exacerbating the issue.
The Fix: We abandoned the Base64 approach for large articles.
The Refined Solution: Local Image Files and Relative Paths
We modified the Python script’s image processing logic:
- Create Assets Directory: For each processed article (
Article_Title_Timestamp.html
), create a corresponding folder namedArticle_Title_Timestamp_files/
. - Save Images Locally: Instead of Base64 encoding, save the raw downloaded image bytes directly into the
_files
directory, using a sequential filename (e.g.,image_0001.jpg
,image_0002.png
). The correct file extension is determined from theContent-Type
or URL. - Update
src
with Relative Path: Modify the<img>
tag’ssrc
attribute to use a relative path pointing to the saved image file (e.g.,src="Article_Title_Timestamp_files/image_0001.jpg"
).
This resulted in a much smaller HTML file (containing only text and tags) and a separate folder with the image assets. Browsers could now open the HTML file instantly.
Troubleshooting Round 2: The Blank Page Mystery
Success! The HTML file opened quickly… but it was completely blank!
Diagnosis: Using the browser’s Developer Tools (Inspect Element) was key. We examined the HTML structure (Elements
tab) and found that the main content div
(#js_content
) had inherited inline CSS styles from the original WeChat page: style="visibility: hidden; opacity: 0;"
. These styles were likely used by WeChat for loading animations but remained active in our static copy because the accompanying JavaScript was missing.
The Fix: We added a few lines to the Python script, right after finding the content_div
, to explicitly remove the style
attribute from that container before processing its contents:
|
|
With this change, the generated HTML finally rendered correctly in the browser, showing both text and locally referenced images.
Handling Python Dependencies: Virtual Environments (PEP 668)
When trying to install the required Python libraries (requests
, beautifulsoup4
) using pip3 install ...
, we encountered the error: externally-managed-environment
. Modern Python distributions (especially on macOS managed by Homebrew) discourage installing packages globally with pip
to avoid conflicts.
The Solution: Use a Python virtual environment.
- Create a dedicated directory for the script (e.g.,
~/scripts
). - Inside that directory, create a virtual environment:
python3 -m venv venv
- Activate it:
source venv/bin/activate
- Install dependencies within the activated environment:
pip install requests beautifulsoup4
- Modify the
getwc
Bash script to explicitly call the Python interpreter from the virtual environment:$HOME/scripts/venv/bin/python3
instead of justpython3
. This ensures the script always uses the environment with the correct dependencies installed, without needing manual activation each time.
The Final Hurdle: Pasting into Discourse (or other Web Platforms)
We now had a working local HTML copy. The final goal was to easily copy this content into another platform, like the Discourse forum software. Simply selecting all (Cmd+A) in the browser and pasting (Cmd+V) into the Discourse editor led to a familiar problem: text appeared, but images were broken, shown as Markdown links like 
.
The Reason: The pasted HTML contained relative image paths (_files/...
). These paths are only meaningful relative to the HTML file on the local machine. The Discourse server has no access to these local files.
The Recommended Solution (for Discourse): Manual Upload
While less automated, the most reliable way to get the content into Discourse is:
- Open the local HTML file in your browser.
- Open the corresponding
_files
folder in Finder. - Copy the text portions from the browser and paste them into the Discourse editor.
- For each image, drag the image file from the Finder (
_files
folder) directly into the Discourse editor at the desired location. - Discourse’s editor will typically handle the drag-and-drop, uploading the image to its own storage and inserting the correct image tag referencing the uploaded file.
This leverages Discourse’s built-in upload mechanism, ensuring images are properly hosted and displayed within the platform. While manual, it’s the most robust workflow for integrating local content with server-based platforms. (An alternative involving modifying the script to upload images to a public image host first was deemed too complex for this user’s primary need).
The Final Code
Here are the final, working versions of the scripts:
1. Python Script (~/scripts/wechat_article_parser.py
)
(Saves images locally, removes hiding styles)
|
|
2. Bash Script (/usr/local/bin/getwc
)
(Uses the venv Python, captures output correctly)
|
|
Setup Instructions:
- Create directory:
mkdir -p ~/scripts && cd ~/scripts
- Create venv:
python3 -m venv venv
- Activate venv:
source venv/bin/activate
- Install packages:
pip install requests beautifulsoup4
- Deactivate (optional):
deactivate
- Save the Python code above to
~/scripts/wechat_article_parser.py
. - Save the Bash code above to
/usr/local/bin/getwc
(usingsudo nano ...
). - Make Bash script executable:
sudo chmod +x /usr/local/bin/getwc
.
Conclusion
Solving the WeChat copy-paste problem required a multi-step approach, evolving from simple ideas to a more complex but robust local script. We tackled Base64 bloat, CSS visibility issues, and Python packaging best practices. While the final step of integrating the local copy into a web platform like Discourse still requires a manual touch (dragging images), the core getwc
tool now reliably archives WeChat articles locally, preserving content, images, and basic formatting for offline reading or further processing.
Remember to always respect copyright when handling content created by others. Happy archiving!
發完召喚 Grok 3,沒被搭理。懷疑是因為剛剛放出來,和論壇數據格式還不兼容。
拉 Gemini 2.5 深度分析是什麼寫法引發了問題,答案一般。
其實,之所以跟這個帖子,提這個問題;還是因為教材,或者某些文學作品。
這週在讀「荷花淀」,這個單元還有那篇反人性的「黨費」,論文筆清秀,孫犁筆墨確乎足夠;但把戰場寫成武俠風,要說和今天抗日神劇沒關係,也難。
所以一直和學生說,該看看「鬼子來了」,這片子和這類片子不能上映一天,這民族下一次遭遇外敵入境,就依舊大概率還是被蹂躪多年。
神劇不能成就真的愛國者,漢奸倒是一定可以培養出來很多。
反思劇或文學,概率更大些;也許。
其實,真正的點從來在於;可以荷花,也能武俠,但不能只有;教材,尤甚。
孫犁拼命咬牙克制自己痛罵教材編者瞎改自己文章,現在讀來,就更弔詭。
沒錯,太多時刻,即便「荷花淀」的寫法,還是有當局者不滿意的;就這類當局主事的生物,但有國難,需要幾秒就可以搖身一變成為敗類新主事?
放心,沒幾秒的。
還記得前文的牆嗎,這類一直騎著呢。
嗯。