After reading Diygod's new article Gracefully Using Cloudflare WARP to Tackle RSSHub's Anti-Scraping Challenges - DIYgod, I was very excited and started experimenting, applying WARP to my instance, but I used a slightly different method by using the host machine to open a socks proxy for Docker containers.
This article is applicable to RSSHub set up using docker run / docker compose, demonstrating with docker compose as an example.
From DIYgod:
Why use WARP?
During the years of developing RSSHub, I found that there are very few sites providing public APIs, and many sites implement strict anti-scraping controls to limit access to their platform content. Some sites block excessive requests from the same IP, while others comprehensively block IP addresses from common cloud service providers. Therefore, it has become very difficult to obtain the latest content updates.
This situation requires the use of proxies, but dedicated scraping proxies are usually expensive and have low cost-effectiveness. It would be great if Cloudflare WARP's unlimited bandwidth and rich IP resources could be utilized by RSSHub. RSSHub already supports general proxy protocols, as long as WARP can be wrapped as a general proxy.
My understanding of proxies is quite shallow, and most of the content in this article is based on trial and error. If there are areas for improvement, please comment and correct me.
I previously wrote an article on deploying RSSHub (and Miniflux): Self-Built RSSHub and Miniflux | Yunfi's Blog
I also wrote an overview about RSS: RSS: What is it? Why? How to use it? | Yunfi's Blog
If you are not familiar with using Docker to set up applications, you can check this article, which demonstrates the entire process from installing Docker to reverse proxy using Nginx Proxy Manager: Docker Series Prerequisite Skills: Visual Management of Nginx with Nginx Proxy Manager
Interested readers can take a look.
2023/08/26 Update: Solving Security Issues
Method Differences#
The official solution uses Docker deployment.
I used a one-click script from the project fscarmen/warp, which actually calls pufferffish/wireproxy to generate a SOCKS proxy.
Comparison | Docker Method | Script Method |
---|---|---|
Advantages | Simple and convenient, just update the compose file | More flexible, easier to use WARP+ and team version; can prioritize endpoints |
Disadvantages | Using custom configurations is more troublesome (need to write two files) | Directly operating on the host machine is slightly cumbersome |
In general, the Docker solution is convenient, while the script method is flexible. Moreover, the script can also be used for other tasks, such as refreshing WARP+ traffic.
If you directly use the docker-compose.yml downloaded from the official repository, other Docker applications in the same compose will not be able to use the WARP provided by that container. Some modifications are needed.
Since the speed difference between the free version of WARP and other versions is not significant, using Docker deployment may be a better method, but I still used the one-click script solution because I have always used this 😂 and the firewall settings are relatively easier.
Docker Deployment#
Quick Deployment#
DIYgod has already placed the docker-compose.yml with WARP added in the main repository of RSSHub. If you want to use it directly, just follow the instructions in the documentation Deployment | RSSHub.
However, there are two shortcomings with this approach. If you want to overcome these two issues, you can look at the advanced settings section below:
- Only services in the same compose file can access the proxy.
- Only the default free WARP account can be used.
It is worth noting that the provided docker-compose.yml uses a method that takes up more space and memory. You can follow the comments in the file or directly copy the following yaml:
version: "3.9"
services:
rsshub:
image: diygod/rsshub:chromium-bundled
restart: always
ports:
- "1200:1200"
environment:
NODE_ENV: production
CACHE_TYPE: redis
REDIS_URL: "redis://redis:6379/"
PROXY_URI: "socks5h://warp-socks:9091"
# add other environment variables below
depends_on:
- redis
redis:
image: redis:alpine
restart: always
volumes:
- redis-data:/data
warp-socks:
image: monius/docker-warp-socks:latest
privileged: true
volumes:
- /lib/modules:/lib/modules
cap_add:
- NET_ADMIN
- SYS_ADMIN
sysctls:
net.ipv6.conf.all.disable_ipv6: 0
net.ipv4.conf.all.src_valid_mark: 1
healthcheck:
test: ["CMD", "curl", "-f", "https://www.cloudflare.com/cdn-cgi/trace"]
interval: 30s
timeout: 10s
retries: 5
volumes:
redis-data:
Advanced Settings#
For the issue mentioned above (issue 1), you can add ports, but it may bring security risks (see the end of the article) (already resolved) by setting the ports to 172.17.0.1 and 127.0.0.1, allowing local applications and other Docker containers to access it, while other source IPs cannot;
For issue 2, you can mount your own conf file, which is relatively complex.
You can modify the warp-socks section to: (note the indentation)
warp-socks:
image: monius/docker-warp-socks:latest
privileged: true
ports:
- "172.17.0.1:9091:9091"
- "127.0.0.1:9091:9091" #solve problem 1
volumes:
- /lib/modules:/lib/modules
- ./wireguard:/opt:ro # solve problem 2
cap_add:
- NET_ADMIN
- SYS_ADMIN
sysctls:
net.ipv6.conf.all.disable_ipv6: 0
net.ipv4.conf.all.src_valid_mark: 1
healthcheck:
test: ["CMD", "curl", "-f", "https://www.cloudflare.com/cdn-cgi/trace"]
interval: 30s
timeout: 10s
retries: 5
You need to create a wireguard folder in the same directory as the compose file, which contains wgcf-profile.conf
and danted.conf
(optional).
Here is an example of the wgcf-profile.conf
file. If you have these values (for example, if you have experimented with WARP in Surge/Loon), just fill them in accordingly; if not, you can try generating one using ViRb3/wgcf (if you don't have WARP+ or team version, just ignore it; it will automatically generate the free version, and you can comment out the line - ./wireguard:/opt:ro # solve problem 2
).
[Interface]
PrivateKey = SNbsrC3W7PAcIvcdUUgqdRKBjOuUby1VPtDurefGJns=
DNS = 1.1.1.1
DNS = 1.0.0.1
Address = 172.16.0.2
Address = fd01:5ca1:ab1e:815a:634d:1529:f7a0:8f32
[Peer]
PublicKey = bmXOC+F1FxEMF9dyiK2H5/1SUtzH0JuVo51h2wPfgyo=
AllowedIPs = 0.0.0.0/0
AllowedIPs = ::/0
Endpoint = engage.cloudflareclient.com:2408
Note: The documentation mounts these two files to
/opt/wireguard/
, but I found that it doesn't work; DIYgod mentioned in the comments that he accesses these two files from/opt/
in the source code, so I wrote it this way. I haven't tested it yet, so if there are issues, please comment and let me know.
For convenience, here is the docker-compose.yml
for [Exposing Ports | Not Using Custom Configuration | Set Container Name] for easy copying, or you can directly wget the one I placed in gist.
wget https://gist.githubusercontent.com/yy4382/4f78b860fef29a7878e03a8a886a7367/raw/docker-compose.yml
version: "3.9"
# https://gist.githubusercontent.com/yy4382/4f78b860fef29a7878e03a8a886a7367/raw/docker-compose.yml
services:
rsshub:
image: diygod/rsshub:chromium-bundled
restart: always
container_name: rsshub-app
ports:
- "1200:1200"
environment:
NODE_ENV: production
CACHE_TYPE: redis
REDIS_URL: "redis://redis:6379/"
PROXY_URI: "socks5h://warp-socks:9091"
# add other environment variables below
depends_on:
- redis
redis:
image: redis:alpine
container_name: rsshub-redis
restart: always
volumes:
- redis-data:/data
warp-socks:
image: monius/docker-warp-socks:latest
container_name: rsshub-warp
privileged: true
ports:
- "172.17.0.1:9091:9091"
- "127.0.0.1:9091:9091"
volumes:
- /lib/modules:/lib/modules
cap_add:
- NET_ADMIN
- SYS_ADMIN
sysctls:
net.ipv6.conf.all.disable_ipv6: 0
net.ipv4.conf.all.src_valid_mark: 1
healthcheck:
test: ["CMD", "curl", "-f", "https://www.cloudflare.com/cdn-cgi/trace"]
interval: 30s
timeout: 10s
retries: 5
volumes:
redis-data:
Using One-Click Script Deployment#
The principle is to generate a SOCKS proxy and then let RSSHub use this proxy.
Using the project fscarmen/warp, you can generate a proxy through the WARP Linux Client or the third-party project pufferffish/wireproxy.
WARP Linux Client | wireproxy | |
---|---|---|
Free Version | Supported ✅ | Supported ✅ |
WARP+ | Supported ✅ | Supported ✅ |
WARP Teams | Not Supported ❌ | Supported ✅ |
Generate SOCKS 0.0.0.0 | Not Supported ❌ | Supported ✅, requires configuration file modification |
WARP+ requires a license, and Teams has four authentication schemes. For details, refer to the original project README.md.
Generating Proxy#
Currently, only wireproxy can generate a SOCKS proxy at 0.0.0.0:40000, so this solution is adopted. (For issues with the Client, you can refer to this issue.)
If using the WARP Client, the network mode of RSSHub needs to be changed to host, which will affect the connection with Redis, requiring additional setup, so it is not recommended.
wget -N https://raw.githubusercontent.com/fscarmen/warp/main/menu.sh && bash menu.sh w
Follow the process and note the port settings. This article uses the default port 40000 as an example.
After the first run, using the
warp
command andwarp h
can invoke two different help lists, which is recommended to check.
Modifying Proxy Access Permissions#
The generated proxy only allows localhost access, and access from Docker containers does not appear as localhost to the host machine, so changes are needed.
Open /etc/wireguard/proxy.conf
and change
[Socks5]
BindAddress = 127.0.0.1:40000
to:
[Socks5]
BindAddress = 0.0.0.0:40000
Letting RSSHub Use the Proxy#
Open the old version of docker-compose.yml and add this line:
PROXY_URI: 'socks://172.17.0.1:40000'
Where 172.17.0.1
is the IP of docker0, which is usually this. If you are unsure, you can run ip addr show docker0
to check for inet 172.17.0.1/16
.
The complete file is as follows:
version: "3.9"
services:
rsshub:
image: diygod/rsshub:chromium-bundled
restart: always
ports:
- "1200:1200"
environment:
NODE_ENV: production
CACHE_TYPE: redis
REDIS_URL: "redis://redis:6379/"
PROXY_URI: "socks://172.17.0.1:40000"
# add other environment variables below
depends_on:
- redis
redis:
image: redis:alpine
restart: always
volumes:
- redis-data:/data
volumes:
redis-data:
Then run docker compose up -d
.
Security Improvements#
Since the proxy listens on 0.0.0.0
, there are significant security risks; anyone who knows the server IP can use this proxy.
If your service provider offers an additional layer of firewall, such as Tencent Cloud, Alibaba Cloud, AWS, etc., that would be ideal, as long as the 9091/40000 ports are closed in their panel.
For smaller VPS providers, it becomes more complicated, and you need to set up firewall rules to allow only Docker to access port 40000. For example, using ufw:
sudo ufw allow from 172.17.0.0/16
Conclusion#
Actually, my VPS's IP is quite good, and I haven't encountered any websites blocking scraping yet; I mainly did this to experiment.
I encountered several issues along the way and took some detours, but I eventually figured it out.
Thanks to the authors of these projects and the guidance from RSSHub community friends.
Side Note:#
I first encountered WARP earlier this year to extract it into Surge to unlock ChatGPT. If you are interested in using WARP through Surge or registering for a team version of WARP, you can check out the two tutorials I looked at at that time:
This article is licensed under CC BY-NC-SA 4.0
You can read this article on my Hexo blog
You can also check this Page to choose to follow my updates in various aspects.