Yunfi

Yunfi

tg_channel
github
email

Two ways to use WARP with RSSHub

After reading Diygod's new article Gracefully Using Cloudflare WARP to Tackle RSSHub's Anti-Scraping Challenges - DIYgod, I was very excited and started experimenting, applying WARP to my instance, but I used a slightly different method by using the host machine to open a socks proxy for Docker containers.

This article is applicable to RSSHub set up using docker run / docker compose, demonstrating with docker compose as an example.

From DIYgod:

Why use WARP?

During the years of developing RSSHub, I found that there are very few sites providing public APIs, and many sites implement strict anti-scraping controls to limit access to their platform content. Some sites block excessive requests from the same IP, while others comprehensively block IP addresses from common cloud service providers. Therefore, it has become very difficult to obtain the latest content updates.

This situation requires the use of proxies, but dedicated scraping proxies are usually expensive and have low cost-effectiveness. It would be great if Cloudflare WARP's unlimited bandwidth and rich IP resources could be utilized by RSSHub. RSSHub already supports general proxy protocols, as long as WARP can be wrapped as a general proxy.

My understanding of proxies is quite shallow, and most of the content in this article is based on trial and error. If there are areas for improvement, please comment and correct me.

I previously wrote an article on deploying RSSHub (and Miniflux): Self-Built RSSHub and Miniflux | Yunfi's Blog

I also wrote an overview about RSS: RSS: What is it? Why? How to use it? | Yunfi's Blog

If you are not familiar with using Docker to set up applications, you can check this article, which demonstrates the entire process from installing Docker to reverse proxy using Nginx Proxy Manager: Docker Series Prerequisite Skills: Visual Management of Nginx with Nginx Proxy Manager

Interested readers can take a look.

2023/08/26 Update: Solving Security Issues

Method Differences#

The official solution uses Docker deployment.

I used a one-click script from the project fscarmen/warp, which actually calls pufferffish/wireproxy to generate a SOCKS proxy.

ComparisonDocker MethodScript Method
AdvantagesSimple and convenient, just update the compose fileMore flexible, easier to use WARP+ and team version; can prioritize endpoints
DisadvantagesUsing custom configurations is more troublesome (need to write two files)Directly operating on the host machine is slightly cumbersome

In general, the Docker solution is convenient, while the script method is flexible. Moreover, the script can also be used for other tasks, such as refreshing WARP+ traffic.

If you directly use the docker-compose.yml downloaded from the official repository, other Docker applications in the same compose will not be able to use the WARP provided by that container. Some modifications are needed.

Since the speed difference between the free version of WARP and other versions is not significant, using Docker deployment may be a better method, but I still used the one-click script solution because I have always used this 😂 and the firewall settings are relatively easier.

Docker Deployment#

Quick Deployment#

DIYgod has already placed the docker-compose.yml with WARP added in the main repository of RSSHub. If you want to use it directly, just follow the instructions in the documentation Deployment | RSSHub.

However, there are two shortcomings with this approach. If you want to overcome these two issues, you can look at the advanced settings section below:

  1. Only services in the same compose file can access the proxy.
  2. Only the default free WARP account can be used.

It is worth noting that the provided docker-compose.yml uses a method that takes up more space and memory. You can follow the comments in the file or directly copy the following yaml:

version: "3.9"

services:
  rsshub:
    image: diygod/rsshub:chromium-bundled
    restart: always
    ports:
      - "1200:1200"
    environment:
      NODE_ENV: production
      CACHE_TYPE: redis
      REDIS_URL: "redis://redis:6379/"
      PROXY_URI: "socks5h://warp-socks:9091"
      # add other environment variables below
    depends_on:
      - redis

  redis:
    image: redis:alpine
    restart: always
    volumes:
      - redis-data:/data

  warp-socks:
    image: monius/docker-warp-socks:latest
    privileged: true
    volumes:
      - /lib/modules:/lib/modules
    cap_add:
      - NET_ADMIN
      - SYS_ADMIN
    sysctls:
      net.ipv6.conf.all.disable_ipv6: 0
      net.ipv4.conf.all.src_valid_mark: 1
    healthcheck:
      test: ["CMD", "curl", "-f", "https://www.cloudflare.com/cdn-cgi/trace"]
      interval: 30s
      timeout: 10s
      retries: 5

volumes:
  redis-data:

Advanced Settings#

For the issue mentioned above (issue 1), you can add ports, but it may bring security risks (see the end of the article) (already resolved) by setting the ports to 172.17.0.1 and 127.0.0.1, allowing local applications and other Docker containers to access it, while other source IPs cannot;

For issue 2, you can mount your own conf file, which is relatively complex.

You can modify the warp-socks section to: (note the indentation)

  warp-socks:
    image: monius/docker-warp-socks:latest
    privileged: true
    ports:
      - "172.17.0.1:9091:9091"
      - "127.0.0.1:9091:9091" #solve problem 1
    volumes:
      - /lib/modules:/lib/modules
      - ./wireguard:/opt:ro # solve problem 2
    cap_add:
      - NET_ADMIN
      - SYS_ADMIN
    sysctls:
      net.ipv6.conf.all.disable_ipv6: 0
      net.ipv4.conf.all.src_valid_mark: 1
    healthcheck:
      test: ["CMD", "curl", "-f", "https://www.cloudflare.com/cdn-cgi/trace"]
      interval: 30s
      timeout: 10s
      retries: 5

You need to create a wireguard folder in the same directory as the compose file, which contains wgcf-profile.conf and danted.conf (optional).

Here is an example of the wgcf-profile.conf file. If you have these values (for example, if you have experimented with WARP in Surge/Loon), just fill them in accordingly; if not, you can try generating one using ViRb3/wgcf (if you don't have WARP+ or team version, just ignore it; it will automatically generate the free version, and you can comment out the line - ./wireguard:/opt:ro # solve problem 2).

[Interface]
PrivateKey = SNbsrC3W7PAcIvcdUUgqdRKBjOuUby1VPtDurefGJns=
DNS = 1.1.1.1
DNS = 1.0.0.1
Address = 172.16.0.2
Address = fd01:5ca1:ab1e:815a:634d:1529:f7a0:8f32

[Peer]
PublicKey = bmXOC+F1FxEMF9dyiK2H5/1SUtzH0JuVo51h2wPfgyo=
AllowedIPs = 0.0.0.0/0
AllowedIPs = ::/0
Endpoint = engage.cloudflareclient.com:2408

Note: The documentation mounts these two files to /opt/wireguard/, but I found that it doesn't work; DIYgod mentioned in the comments that he accesses these two files from /opt/ in the source code, so I wrote it this way. I haven't tested it yet, so if there are issues, please comment and let me know.

For convenience, here is the docker-compose.yml for [Exposing Ports | Not Using Custom Configuration | Set Container Name] for easy copying, or you can directly wget the one I placed in gist.

wget https://gist.githubusercontent.com/yy4382/4f78b860fef29a7878e03a8a886a7367/raw/docker-compose.yml
version: "3.9"
# https://gist.githubusercontent.com/yy4382/4f78b860fef29a7878e03a8a886a7367/raw/docker-compose.yml
services:
  rsshub:
    image: diygod/rsshub:chromium-bundled
    restart: always
    container_name: rsshub-app
    ports:
      - "1200:1200"
    environment:
      NODE_ENV: production
      CACHE_TYPE: redis
      REDIS_URL: "redis://redis:6379/"
      PROXY_URI: "socks5h://warp-socks:9091"
      # add other environment variables below
    depends_on:
      - redis

  redis:
    image: redis:alpine
    container_name: rsshub-redis
    restart: always
    volumes:
      - redis-data:/data

  warp-socks:
    image: monius/docker-warp-socks:latest
    container_name: rsshub-warp
    privileged: true
    ports:
      - "172.17.0.1:9091:9091"
      - "127.0.0.1:9091:9091"
    volumes:
      - /lib/modules:/lib/modules
    cap_add:
      - NET_ADMIN
      - SYS_ADMIN
    sysctls:
      net.ipv6.conf.all.disable_ipv6: 0
      net.ipv4.conf.all.src_valid_mark: 1
    healthcheck:
      test: ["CMD", "curl", "-f", "https://www.cloudflare.com/cdn-cgi/trace"]
      interval: 30s
      timeout: 10s
      retries: 5

volumes:
  redis-data:

Using One-Click Script Deployment#

The principle is to generate a SOCKS proxy and then let RSSHub use this proxy.

Using the project fscarmen/warp, you can generate a proxy through the WARP Linux Client or the third-party project pufferffish/wireproxy.

WARP Linux Clientwireproxy
Free VersionSupported ✅Supported ✅
WARP+Supported ✅Supported ✅
WARP TeamsNot Supported ❌Supported ✅
Generate SOCKS 0.0.0.0Not Supported ❌Supported ✅, requires configuration file modification

WARP+ requires a license, and Teams has four authentication schemes. For details, refer to the original project README.md.

Generating Proxy#

Currently, only wireproxy can generate a SOCKS proxy at 0.0.0.0:40000, so this solution is adopted. (For issues with the Client, you can refer to this issue.)
If using the WARP Client, the network mode of RSSHub needs to be changed to host, which will affect the connection with Redis, requiring additional setup, so it is not recommended.

wget -N https://raw.githubusercontent.com/fscarmen/warp/main/menu.sh && bash menu.sh w

Follow the process and note the port settings. This article uses the default port 40000 as an example.

After the first run, using the warp command and warp h can invoke two different help lists, which is recommended to check.

Modifying Proxy Access Permissions#

The generated proxy only allows localhost access, and access from Docker containers does not appear as localhost to the host machine, so changes are needed.

Open /etc/wireguard/proxy.conf and change

[Socks5]
BindAddress = 127.0.0.1:40000

to:

[Socks5]
BindAddress = 0.0.0.0:40000

Letting RSSHub Use the Proxy#

Open the old version of docker-compose.yml and add this line:

PROXY_URI: 'socks://172.17.0.1:40000'

Where 172.17.0.1 is the IP of docker0, which is usually this. If you are unsure, you can run ip addr show docker0 to check for inet 172.17.0.1/16.

The complete file is as follows:

version: "3.9"

services:
  rsshub:
    image: diygod/rsshub:chromium-bundled
    restart: always
    ports:
      - "1200:1200"
    environment:
      NODE_ENV: production
      CACHE_TYPE: redis
      REDIS_URL: "redis://redis:6379/"
      PROXY_URI: "socks://172.17.0.1:40000"
      # add other environment variables below
    depends_on:
      - redis

  redis:
    image: redis:alpine
    restart: always
    volumes:
      - redis-data:/data

volumes:
  redis-data:

Then run docker compose up -d.

Security Improvements#

Since the proxy listens on 0.0.0.0, there are significant security risks; anyone who knows the server IP can use this proxy.

If your service provider offers an additional layer of firewall, such as Tencent Cloud, Alibaba Cloud, AWS, etc., that would be ideal, as long as the 9091/40000 ports are closed in their panel.

For smaller VPS providers, it becomes more complicated, and you need to set up firewall rules to allow only Docker to access port 40000. For example, using ufw:

sudo ufw allow from 172.17.0.0/16

Conclusion#

Actually, my VPS's IP is quite good, and I haven't encountered any websites blocking scraping yet; I mainly did this to experiment.

I encountered several issues along the way and took some detours, but I eventually figured it out.

Thanks to the authors of these projects and the guidance from RSSHub community friends.

Side Note:#

I first encountered WARP earlier this year to extract it into Surge to unlock ChatGPT. If you are interested in using WARP through Surge or registering for a team version of WARP, you can check out the two tutorials I looked at at that time:


This article is licensed under CC BY-NC-SA 4.0
You can read this article on my Hexo blog
You can also check this Page to choose to follow my updates in various aspects.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.