Reducing Forgejo Scraping

Around 26/2 I noticed unusually high resource usage on my Forgejo instance. Upon investigation, it seemed that the instance had been being crawled since 19/2, with requests going as high as 50rpm. The crawler had started to crawl the webpage for each individual commit, which was causing massively increased memory usage.

To prevent this, I added a robots.txt file based on codeberg.org’s one. I then limited the repository that it was crawling, to reduce the immediate impact. It seems that this worked, as I haven’t seen the crawler since 28/2.

For reference, the robots.txt was added to forgejo_data/gitea/public/robots.txt and is as follows:

User-agent: *
Disallow: /api/*
Disallow: /avatars
Disallow: /user/*
Disallow: /*/*/src/commit/*
Disallow: /*/*/commit/*
Disallow: /*/*/*/refs/*
Disallow: /*/*/*/star
Disallow: /*/*/*/watch
Disallow: /*/*/labels
Disallow: /*/*/activity/*
Disallow: /vendor/*
Disallow: /swagger.*.json

Disallow: /explore/*?*

Disallow: /repo/create
Disallow: /repo/migrate
Disallow: /org/create
Disallow: /*/*/fork

Disallow: /*/*/watchers
Disallow: /*/*/stargazers
Disallow: /*/*/forks

Disallow: /*/*/activity
Disallow: /*/*/projects
Disallow: /*/*/commits/
Disallow: /*/*/branches
Disallow: /*/*/tags
Disallow: /*/*/compare
Disallow: /*/*/lastcommit/*

Disallow: /*/*/issues/new
Disallow: /*/*/issues/?*
Disallow: /*/*/issues?*
Disallow: /*/*/pulls/?*
Disallow: /*/*/pulls?*
Disallow: /*/*/pulls/*/files

Disallow: /*/tree/
Disallow: /*/download
Disallow: /*/revisions
Disallow: /*/commits/*?author
Disallow: /*/commits/*?path
Disallow: /*/comments
Disallow: /*/blame/
Disallow: /*/raw/
Disallow: /*/cache/
Disallow: /.git/
Disallow: */.git/
Disallow: /*.git
# /*.atom
# /*.rss

Disallow: /*/*/archive/
Disallow: *.bundle
Disallow: */commit/*.patch
Disallow: */commit/*.diff

Disallow: /*lang=*
Disallow: /*source=*
Disallow: /*ref_cta=*
Disallow: /*plan=*
Disallow: /*return_to=*
Disallow: /*ref_loc=*
Disallow: /*setup_organization=*
Disallow: /*source_repo=*
Disallow: /*ref_page=*
Disallow: /*source=*
Disallow: /*referrer=*
Disallow: /*report=*
Disallow: /*author=*
Disallow: /*since=*
Disallow: /*until=*
Disallow: /*commits?author=*
Disallow: /*tab=*
Disallow: /*q=*
Disallow: /*repo-search-archived=*

Crawl-delay: 2