Automatically rename arXiv PDFs
If you read a lot of arXiv papers you have probably noticed that their PDFs are named in the YYMM.number.pdf
format, where YYMM
is year/month and is followed by a zero-padded sequence number. I use fd (a faster alternative to find
) to search for files so I’d much prefer to have the paper’s title or authors as part of the filename.
In this post I show how to automate this process so that as soon as an arXiv PDF is saved to a directory it is renamed to include the title in the filename.
Rename script
At the heart of this automation is a Python script that takes the file path as the argument and uses arXiv’s REST API to get the paper title, and file into the title_YYMM.number.pdf
:
#!/usr/bin/env python3
import re
import sys
from pathlib import Path
import feedparser
def slugify(s):
"""
Adopted from Django:
https://github.com/django/django/blob/main/django/utils/text.py
"""
s = str(s).strip().replace(' ', '_')
return re.sub(r'(?u)[^-\w.]', '', s)
p = Path(sys.argv[1])
arxiv_id = p.stem
feed = feedparser.parse(f"http://export.arxiv.org/api/query?id_list={arxiv_id}&max_results=1")
paper = feed.entries[0]
title = paper["title"]
title_slug = slugify(title)
new_p = p.with_stem(f"{title_slug}_{arxiv_id}")
if not new_p.exists():
p.rename(new_p)
else:
print("😥 File already exists")
Automatically invoking the script
incron daemon can be used to automatically invoke the rename script when a new file is added to a directory. incron is similar to cron, but instead of running commands based on time, it runs commands based on filesystem events. It is typically not installed by default but it’s usually available in package managers. On Arch:
$ pacman -S incron
Note that installing this package does not necessarily start the daemon so make sure the service is enabled and running:
$ sudo systemctl enable --now incrond.service
Similar to cron, incrob is driven by a table where each line of the table has the following format:
watched_path event_type command
The table can be edited with incrontab:
$ incrontab -e
Add the following line replacing paths to match the directory where you store your PDFs, and the Python rename script:
/PATH/TO/DIR IN_CREATE [[ $# =~ [0-9]+\.[0-9]+(v[0-9]+)*\.pdf ]] && /PATH/TO/rename_arxiv $@/$#
The IN_CREATE
event fires when new a file or directory is created in the watched directory. The regular expression test in the first part of the command ensures we only call the script for files that match the arXiv filename pattern. incron
provides some wildcards that can be used in the command section. The above command uses the following:
$@
expands to the watched directory path$#
expands to the event-related filename