Automatically rename arXiv PDFs

If you read a lot of arXiv papers you have probably noticed that their PDFs are named in the YYMM.number.pdf format, where YYMM is year/month and is followed by a zero-padded sequence number. I use fd (a faster alternative to find) to search for files so I’d much prefer to have the paper’s title or authors as part of the filename.

In this post I show how to automate this process so that as soon as an arXiv PDF is saved to a directory it is renamed to include the title in the filename.

Rename script

At the heart of this automation is a Python script that takes the file path as the argument and uses arXiv’s REST API to get the paper title, and file into the title_YYMM.number.pdf:

#!/usr/bin/env python3

import re
import sys
from pathlib import Path

import feedparser

def slugify(s):
    """
    Adopted from Django:
    https://github.com/django/django/blob/main/django/utils/text.py
    """
    s = str(s).strip().replace(' ', '_')
    return re.sub(r'(?u)[^-\w.]', '', s)

p = Path(sys.argv[1])
arxiv_id = p.stem

feed = feedparser.parse(f"http://export.arxiv.org/api/query?id_list={arxiv_id}&max_results=1")
paper = feed.entries[0]
title = paper["title"]
title_slug = slugify(title)

new_p = p.with_stem(f"{title_slug}_{arxiv_id}")
if not new_p.exists():
    p.rename(new_p)
else:
    print("😥 File already exists")

Automatically invoking the script

incron daemon can be used to automatically invoke the rename script when a new file is added to a directory. incron is similar to cron, but instead of running commands based on time, it runs commands based on filesystem events. It is typically not installed by default but it’s usually available in package managers. On Arch:

$ pacman -S incron

Note that installing this package does not necessarily start the daemon so make sure the service is enabled and running:

$ sudo systemctl enable --now incrond.service

Similar to cron, incrob is driven by a table where each line of the table has the following format:

watched_path event_type command

The table can be edited with incrontab:

$ incrontab -e

Add the following line replacing paths to match the directory where you store your PDFs, and the Python rename script:

/PATH/TO/DIR    IN_CREATE    [[ $# =~ [0-9]+\.[0-9]+(v[0-9]+)*\.pdf ]] && /PATH/TO/rename_arxiv $@/$#

The IN_CREATE event fires when new a file or directory is created in the watched directory. The regular expression test in the first part of the command ensures we only call the script for files that match the arXiv filename pattern. incron provides some wildcards that can be used in the command section. The above command uses the following:

$@ expands to the watched directory path
$# expands to the event-related filename