If you read a lot of arXiv articles you have probably noticed that arXiv PDFs are named in the YYMM.number.pdf format, where YYMM specifies year/month, and is followed by a zero-padded sequence number. This makes searching for stored articles difficult. I would much prefer to have the article’s title or author names as part of the filename.

In this post I show how to automate this process so that the as soon as an arXiv PDF is saved to a directory it is renamed to include the title in the filename.

## Python rename script

At the heart of this automation is a Python script that accepts the file path as the argument, uses the arXiv python API package to get the article title, and then renames the file into the title_YYMM.number.pdf. The required arxiv package can be installed via pip.

#!/usr/bin/env python3

import sys
import re
from pathlib import Path
import arxiv

def get_valid_filename(s):
"""
Return the given string converted to a string that can be used for a clean
filename. Remove leading and trailing spaces; convert other spaces to
underscores; and remove anything that is not an alphanumeric, dash,
underscore, or dot.
>>> get_valid_filename("john's portrait in 2004.jpg")
'johns_portrait_in_2004.jpg'
"""
s = str(s).strip().replace(' ', '_')
return re.sub(r'(?u)[^-\w.]', '', s)

p = Path(sys.argv[1])

arxiv_id = p.stem
entry = arxiv.query(id_list=[arxiv_id])
title = entry[0]['title']
title_slug = get_valid_filename(title)

new_p = p.with_stem(title_slug + arxiv_id)
if not new_p.exists():
p.rename(new_p)


## Automatically invoking the rename script

In linux we can use incron to automatically invoke the rename script when a new file is added to a directory. incron is similar to cron, but instead of running commands based on time, it runs commands based on filesystem events. It is typically not installed on Linux distros by default but can be found in most package managers. For example in Debian-based distros it can be installed via:

$sudo apt-get install incron  Note that installing this package does not necessaarily start the daemon so make sure the service is enabled and running: $ sudo systemctl enable incrond.service --now


Simialr to cron, incrob is driven by a table where each line of the table has the following format:

watched_path event_type command


The table can be edited with incrontab:

$incrontab -e  Add the following line replacing paths to match the directory where you store your PDFs, and the Python rename script: /PATH/TO/DIR IN_CREATE [[$# =~ [0-9]+\.[0-9]+(v[0-9]+)*\.pdf ]] && /PATH/TO/rename_arxiv [email protected]/$#  The IN_CREATE event fires when new a file or directory is created in the watched directory. The regular expression test in the first part of the command ensures we only call the script for files that match the arXiv filename pattern. incron provides some wildcards that can be used in the command section. The above command uses the following: • [email protected] expands to the watched directory path • $# expands to the event-related filename

The rename script obviously works in macOS as well. The equivalent of incron functionality can be achieved via BSD’s wait_on or by apps like Hazel:
• Updated the regex in incrob entry to match arxiv PDFs that have been revised e.g. 2020.1234v4.