bioRxiv hosts a S3 bucket containing all deposited preprint articles. This bucket is “requester pays” which unfortunately means you’ll be paying $0.09 per GB downloaded. As the entire bucket is ~6TB the total cost to download would be $500+. To avoid this we will first download the files to a machine hosted in EC2 (no egress) and then filter down the articles to plain-text before downloading. By keeping only the plain-text data we can massively reduce the size of the data to download affordably.
To download the data I used i7ie.3xlarge
($1.56/hour) which offers sufficient network and storage bandwidth to download at 2.5GB/s with 7.5TB of disk space.
To efficiently download the thousands of compressed article files in parallel I used s5cmd
:
s5cmd --request-payer requester cp --sp 's3://biorxiv-src-monthly/Current_Content/*' .
Now we want to extract plaintext from the thousands of compressed (.meca) files each corresponding to a different article. Thankfully the bioRxiv team have already preprocessed the PDF’s into XML files containing plaintext! Now all we need to do is extract the XML file and delete the remaining contents
First we can create a function to process each MECA archive file:
process_meca() {
meca="$1"
id=$(basename "$meca" .meca)
mkdir -p "xml/$id" 2>/dev/null
unzip -qq -j "$meca" "content/*.xml" -d "xml/$id" < /dev/null 2>/dev/null
}
export -f process_meca
And then we can break up the processing to operate on one subdirectory at a time to avoid hitting system limits:
for dir in */; do
find "$dir" -name "*.meca" -print0 | parallel -0 --progress --jobs $(($(nproc) * 2)) process_meca
done
Finally let’s flatten our directory structure so we have all XML files in one top directory.
for d in xml/*/; do
prefix=$(basename "$d")
for f in "$d"*.xml; do
if [ -f "$f" ]; then
mv "$f" "xml/$prefix-$(basename "$f")"
fi
done
rmdir "$d"
done
To reduce the download costs we can compress the directory of XML’s ready to download to our local machine with scp.
tar cf - xml/ | zstd -T0 > biorxiv_text.tar.zst
This final file biorxiv_text.tar.zst
is only 7GB which costs only $0.63 to download. Conveniently this entire process can be repeated for medrxiv - just change the s3 endpoint: “s3://biorxiv-src-monthly” -> “s3://medrxiv-src-monthly”.
If you don’t want to go through all of this yourself I’ve uploaded the plain-text content of both biorxiv+medrxiv up to Dec 2024