Ways to Extract All PDF Links in Python (With Examples) ⚡️ Speed Up By 10X!

This article walks you through how to extract links and URLs from PDF files using pikepdf and PyMuPDF libraries using Python.

Do you want to extract URLs from a specific PDF file? If so, then you’ve come to the right place. In this tutorial, we will use the pikepdf and PyMuPDF libraries in Python to extract all the links from the PDF file.

How does Python extract all PDF links? We’re going to use two methods to get links from a specific PDF file, the first is to extract the comments, i.e., markups, comments, and comments, where you can actually click on a regular PDF reader and be redirected to your browser, and the second is to extract all the original text and use a regular expression to parse the URL.

First, let’s install these libraries:

pip3 install pikepdf PyMuPDF

Table of Contents

Method 1: Use annotations to extract URLs

Python extracts the resolution of all PDF links: In this technique, we will open a PDF file using the pikepdf library, go through all the comments on each page and see if there is a URL there:

import pikepdf # pip3 install pikepdf

file = "1810.04805.pdf"
# file = "1710.05006.pdf"
pdf_file = pikepdf.Pdf.open(file)
urls = []
# iterate over PDF pages
for page in pdf_file.pages:
    for annots in page.get("/Annots"):
        uri = annots.get("/A").get("/URI")
        if uri is not None:
            print("[+] URL Found:", uri)
            urls.append(uri)

print("[*] Total URLs extracted:", len(urls))

I’m testing this PDF file, but feel free to use any PDF file of your choice, just make sure it has some clickable links.

After running that code, I get the following output:

[+] URL Found: https://github.com/google-research/bert
[+] URL Found: https://github.com/google-research/bert
[+] URL Found: https://gluebenchmark.com/faq
[+] URL Found: https://gluebenchmark.com/leaderboard
...<SNIPPED>...
[+] URL Found: https://gluebenchmark.com/faq
[*] Total URLs extracted: 30

Awesome, we’ve managed to extract 30 URLs from that PDF paper.

Related: How to extract all website links in Python.

Method 2: Use regular expressions to extract URLs

Python extracts all PDF links for method parsing: In this section, we’ll extract all the original text from the PDF file and then use a regular expression to parse the URL. First, let’s get the text version of the PDF:

import fitz # pip install PyMuPDF
import re

# a regular expression of URLs
url_regex = r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=\n]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
# extract raw text from pdf
file = "1710.05006.pdf"
# file = "1810.04805.pdf"
# open the PDF file
with fitz.open(file) as pdf:
    text = ""
    for page in pdf:
        # extract text of each PDF page
        text += page.getText()

Now that we’re going to parse the target strings of URLs, let’s use the re module to parse them:text

urls = []
# extract all urls using the regular expression
for match in re.finditer(url_regex, text):
    url = match.group()
    print("[+] URL Found:", url)
    urls.append(url)
print("[*] Total URLs extracted:", len(urls))

Output:

[+] URL Found: https://github.com/
[+] URL Found: https://github.com/tensor
[+] URL Found: http://nlp.seas.harvard.edu/2018/04/03/attention.html
[+] URL Found: https://gluebenchmark.com/faq.
[+] URL Found: https://gluebenchmark.com/leaderboard).
[+] URL Found: https://gluebenchmark.com/leaderboard
[+] URL Found: https://cloudplatform.googleblog.com/2018/06/Cloud-
[+] URL Found: https://gluebenchmark.com/
[+] URL Found: https://gluebenchmark.com/faq
[*] Total URLs extracted: 9

Conclusion

How does Python extract all PDF links? This time we only extracted 9 URLs from the same PDF file, and now that doesn’t mean the second method isn’t accurate. This method only parses URLs that are in text form (not clickable).

However, there’s a problem with this method because the URL might contain a new line( ), so you might want to allow it in the expression.\nurl_regex

All in all, out of the above two Python methods to extract all PDF links, if you want to get clickable URLs, you may need to use the first method, which is preferable. But if you want to get a URL in text form, the second one might help you do just that!

If you want to extract tables or images from PDFs, there are tutorials:

How to extract all PDF links in Python
How to extract PDF tables in Python

IT Technology

Ways to Extract All PDF Links in Python (With Examples) ⚡️ Speed Up By 10X!

Method 1: Use annotations to extract URLs

Method 2: Use regular expressions to extract URLs

Conclusion

Thao Nguyen