How to Extract the Coordinates of Text in a PDF

PDFs are an interesting format because they often have “selectable” text. Thus the format must somehow differentiate between text and the rest of the file. When trying to solve the problem of how to identify and locate each room in the floor plan, I decided to try converting my SVG files, which also differentiate text, and PDFs, which have more support from Python packages. I have attached the solution that I came across my compiling various internet sources and reading quite a lot of documentation. It requires the packages pdfminer, PyPDF2, and pandas. Pandas is optional, as it is included to convert the extracted data into a DataFrame (the rest of my application works with DataFrames to manage large amounts of data, so it was a matter of convenience).

import os
import pdfminer
import pandas as pd
from pdfminer.pdfpage import *
from pdfminer.pdfinterp import *
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from PyPDF2 import PdfFileReader


def get_text_and_coordinates(pdf_path):
    # Extract the room prefix from level in the pdf_path
    room_prefix = int(pdf_path.split(os.sep)[-1].split('-')[-1][:1]) - 1

    # Open a PDF file.
    fp = open(pdf_path, 'rb')

    # Create a PDF parser object associated with the file object.
    parser = PDFParser(fp)

    # Create a PDF document object that stores the document structure.
    # Password for initialization as 2nd parameter
    document = PDFDocument(parser)

    # Check if the document allows text extraction. If not, abort.
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed

    # Create a PDF resource manager object that stores shared resources.
    resource_manager = PDFResourceManager()

    # BEGIN LAYOUT ANALYSIS
    # Set parameters for analysis.
    la_params = LAParams()

    # Create a PDF page aggregator object.
    device = PDFPageAggregator(resource_manager, laparams=la_params)

    # Create a PDF interpreter object.
    interpreter = PDFPageInterpreter(resource_manager, device)

    def parse_obj(lt_objects):

        # (x0, y0) = Bottom left corner, (x1, y1) = Top right corner
        df_dictionary = {
            'x0': [],
            'y0': [],
            'x1': [],
            'y1': [],
            'width': [],
            'height': [],
            'text': []
        }

        # loop over the object list
        for obj in lt_objects:

            # if it's a textbox, print text and location
            if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
                # Use some basic filtering: Remove letters, add hyphens, ignore combined rooms
                text = re.sub('[^0-9]', '', obj.get_text())

                if not text.startswith(str(room_prefix)):
                    continue  # Ignore noise that gives room numbers that cannot possibly belong to the floor

                text_len = len(text)

                if text_len > 0:
                    bbox = obj.bbox
                    width = bbox[2] - bbox[0]
                    height = bbox[3] - bbox[1]

                    if text_len == 5:
                        text = text[:3] + '.' + text[3:]
                    elif text_len > 5 or text_len < 3:
                        continue  # Currently just ignoring those few rooms which are problematic

                    df_dictionary['x0'].append(bbox[0])
                    df_dictionary['y0'].append(bbox[1])
                    df_dictionary['x1'].append(bbox[2])
                    df_dictionary['y1'].append(bbox[3])
                    df_dictionary['width'].append(width)
                    df_dictionary['height'].append(height)
                    df_dictionary['text'].append(text)

            # if it's a container, recurse
            elif isinstance(obj, pdfminer.layout.LTFigure):
                parse_obj(obj._objs)

        return pd.DataFrame.from_dict(df_dictionary)

    # loop over all pages in the document
    for page in PDFPage.create_pages(document):
        # read the page into a layout object
        interpreter.process_page(page)
        layout = device.get_result()
        # extract text from this object
        df = parse_obj(layout._objs)
        return df


def get_media_box(pdf_path):
    return PdfFileReader(open(pdf_path, 'rb')).getPage(0).mediaBox

The comments explain what each field represents. (x0, y0) represents the coordinates of the bottom left corner of the textbox, while (x1, y1) represents the coordinates of the top right corner of the textbox. I have found that this method works quite well, but it is unfortunately only a step in the direction of the final solution I desire. I hope somebody else will find this useful and safe themselves from all the trouble I went through!