Suppose we have a safety report with performance data stored on the second page in a table format.
We could try to just highlight the relevant data in the PDF, then copy and paste the data into Excel, but it’s likely not to go as well as we had hoped.
I wouldn’t call this a rousing success.
We could spend time rearranging and fixing the pasted data, but we don’t have that kind of time and it would negate the second requirement of an easily updatable Excel report.
Using Power Query to Import PDF Tables
Power Query has a built-in connector used to extract information from PDF files. To use the connector, we perform the following steps (the example uses Excel 365).
- Start Excel to create a blank workbook.
- Select Data (tab) -> Get & Transform (group) -> Get Data -> From File -> From PDF.
- Browse to the folder that contains the PDF, select the PDF, and click IMPORT.
The Navigator window displays a list of every “proper” table in the PDF as well as every page.
If the needed data is in a table, you can select either the table or the page that holds the table. In most cases, it is best to select the table as it will negate the need to later sanitize the page of unwanted information.
If you are unsure about which listed table contains the needed information, you can single-click any item in the left-hand list to display a preview of the item’s contents.
- Select the table or page and click Transform Data.
The data will be brought into the Power Query Editor where it can be cleaned and/or modified to fit your output needs.
- Select Close & Load to send the extracted data to an Excel Table.
The new table of PDF information can be used to drive other Excel objects, like charts and Pivot Tables.
If the PDF were to be updated with additional years of statistics, the user needs to merely right-click on the extracted table and select Refresh to receive the updated information.