extract images from pdf in python

Images are commonly used in PDF documents along with text, which makes the content more appealing and elaborating. While processing and analyzing the PDF documents, you may need to extract images also. Therefore, in this article, we will demonstrate how to process PDF files and extract images programmatically in Python. The step-by-step guide and code sample will demonstrate the whole image extraction process.

Python Library to Extract Images from PDF

To extract images from a PDF file, we will use Aspose.Words for Python. It is a powerful and feature-rich library to create and manipulate text documents including PDF and DOCX. You can install the library from PyPI using the following pip command.

> pip install aspose-words

Steps to Extract Images from PDF

Aspose.Words for Python lets you extract the images from a PDF file within a few simple steps. The following is the workflow of how to extract images from a PDF using Aspose.Words for Python.

  • Load the PDF file from the desired location.
  • Convert PDF to DOCX format.
  • Process DOCX version of PDF and extract images
  • Save each image as a file to the desired location.

The following section demonstrates how to transform the above-mentioned steps into Python code and extract images from a PDF.

Extract Images from PDF in Python

In the process of image extraction, we will first convert the PDF file to DOCX format. In a DOCX file, the images are represented by the shape nodes. Therefore, we will process each shape and extract the image from it.

The following are the steps to extract images from a PDF in Python.

  • First, load the PDF file using Document class.
  • Then, save PDF in DOCX format and load the DOCX version of the PDF file.
  • Retrieve all the shapes into an object using Document.get_child_nodes(NodeType.SHAPE, True) method.
  • Loop through the shapes and perform the following operations for each shape node:
    • Cast the shape into Shape type using as_shape() method.
    • Use Shape.has_image() method to check if the shape has image.
    • Extract the image from the shape and save it using Shape.image_data.save(string) method.

The following code sample demonstrates image extraction from a PDF document in Python.

Python PDF Image Extraction Library - Get a Free License

You can get a free temporary license to extract images from PDF without evaluation limitations.

Conclusion

While analyzing the PDF documents, images are also required to be extracted along with the text. In this article, you have learned how to extract images from a PDF in Python. You can simply install Aspose.Words for Python and integrate image extraction in your applications.

Explore Aspose’ PDF Image Extraction Library

Aspose.Words for Python offers a range of other features to manipulate text documents. You can visit the documentation to explore more about the library. In case you would have any questions, feel free to let us know via our forum.

See Also