1
votes

How can we convert a PDF to docx with/without using python. Actually I want to automate conversion of large number of file, so I need an API.

I have used online websites like: https://pdf2docx.com/

https://online2pdf.com/pdf2docx

https://www.zamzar.com/convert/pdf-to-docx/

I was unable to get access for using there api directly

5
What have you tried so far (code-wise)? - h4z3
I was unable to get any code that directly convert pdf to docx. - Yash Sharma
"to get any code" - uh? Have YOU tried writing anything? SO isn't for writing code for you, it's for helping with your code. - h4z3

5 Answers

1
votes

pdf2docx

  1. Install the pdf2docx package Click here

Installation

  • Clone or download pdf2docx

     pip install pdf2docx
         or
     # download the package and install your environment
     python setup.py install 
    
  • Option 1

    from pdf2docx import Converter
    
    pdf_file  = r'C:\Users\ABCD\Desktop\XYZ/Document1.pdf'# source file 
    docx_file = r'C:\Users\ABCD\Desktop\XYZ/sample.docx'  # destination file
    
    # convert pdf to docx
    cv = Converter(pdf_file)
    cv.convert(docx_file, start=0, end=None)
    cv.close()
    
    #Output
    
    Parsing Page 53: 53/53...
    Creating Page 53: 53/53...
    --------------------------------------------------
    Terminated in 6.258919400000195s.
    
  • Option 2

    from pdf2docx import parse
    
    pdf_file  = r'C:\Users\ABCD\Desktop\XYZ/Document2.pdf' # source file
    docx_file = r'C:\Users\ABCD\Desktop\XYZ/sample_2.docx' # destination file
    
    # convert pdf to docx
    parse(pdf_file, docx_file, start=0, end=None)
    
    # output
    Parsing Page 53: 53/53...
    Creating Page 53: 53/53...
    --------------------------------------------------
    Terminated in 5.883666100000482s.
    
0
votes

You can try pdftohtml, then use Pandoc to convert HTML to docx.

Actually, PDF is not a really document format but a rather page layout format, so conversion can be problematic.

0
votes

I'm the CTO at Zamzar and we have an API to do just this available at https://developers.zamzar.com/

We have a Test account you can use for free to try out the service, and code samples for Python in our docs which which would enable you to convert a PDF file to DOCX quite simply:

import requests
from requests.auth import HTTPBasicAuth

api_key = 'YOUR_API_KEY'
endpoint = "https://sandbox.zamzar.com/v1/jobs"
source_file = "/tmp/my.pdf"
target_format = "docx"

file_content = {'source_file': open(source_file, 'rb')}
data_content = {'target_format': target_format}
res = requests.post(endpoint, data=data_content, files=file_content, auth=HTTPBasicAuth(api_key, ''))
print res.json()

You can then poll the job to see when it has finished before downloading your converted file.

0
votes

Try PDF.to it has a PDF API that has Curl, PHP, Python, and NodeJS examples, and has good documentation

-2
votes

Converting PDFs to Documents can be a problematic task rather it would be easy the other way.

One possible solution could be "Save As" the PDF file in a desired location with ".docx" extension. This might work if the PDF was saved from docx and vice-versa.