0
votes

I have thousands of html files, and need to save each of them as txt, using firefox.

If I do this job manually, I would open each html file in firefox, click the File menu, click the 'Save Page As' menu item, then select the format as 'TEXT', and save to local disk.

But how to automate this job?

Any script/tool can help this?

Thanks.

2
Do you know any scripting languages? What system are you on?jdi
Any script language is Ok. I assume not too much script programming is needed here. Windows platform.Hardbone
Another option is taking advantage of text browser en.wikipedia.org/wiki/Lynx_%28web_browser%29Hardbone

2 Answers

2
votes

If your goal is to get firefox to strip the html out of each page and save just the text, then there are a ton of options. I'm not aware of any firefox add-ons that will be intelligent enough to loop over every file in a directory in order to perform a macro, so here are some options:

  1. Refer to this SO question regarding how to use python to strip the html from each file. It provides examples for both the built in HTMLParser module and for using BeautifulSoup

  2. Use Selenium to automate your webbrowser: http://seleniumhq.org/

  3. If you know javascript, you can use PhantomJS: http://www.phantomjs.org/, which is a headless web browser that you drive with javascript scripts.

1
votes

I have thousands of html files...

Do you actually have these files on-hand, or are they online?

...and need to save each of them as txt...

Any text editor should be able to save the data within (i.e. why use FireFox), and I think a straight rename of .htm or .html to .txt. will work (at least on any Windows system). Or do you mean: save just the displayed text of the HTML file?


EDIT:

First, start off with this link, which has a good explanation of how to get started with shdocvw, which you will need to do this. Once you have the reference set up, using the functions

Function GetNewIE() As SHDocVw.InternetExplorer

and

Function LoadWebPage(i_IE As SHDocVw.InternetExplorer, i_URL As String) As Boolean

from the link (just copy into your project as described in the link) to load your individual html files, using a loop to get through each file. (Excel would be good for this, because you can put your list of files into the cells, and cycle through each cell to retrieve.) I have never done something like this with so many files, so I cannot guarantee this will work, unfortunately...

Dim IE As SHDocVw.InternetExplorer
Dim lRow as Long 'Long in case you have a LOT of files
Dim iFNum As Integer
Dim sFilePath As String

Set IE = GetNewIE
For lRow = 1 To 5000 Step 1 ' Assuming you have 5,000 html files, so 5,000 rows with the paths to each
    sFilePath = ActiveSheet.Range("A" & lRow).Value ' This should also include the filepath. i.e. "C:\dir\..."
    If LoadWebPage(IE, sFilePath) Then
        iFNum = FreeFile(lRow)
        Open sFilePath & ".txt" For Output As iFNum
        Write #iFNum, IE.Document.InnerText
        Close #iFNum
    End If
Next lRow