0
votes

I have recently moved from excel VBA automation to try out the autohotkey automation based on http://the-automator.com/web-scraping-intro-with-autohotkey/ tutorial, but I can't seem to understand well the code, could someone please point me in the right direction?

I am trying to make my F1 key to scrape some data on the current active.

F1::

pwb := ComObjCreate("InternetExplorer.Application") ;create IE Object
pwb.visible:=true  ; Set the IE object to visible

pwb := WBGet()

;************Pointer to Open IE Window******************
WBGet(WinTitle="ahk_class IEFrame", Svr#=1) {               ;// based on ComObjQuery docs
   static msg := DllCall("RegisterWindowMessage", "str", "WM_HTML_GETOBJECT")
        , IID := "{0002DF05-0000-0000-C000-000000000046}"   ;// IID_IWebBrowserApp
;//     , IID := "{332C4427-26CB-11D0-B483-00C04FD90119}"   ;// IID_IHTMLWindow2
   SendMessage msg, 0, 0, Internet Explorer_Server%Svr#%, %WinTitle%

   if (ErrorLevel != "FAIL") {
      lResult:=ErrorLevel, VarSetCapacity(GUID,16,0)
      if DllCall("ole32\CLSIDFromString", "wstr","{332C4425-26CB-11D0-B483-00C04FD90119}", "ptr",&GUID) >= 0 {
         DllCall("oleacc\ObjectFromLresult", "ptr",lResult, "ptr",&GUID, "ptr",0, "ptr*",pdoc)
         return ComObj(9,ComObjQuery(pdoc,IID,IID),1), ObjRelease(pdoc)
      }
   }
}

I understand this code creates a new IE application, but what if I don't want to create one? Which is just to get the current active window? I saw a few codes that allow me to get the current active browser URL, but I can't seem to get the current active browser elements.

So far I have tried this. Can someone tell me how do I get it to point to the active page and get some of its data?

F1::

wb := WBGet()
if !instr(wb.LocationURL, "https://www.google.com/")
{
   wb := ""
   return
}
doc := wb.document
h2name    := rows[0].getElementsByTagName("h2")


FileAppend, %h2name%, Somefile.txt
Run Somefile.txt
return




WBGet(WinTitle="ahk_class IEFrame", Svr#=1) {               ;// based on ComObjQuery docs
   static msg := DllCall("RegisterWindowMessage", "str", "WM_HTML_GETOBJECT")
        , IID := "{0002DF05-0000-0000-C000-000000000046}"   ;// IID_IWebBrowserApp
;//     , IID := "{332C4427-26CB-11D0-B483-00C04FD90119}"   ;// IID_IHTMLWindow2
   SendMessage msg, 0, 0, Internet Explorer_Server%Svr#%, %WinTitle%
   if (ErrorLevel != "FAIL") {
      lResult:=ErrorLevel, VarSetCapacity(GUID,16,0)
      if DllCall("ole32\CLSIDFromString", "wstr","{332C4425-26CB-11D0-B483-00C04FD90119}", "ptr",&GUID) >= 0 {
         DllCall("oleacc\ObjectFromLresult", "ptr",lResult, "ptr",&GUID, "ptr",0, "ptr*",pdoc)
         return ComObj(9,ComObjQuery(pdoc,IID,IID),1), ObjRelease(pdoc)
      }
   }
}

Try to test if the variable would write onto the somefile.txt, not too sure how it should test with msgbox. It kept writing the whole script instead of showing the result.

1
You can use the command UrlDownloadToFile, http://www.example.com, sourcecode.html to save the (whole) code in your PC then you could parse the text outside of the labels <> (excluding the text between <style></style> and <script></script>) to get the page's innertext.Le____

1 Answers

2
votes

To work on the active window's active tab (if it's an Internet Explorer window):

q::
WinGet, hWnd, ID, A
WinGetClass, vWinClass, ahk_id %hWnd%
if !(vWinClass = "IEFrame")
Return
wb := WBGet("ahk_id " hWnd)
MsgBox % wb.document.activeElement.tagName "`r`n" wb.document.activeElement.innerText
wb := ""
Return

To work on the first found Internet Explorer window's active tab:

w::
WinGet, hWnd, ID, ahk_class IEFrame
wb := WBGet()
;wb := WBGet("ahk_class IEFrame") ;this line is equivalent to the one above
MsgBox % wb.document.activeElement.tagName "`r`n" wb.document.activeElement.innerText
wb := ""
Return

Regarding h2name, I don't believe that this will do anything, because 'rows' is not defined anywhere in the script.

h2name    := rows[0].getElementsByTagName("h2")

The following might work:

h2name := ""
try h2name := wb.document.getElementsByTagName("h2").item[0].name
MsgBox % h2name

MsgBox % wb.document.getElementsByTagName("h2").item[0].tagName
MsgBox % wb.document.getElementsByTagName("h2").item[0].innerText

In your link I think by 'name' they are referring to LocationName (the tab's title):

MsgBox % wb.LocationName
MsgBox % wb.document.title ;more reliable

For the entire page's innerText:

MsgBox % wb.document.documentElement.innerText

HTH