0
votes

Cross posted here: https://www.mrexcel.com/board/threads/storefront-web-scraping.1120494/#post-5403849

Hello everyone. I'm having trouble creating a webscraper in VBA that can handle the below. So basically i need to scrape my webstore https://www.ebay.com/str/customwheelandperformancedepot?_pgn=1 into excel.

i need to go through all available pages ( found at the bottom ) and open each listing. Now once the listing has been opened we need to determine if its a wheel or a wheel & tire package to do this we can look in the "Item Specifics" table; if any of the item specifics contains the phrases "tire", "section width", or "aspect ratio" its a wheel and tire package.

example of wheel: https://www.ebay.com/itm/Set-of-4-16x8-Mo970-Black-Machine-8x165-1-Wheels-Rims-SILVERADO-2500/283545274424?epid=1540162229&hash=item42049d8838:g:dZgAAOSw5wVdJ2~0

example of wheel & tire package: https://www.ebay.com/itm/HELO-HE878-17x9-Wheels-Rims-33-FUEL-AT-Tires-Package-5x5-Jeep-Wrangler-JK-JL/372571036378?hash=item56bef6dada:g:AhkAAOSw2~NcQO35

For wheel & tire packages i only need the:

1. Title [#itemTitle] 3. Price [#mm-saleOrgPrc] if unavailable [#prcIsum] 4. Ebay Item Number [#descItemNumber] 5. HTML Inner of the description [#ds_div]

For wheel only listings i need:

1. Title [#itemTitle]

2. Price [#mm-saleOrgPrc] if unavailable [#prcIsum] 3. Ebay Item Number [#descItemNumber] 4. Item specifics table [.section > table:nth-child(2) > tbody:nth-child(1)] 6. HTML Inner of Description [#container]

**note that the item specifics table may not be in order and may be missing some of the values (such as bolt pattern 2). The header values are in columns 1 and 3 (Condition, backspacing, offset, etc) and the actual values to put into the excel sheet are in columns 2 and 4 (New, 4.5, 0, etc)

Here is the result i'm going for with 3 examples of wheel only followed by 3 examples of wheel and tire packages.

Excel Scrape, columns don't have to be in this order


this is what i have so far as i do not know how to access multiple pages i attempted to scrape just the title from one listing, and it seems i am struggling with even that.

Option Explicit

Const sSiteName = "https://www.ebay.com/itm/1-New-20x8-5-Kmc-District-ET-35-Bronze-5x114-3-5X4-5-Wheel-Rim/372780750649?epid=24031177590&hash=item56cb76d739:g:yDYAAOSwE91diN8Q"

Private Sub GetHTMLContents() ' Create Internet Explorer object. Dim IE As Object Set IE = CreateObject("InternetExplorer.Application") IE.Visible = False ' Keep this hidden.

IE.Navigate sSiteName ' Wait till IE is fully loaded. While IE.ReadyState <> 4 DoEvents Wend Dim oHDoc As HTMLDocument ' Create document object. Set oHDoc = IE.Document Dim oHEle As HTMLDivElement ' Create HTML element (<ul>) object. Set oHEle = oHDoc.getElementById(".vi-swc-lsp") ' Get the element reference using its ID. Dim iCnt As Integer ' Loop through elements inside the <ul> element and find <h2>, which has the texts we want. With oHEle For iCnt = 0 To .getElementsByTagName("h1").Length - 1 Debug.Print .getElementsByTagName("h1").Item(iCnt).getElementsByTagName("a").Item(0).innerHTML Next iCnt End With ' Clean up. IE.Quit Set IE = Nothing Set oHEle = Nothing Set oHDoc = Nothing

End Sub

i get "object variable or with block variable not set" on the .getelementsbytagname line

i was using this article as reference. https://www.encodedna.com/excel/extract-contents-from-html-element-of-a-webpage-in-excel-using-vba.htm

1
I'am confused you lost your data. But let's see. In the other forum you wrote about 70K listings. But in your shop I can only see 25,967 listings. But step by step. I advise you to use not the shop pages. Use the search pages. In the store you can only grab 48 offer links with one access. In the search list you can get up to 200 per page via an url parameter. 541 page loads against 130. After that you must truly access each offer. Have you any code? What have you tried till now? You need the html code for the description not only the text? This is not trivial. This is a bunch of work I think.Zwenn
i show "1-48 of 73,320 Results" when i am in the store using category "all"BuhBlake
and yes this is true i didn't think of using search function in "my listings" but wouldnt that require authorizing a login? I've never scraped anything before but what i have so far i will edit into the opBuhBlake
Oh right, thats because geoblocking. I have experience with Ebay in this respect and I did not take into account that you are not in Germany like me. So it's 367 page loads instead of 1,528.Zwenn
I have also experience with web scraping from Ebay. Here is a link to a project in another forum to scrape data from the first search page. The problem here is, it's all in german. You can download an Excel file at the end of the first posting: herber.de/forum/cgi-bin/callthread.pl?index=1678466 Actually, I don't have time for your project. But since I can use it for my work I will think about it without going into details. But that's not made in half an hour (for me).Zwenn

1 Answers

1
votes

I realise you've asked this two weeks ago, but maybe you're still looking for an answer.

At the time of writing, I think the code below is working for me. I say at the time of writing because I get the impression that some of the ids (in the HTML received from the server) are changing periodically -- which breaks the code.

This is what I've currently got:

Output sheet

The code is a bit of a mess, feel free to refactor. The entry point is ScrapeAllItemsFromEbayShop.

Option Explicit

Private Function GetUrlForShopPageN(ByVal N As Long) As String
    ' Should return the store URL for page N,
    ' where N is some 1-based page index present in the query string.
    GetUrlForShopPageN = "https://www.ebay.com/str/customwheelandperformancedepot?_pgn=" & N
End Function

Private Function GetHtmlForShopPageN(ByVal webClient As WinHttp.WinHttpRequest, ByVal N As Long) As MSHTML.HTMLDocument
    ' Should return a HTML document representing the response of server for page N,
    ' where N is some 1-based page index present in the query string.

    Dim targetUrl As String
    targetUrl = GetUrlForShopPageN(N)

    With webClient
        .Open "GET", targetUrl, False
        .setRequestHeader "user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"
        .send
        Set GetHtmlForShopPageN = New MSHTML.HTMLDocument
        GetHtmlForShopPageN.body.innerHTML = .responseText
    End With
End Function

Private Function DoesShopPageNotContainResults(ByVal htmlResponse As MSHTML.HTMLDocument) As Boolean
    ' Should return a boolean representing whether the htmlResponse contains zero results.
    DoesShopPageNotContainResults = (htmlResponse.getElementsByClassName("srp-controls").Length = 0)
End Function

Private Function GetUrlsOfItemsToScrape() As Collection
    ' Should return a collection of strings, representings the URLs of items.
    Set GetUrlsOfItemsToScrape = New Collection

    Dim webClient As WinHttp.WinHttpRequest
    Set webClient = New WinHttp.WinHttpRequest

    Do While True
        Dim pageIndex As Long
        pageIndex = pageIndex + 1

        Dim htmlResponse As MSHTML.HTMLDocument
        Set htmlResponse = GetHtmlForShopPageN(webClient, pageIndex)

        If DoesShopPageNotContainResults(htmlResponse) Then Exit Do

        Dim anchor As MSHTML.IHTMLElement
        For Each anchor In htmlResponse.getElementsByClassName("s-item__link")
            Debug.Assert StrComp(LCase$(Left$(anchor.getAttribute("href"), 25)), "https://www.ebay.com/itm/", vbBinaryCompare) = 0
            GetUrlsOfItemsToScrape.Add anchor.getAttribute("href")
            If GetUrlsOfItemsToScrape.Count > 10 Then Exit Do ' Delete this line completely once you think everything is working.
        Next anchor

        If (0 = (pageIndex Mod 10)) Then DoEvents
    Loop
End Function

Private Function DoesTextContainAnyOf(ByVal textToCheck As String, stringsToCheck As Variant) As Boolean
    ' Should return a boolean representing whether any of "stringsToCheck"
    ' can be found within "textToCheck". Performs a case-sensitive search.
    Dim i As Long
    For i = LBound(stringsToCheck) To UBound(stringsToCheck)
        If InStr(1, textToCheck, stringsToCheck(i), vbBinaryCompare) Then
            DoesTextContainAnyOf = True
            Exit For
        End If
    Next i
End Function

Private Function IsItemAWheelOnly(ByVal htmlResponse As MSHTML.HTMLDocument) As Boolean
    ' Should return True if, based on the HTML, the item is inferred to be a "wheel".
    Dim itemSpecifics As MSHTML.IHTMLTableSection
    Set itemSpecifics = htmlResponse.querySelector(".itemAttr tbody")
    Debug.Assert Not (itemSpecifics Is Nothing)

    Dim tireAndPackageIdentifiers As Variant
    tireAndPackageIdentifiers = Array("tire", "section width", "aspect ratio")

    Dim tableRow As MSHTML.IHTMLTableRow
    For Each tableRow In itemSpecifics.Rows
        Debug.Assert 0 = (tableRow.Cells.Length Mod 2)
        Dim columnIndex As Long
        For columnIndex = 0 To (tableRow.Cells.Length - 1) Step 2
            Debug.Assert InStr(1, tableRow.Cells(columnIndex).className, "attrLabels", vbBinaryCompare)
            If DoesTextContainAnyOf(LCase$(tableRow.Cells(columnIndex).innerText), tireAndPackageIdentifiers) Then Exit Function
        Next columnIndex
    Next tableRow

    IsItemAWheelOnly = True
End Function

Private Function GetHtmlForItem(ByVal webClient As WinHttp.WinHttpRequest, ByVal urlForItem As String) As MSHTML.HTMLDocument
    ' Should return a HTML document representing the response of server for a given item.
    With webClient
        .Open "GET", urlForItem, False
        .setRequestHeader "user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"
        .send
        Set GetHtmlForItem = New MSHTML.HTMLDocument
        GetHtmlForItem.body.innerHTML = .responseText
    End With
End Function

Private Sub ScrapeAllItemsFromEbayShop()

    Dim webClient As WinHttp.WinHttpRequest
    Set webClient = New WinHttp.WinHttpRequest

    Dim urlsOfItemsToScrape As Collection
    Set urlsOfItemsToScrape = GetUrlsOfItemsToScrape()

    Dim rowWriteIndex As Long
    rowWriteIndex = 1 ' Skip row 1/headers

    Dim destinationSheet As Worksheet
    Set destinationSheet = ThisWorkbook.Worksheets("Sheet1")

    destinationSheet.Cells.ClearContents

    Dim columnIndexes As Collection
    Set columnIndexes = New Collection

    Dim urlOfItem As Variant
    For Each urlOfItem In urlsOfItemsToScrape
        Debug.Print urlOfItem

        Dim htmlOfItemPage As MSHTML.HTMLDocument
        Set htmlOfItemPage = GetHtmlForItem(webClient, urlOfItem)

        Dim nameValuePairs As Collection
        If IsItemAWheelOnly(htmlOfItemPage) Then
            Set nameValuePairs = CreateNameValuePairsForWheelOnly(htmlOfItemPage)
        Else
            Set nameValuePairs = CreateNameValuePairsForWheelAndTirePackage(htmlOfItemPage)
        End If

        rowWriteIndex = rowWriteIndex + 1

        Dim nameValuePair As Variant
        For Each nameValuePair In nameValuePairs
            Dim columnWriteIndex As Long
            columnWriteIndex = GetColumnIndexOfHeader(columnIndexes, nameValuePair(0))

            If columnWriteIndex = 0 Then
                columnWriteIndex = columnIndexes.Count + 1
                columnIndexes.Add columnWriteIndex, Key:=nameValuePair(0)
                destinationSheet.Cells(1, columnWriteIndex).Value = nameValuePair(0)
            End If
            destinationSheet.Cells(rowWriteIndex, columnWriteIndex).Value = nameValuePair(1)
        Next nameValuePair
        DoEvents
    Next urlOfItem
End Sub

Private Function CreateNameValuePairsForWheelAndTirePackage(ByVal htmlOfItemPage As MSHTML.HTMLDocument) As Collection
    ' Should return a collection of 2-element arrays (where each 2-element array
    ' represents a name-value pair).
    Dim outputCollection As Collection
    Set outputCollection = New Collection

    Dim targetElement As MSHTML.IHTMLElement

    Set targetElement = htmlOfItemPage.getElementById("itemTitle")
    Debug.Assert Not (targetElement Is Nothing)
    outputCollection.Add CreateNameValuePair("Title", targetElement.innerText)

    Set targetElement = htmlOfItemPage.getElementById("mm-saleOrgPrc")
    If targetElement Is Nothing Then
        Set targetElement = htmlOfItemPage.getElementById("prcIsum")
        Debug.Assert Not (targetElement Is Nothing)
    End If
    outputCollection.Add CreateNameValuePair("Price", targetElement.innerText)

    Set targetElement = htmlOfItemPage.getElementById("descItemNumber")
    Debug.Assert Not (targetElement Is Nothing)
    outputCollection.Add CreateNameValuePair("eBay Item Number", targetElement.innerText)

    Set targetElement = htmlOfItemPage.getElementById("desc_div")
    Debug.Assert Not (targetElement Is Nothing)
    outputCollection.Add CreateNameValuePair("Description HTML", targetElement.innerHTML)

    Set CreateNameValuePairsForWheelAndTirePackage = outputCollection
End Function

Private Function CreateNameValuePairsForWheelOnly(ByVal htmlOfItemPage As MSHTML.HTMLDocument) As Collection
    ' Should return a collection of 2-element arrays (where each 2-element array
    ' represents a name-value pair).
    Dim outputCollection As Collection
    Set outputCollection = New Collection

    Dim targetElement As MSHTML.IHTMLElement

    Set targetElement = htmlOfItemPage.getElementById("itemTitle")
    Debug.Assert Not (targetElement Is Nothing)
    outputCollection.Add CreateNameValuePair("Title", targetElement.innerText)

    Set targetElement = htmlOfItemPage.getElementById("mm-saleOrgPrc")
    If targetElement Is Nothing Then
        Set targetElement = htmlOfItemPage.getElementById("prcIsum")
        Debug.Assert Not (targetElement Is Nothing)
    End If
    outputCollection.Add CreateNameValuePair("Price", targetElement.innerText)

    Set targetElement = htmlOfItemPage.getElementById("descItemNumber")
    Debug.Assert Not (targetElement Is Nothing)
    outputCollection.Add CreateNameValuePair("eBay Item Number", targetElement.innerText)

    Set targetElement = htmlOfItemPage.getElementById("desc_wrapper_ctr")
    Debug.Assert Not (targetElement Is Nothing)
    outputCollection.Add CreateNameValuePair("Description HTML", targetElement.innerHTML)

    Dim itemSpecifics As MSHTML.IHTMLTableSection
    Set itemSpecifics = htmlOfItemPage.querySelector(".itemAttr tbody")
    Debug.Assert Not (itemSpecifics Is Nothing)

    Dim tableRow As MSHTML.IHTMLTableRow
    For Each tableRow In itemSpecifics.Rows
        Debug.Assert 0 = (tableRow.Cells.Length Mod 2)
        Dim columnIndex As Long
        For columnIndex = 0 To (tableRow.Cells.Length - 1) Step 2
            Debug.Assert InStr(1, tableRow.Cells(columnIndex).className, "attrLabels", vbBinaryCompare)
            outputCollection.Add CreateNameValuePair(tableRow.Cells(columnIndex).innerText, tableRow.Cells(columnIndex + 1).innerText)
        Next columnIndex
    Next tableRow

    Set CreateNameValuePairsForWheelOnly = outputCollection
End Function

Private Function CreateNameValuePair(ByVal someName As String, ByVal someValue As String) As String()
    Dim outputArray(0 To 1) As String
    outputArray(0) = someName
    outputArray(1) = someValue
    CreateNameValuePair = outputArray
End Function

Private Function GetColumnIndexOfHeader(ByVal columnIndexes As Collection, ByVal header As String) As Long
    ' Should return a 1-based column index associated with "header".
    ' If "header" does not exist within collection, 0 is returned.
    On Error Resume Next
    GetColumnIndexOfHeader = columnIndexes(header)
    On Error GoTo 0
End Function

This code is slow for a number of reasons:

  • A lot of time is spent waiting for a response from the server.
  • Each item is scraped in a serial and synchronous manner.
  • Output is written to the sheet one cell a time (instead of using arrays and reducing the number of read/write operations involving the sheet).
  • No toggling of Application.Calculation or Application.ScreenUpdating.

Once you think the code is working, you'll want to get rid of this line If GetUrlsOfItemsToScrape.Count > 10 Then Exit Do in the GetUrlsOfItemsToScrape function. Otherwise you won't scrape all items.

I've left DoEvents inside the Do loops to keep things responsive (at the cost of some performance perhaps)