1
votes

I'm working on a scraper that works through a Chrome extension. It grabs all of the HTML on the page(s) and sends it to a python code that filters and saves the data. The reason that I'm doing the scraping this way is because the website has Distil Networks and a 'traditional' scraper gets blocked.

I have a successful connection between the 2 codes but whenever I try to send 'Test.' to the python server it just output the headers of the browser.

b'GET / HTTP/1.1 Host: localhost:18364 Connection: Upgrade Pragma: no-cache Cache-Control: no-cache User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 Upgrade: websocket Origin: chrome-extension://ocplnbpkkcpcomkjioockgnlohhkdeic Sec-WebSocket-Version: 13 Accept-Encoding: gzip, deflate, br Accept-Language: nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7 Sec-WebSocket-Key: SDC7zPgHK/eV+QRSJy0DZQ== Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits'

The JavaScript Code (Client):

chrome.runtime.onMessage.addListener(function(request, sender) {
if (request.action == "getSource") {
  var pageAmount = parseInt(request.source, 10)

  var allHTML = ""
  var BaseURL = "https://www.funda.nl/huur/rotterdam/p"

  function encode_utf8(s) {
    return unescape(encodeURIComponent(s));
  }

  var websocket = new WebSocket('ws://localhost:18364');

  websocket.onopen = function () {
    data = encode_utf8('Test.')
    websocket.send('Test.'); 
  };
message.innerText = request.source;
}
});

function onWindowLoad() {

var message = document.querySelector('#message');

chrome.tabs.executeScript(null, {
file: "getPageContent.js"
}, function() {
// If you try and inject into an extensions page or the         webstore/NTP you'll get an error
if (chrome.runtime.lastError) {
  message.innerText = 'There was an error injecting script : \n' + chrome.runtime.lastError.message;
}
});
}

window.onload = onWindowLoad;

The Python code (Server):

import socket

LocalSocket = socket.socket()
allHTML = ''

try:  # Connecting the Socket
LocalSocket = socket.socket(socket.AF_INET,     socket.SOCK_STREAM)
LocalSocket.setsockopt(socket.SOL_SOCKET,   socket.SO_REUSEADDR, 1)
LocalSocket.bind(('localhost', 18364))
print("Connected.")
except socket.error as err:
print("ConnectionError: %s" % err)


def main():
LocalSocket.listen(1)

c, addr = LocalSocket.accept()
print('Got connection from', addr)
print(c.recv(1024))

c.close()

if __name__ == "__main__":
main()
1

1 Answers

1
votes

web sockets are layered over HTTP, so this is expected behaviour. you need a web server (or something that speaks HTTP) to handle the Connection: Upgrade and Upgrade: websocket parts, then perform the rest of the handshake before getting a valid connection that supports bi-directional communication

you could look at using the websockets package which wraps this up nicely