I am trying to get the status code from millions of different sites, I am using asyncio and aiohttp, I run the below code with a different number of connections (yet same timeout on the request) but get very different results specifically much higher number of the following exception.
'concurrent.futures._base.TimeoutError'
The code
import pandas as pd
import asyncio
import aiohttp
out = []
CONNECTIONS = 1000
TIMEOUT = 10
async def fetch(url, session, loop):
try:
async with session.get(url,timeout=TIMEOUT) as response:
res = response.status
out.append(res)
return res
except Exception as e:
_exception = 'Error: '+str(type(e))
out.append(_exception)
return _exception
async def bound_fetch(sem, url, session, loop):
async with sem:
await fetch(url, session, loop)
async def run(urls, loop):
tasks = []
sem = asyncio.Semaphore(value=CONNECTIONS,loop=loop)
_connector = aiohttp.TCPConnector(limit=CONNECTIONS, loop=loop)
async with aiohttp.ClientSession(connector=_connector,loop=loop) as session:
for url in urls:
task = asyncio.ensure_future(bound_fetch(sem, url, session, loop))
tasks.append(task)
responses = await asyncio.gather(*tasks,return_exceptions=True)
return responses
## BEGIN ##
tlds = open('data/sample_1k.txt').read().splitlines()
urls = ['http://{}'.format(x) for x in tlds[1:]]
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(urls,loop))
ans = loop.run_until_complete(future)
print(str(pd.Series(out).value_counts()))
Results
CONNECTIONS=1000
CONNECTIONS=100
Is this a bug? These sites do response with a status code and run sequentially or with lower connections there is no timeout error so why is this happening? The other exceptions seem stable as you change number of connections. The ClientOSErrors are from sites that actually timeout or respond, honestly don't really know where the concurrent.futures._base.TimeoutError errors are coming from.

