Is there anyone who has tried to extract the individual risk factors from the Risk Factors section i.e. Item 1A from the EDGAR 10-K filings of the company using BeautifulSoup or any other web scraping library along with using Regular Expressions.
It would be much helpful if you can provide github or pseudo code or atleast some headstart so that I can move forward.
EDIT: Some examples of 10-Ks
- https://www.sec.gov/Archives/edgar/data/1350653/000156459018005156/atec-10k_20171231.htm
- https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm
- https://www.sec.gov/Archives/edgar/data/750574/000119312518080325/d472492d10k.htm
- https://www.sec.gov/Archives/edgar/data/773840/000093041318000292/c89913_10k.htm
- https://www.sec.gov/Archives/edgar/data/12927/000001292718000007/a201712dec3110k.htm
I have given more than 1 example because the HTML code is so much random in all of them that using single type of RegEx is tough.