3
votes

Is there anyone who has tried to extract the individual risk factors from the Risk Factors section i.e. Item 1A from the EDGAR 10-K filings of the company using BeautifulSoup or any other web scraping library along with using Regular Expressions.

It would be much helpful if you can provide github or pseudo code or atleast some headstart so that I can move forward.

EDIT: Some examples of 10-Ks

  1. https://www.sec.gov/Archives/edgar/data/1350653/000156459018005156/atec-10k_20171231.htm
  2. https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm
  3. https://www.sec.gov/Archives/edgar/data/750574/000119312518080325/d472492d10k.htm
  4. https://www.sec.gov/Archives/edgar/data/773840/000093041318000292/c89913_10k.htm
  5. https://www.sec.gov/Archives/edgar/data/12927/000001292718000007/a201712dec3110k.htm

I have given more than 1 example because the HTML code is so much random in all of them that using single type of RegEx is tough.

1
Can you share URL of sample 10-K filling which contains Section 1A? - Andrej Kesely
I have edited the question pls check. - Forscher
What information do you need to extract? The whole 1A section as a text? - Andrej Kesely
Yes, text between Item 1A. Risk Factors to Item 1B. and the text in between is divided in parts by the headings (maybe bold or italic) where all those have different theme. So I need to extract all those part in different text files. - Forscher

1 Answers

1
votes

I spent a lot of time trying to develop a methodology using REGEX and had some limited success. The problem is that the underlying XML submitted to the SEC does not strictly adhere to standard, and many reports deviate from the report format used. Sometimes they'll use capital case, title case, or use different combinations of letters and numbers to delineate sections. Sometimes they'll include introductory paragraphs to provide additional context to the risks they're about to list. There are so many random factors that interfere with the ability to establish any kind of pattern to the document structure that it's currently more efficient to parse these by human than machine. But, there are thousands upon thousands of documents, making this a very tedious, expensive and drawn out process. One method that might be useful is Amazon's Mechanical Turk, but that would still likely require a lot of upfront development time and could lead to cost restrictions unless the project is well funded.