Background: I have a large dataset (>100000 entries) containing fiscal information of major institutions/companies. There are several columns containing information such as fiscal year (column 1), company name (column 5), name of director (column 6), annual revenue (column 11), etc.. Ideally, each company should have included their fiscal information from 1996 to 2006 into this dataset. However, many companies are missing information for one or more years and should therefore be excluded from any further analysis.
Here is a screenshot of my dataset: BoardCharacteristics
As you can see, many of the companies included in this snapshot do not have provided complete information for fiscal years 1996 to 2006.
Objective: The first step is to filter this dataset in such a way that only companies which have provided information for the entire timespan (i.e. from 1996 to 2006) are to be included in a subsequent analysis. Since the provided information should ideally equal to 11 rows (i.e. 1996 - 2006) per company at minimum, with many companies having included more than one director's name per fiscal year, my initial idea was to specify a filter that would only select specific sequences of rows (starting with 1996 and ending in 2006) and do this sequentially for the entire dataset whilst omitting sequences that are not complete (e.g. years 2001 to 2006 of company A missing) or anything in between. However, due to the inconsistency of sequence length and variability of sequence composition, a simple/rigid filter which would select any sequence which starts with 1996 and ends in 2006 would not suffice.
There are several complications of which I am aware:
- not every complete sequence consists of exactly 11 rows, since many companies included more than one director's name per fiscal year (e.g. 3 director's names for fiscal year 2001 --> 13 rows in total for this company)
- potential sequence overlap between companies [e.g. company A provided years 1996 to 1999; company B (which comes immediately after company A) 2000 to 2006] --> this would result in a sequence from 1996 to 2006 including companies A and B (which is not desirable)
I have already tried several functions, including the filter by range function:
RANGE(exp,low,high) --> RANGE(year,1996,2006)
As expected, this did not work. I also tried to filter this dataset in Excel, but to no avail.
While I doubt that there is a single function to solve this matter, I have still not come by any useful syntax to resolve this problem. Therefore, I would greatly appreciate some input. If there is any lack of clarity regarding my problem statement, feel free to ask.