2
votes

Background: I have a large dataset (>100000 entries) containing fiscal information of major institutions/companies. There are several columns containing information such as fiscal year (column 1), company name (column 5), name of director (column 6), annual revenue (column 11), etc.. Ideally, each company should have included their fiscal information from 1996 to 2006 into this dataset. However, many companies are missing information for one or more years and should therefore be excluded from any further analysis.

Here is a screenshot of my dataset: BoardCharacteristics

As you can see, many of the companies included in this snapshot do not have provided complete information for fiscal years 1996 to 2006.

Objective: The first step is to filter this dataset in such a way that only companies which have provided information for the entire timespan (i.e. from 1996 to 2006) are to be included in a subsequent analysis. Since the provided information should ideally equal to 11 rows (i.e. 1996 - 2006) per company at minimum, with many companies having included more than one director's name per fiscal year, my initial idea was to specify a filter that would only select specific sequences of rows (starting with 1996 and ending in 2006) and do this sequentially for the entire dataset whilst omitting sequences that are not complete (e.g. years 2001 to 2006 of company A missing) or anything in between. However, due to the inconsistency of sequence length and variability of sequence composition, a simple/rigid filter which would select any sequence which starts with 1996 and ends in 2006 would not suffice.

There are several complications of which I am aware:

  • not every complete sequence consists of exactly 11 rows, since many companies included more than one director's name per fiscal year (e.g. 3 director's names for fiscal year 2001 --> 13 rows in total for this company)
  • potential sequence overlap between companies [e.g. company A provided years 1996 to 1999; company B (which comes immediately after company A) 2000 to 2006] --> this would result in a sequence from 1996 to 2006 including companies A and B (which is not desirable)

I have already tried several functions, including the filter by range function:

RANGE(exp,low,high) --> RANGE(year,1996,2006)

As expected, this did not work. I also tried to filter this dataset in Excel, but to no avail.

While I doubt that there is a single function to solve this matter, I have still not come by any useful syntax to resolve this problem. Therefore, I would greatly appreciate some input. If there is any lack of clarity regarding my problem statement, feel free to ask.

1

1 Answers

1
votes

Use of AGGREGATE is what is requried here. Here is an example to get you off the blocks:

DATA LIST FREE / Company Year.
BEGIN DATA
1, 1995
1, 1996
1, 1997
1, 1998
1, 1999
1, 2000
1, 2001 
1, 2002
1, 2002
1, 2002
1, 2003
1, 2004 
1, 2005 
1, 2006
2, 1996
2, 1997
2, 2005 
2, 2006
END DATA.
DATASET NAME DS0.

/* Filter data to retain only years of interest*/.
SELECT IF RANGE(Year, 1996,2006).

/* Aggregate to remove multiple year entries */.
DATASET DECLARE DSBREAK01.
AGGREGATE OUTFILE=DSBREAK01 /BREAK=Company Year /Count1=N.

/* Aggregate to find first and last year present in data and a count of number of other distinct years in-between*/.
DATASET ACTIVATE DSBREAK01.
DATASET DECLARE DSBREAK02.
AGGREGATE OUTFILE=DSBREAK02 /BREAK=Company /Count2=N /FirstYear=MIN(Year) / LastYear=MAX(Year).
DATASET ACTIVATE DSBREAK02.

/* Compute flag for companies which match desired conditions*/.
COMPUTE Flag=(Count2=11 AND FirstYear=1996 AND LastYear=2006).

/* Match flag variable to original dataset for further processing*/.
DATASET ACTIVATE DS0.
MATCH FILES FILE=* /TABLE=DSBREAK02 /BY Company.
ADD FILES FILE=* /DROP=Count2 FirstYear LastYear.