I've downloaded data from a local metro-list and I want to analyze different market trends in the area. I have the tools to analyze the trends on the metro-list site, but I want to get more familiar with plotting using Pandas and matplotlib.
Currently, this what I have.
train_20901 = pd.read_csv('C:/Users/Damanjit/Documents/marketTrends/data/area20901.csv')
train_20901.head()
#small snippet of the full data, to see full data see images.
Address - Zip Code Listing Date Listing Price
0 95242 10/15/2012 15000
1 95240 2/14/2011 67900
2 95240 12/18/2008 31900
3 95240 8/24/2011 22000
4 95240 7/15/2012 25000
#Need to create Days on Market for training data
train_20901['Listing Date'] = pd.to_datetime(train_20901['Listing Date'])
train_20901['Pending Date'] = pd.to_datetime(train_20901['Pending Date'])
train_20901['Selling Date'] = pd.to_datetime(train_20901['Selling Date'])
train_20901['Days on Market'] = train_20901['Pending Date'] - train_20901['Listing Date']
train_20901['Days on Market'] = (train_20901['Days on Market'] / np.timedelta64(1, 'D')).astype(int)
train_20901.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6185 entries, 0 to 6184
Data columns (total 9 columns):
Address 6185 non-null object
Address - Zip Code 6185 non-null int64
Listing Date 6185 non-null datetime64[ns]
Listing Price 6185 non-null int64
Pending Date 6185 non-null datetime64[ns]
Price Per Sq Ft 6185 non-null float64
Selling Date 6185 non-null datetime64[ns]
Selling Price 6185 non-null int64
Days on Market 6185 non-null int32
dtypes: datetime64[ns](3), float64(1), int32(1), int64(3), object(1)
memory usage: 410.8+ KB
train_20901.groupby(train_20901['Address - Zip Code']).median()
Address - Zip Code Listing Price Price Per Sq Ft Selling Price Days on Market
95240 169900 122.820 168000 35.0
95242 267750 150.945 265000 35.5
95258 265000 148.620 258050 41.0
Opening the Data and Preparing it
Grouping the Data by Zip Codes
What I want to be able to do is use the listing date to plot how the listing price's median change over time. The dates start from 10/22/2008 and end 12/29/2016. I want to bin the dates so that only the days and years are showing. For example, I want to plot the listing price median in 12/09 for the three different zip codes of 95242, 95240, and 95258. So there should be three different points for the bin in the time period of 12/09. I want to do this for all the different time periods between 10/08 and 12/16. There should be 99 bins in total for the graph. What I'm confused on is how to structure my data in the DataFrame to accomplish this and plot it. I've tried pandas.resample and pandas.pivot, but didn't get it to work. Any small hints in the right direction would be extremely appreciated.
What I guess my table to should look like at the end is something like:
Listing Date Zip Code Median Home Prices
10/08 95240 168000
10/08 95242 216500
10/08 95258 210000
11/08 95240 171400
11/08 95242 219000
11/08 95258 212100
...and so on, where each zip codes has its own median price for the certain time period. I can hard code using a dictionary for each month and year and find the median home prices for a certain zip code that way, but I feel like pandas should have a simpler way of doing this for large data set.
EDIT
------------------------------------------------------------------------------
So when I use resample, I do create the bins for the time series data, but I want to individually obtain the different zip codes. I have this when I resample
train_20901.resample('M', on = 'Selling Date').median()
Selling Date Zip Code Days on Market Listing Price Price Per Sq Ft Selling Price
2008-10-31 95240 21.0 160500 122.575 150000
2008-11-30 95240 36.0 179400 136.710 180000
2008-12-31 95240 32.0 165400 126.665 165025
2009-01-31 95240 37.5 165200 122.075 165000
2009-02-28 95240 52.0 174700 116.925 172500
2009-03-31 95240 48.5 129900 99.300 124950
2009-04-30 95240 35.5 164950 114.900 163250
2009-05-31 95240 21.0 159000 122.860 165000
2009-06-30 95240 36.0 134900 110.470 126000
2009-07-31 95240 14.0 174900 121.800 177500
Do I use the groupby function, but if so how do I keep the index as the timeseries, but get the different zip codes, not the median?
SOLVED
I figured it out by doing this.
resampled = train_20901.set_index('Selling Date').groupby('Address - Zip Code').resample('M').median()
This is exactly what I wanted, now I just have to plot.