1
votes

I've downloaded data from a local metro-list and I want to analyze different market trends in the area. I have the tools to analyze the trends on the metro-list site, but I want to get more familiar with plotting using Pandas and matplotlib.

Currently, this what I have.

train_20901 = pd.read_csv('C:/Users/Damanjit/Documents/marketTrends/data/area20901.csv')
train_20901.head()
#small snippet of the full data, to see full data see images. 
     Address - Zip Code     Listing Date      Listing Price
0    95242                  10/15/2012        15000 
1    95240                  2/14/2011         67900 
2    95240                  12/18/2008        31900 
3    95240                  8/24/2011         22000 
4    95240                  7/15/2012         25000

#Need to create Days on Market for training data
train_20901['Listing Date'] = pd.to_datetime(train_20901['Listing Date'])
train_20901['Pending Date'] = pd.to_datetime(train_20901['Pending Date'])
train_20901['Selling Date'] = pd.to_datetime(train_20901['Selling Date'])

train_20901['Days on Market'] = train_20901['Pending Date'] - train_20901['Listing Date']
train_20901['Days on Market'] = (train_20901['Days on Market'] / np.timedelta64(1, 'D')).astype(int)

train_20901.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6185 entries, 0 to 6184
Data columns (total 9 columns):
Address               6185 non-null object
Address - Zip Code    6185 non-null int64
Listing Date          6185 non-null datetime64[ns]
Listing Price         6185 non-null int64
Pending Date          6185 non-null datetime64[ns]
Price Per Sq Ft       6185 non-null float64
Selling Date          6185 non-null datetime64[ns]
Selling Price         6185 non-null int64
Days on Market        6185 non-null int32
dtypes: datetime64[ns](3), float64(1), int32(1), int64(3), object(1)
memory usage: 410.8+ KB

train_20901.groupby(train_20901['Address - Zip Code']).median()

Address - Zip Code  Listing Price   Price Per Sq Ft Selling Price   Days on Market              
95240               169900          122.820          168000         35.0
95242               267750          150.945          265000         35.5
95258               265000          148.620          258050         41.0

Opening the Data and Preparing it

enter image description here

Grouping the Data by Zip Codes

enter image description here

What I want to be able to do is use the listing date to plot how the listing price's median change over time. The dates start from 10/22/2008 and end 12/29/2016. I want to bin the dates so that only the days and years are showing. For example, I want to plot the listing price median in 12/09 for the three different zip codes of 95242, 95240, and 95258. So there should be three different points for the bin in the time period of 12/09. I want to do this for all the different time periods between 10/08 and 12/16. There should be 99 bins in total for the graph. What I'm confused on is how to structure my data in the DataFrame to accomplish this and plot it. I've tried pandas.resample and pandas.pivot, but didn't get it to work. Any small hints in the right direction would be extremely appreciated.

What I guess my table to should look like at the end is something like:

Listing Date    Zip Code     Median Home Prices
10/08           95240        168000
10/08           95242        216500
10/08           95258        210000
11/08           95240        171400
11/08           95242        219000
11/08           95258        212100

...and so on, where each zip codes has its own median price for the certain time period. I can hard code using a dictionary for each month and year and find the median home prices for a certain zip code that way, but I feel like pandas should have a simpler way of doing this for large data set.

EDIT

------------------------------------------------------------------------------

So when I use resample, I do create the bins for the time series data, but I want to individually obtain the different zip codes. I have this when I resample

train_20901.resample('M', on = 'Selling Date').median()




Selling Date    Zip Code    Days on Market  Listing Price   Price Per Sq Ft Selling Price               
2008-10-31        95240         21.0           160500         122.575        150000
2008-11-30        95240         36.0           179400         136.710        180000
2008-12-31        95240         32.0           165400         126.665        165025
2009-01-31        95240         37.5           165200         122.075        165000
2009-02-28        95240         52.0           174700         116.925        172500
2009-03-31        95240         48.5           129900         99.300         124950
2009-04-30        95240         35.5           164950         114.900        163250
2009-05-31        95240         21.0           159000         122.860        165000
2009-06-30        95240         36.0           134900         110.470        126000
2009-07-31        95240         14.0           174900         121.800        177500

Do I use the groupby function, but if so how do I keep the index as the timeseries, but get the different zip codes, not the median?


SOLVED

I figured it out by doing this.

resampled = train_20901.set_index('Selling Date').groupby('Address - Zip Code').resample('M').median()

This is exactly what I wanted, now I just have to plot.

1
Include the code as textMohammad Yusuf
I've added the code as text.Damanjit Hundal

1 Answers

0
votes

As shown above, this was solved with:

resampled = train_20901.set_index('Selling Date').groupby('Address - Zip Code').resample('M').median()