This is a simple machine learning tutorial in python. I am new to machine learning, and hence, wanted to keep it extremely simple and short. I loaded a data frame using quandl, which provides free financial data. For this tutorial a followed along a youtube series of python tutorial by sentdex. Broadly, the project includes taking stock price data, doing simple feature engineering to get meaningful features, define a label, and finally, running a linear regression. This tutorial is structured as follows:

Loading and Understanding Data

Before I go ahead and load the data, I imported releavnt python modules as you can see in the first snippet of my code. For instance, to load our data-frame using quandl, I imported the module ‘quandl’. Once I had the data loaded, I tried understanding the data by printing the first few lines of the data frame using data.head() command.

# import relevant modules
import pandas as pd
import numpy as np
import quandl, math
import datetime

# Machine Learning
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression

#Visualization
import matplotlib 
import matplotlib.pyplot as plt
%matplotlib inline
matplotlib.style.use('ggplot')

# Get unique quandl key by creating a free account with quandl 
# And directly load financial data from GOOGL

quandl.ApiConfig.api_key = 'q-UWpMLYsWKFejy5y-4a'
df = quandl.get('WIKI/GOOGL')

# Getting a peek into data 
# I am using round function to see only upto 2 decimal digits
print(df.head(2).round(1))
print('\n')

# Also print columns and index
print(df.columns)
print(df.index)
             Open   High    Low  Close      Volume  Ex-Dividend  Split Ratio  \
Date                                                                           
2004-08-19  100.0  104.1   96.0  100.3  44659000.0          0.0          1.0   
2004-08-20  101.0  109.1  100.5  108.3  22834300.0          0.0          1.0   

            Adj. Open  Adj. High  Adj. Low  Adj. Close  Adj. Volume  
Date                                                                 
2004-08-19       50.2       52.2      48.1        50.3   44659000.0  
2004-08-20       50.7       54.7      50.4        54.3   22834300.0  


Index(['Open', 'High', 'Low', 'Close', 'Volume', 'Ex-Dividend', 'Split Ratio',
       'Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume'],
      dtype='object')
DatetimeIndex(['2004-08-19', '2004-08-20', '2004-08-23', '2004-08-24',
               '2004-08-25', '2004-08-26', '2004-08-27', '2004-08-30',
               '2004-08-31', '2004-09-01',
               ...
               '2017-03-14', '2017-03-15', '2017-03-16', '2017-03-17',
               '2017-03-20', '2017-03-21', '2017-03-22', '2017-03-23',
               '2017-03-24', '2017-03-27'],
              dtype='datetime64[ns]', name='Date', length=3173, freq=None)

Feature Engineering

Before I even got into feature engineering, I noticed that the data had very similar features such as Open and Adj. Open. These features only differ if stock-split or merge happens. In this tutorial I work with only adjusted quantities as they are largely self contained . I also discarded any other column that I thought weren’t that important (Ex-Dividend & Split Ratio).

Then I refined features even further based on general understanding of financial data. For instance, instead of dealing with High and Low separately, I created volatility percentages as my new features as shown below:

Similarly,

# Discarding features that aren't useful
df = df[['Adj. Open','Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']]

# define a new feature, HL_PCT
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low'])/(df['Adj. Low']*100)

# define a new feature percentage change
df['PCT_CHNG'] = (df['Adj. Close'] - df['Adj. Open'])/(df['Adj. Open']*100)

df = df[['Adj. Close', 'HL_PCT', 'PCT_CHNG', 'Adj. Volume']]

print(df.head(1))
            Adj. Close    HL_PCT  PCT_CHNG  Adj. Volume
Date                                                   
2004-08-19   50.322842  0.000844  0.000032   44659000.0

Then I plotted my features as a function of dates, which are saved in the index of my data frame. Since the shares prices are almost linearly rising with time, linear regression should give me a good prediction!

# Visualization

df['Adj. Close'].plot(figsize=(15,6), color="green")
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()

df['HL_PCT'].plot(figsize=(15,6), color="red")
plt.xlabel('Date')
plt.ylabel('High Low Percentage')
plt.show()

df['PCT_CHNG'].plot(figsize=(15,6), color="blue")
plt.xlabel('Date')
plt.ylabel('Percent Change')
plt.show()

png

png

png

Machine Learning

Creating Features and Label

I chose the forecast of close, forecast_out after next 30 days for my label (the entity that I want to predict). This is competely flexible, the smaller the value of forecast_out, more accurate would be the model. An important thing to note here is that once we have shifted our data according to number of days in forecast (say n) to create our column ‘label’, we will end up with Nan’s in last n rows of column ‘label’.

# pick a forecast column
forecast_col = 'Adj. Close'

# Chosing 30 days as number of forecast days
forecast_out = int(30)
print('length =',len(df), "and forecast_out =", forecast_out)
length = 3173 and forecast_out = 30
# Creating label by shifting 'Adj. Close' according to 'forecast_out'
df['label'] = df[forecast_col].shift(-forecast_out)
print(df.head(2))
print('\n')
# If we look at the tail, it consists of n(=forecast_out) rows with NAN in Label column 
print(df.tail(2))
            Adj. Close    HL_PCT  PCT_CHNG  Adj. Volume      label
Date                                                              
2004-08-19   50.322842  0.000844  0.000032   44659000.0  66.495265
2004-08-20   54.322689  0.000854  0.000723   22834300.0  67.739104


            Adj. Close    HL_PCT  PCT_CHNG  Adj. Volume  label
Date                                                          
2017-03-24      835.14  0.000180 -0.000081    2080936.0    NaN
2017-03-27      838.51  0.000207  0.000126    1922073.0    NaN
# Define features Matrix X by excluding the label column which we just created 
X = np.array(df.drop(['label'], 1))

# Using a feature in sklearn, preposessing to scale features
X = preprocessing.scale(X)
print(X[1,:])
[-1.51873027  4.29658969  4.73498142  1.73495807]
# X contains last 'n= forecast_out' rows for which we don't have label data
# Put those rows in different Matrix X_forecast_out by X_forecast_out = X[end-forecast_out:end]

X_forecast_out = X[-forecast_out:]
X = X[:-forecast_out]
print ("Length of X_forecast_out:", len(X_forecast_out), "& Length of X :", len(X))

Length of X_forecast_out: 30 & Length of X : 3143
# Similarly Define Label vector y for the data we have prediction for
# A good test is to make sure length of X and y are identical
y = np.array(df['label'])
y = y[:-forecast_out]
print('Length of y: ',len(y))
Length of y:  3143

Regression

Finally, I try out Linear Regression on our data set by dividing it into train and test data.

Creating Training and Test Sets

Using cross validation basically shuffles the data and according to our test_size criteria, splits the data into test and training data.

# Cross validation (split into test and train data)
# test_size = 0.2 ==> 20% data is test data
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size = 0.2)

print('length of X_train and x_test: ', len(X_train), len(X_test))
length of X_train and x_test:  2514 629

Training and testing

Now it’s time to use linear regression. I first Split the data into 80% of training data and 20% of test data. I then, used Linear regression to train and test data. Finally, I tested the accuracy of my model on the test data.

# Train
clf = LinearRegression()
clf.fit(X_train,y_train)
# Test
accuracy = clf.score(X_test, y_test)
print("Accuracy of Linear Regression: ", accuracy)
Accuracy of Linear Regression:  0.97469687946

Prediction

It seems like this Linear Regression model did fairly well on the test data set! Now I can go ahead and use this model to predict prices of shares for the next 30 days.

# Predict using our Model
forecast_prediction = clf.predict(X_forecast_out)
print(forecast_prediction)
[ 846.77765824  847.91972606  845.22013582  849.99170986  854.03470016
  857.04579418  858.979216    858.53564003  855.59422798  857.30445545
  852.70919094  863.95054971  857.77724182  856.95606759  855.05463666
  858.79589268  861.49151795  865.31415252  869.11320698  872.18038495
  873.54875643  875.9179302   877.79452922  879.84986158  875.53405708
  857.17235423  857.42685204  846.64424381  842.49199376  845.18269079]

I then plot the predicted prices as a function of dates. The piece of code below just adds dates for the predicted days.

# Plotting data
df.dropna(inplace=True)
df['forecast'] = np.nan
last_date = df.iloc[-1].name
last_unix = last_date.timestamp()
one_day = 86400
next_unix = last_unix + one_day

for i in forecast_prediction:
    next_date = datetime.datetime.fromtimestamp(next_unix)
    next_unix += 86400
    df.loc[next_date] = [np.nan for _ in range(len(df.columns)-1)]+[i]
df['Adj. Close'].plot(figsize=(15,6), color="green")
df['forecast'].plot(figsize=(15,6), color="orange")
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()

png

# Zoomed In to a year
df['Adj. Close'].plot(figsize=(15,6), color="green")
df['forecast'].plot(figsize=(15,6), color="orange")
plt.xlim(xmin=datetime.date(2015, 4, 26))
plt.ylim(ymin=500)
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()

png

There it is! The prediction of stock prices for the next 30 days, by using linear regression.