Analyzing Data with Python


Alfred Adjei-Darko


In [1]:
# # widen notebook cells
# from IPython.core.display import display, HTML
# display(HTML("<style>.container { width:95% !important; }</style>"))

The process of converting data from the initial format to a format that may be better for analysis.

In [2]:
# import libraries
import pandas as pd

# data source
filename = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"

# headers
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style","drive-wheels","engine-location","wheel-base", 
           "length","width","height","curb-weight","engine-type","num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio",
           "horsepower","peak-rpm","city-mpg","highway-mpg","price"]

# read in the data set
df = pd.read_csv(filename, names = headers)

# see what the data set looks like
df.head()
Out[2]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

5 rows × 26 columns

i. Identify Missing Values

In [3]:
# import libraries
import numpy as np

# replace "?" with NaN
df.replace("?", np.nan, inplace = True)

# Evaluating for Missing Data using either .isnull() or .notnull()
missing_data = df.isnull()
missing_data.head()
Out[3]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 False True False False False False False False False False ... False False False False False False False False False False
1 False True False False False False False False False False ... False False False False False False False False False False
2 False True False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False False False False False False False False False False ... False False False False False False False False False False

5 rows × 26 columns

ii. Deal with Missing Values

Options

1. Drop data

  • drop the whole row (observation) [recommended]
    Example: "price" -- Any data entry without price data is essentially useless as it cannot be used for prediction

  • In [4]:
    # drop all rows that do not have price data
    df.dropna(subset=["price"], axis=0, inplace=True)
    
    # reset index, because we droped two rows
    df.reset_index(drop=True, inplace=True)
    df.head()
    
    Out[4]:
    symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
    0 3 NaN alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
    1 3 NaN alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
    2 1 NaN alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
    3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
    4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

    5 rows × 26 columns

  • drop the whole column/variable -- whole columns should be dropped only if most entries in the column are empty

  • 2. Replace data -- less accurate since we replace w/ a guess

  • replace it by mean
    Example: "normalized-losses", "stroke", "bore", "horsepower", "peak-rpm"

  • In [5]:
    # Calculate the average of the column
    avg_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)
    avg_stroke = df["stroke"].astype("float").mean(axis = 0)
    avg_bore = df['bore'].astype('float').mean(axis=0)
    avg_horsepower = df['horsepower'].astype('float').mean(axis=0)
    avg_peakrpm = df['peak-rpm'].astype('float').mean(axis=0)
    
    In [6]:
    # Replace "NaN" by mean value in "normalized-losses" column
    df["normalized-losses"].replace(np.nan, avg_norm_loss, inplace=True)
    df["stroke"].replace(np.nan, avg_stroke, inplace = True)
    df["bore"].replace(np.nan, avg_bore, inplace=True)
    df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)
    df['peak-rpm'].replace(np.nan, avg_peakrpm, inplace=True)
    

  • replace it by frequency
    Example: "num-of-doors" -- Like all noncategorical variables there isn't an average fuel type, for example. 84% of sedans have four doors. Since four doors is most frequent, it is most likely to occur

  • In [7]:
    # calculate the most common value in a particular column
    mcv = df['num-of-doors'].value_counts().idxmax()
    
    # replace the missing 'num-of-doors' values with the most frequent 
    df["num-of-doors"].replace(np.nan,mcv, inplace=True)
    

    iii. Correct Data Format

    Use:
    • .dtype() to check the data type
    • .astype() to change the data type
    In [8]:
    # df.describe()
    # df.describe(include = "all")
    # df.inf0
    # df.dtypes
    # Convert data types to proper format
    df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")
    df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")
    df[["price"]] = df[["price"]].astype("float")
    df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")
    
    df.dtypes
    
    Out[8]:
    symboling              int64
    normalized-losses      int32
    make                  object
    fuel-type             object
    aspiration            object
    num-of-doors          object
    body-style            object
    drive-wheels          object
    engine-location       object
    wheel-base           float64
    length               float64
    width                float64
    height               float64
    curb-weight            int64
    engine-type           object
    num-of-cylinders      object
    engine-size            int64
    fuel-system           object
    bore                 float64
    stroke               float64
    compression-ratio    float64
    horsepower            object
    peak-rpm             float64
    city-mpg               int64
    highway-mpg            int64
    price                float64
    dtype: object

    iv. Data Standardization

    Standardization is the process of transforming data into a common format which allows the researcher to make meaningful comparisons.
    Example: In our dataset, the fuel consumption columns "city-mpg" and "highway-mpg" are represented by mpg (miles per gallon) unit. Assume we are developing an application in a country that accept the fuel consumption with the L/100km standard We would need to transform mpg to L/100km:
    L/100km = 235 / mpg

    In [9]:
    # Convert mpg to L/100km
    df['city-L/100km'] = 235/df["city-mpg"]
    
    # transform mpg to L/100km in the column of "highway-mpg" 
    df["highway-mpg"] = 235/df["highway-mpg"]
    
    # change the name of the column from "highway-mpg" to "highway-L/100km"
    df.rename(columns={'highway-mpg':'highway-L/100km'}, inplace=True)
    
    # check your transformed data 
    df.head()
    
    Out[9]:
    symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-L/100km price city-L/100km
    0 3 122 alfa-romero gas std two convertible rwd front 88.6 ... mpfi 3.47 2.68 9.0 111 5000.0 21 8.703704 13495.0 11.190476
    1 3 122 alfa-romero gas std two convertible rwd front 88.6 ... mpfi 3.47 2.68 9.0 111 5000.0 21 8.703704 16500.0 11.190476
    2 1 122 alfa-romero gas std two hatchback rwd front 94.5 ... mpfi 2.68 3.47 9.0 154 5000.0 19 9.038462 16500.0 12.368421
    3 2 164 audi gas std four sedan fwd front 99.8 ... mpfi 3.19 3.40 10.0 102 5500.0 24 7.833333 13950.0 9.791667
    4 2 164 audi gas std four sedan 4wd front 99.4 ... mpfi 3.19 3.40 8.0 115 5500.0 18 10.681818 17450.0 13.055556

    5 rows × 27 columns

    v. Data Normalization

    Different columns of numerical data may have very different ranges and direct comparison is often not meaningful. Normalization is a way to bring all data into a similar range for more useful comparison. Eg. In the used car data set, the feature "length" ranges from 150-250, while the features "width" and "height" range from 50-100. We could normalize these variables so that the range of the values is consistent.

    Methods of Normalizing Data

    • Simple Feature Scaling
    • $$ x_{new} = \frac{x_{old}}{x_{max}} $$ Divide each value by the maximum value for that feature. New values range between zero and one.
    df["length"] = df["length"]/df["length"].max()
    

    • Min-Max
    • $$ x_{new} = \frac{x_{old} - x_{min}}{x_{max} - x_{min}} $$ Min-Max takes each value X_old subtract it from the minimum value of that feature, then divides by the range of that feature. Again, new values range between 0 and 1
    df["length"] = (df["length"] - df["length"].min())/ (df["length"].max() - df["length"].min())
    

    • Z-Score or Standard Score
    • $$x_{new} = \frac{x_{old} - \mu }{\sigma}$$ For each value you subtract mu (the average of the feature) and then divide by the standard deviation sigma. New values hover around zero, and typically range between negative three and positive three but can be higher or lower.
    df["length"] = (df["length"] - df["length"].mean())/df["length"].std()
    

    vi. Binning

    Binning is a process of transforming continuous numerical variables into discrete categorical 'bins', for grouped analysis. For example, you can bin “age” into [0 to 5], [6 to 10], [11 to 15] and so on.
    In our dataset, "horsepower" is a real valued variable ranging from 48 to 288, it has 57 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins' to simplify analysis?

    In [10]:
    # Convert data to correct format
    df["horsepower"]= df["horsepower"].astype(int, copy=True)
    
    # plot the histogram of horspower to see what the distribution of horsepower looks like
    from matplotlib import pyplot
    pyplot.hist(df["horsepower"])
    
    # set x/y labels and plot title
    pyplot.xlabel("horsepower")
    pyplot.ylabel("count")
    pyplot.title("horsepower bins")
    
    Out[10]:
    Text(0.5, 1.0, 'horsepower bins')
    In [11]:
    # use numpy's linspace() functioin to get 3 bins of equal size bandwidth
    bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)
    
    # set group names
    group_names = ['Low', 'Medium', 'High']
    
    # apply the function "cut" to determine what each value of "df['horsepower']" belongs to
    df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )
    df[['horsepower','horsepower-binned']].head(5)
    
    Out[11]:
    horsepower horsepower-binned
    0 111 Low
    1 111 Low
    2 154 Medium
    3 102 Low
    4 115 Low
    In [12]:
    # draw historgram of attribute "horsepower" with bins = 3
    pyplot.hist(df["horsepower"], bins = 3)
    
    # set x/y labels and plot title
    pyplot.xlabel("horsepower")
    pyplot.ylabel("count")
    pyplot.title("horsepower bins")
    
    Out[12]:
    Text(0.5, 1.0, 'horsepower bins')

    vii. Indicator variable

    An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning. Indicator variables are used so we can use categorical variables for regression analysis in the later modules. Example:We see the column "fuel-type" has two unique values, "gas" or "diesel". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert "fuel-type" into indicator variables

    In [13]:
    # get indicator variables and assign it to data frame "dummy_variable_1"
    dummy_variable_1 = pd.get_dummies(df["fuel-type"])
    
    # change column names for clarity
    dummy_variable_1.rename(columns={'fuel-type-diesel':'gas', 'fuel-type-diesel':'diesel'}, inplace=True)
    
    # merge data frame "df" and "dummy_variable_1" 
    df = pd.concat([df, dummy_variable_1], axis=1)
    
    # drop original column "fuel-type" from "df"
    df.drop("fuel-type", axis = 1, inplace=True)
    df.head()
    
    Out[13]:
    symboling normalized-losses make aspiration num-of-doors body-style drive-wheels engine-location wheel-base length ... compression-ratio horsepower peak-rpm city-mpg highway-L/100km price city-L/100km horsepower-binned diesel gas
    0 3 122 alfa-romero std two convertible rwd front 88.6 168.8 ... 9.0 111 5000.0 21 8.703704 13495.0 11.190476 Low 0 1
    1 3 122 alfa-romero std two convertible rwd front 88.6 168.8 ... 9.0 111 5000.0 21 8.703704 16500.0 11.190476 Low 0 1
    2 1 122 alfa-romero std two hatchback rwd front 94.5 171.2 ... 9.0 154 5000.0 19 9.038462 16500.0 12.368421 Medium 0 1
    3 2 164 audi std four sedan fwd front 99.8 176.6 ... 10.0 102 5500.0 24 7.833333 13950.0 9.791667 Low 0 1
    4 2 164 audi std four sedan 4wd front 99.4 176.6 ... 8.0 115 5500.0 18 10.681818 17450.0 13.055556 Low 0 1

    5 rows × 29 columns

    EDA is an approach to analyze data in order to summarize the main characteristics of the data, gain better understanding of the data set, uncover relationships between different variables, and extract important variables for the problem we're trying to solve.

    In [14]:
    # ***load data and store in dataframe df***
    # *final dataframe product after all the above treatment -- ready for analysis
    path='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'
    df = pd.read_csv(path)
    

    i. Analyzing Individual Feature Patterns using Visualization

    a. Continuous Numerical Variables

    Continuous numerical variables may contain any value within some range. A great way to visualize these variables is by using scatterplots with fitted lines:

    In [15]:
    # import libraries
    import seaborn as sns
    import matplotlib.pylab as plt
    
    # correlation between: bore, stroke,compression-ratio, and horsepower
    # df[["bore", "stroke", "compression-ratio", "horsepower"]].corr() 
    
    # Engine size as potential predictor variable of price
    sns.regplot(x="engine-size", y="price", data=df)
    plt.ylim(0,)
    
    Out[15]:
    (0.0, 53442.49589622432)
    In [16]:
    # correlation between 'engine-size' and 'price' is approximately 0.87:
    df[["engine-size", "price"]].corr()
    
    Out[16]:
    engine-size price
    engine-size 1.000000 0.872335
    price 0.872335 1.000000

    A Positive linear relationship is observed between engine size and price. The correlation between 'engine-size' and 'price' is approximately 0.87. The Engine size feature therefore seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line

    b. Categorical Variables

    The categorical variables can have the type "object" or "int64". A good way to visualize categorical variables is by using boxplots.

    • The distributions of price between the different body-style categories have a significant overlap, and so body-style would not be a good predictor of price:
    In [17]:
    # the relationship between "body-style" and "price"
    sns.boxplot(x="body-style", y="price", data=df)
    
    Out[17]:
    <AxesSubplot:xlabel='body-style', ylabel='price'>
    • The distribution of price between two engine-location categories -- front and rear, are distinct enough to take engine-location as a potentially good predictor of price
    In [18]:
    sns.boxplot(x="engine-location", y="price", data=df)
    
    Out[18]:
    <AxesSubplot:xlabel='engine-location', ylabel='price'>
    • The distribution of price between the different drive-wheels categories differs; as such drive-wheels could potentially be a predictor of price.
    In [19]:
    # drive-wheels
    sns.boxplot(x="drive-wheels", y="price", data=df)
    
    Out[19]:
    <AxesSubplot:xlabel='drive-wheels', ylabel='price'>

    ii. Descriptive Statistical Analysis

    Value Counts

    Note: the method value_counts only works on Pandas series, not Pandas Dataframes.

    In [20]:
    drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
    drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
    
    # rename the index to 'drive-wheels'
    drive_wheels_counts.index.name = 'drive-wheels'
    drive_wheels_counts
    
    Out[20]:
    value_counts
    drive-wheels
    fwd 118
    rwd 75
    4wd 8

    iii. Basics of Grouping

    Say we want to know, on average, which type of drive wheel is most valuable, we can group "drive-wheels" and then average them

    In [21]:
    # group by the variable "drive-wheels"
    # df['drive-wheels'].unique()
    # select columns
    df_group_one = df[['drive-wheels','body-style','price']]
    
    # calculate the average price for each of the different categories of data
    df_group_one = df_group_one.groupby(['drive-wheels'],as_index=False).mean()
    df_group_one
    
    Out[21]:
    drive-wheels price
    0 4wd 10241.000000
    1 fwd 9244.779661
    2 rwd 19757.613333
    In [22]:
    # You can also group with multiple variables
    # group dataframe by the unique combinations 'drive-wheels' and 'body-style'
    df_gptest = df[['drive-wheels','body-style','price']]
    grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()
    grouped_test1
    
    Out[22]:
    drive-wheels body-style price
    0 4wd hatchback 7603.000000
    1 4wd sedan 12647.333333
    2 4wd wagon 9095.750000
    3 fwd convertible 11595.000000
    4 fwd hardtop 8249.000000
    5 fwd hatchback 8396.387755
    6 fwd sedan 9811.800000
    7 fwd wagon 9997.333333
    8 rwd convertible 23949.600000
    9 rwd hardtop 24202.714286
    10 rwd hatchback 14337.777778
    11 rwd sedan 21711.833333
    12 rwd wagon 16994.222222
    In [23]:
    # grouped data is much easier to visualize when it is made into a pivot table
    # A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row
    grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')
    grouped_pivot
    
    Out[23]:
    price
    body-style convertible hardtop hatchback sedan wagon
    drive-wheels
    4wd NaN NaN 7603.000000 12647.333333 9095.750000
    fwd 11595.0 8249.000000 8396.387755 9811.800000 9997.333333
    rwd 23949.6 24202.714286 14337.777778 21711.833333 16994.222222
    In [24]:
    # fill missing values with 0
    grouped_pivot = grouped_pivot.fillna(0) 
    grouped_pivot
    
    Out[24]:
    price
    body-style convertible hardtop hatchback sedan wagon
    drive-wheels
    4wd 0.0 0.000000 7603.000000 12647.333333 9095.750000
    fwd 11595.0 8249.000000 8396.387755 9811.800000 9997.333333
    rwd 23949.6 24202.714286 14337.777778 21711.833333 16994.222222
    In [25]:
    # Let's use a heat map to visualize the relationship between Body Style vs Price
    # The heatmap plots the target variable (price) proportional to colour with 
    # respect to the variables 'drive-wheel' and 'body-style' in the vertical and horizontal axis respectively
    # This allows us to visualize how the price is related to 'drive-wheel' and 'body-style'.
    
    fig, ax = plt.subplots()
    im = ax.pcolor(grouped_pivot, cmap='RdBu')
    
    #label names
    row_labels = grouped_pivot.columns.levels[1]
    col_labels = grouped_pivot.index
    
    #move ticks and labels to the center
    ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
    ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)
    
    #insert labels
    ax.set_xticklabels(row_labels, minor=False)
    ax.set_yticklabels(col_labels, minor=False)
    
    #rotate label if too long
    plt.xticks(rotation=90)
    
    fig.colorbar(im)
    plt.show()
    

    iv. Correlation and Causation

    Pearson Correlation

    Pearson Correlation is the default method of the function corr. We can calculate the Pearson Correlation of 'int64' or 'float64' variables. The Pearson Correlation coefficient is a value between -1 and 1 inclusive, where:

    • 1: Total positive linear correlation.
    • 0: No linear correlation, the two variables most likely do not affect each other.
    • -1: Total negative linear correlation.

    P-value

    The P-value tells us how certain we are about the correlation we calculated
    Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant. By convention, when the:

    • p-value is < 0.001: there is strong evidence that the correlation is significant.
    • p-value is < 0.05: there is moderate evidence that the correlation is significant.
    • p-value is < 0.1: there is weak evidence that the correlation is significant.
    • p-value is > 0.1: there is no evidence that the correlation is significant.

    In [26]:
    # import libraries
    from scipy import stats
    
    # Horsepower vs Price
    pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
    print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value) 
    
    The Pearson Correlation Coefficient is 0.8095745670036559  with a P-value of P =  6.369057428260101e-48
    

    Conclusion: Since the p-value is < 0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1)

    v. Analysis of Variance (ANOVA)

    The ANOVA is a statistical method used to:

    • test whether there are significant differences between the means of two or more groups.
    • [It can be used to] find the correlation between different groups of a categorical variable.
    ANOVA returns two parameters:
    • F-test score: variation between sample group means divided by variation within sample group.
      ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score.
      A larger score means there is a larger difference between the means.
    • p-value: p-value, a.k.a. confidence degree a.k.a. Statistical significance tells how statistically significant our calculated score value is

    Drive Wheels

    Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Let's see if different types 'drive-wheels' impact 'price'

    In [27]:
    # we group the data
    # grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])
    grouped_test2=df[['drive-wheels', 'price']].groupby(['drive-wheels'])
    
    # obtain the values of the method group using the method "get_group".
    grouped_test2.get_group('4wd')['price']
    
    # use the function 'f_oneway' in the module 'stats' to obtain the F-test score and P-value.
    f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])  
     
    print( "ANOVA results: F=", f_val, ", P =", p_val)  
    
    ANOVA results: F= 67.95406500780399 , P = 3.3945443577151245e-23
    

    This is a great result with a large F test score showing a strong correlation and a P value of almost 0 implying almost certain statistical significance.

    But does this mean all three tested groups are all this highly correlated?
    In [28]:
    # Separately: fwd and rwd
    f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'])  
     
    print( "ANOVA results: F=", f_val, ", P =", p_val )
    
    ANOVA results: F= 130.5533160959111 , P = 2.2355306355677845e-23
    
    In [29]:
    # 4wd and rwd
    f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('rwd')['price'])  
       
    print( "ANOVA results: F=", f_val, ", P =", p_val) 
    
    ANOVA results: F= 8.580681368924756 , P = 0.004411492211225333
    
    In [30]:
    # 4wd and fwd
    f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('fwd')['price'])  
     
    print("ANOVA results: F=", f_val, ", P =", p_val)  
    
    ANOVA results: F= 0.665465750252303 , P = 0.41620116697845666
    

    Conclusion: Important Variables

    We now have a better idea of what our data looks like and which variables are important to take into account when predicting the car price. We have narrowed it down to the following variables:

    1. Continuous numerical variables 2. Categorical variables
    Length Drive-wheels
    Width
    Curb-weight
    Engine-size
    Horsepower
    City-mpg
    Highway-mpg
    Wheel-base
    Bore

    As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.

    In [31]:
    # load data and store in dataframe df
    path = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'
    
    df = pd.read_csv(path)
    df.head()
    
    Out[31]:
    symboling normalized-losses make aspiration num-of-doors body-style drive-wheels engine-location wheel-base length ... compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km horsepower-binned diesel gas
    0 3 122 alfa-romero std two convertible rwd front 88.6 0.811148 ... 9.0 111.0 5000.0 21 27 13495.0 11.190476 Medium 0 1
    1 3 122 alfa-romero std two convertible rwd front 88.6 0.811148 ... 9.0 111.0 5000.0 21 27 16500.0 11.190476 Medium 0 1
    2 1 122 alfa-romero std two hatchback rwd front 94.5 0.822681 ... 9.0 154.0 5000.0 19 26 16500.0 12.368421 Medium 0 1
    3 2 164 audi std four sedan fwd front 99.8 0.848630 ... 10.0 102.0 5500.0 24 30 13950.0 9.791667 Medium 0 1
    4 2 164 audi std four sedan 4wd front 99.4 0.848630 ... 8.0 115.0 5500.0 18 22 17450.0 13.055556 Medium 0 1

    5 rows × 29 columns

    i. (Simple) Linear Regression

    SLR helps us understand the relationship between 2 variables:

    • The predictor/independent variable(x)
    • The target/dependent variable (y)

    The result of Linear Regression is a linear function that predicts the response/target/dependent variable as a function of the predictor/independent variable.

    $$Linear function: Yhat = a + b X $$
    • a -- the intercept of the regression line; in other words: the value of Y when X is 0
    • b -- the slope of the regression line, in other words: the value with which Y changes when X increases by 1 unit

    How could Highway-mpg help us predict car price?

    To find out, we will use simple linear regression to create a linear function with "highway-mpg" as the predictor variable and the "price" as the response variable.

    In [32]:
    # import libraries
    # load the modules for linear regression
    from sklearn.linear_model import LinearRegression
    
    # definte the variables
    X = df[['highway-mpg']]
    Y = df['price']
    
    # Create the linear regression object
    lm = LinearRegression()
    
    # Fit the linear model using highway-mpg
    lm.fit(X,Y)
    
    Out[32]:
    LinearRegression()
    In [33]:
    # What is the value of the intercept (a)?
    lm.intercept_
    
    Out[33]:
    38423.3058581574
    In [34]:
    # What is the value of the Slope (b)?
    lm.coef_
    
    Out[34]:
    array([-821.73337832])
    $$ price = 38423.31 - 821.73(highway-mpg) $$

    ii. Multiple Linear Regression

    Multiple Linear Regression is used to explain the relationship between one continuous dependent variable and two or more independent variables. Most real-world regression models involve multiple predictor/independent variables. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:

    $$ Y: Response \ Variable\\ X_1 :Predictor\ Variable \ 1\\ X_2: Predictor\ Variable \ 2\\ X_3: Predictor\ Variable \ 3\\ X_4: Predictor\ Variable \ 4\\ $$$$ a: intercept\\ b_1 :coefficients \ of\ Variable \ 1\\ b_2: coefficients \ of\ Variable \ 2\\ b_3: coefficients \ of\ Variable \ 3\\ b_4: coefficients \ of\ Variable \ 4\\ $$$$ Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4 $$

    From the previous section we know that other good predictors of price could be:

    • Horsepower
    • Curb-weight
    • Engine-size
    • Highway-mpg
    In [35]:
    # develop a model using these variables as the predictor variables
    Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]
    
    # Fit the linear model using the four above-mentioned variables.
    # lm.fit(Z, df['price'])
    lm.fit(Z, Y)
    
    Out[35]:
    LinearRegression()
    In [36]:
    # What is the value of the intercept(a)?
    lm.intercept_
    
    Out[36]:
    -15806.62462632923
    In [37]:
    # What are the values of the coefficients (b1, b2, b3, b4)?
    lm.coef_
    
    Out[37]:
    array([53.49574423,  4.70770099, 81.53026382, 36.05748882])

    What is the final estimated linear model that we get?

    $$ price = -15806.62462632923 + 53.49574423(horsepower) + 4.70770099(curb-weight) + 81.53026382(engine-size) + 36.05748882(highway-mpg) $$

    iii. Model Evaluation using Visualization

    Now that we've developed some models, how do we evaluate our models and how do we choose the best one? One way to do this is by using visualization.

    Regression Plot

    This plot shows a combination of a scatter plot, as well as the fitted linear regression line going through the data. This will give us a reasonable estimate of:

    • The relationship between the two variables
    • the strength of the correlation
    • the direction (positive or negative correlation)
    In [38]:
    # Horsepower as potential predictor variable of price
    width = 12
    height = 10
    plt.figure(figsize=(width, height))
    sns.regplot(x="highway-mpg", y="price", data=df)
    plt.ylim(0,)
    
    Out[38]:
    (0.0, 48193.04207837188)

    Pay attention to how scattered the data points are around the regression line. This will give you a good indication of the variance of the data, and whether a linear model would be the best fit or not. If the data is too far off from the line, this linear model might not be the best model for the data.

    In [39]:
    # peak-rpm" as potential predictor variable of price
    plt.figure(figsize=(width, height))
    sns.regplot(x="peak-rpm", y="price", data=df)
    plt.ylim(0,)
    
    Out[39]:
    (0.0, 47414.1)

    Comparing the regression plot of "peak-rpm" and "highway-mpg" we see that the points for "highway-mpg" are much closer to the generated line and on the average decrease. The points for "peak-rpm" have more spread around the predicted line, and it is much harder to determine if the points are decreasing or increasing

    In [40]:
    df[["peak-rpm","highway-mpg","price"]].corr()
    
    Out[40]:
    peak-rpm highway-mpg price
    peak-rpm 1.000000 -0.058598 -0.101616
    highway-mpg -0.058598 1.000000 -0.704692
    price -0.101616 -0.704692 1.000000

    a. Residual Plot

    The residual (e) is the difference between the observed value (y) and the predicted value (Yhat). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line. A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal axis When looking at a residual plot, pay attention to the spread of the residuals:

    • If the points are randomly spread out around the x-axis, then a linear model is appropriate for the data. Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.

    In the residual plot below, the residuals are not randomly spread around the x-axis, which leads us to believe that maybe a non-linear model is more appropriate for this data.

    In [41]:
    plt.figure(figsize=(width, height))
    sns.residplot(x=df['highway-mpg'], y=df['price'])
    plt.show()
    

    b. Distribution Plot

    Visualizing a model for Multiple Linear Regression is a bit more complicated because you can't visualize it with a regression or residual plot.
    One way to look at the fit of the model is by looking at the distribution plot.
    We can look at the distribution of the fitted values that result from the model and compare it to the distribution of the actual values.

    In [42]:
    # First lets make a prediction
    Y_hat = lm.predict(Z)
    
    In [43]:
    plt.figure(figsize=(width, height))
    
    #kind{“hist”, “kde”, “ecdf”}
    ax1 = sns.kdeplot(df['price'], color="r", label="Actual Value")
    sns.kdeplot(Y_hat, color="b", label="Fitted Values" , ax=ax1)
    
    plt.title('Actual vs Fitted Values for Price')
    plt.xlabel('Price (in dollars)')
    plt.ylabel('Proportion of Cars')
    
    plt.show()
    plt.close()
    

    We can see that the fitted values are reasonably close to the actual values, since the two distributions overlap a bit. However, there is definitely some room for improvement.

    iv. Polynomial Regression and Pipelines

    Polynomial regression is a particular case of the simple linear regression model or multiple linear regression models. We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.
    There are different orders of polynomial regression:

    Quadratic - 2nd order
    $$ Yhat = a + b_1 X^2 +b_2 X^2\\ $$

    Cubic - 3rd order
    $$ Yhat = a + b_1 X^2 +b_2 X^2 +b_3 X^3\\ $$

    Higher order:
    $$ Y = a + b_1 X^2 +b_2 X^2 +b_3 X^3 ....\\ $$

    We saw earlier that a linear model did not provide the best fit while using highway-mpg as the predictor variable. Let's see if we can try fitting a polynomial model to the data instead. We will use the following function to plot the data:

    In [44]:
    def PlotPolly(model, independent_variable, dependent_variable, Name):
        x_new = np.linspace(15, 55, 100)
        y_new = model(x_new)
        
        plt.rcParams["figure.figsize"] = (12,12)
    
        plt.plot(independent_variable, dependent_variable, '.', x_new, y_new, '-')
        plt.title('Polynomial Fit with Matplotlib for Price ~ Length')
        ax = plt.gca()
        ax.set_facecolor((0.898, 0.898, 0.898))
        fig = plt.gcf()
        plt.xlabel(Name)
        plt.ylabel('Price of Cars')
    
        plt.show()
        plt.close()
    
    In [45]:
    # get the variables
    x = df['highway-mpg']
    y = df['price']
    
    # fit the polynomial using the function polyfit, then use the function poly1d to display the polynomial function.
    # Here we use a polynomial of the 3rd order (cubic) 
    f = np.polyfit(x, y, 3)
    p = np.poly1d(f)
    print(p)
    
            3         2
    -1.557 x + 204.8 x - 8965 x + 1.379e+05
    
    In [46]:
    # plot the function
    PlotPolly(p, x, y, 'highway-mpg')
    

    We can already see from plotting that this polynomial model performs better than the linear model. This is because the generated polynomial function "hits" more of the data points.

    The analytical expression for Multivariate Polynomial function gets complicated. For example, the expression for a second-order (degree=2) polynomial with two variables is given by: $$ Yhat = a + b_1 X_1 +b_2 X_2 +b_3 X_1 X_2+b_4 X_1^2+b_5 X_2^2 $$

    We can perform a polynomial transform on multiple features (variables):

    In [47]:
    # import libraries
    from sklearn.preprocessing import PolynomialFeatures
    
    # create a PolynomialFeatures object of degree 2
    pr=PolynomialFeatures(degree=2)
    
    Z_pr=pr.fit_transform(Z)
    

    In the original data, there are 201 samples and 4 features:

    In [48]:
    Z.shape
    
    Out[48]:
    (201, 4)

    After the transformation, there are 201 samples and 15 features:

    In [49]:
    Z_pr.shape
    
    Out[49]:
    (201, 15)

    Pipeline

    Data Pipelines simplify the steps of processing the data.

    In [50]:
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    

  • We create the pipeline by creating a list of tuples including the name of the model or estimator and its corresponding constructor.

  • In [51]:
    Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), ('model',LinearRegression())]
    

  • We input the list as an argument to the pipeline constructor:

  • In [52]:
    pipe=Pipeline(Input)
    pipe
    
    Out[52]:
    Pipeline(steps=[('scale', StandardScaler()),
                    ('polynomial', PolynomialFeatures(include_bias=False)),
                    ('model', LinearRegression())])

  • we convert the data type Z to type float to avoid conversion warnings that may appear as a result of StandardScaler taking float inputs

  • In [53]:
    Z = Z.astype(float)
    

  • Then, we can normalize the data, perform a transform and fit the model simultaneously.

  • In [54]:
    pipe.fit(Z,y)
    
    Out[54]:
    Pipeline(steps=[('scale', StandardScaler()),
                    ('polynomial', PolynomialFeatures(include_bias=False)),
                    ('model', LinearRegression())])

  • Similarly, we can normalize the data, perform a transform and produce a prediction simultaneously.

  • In [55]:
    ypipe=pipe.predict(Z)
    ypipe[0:3]
    
    Out[55]:
    array([13102.74784201, 13102.74784201, 18225.54572197])

    v. Measures for In-Sample Evaluation

    When evaluating our models, not only do we want to visualize the results, we also want a quantitative measure of it
    Two very important measures often used in Statistics are:


  • R^2 / R-squared / Coefficient of Determination
    A measure to indicate how close the data is to the fitted regression line.
    The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model

  • Mean Squared Error (MSE)
    MSE measures the average of the squares of errors -- that is, the difference between actual value (y) and the estimated value (ŷ).
  • Model 1: Simple Linear Regression

    R^2:

    In [56]:
    # highway_mpg_fit
    lm.fit(X, Y)
    print('The R-square is: ', lm.score(X, Y))
    
    The R-square is:  0.4965911884339176
    

    We can say that ~49.659% of the variation of the price is explained by this simple linear model "horsepower_fit".

    MSE:

    In [57]:
    from sklearn.metrics import mean_squared_error
    
    # predict the output
    Yhat=lm.predict(X)
    # print('The output of the first four predicted value is: ', Yhat[0:4])
    
    # compare the predicted results with the actual results
    mse = mean_squared_error(df['price'], Yhat)
    print('The mean square error of price and predicted value is: ', mse)
    
    The mean square error of price and predicted value is:  31635042.944639888
    

    Model 2: Multiple Linear Regression

    R^2:

    In [58]:
    # fit the model 
    lm.fit(Z, df['price'])
    # Find the R^2
    print('The R-square is: ', lm.score(Z, df['price']))
    
    The R-square is:  0.8093562806577457
    

    We can say that ~80.896 % of the variation of price is explained by this multiple linear regression "multi_fit".

    MSE:

    In [59]:
    # produce a prediction
    Y_predict_multifit = lm.predict(Z)
    
    # compare the predicted results with the actual results
    print('The mean square error of price and predicted value using multifit is: ', \
          mean_squared_error(df['price'], Y_predict_multifit))
    
    The mean square error of price and predicted value using multifit is:  11980366.87072649
    

    Model 3: Polynomial Fit

    R^2:

    In [60]:
    from sklearn.metrics import r2_score
    
    r_squared = r2_score(y, p(x))
    print('The R-square value is: ', r_squared)
    
    The R-square value is:  0.674194666390652
    

    We can say that ~67.419 % of the variation of price is explained by this polynomial fit.

    MSE:

    In [61]:
    mean_squared_error(df['price'], p(x))
    
    Out[61]:
    20474146.426361218

    vi. Prediction and Decision Making

    a. Prediction

    In the previous section, we trained the model using the method fit.
    Now we will use the method predict to produce a prediction.

    In [62]:
    new_input=np.arange(1, 100, 1).reshape(-1, 1)
    
    In [63]:
    # Fit the model:
    lm.fit(X.values, Y)
    
    # Produce a prediction
    yhat=lm.predict(new_input)
    yhat[0:3]
    
    Out[63]:
    array([37601.57247984, 36779.83910151, 35958.10572319])
    In [64]:
    # plot the data
    plt.plot(new_input, yhat)
    plt.show()
    

    b. Decision Making: Determining a Good Model Fit

    Now that we have visualized the different models, and generated the R-squared and MSE values for the fits, how do we determine a good model fit?

    • What is a good R-squared value? -- Models with higher R-squared values are better fits for data
    • What is a good MSE? -- Models with smaller MSE values are better fits for data

    Conclusion

    Comparing the SLR, MLR amd Polynimial regression models, we conclude that the MLR model is the best model for predicting price from our dataset. This result makes sense since we have 27 variables in total and we know that more than one of those variables is a potential predictor of the final car price.

    We have built models and made predictions (of vehicle prices). Now we will determine how accurate these predictions are

    In [65]:
    # Import clean data 
    path = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/module_5_auto.csv'
    df = pd.read_csv(path)
    
    # First lets only use numeric data
    df=df._get_numeric_data()
    df.head()
    
    Out[65]:
    Unnamed: 0 Unnamed: 0.1 symboling normalized-losses wheel-base length width height curb-weight engine-size ... stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km diesel gas
    0 0 0 3 122 88.6 0.811148 0.890278 48.8 2548 130 ... 2.68 9.0 111.0 5000.0 21 27 13495.0 11.190476 0 1
    1 1 1 3 122 88.6 0.811148 0.890278 48.8 2548 130 ... 2.68 9.0 111.0 5000.0 21 27 16500.0 11.190476 0 1
    2 2 2 1 122 94.5 0.822681 0.909722 52.4 2823 152 ... 3.47 9.0 154.0 5000.0 19 26 16500.0 12.368421 0 1
    3 3 3 2 164 99.8 0.848630 0.919444 54.3 2337 109 ... 3.40 10.0 102.0 5500.0 24 30 13950.0 9.791667 0 1
    4 4 4 2 164 99.4 0.848630 0.922222 54.3 2824 136 ... 3.40 8.0 115.0 5500.0 18 22 17450.0 13.055556 0 1

    5 rows × 21 columns

    In [66]:
    # Libraries for plotting
    from IPython.display import display
    from ipywidgets import interact, interactive, fixed, interact_manual, widgets
    
    # Functions for plotting
    def DistributionPlot(RedFunction, BlueFunction, RedName, BlueName, Title):
        width = 12
        height = 10
        plt.figure(figsize=(width, height))
    
    #     ax1 = sns.distplot(RedFunction, hist=False, color="r", label=RedName)
    #     ax2 = sns.distplot(BlueFunction, hist=False, color="b", label=BlueName, ax=ax1)
        ax1 = sns.kdeplot(RedFunction, color="r", label=RedName)
        ax2 = sns.kdeplot(BlueFunction, color="b", label=BlueName, ax=ax1)
    
        plt.title(Title)
        plt.xlabel('Price (in dollars)')
        plt.ylabel('Proportion of Cars')
    
        plt.show()
        plt.close()
    
        
    def PollyPlot(xtrain, xtest, y_train, y_test, lr,poly_transform):
        width = 12
        height = 10
        plt.figure(figsize=(width, height))
     
        xmax=max([xtrain.values.max(), xtest.values.max()])
        xmin=min([xtrain.values.min(), xtest.values.min()])
    
        x=np.arange(xmin, xmax, 0.1)
    
        plt.plot(xtrain, y_train, 'ro', label='Training Data')
        plt.plot(xtest, y_test, 'go', label='Test Data')
        plt.plot(x, lr.predict(poly_transform.fit_transform(x.reshape(-1, 1))), label='Predicted Function')
        plt.ylim([-10000, 60000])
        plt.ylabel('Price')
        plt.legend()
    

    i. Training and Testing

    An important step in testing your model is to split your data into training and testing data.

    R^2

    In [67]:
    # place the target data price in a separate dataframe y:
    y_data = df['price']
    
    # drop price data in x data
    x_data=df.drop('price',axis=1)
    # x_data.head()
    
    # Now we randomly split our data into training and testing data using the function train_test_split
    from sklearn.model_selection import train_test_split
    
    # split up the data set such that 15% of the data samples will be used for testing
    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=1)
    
    #y_train = a + b(x_train)
    #y_hat   = a + b(x_test)
    #lre.score below creates/predicsts y_hat and compares it with y_test and gives a score
    
    
    print("number of test samples :", x_test.shape[0])
    print("number of training samples:",x_train.shape[0])
    
    number of test samples : 31
    number of training samples: 170
    
    In [68]:
    # import LinearRegression
    from sklearn.linear_model import LinearRegression
    
    # create a Linear Regression object
    lre=LinearRegression()
    
    # fit the model using the feature horsepower
    lre.fit(x_train[['horsepower']], y_train)
    
    # Calculate the R^2 on the test data
    lre.score(x_test[['horsepower']], y_test)
    
    Out[68]:
    0.707688374146705
    In [69]:
    lre.score(x_train[['horsepower']], y_train)
    
    Out[69]:
    0.6449517437659684

    Cross-validation Score

    In [70]:
    from sklearn.model_selection import cross_val_score
    
    # Calculate the average R^2 using four folds, and 'horsepower' as a feature
    # We input the object, the feature in this case ' horsepower', the target data (y_data). 
    # The parameter 'cv' determines the number of folds; in this case 4.
    # The default scoring is R^2; each element in the array has the average R^2 value in the fold
    Rcross = cross_val_score(lre, x_data[['horsepower']], y_data, cv=4)
    
    In [71]:
    Rcross
    
    Out[71]:
    array([0.7746232 , 0.51716687, 0.74785353, 0.04839605])
    In [72]:
    # calculate the average and standard deviation of our estimate(Rcross):
    print("The mean of the folds are", Rcross.mean(), "and the standard deviation is" , Rcross.std())
    
    The mean of the folds are 0.522009915042119 and the standard deviation is 0.2911839444756029
    

    Negative Squared Error

    In [73]:
    -1 * cross_val_score(lre,x_data[['horsepower']], y_data,cv=4,scoring='neg_mean_squared_error')
    
    Out[73]:
    array([20254142.84026702, 43745493.2650517 , 12539630.34014931,
           17561927.7224759 ])

    You can also use the function 'cross_val_predict' to predict the output. The function splits up the data into the specified number of folds, using one fold to get a prediction while the rest of the folds are used as test data.

    In [74]:
    from sklearn.model_selection import cross_val_predict
    # We input the object, the feature in this case 'horsepower' , the target data y_data. 
    # The parameter 'cv' determines the number of folds; in this case 4. We can produce an output:
    yhat = cross_val_predict(lre,x_data[['horsepower']], y_data,cv=4)
    yhat[0:5]
    
    Out[74]:
    array([14141.63807508, 14141.63807508, 20814.29423473, 12745.03562306,
           14762.35027598])

    ii. Overfitting

    • Overfitting occurs when the model fits the noise, not the underlying process.
    • Note: The test data -- sometimes referred to as the out of sample data -- is a much better measure of how well your model performs in the real world:
    In [75]:
    # create Multiple linear regression object
    lr = LinearRegression()
    
    # train the model using 'horsepower', 'curb-weight', 'engine-size' and 'highway-mpg' as features.
    lr.fit(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_train)
    
    # Prediction using training data:
    yhat_train = lr.predict(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
    yhat_train[0:5]
    
    Out[75]:
    array([11927.70699817, 11236.71672034,  6436.91775515, 21890.22064982,
           16667.18254832])
    In [76]:
    # Prediction using test data:
    yhat_test = lr.predict(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
    yhat_test[0:5]
    
    Out[76]:
    array([11349.16502418,  5914.48335385, 11243.76325987,  6662.03197043,
           15555.76936275])

    Let's perform some model evaluation using our training and testing data separately

    In [77]:
    # import the seaborn and matplotlib
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # examine the distribution of the predicted values of the training data.
    plt.figure(figsize=(width, height))
    ax1 = sns.kdeplot(y_train, color="r", label="Actual Values (Train)")
    sns.kdeplot(yhat_train, color="b", label = "Predicted Values (Train)", legend=True, ax=ax1)
    
    plt.title('Distribution  Plot of  Predicted Value Using Training Data vs Training Data Distribution')
    plt.xlabel('Price (in dollars)')
    plt.ylabel('Proportion of Cars')
    
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    
    plt.show()
    plt.close()
    

    Figure 1: Plot of predicted values using the training data compared to the training data.


    So far the model seems to be doing well in learning from the training dataset. But what happens when the model encounters new data from the testing dataset? When the model generates new values from the test data, we see the distribution of the predicted values is much different from the actual target values.

    In [78]:
    # import the seaborn and matplotlib
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # examine the distribution of the predicted values of the training data.
    plt.figure(figsize=(width, height))
    ax1 = sns.kdeplot(y_test, color="r", label="Actual Values (Test)")
    sns.kdeplot(yhat_test, color="b", label = "Predicted Values (Test)", legend=True, ax=ax1)
    
    plt.title('Distribution  Plot of  Predicted Value Using Test Data vs Data Distribution of Test Data')
    plt.xlabel('Price (in dollars)')
    plt.ylabel('Proportion of Cars')
    
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    
    plt.show()
    plt.close()
    

    Figure 2: Plot of predicted value using the test data compared to the test data.


    Comparing Figure 1 and Figure 2, it is evident that the distribution of the test data in Figure 1 is much better at fitting the data. This difference in Figure 2 is apparent where the ranges are from 5000 to 15 000. This is where the distribution shape is exceptionally different. Let's see if polynomial regression also exhibits a drop in the prediction accuracy when analysing the test dataset.

    In [79]:
    from sklearn.preprocessing import PolynomialFeatures
    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.45, random_state=0)
    
    # We will perform a degree 5 polynomial transformation on the feature 'horse power'
    pr = PolynomialFeatures(degree=5)
    x_train_pr = pr.fit_transform(x_train[['horsepower']])
    x_test_pr = pr.fit_transform(x_test[['horsepower']])
    pr
    
    Out[79]:
    PolynomialFeatures(degree=5)
    In [80]:
    # create a linear regression model "poly" and train it
    poly = LinearRegression()
    poly.fit(x_train_pr, y_train)
    
    Out[80]:
    LinearRegression()
    In [81]:
    # We can see the output of our model using the method "predict." then assign the values to "yhat"
    yhat = poly.predict(x_test_pr)
    
    # Let's take the first five predicted values and compare it to the actual targets.
    print("Predicted values:", yhat[0:4])
    print("True values:", y_test[0:4].values)
    
    Predicted values: [ 6728.65566037  7307.9878638  12213.7877412  18893.24796457]
    True values: [ 6295. 10698. 13860. 13499.]
    
    In [82]:
    # use the function "PollyPlot" that we defined at the beginning of the lab 
    # to display the training data, testing data, and the predicted function.
    PollyPlot(x_train[['horsepower']], x_test[['horsepower']], y_train, y_test, poly,pr)
    

    Figure 4 A polynomial regression model. Red dots represent training data, green dots represent test data, and the blue line represents the model prediction.


    We see that the estimated function appears to track the data but around 200 horsepower, the function begins to diverge from the data points.

    R^2 of the training data:

    In [83]:
    poly.score(x_train_pr, y_train)
    
    Out[83]:
    0.5567716902121724

    R^2 of the test data:

    In [84]:
    poly.score(x_test_pr, y_test)
    
    Out[84]:
    -29.871340540625788

    We see the R^2 for the training data is 0.5567 while the R^2 on the test data was -29.87. The lower the R^2, the worse the model, a Negative R^2 is a sign of overfitting. Let's see how the R^2 changes on the test data for different order polynomials and plot the results:

    In [85]:
    Rsqu_test = []
    
    order = [1, 2, 3, 4]
    for n in order:
        pr = PolynomialFeatures(degree=n)
        
        x_train_pr = pr.fit_transform(x_train[['horsepower']])
        
        x_test_pr = pr.fit_transform(x_test[['horsepower']])    
        
        lr.fit(x_train_pr, y_train)
        
        Rsqu_test.append(lr.score(x_test_pr, y_test))
    
    plt.plot(order, Rsqu_test)
    plt.xlabel('order')
    plt.ylabel('R^2')
    plt.title('R^2 Using Test Data')
    plt.text(3, 0.75, 'Maximum R^2 ') 
    
    Out[85]:
    Text(3, 0.75, 'Maximum R^2 ')

    We see the R^2 gradually increases until an order three polynomial is used. Then the R^2 dramatically decreases at four.

    In [86]:
    def f(order, test_data):
        x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=test_data, random_state=0)
        pr = PolynomialFeatures(degree=order)
        x_train_pr = pr.fit_transform(x_train[['horsepower']])
        x_test_pr = pr.fit_transform(x_test[['horsepower']])
        poly = LinearRegression()
        poly.fit(x_train_pr,y_train)
        PollyPlot(x_train[['horsepower']], x_test[['horsepower']], y_train,y_test, poly, pr)
    

    The following interface allows you to experiment with different polynomial orders and different amounts of data.

    In [87]:
    interact(f, order=(0, 6, 1), test_data=(0.05, 0.95, 0.05))
    
    Out[87]:
    <function __main__.f(order, test_data)>

    Polynomial transformations with more than one feature

    In [88]:
    # Create a "PolynomialFeatures" object "pr1" of degree two
    pr1=PolynomialFeatures(degree=2)
    
    # Transform the training and testing samples for the features 'horsepower', 'curb-weight', 'engine-size' and 'highway-mpg'
    x_train_pr1=pr.fit_transform(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
    
    x_test_pr1=pr.fit_transform(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
    

    How many dimensions does the new feature have? Hint: use the attribute "shape"

    In [89]:
    x_train_pr1.shape 
    # There are now 15 features: 
    
    Out[89]:
    (110, 70)
    In [90]:
    # Create a linear regression model "poly1" and train the object using the method "fit" using the polynomial features
    poly1=LinearRegression().fit(x_train_pr1,y_train)
    
    In [91]:
    # predict an output on the polynomial features
    yhat_test1=poly1.predict(x_test_pr1)
    
    # display the distribution of the predicted output vs the test data
    Title='Distribution  Plot of  Predicted Value Using Test Data vs Data Distribution of Test Data'
    DistributionPlot(y_test, yhat_test1, "Actual Values (Test)", "Predicted Values (Test)", Title)
    

    3. Ridge regression

    In [92]:
    # perform a degree two polynomial transformation on our data
    # NB: our test data will be used as validation data
    pr=PolynomialFeatures(degree=2)
    x_train_pr=pr.fit_transform(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg','normalized-losses','symboling']])
    x_test_pr=pr.fit_transform(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg','normalized-losses','symboling']])
    
    # import Ridge
    from sklearn.linear_model import Ridge
    
    # create a Ridge regression object, setting the regularization parameter to 0.1
    # create a Ridge regression object, setting the regularization parameter to 0.2
    RigeModel=Ridge(alpha=0.2)
    
    #  fit the model
    RigeModel.fit(x_train_pr, y_train)
    
    Out[92]:
    Ridge(alpha=0.2)
    In [93]:
    # obtain a prediction
    yhat = RigeModel.predict(x_test_pr)
    
    # compare the first five predicted samples to our test set
    print('predicted:', yhat[0:4])
    print('test set :', y_test[0:4].values)
    
    predicted: [ 6567.97882844  9610.53802227 20866.24455926 19356.52523473]
    test set : [ 6295. 10698. 13860. 13499.]
    
    In [94]:
    # We select the value of Alfa that minimizes the test error, for example, we can use a for loop.
    Rsqu_test = []
    Rsqu_train = []
    dummy1 = []
    ALFA = 10 * np.array(range(0,1000))
    for alfa in ALFA:
        RigeModel = Ridge(alpha=alfa) 
        RigeModel.fit(x_train_pr, y_train)
        Rsqu_test.append(RigeModel.score(x_test_pr, y_test))
        Rsqu_train.append(RigeModel.score(x_train_pr, y_train))
        
        
        
    # plot out the value of R^2 for different Alphas
    width = 12
    height = 10
    plt.figure(figsize=(width, height))
    
    plt.plot(ALFA,Rsqu_test, label='validation data  ')
    plt.plot(ALFA,Rsqu_train, 'r', label='training Data ')
    plt.xlabel('alpha')
    plt.ylabel('R^2')
    plt.legend()
    
    Out[94]:
    <matplotlib.legend.Legend at 0x22bd4742c20>

    Figure 6:The blue line represents the R^2 of the test data, and the red line represents the R^2 of the training data. The x-axis represents the different values of Alfa. The red line in figure 6 represents the R^2 of the test data, as Alpha increases the R^2 decreases; therefore as Alfa increases the model performs worse on the test data. The blue line represents the R^2 on the validation data, as the value for Alfa increases the R^2 decreases.

    Task 4):

    Perform Ridge regression and calculate the R^2 using the polynomial features, use the training data to train the model and test data to test the model. The parameter alpha should be set to 10.
    In [95]:
    RigeModel = Ridge(alpha=0) 
    RigeModel.fit(x_train_pr, y_train)
    RigeModel.score(x_test_pr, y_test)
    
    Out[95]:
    0.47098332869277415

    4. Grid Search

    The term Alfa is a hyperparameter, sklearn has the class GridSearchCV to make the process of finding the best hyperparameter simpler.

    In [96]:
    # import GridSearchCV
    from sklearn.model_selection import GridSearchCV
    
    # We create a dictionary of parameter values
    parameters1= [{'alpha': [0.001,0.1,1, 10, 100, 1000, 10000, 100000, 100000]}]
    parameters1
    
    Out[96]:
    [{'alpha': [0.001, 0.1, 1, 10, 100, 1000, 10000, 100000, 100000]}]
    In [97]:
    # Create a ridge regions object
    RR=Ridge()
    RR
    
    Out[97]:
    Ridge()
    In [98]:
    # Create a ridge grid search object
    Grid1 = GridSearchCV(RR, parameters1,cv=4)
    
    # Fit the model
    Grid1.fit(x_data[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_data)
    
    Out[98]:
    GridSearchCV(cv=4, estimator=Ridge(),
                 param_grid=[{'alpha': [0.001, 0.1, 1, 10, 100, 1000, 10000, 100000,
                                        100000]}])

    The object finds the best parameter values on the validation data. We can obtain the estimator with the best parameters and assign it to the variable BestRR as follows:

    In [99]:
    BestRR=Grid1.best_estimator_
    BestRR
    
    Out[99]:
    Ridge(alpha=10000)
    In [100]:
    # test our model on the test data
    BestRR.score(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_test)
    
    Out[100]:
    0.8411649831036152
    In [ ]: