This is research to test whether the variables
last_oil_price, mid_USD_rate and inflation
influence oil_export volume in Indonesia. The
test uses regression calculations by taking
sample data from 2021 to 2022. The data is
downloaded from the BPS, BI and investing
websites.
1. Preparation of Data
Import all of libraries that is required
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
from math import log
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.stattools import jarque_bera
%matplotlib inline
#Read in the export_oil csv file as DataFrame called oil_indonesia
oil_indonesia = pd.read_csv("/content/export_oil.csv")
#Check the head of oil_indonesia
oil_indonesia.head()
last_oil_price mid_USD_rate oil_export inflation
0 51.56 13662.0 815.3 0.0268
1 44.76 14234.0 805.2 0.0298
2 20.48 16367.0 617.4 0.0296
3 18.84 15157.0 562.1 0.0267
4 35.49 14733.0 560.9 0.0219
#Check the oil_indonesia table's shape
oil_indonesia.shape
(48, 5)
#Check the oil_indonesia info
oil_indonesia.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 end_month 48 non-null object
1 last_oil_price 48 non-null float64
2 mid_USD_rate 48 non-null int64
3 oil_export 48 non-null float64
4 inflation 48 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 2.0+ KB
Change data type mid_USD_rate to float64 type
oil_indonesia['mid_USD_rate'] =
oil_indonesia['mid_USD_rate'].astype('float64')
oil_indonesia.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 end_month 48 non-null object
1 last_oil_price 48 non-null float64
2 mid_USD_rate 48 non-null float64
3 oil_export 48 non-null float64
4 inflation 48 non-null float64
dtypes: float64(4), object(1)
memory usage: 2.0+ KB
#Check null value in oil_indonesia
print(oil_indonesia.isnull().sum())
end_month 0
last_oil_price 0
mid_USD_rate 0
oil_export 0
inflation 0
dtype: int64
Check possibility duplicated value in
oil_indonesia
oil_indonesia.duplicated().sum()
#Drop "end_month" column that it is useless for regression calculation
oil_indonesia.drop(['end_month'],axis=1,inplace=True)
Check the oil_indonesia describe methods
oil_indonesia.describe()
last_oil_price mid_USD_rate oil_export inflation
count 48.000000 48.000000 48.000000 48.000000
mean 69.608542 14776.416667 1061.262500 0.028981
std 22.174758 554.931463 342.029537 0.014455
min 18.840000 13662.000000 0.000000 0.013200
25% 52.040000 14346.750000 849.275000 0.015975
50% 74.580000 14671.500000 1066.150000 0.026050
75% 82.242500 15101.500000 1311.150000 0.038500
max 114.670000 16367.000000 1662.900000 0.059500
2. Exploratory Data Analysis
Use heatmap to recreate the plot below
plt.figure(figsize=(10,5))
c = oil_indonesia.corr()
sns.heatmap(c,cmap="BrBG",annot=True)
c
last_oil_price mid_USD_rate oil_export inflation
last_oil_price 1.000000 0.073903 0.755607 0.458852
mid_USD_rate 0.073903 1.000000 0.156464 0.545627
oil_export 0.755607 0.156464 1.000000 0.541510
inflation 0.458852 0.545627 0.541510 1.000000
Use pairplot to recreate the plot below
sns.pairplot(oil_indonesia)
<seaborn.axisgrid.PairGrid at 0x78b559296e90>
Create a linear model plot (using seaborn
lmplot) inflation vs oil_export
sns.lmplot(x='inflation', y="oil_export", data=oil_indonesia)
<seaborn.axisgrid.FacetGrid at 0x78b557c3aa10>
3. Training and Testing Data
Set variable X equal to the numerical features of
the oil _indonesia and a variable y equal to the
oil_export column
X = oil_indonesia.drop(['oil_export'],axis=1)
y = oil_indonesia['oil_export']
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)
Import cross validation train_test_split from
sklearn
from sklearn.model_selection import train_test_split
Split the data into training and testing sets. Set
test size=0.2 and random test=0
X_train,X_test, y_train, y_test = train_test_split(X,y, test_size=0.2,
random_state=0)
4. Training the Model
Import linear regression from
sklearn.linear_model
from sklearn.linear_model import LinearRegression
Create a instance of a LinearRegression model
named lm
lm = LinearRegression()
Train/fit lm on the training data
lm.fit(X_train, y_train)
LinearRegression()
Print out the coefficients of the model
lm.coef_
array([200.44050601, -10.80755483, 121.11087484])
Print out the X_test
prediction = lm.predict(X_test)
prediction
array([1528.45452188, 705.9208487 , 1343.42451644, 1511.02483507,
1416.10915511, 1357.42519896, 1372.09041617, 1159.72902241,
700.81715592, 757.00143871])
5. Predicting the Test Data
Create a scatterplot of the real test values vs
the predicted values
plt.scatter(y_test, prediction)
plt.xlabel("Y_test")
plt.ylabel('predicted values')
Text(0, 0.5, 'predicted values')
6. Evaluating the Model
Calculate the mean_absolute_error,
mean_squared_error and
metrics.mean_squared_error for normalization
of data
from sklearn import metrics
print('MAE',metrics.mean_absolute_error(y_test, prediction))
print('MSE',metrics.mean_squared_error(y_test,prediction))
print('RMSE',np.sqrt(metrics.mean_squared_error(y_test,prediction)))
MAE 139.51781104780872
MSE 25329.345231343865
RMSE 159.15195641695348
Print out the R square value for deteminate how
accurate the model in prediction
metrics.explained_variance_score(y_test,prediction)
0.8234756635987014
Print head of independent variable(X) from
oil_indonesia DataFramePrint head of
independent variable(X) from oil_indonesia
DataFrame
X = oil_indonesia.drop(["oil_export"],axis=1)
X.head()
last_oil_price mid_USD_rate inflation
0 51.56 13662.0 0.0268
1 44.76 14234.0 0.0298
2 20.48 16367.0 0.0296
3 18.84 15157.0 0.0267
4 35.49 14733.0 0.0219
Print head of independent variable(y) from
oil_indonesia DataFrame
y = oil_indonesia['oil_export']
y.head()
0 815.3
1 805.2
2 617.4
3 562.1
4 560.9
Name: oil_export, dtype: float64
Add constant in X variable
X = sm.add_constant(X)
X.head()
const last_oil_price mid_USD_rate inflation
0 1.0 51.56 13662.0 0.0268
1 1.0 44.76 14234.0 0.0298
2 1.0 20.48 16367.0 0.0296
3 1.0 18.84 15157.0 0.0267
4 1.0 35.49 14733.0 0.0219
Print OLS Regression Result in watching
performance of regression model
OLS = sm.OLS(endog = y, exog = X).fit()
OLS.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
======================================================================
========
Dep. Variable: oil_export R-squared:
0.620
Model: OLS Adj. R-squared:
0.594
Method: Least Squares F-statistic:
23.93
Date: Wed, 17 Jan 2024 Prob (F-statistic):
2.44e-09
Time: 10:50:40 Log-Likelihood:
-324.46
No. Observations: 48 AIC:
656.9
Df Residuals: 44 BIC:
664.4
Df Model: 3
Covariance Type: nonrobust
======================================================================
============
coef std err t P>|t| [0.025
0.975]
----------------------------------------------------------------------
------------
const 546.7744 1021.789 0.535 0.595 -1512.506
2606.054
last_oil_price 9.7749 1.661 5.887 0.000 6.428
13.122
mid_USD_rate -0.0238 0.070 -0.339 0.736 -0.166
0.118
inflation 6432.0732 3031.540 2.122 0.040 322.405
1.25e+04
======================================================================
========
Omnibus: 48.862 Durbin-Watson:
1.555
Prob(Omnibus): 0.000 Jarque-Bera (JB):
286.338
Skew: -2.453 Prob(JB):
6.65e-63
Kurtosis: 13.913 Cond. No.
1.45e+06
======================================================================
========
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is
correctly specified.
[2] The condition number is large, 1.45e+06. This might indicate that
there are
strong multicollinearity or other numerical problems.
"""
7. Conclusion
Testing of prediction instead testing data that is
prepared. Create a new variable DataFrame (df)
new_df = pd.DataFrame({"Actual":y_test,"Predicted":prediction})
new_df
Actual Predicted
29 1551.8 1528.454522
4 560.9 705.920849
26 1493.3 1343.424516
30 1287.6 1511.024835
32 1259.0 1416.109155
37 1186.5 1357.425199
34 1101.9 1372.090416
40 1308.6 1159.729022
7 599.6 700.817156
10 762.2 757.001439