Sampling/ABtesting/GradientMethod
Google Coding Summary
import numpy as np
import scipy.stats as st
import seaborn as sns
import matplotlib.pyplotLinear Regression
Linear Skilearn:X_b = np.c_[np.ones((100, 1)), X] 和theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
####################linear regression sklearn#########################
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_
theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)
theta_best_svd
np.linalg.pinv(X_b).dot(y)
###################Least Square ##########################
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
plt.plot(X, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])
save_fig("generated_data_plot")
plt.show()
#least square
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
#np.r_是按列连接两个矩阵,就是把两矩阵上下相加,要求列数相等。
#np.c_是按行连接两个矩阵,就是把两矩阵左右相加,要求行数相等。
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
#prediction
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance
y_predict = X_new_b.dot(theta_best)
y_predict
#plot
plt.plot(X_new, y_predict, "r-")
plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
##################Statstics Summary###########################
reg = smf.ols('mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin', df).fit()
model_fit = reg.fit()
reg.summary()
reg.coef_
# fitted values (need a constant term for intercept)
model_fitted_y = model_fit.fittedvalues
# model residuals
model_residuals = model_fit.resid
# normalized residuals
model_norm_residuals = model_fit.get_influence().resid_studentized_internal
# absolute squared normalized residuals
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
# absolute residuals
model_abs_resid = np.abs(model_residuals)
# leverage, from statsmodels internals
model_leverage = model_fit.get_influence().hat_matrix_diag
# cook's distance, from statsmodels internals
model_cooks = model_fit.get_influence().cooks_distance[0]
xfit = np.linspace(df.x.min(), df.x.max(), 100)
yfit = model.predict(xfit[:, np.newaxis])Regularization model
CV+ grid search
Sampling/Bootstrap
Sample Normal:np.random.normal和plt.hist
Bootstrap/Median/Standard Error/CI
Important Sampling:第一步st.norm.pdf(p(x)/q(x));第二步k;第三步z是normal_q;第三步u是uniform(0,k*q(z));第四步u≤p(z)
Inverse Sampling

Shuffling algorithm(从后往前,random.randint(0,i)&array[i]↔array[random_number])
Reservoir Sampling(让前面的random.randint(0, self.count-1))
Randomization(5*random5()+5&大于21重新random25())
Gradient Method
Batch/Gradient Method
Stochastic Gradient Method
MiniBatch Gradient Method
AB testing
Generate 100000 dice rolling results
CLT Central Limit Theorem
How to calculate HT as an Algorithm(norm.cdf)
Calculate the sample size(zt_ind_solve_power, proportion_effectsize)
Test for Independence(U, p, dof, expected = chi2_contingency(table, correction = False))
p-value:
statistical significance:
hypothesis:
significance level :
type I error(冤狱):
type II error(漏网):
power(火眼金睛):
Last updated