Sampling/ABtesting/GradientMethod
Google Coding Summary
import numpy as np
import scipy.stats as st
import seaborn as sns
import matplotlib.pyplotLinear Regression
Linear Skilearn:X_b = np.c_[np.ones((100, 1)), X] 和theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
https://colab.research.google.com/drive/1MnNENQS5j2otC7quOGuqWGHg7DqCXG7g
####################linear regression sklearn#########################
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_
theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)
theta_best_svd
np.linalg.pinv(X_b).dot(y)
###################Least Square ##########################
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
plt.plot(X, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])
save_fig("generated_data_plot")
plt.show()
#least square
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
#np.r_是按列连接两个矩阵,就是把两矩阵上下相加,要求列数相等。
#np.c_是按行连接两个矩阵,就是把两矩阵左右相加,要求行数相等。
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
#prediction
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance
y_predict = X_new_b.dot(theta_best)
y_predict
#plot
plt.plot(X_new, y_predict, "r-")
plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
##################Statstics Summary###########################
reg = smf.ols('mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin', df).fit()
model_fit = reg.fit()
reg.summary()
reg.coef_
# fitted values (need a constant term for intercept)
model_fitted_y = model_fit.fittedvalues
# model residuals
model_residuals = model_fit.resid
# normalized residuals
model_norm_residuals = model_fit.get_influence().resid_studentized_internal
# absolute squared normalized residuals
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
# absolute residuals
model_abs_resid = np.abs(model_residuals)
# leverage, from statsmodels internals
model_leverage = model_fit.get_influence().hat_matrix_diag
# cook's distance, from statsmodels internals
model_cooks = model_fit.get_influence().cooks_distance[0]
xfit = np.linspace(df.x.min(), df.x.max(), 100)
yfit = model.predict(xfit[:, np.newaxis])Regularization model
https://colab.research.google.com/drive/1MnNENQS5j2otC7quOGuqWGHg7DqCXG7g
CV+ grid search
https://colab.research.google.com/drive/1gxFuyhHKM-HXF0uiqVXfKrwGCgRDmJLN
Sampling/Bootstrap
https://www.notion.so/yangnyc/Google-interview-solution-fe08edcd81d94e78804b5e6518ddd291
Sample Normal:np.random.normal和plt.hist
Bootstrap/Median/Standard Error/CI
Important Sampling:第一步st.norm.pdf(p(x)/q(x));第二步k;第三步z是normal_q;第三步u是uniform(0,k*q(z));第四步u≤p(z)
Inverse Sampling

Shuffling algorithm(从后往前,random.randint(0,i)&array[i]↔array[random_number])
Reservoir Sampling(让前面的random.randint(0, self.count-1))
Randomization(5*random5()+5&大于21重新random25())
Gradient Method
说白了就是你有n个sample,分别分几次使用,反正每次都用在降低原来的点上,你的gradient没变,只是每次realize出来变了
https://www.notion.so/yangnyc/HandsOnML-Chapter4-e5d0e68590e149b9af58dbfc0e4cfaaf
Batch/Gradient Method
Stochastic Gradient Method
MiniBatch Gradient Method
AB testing
Generate 100000 dice rolling results
CLT Central Limit Theorem
How to calculate HT as an Algorithm(norm.cdf)
Calculate the sample size(zt_ind_solve_power, proportion_effectsize)
Test for Independence(U, p, dof, expected = chi2_contingency(table, correction = False))
print(U) #卡方值
Two discrete::Simpson paradox
Two Continuous variables::look at their correlation
One Continuous variable and one discrete: look at the test statistics
p-value:
是一个数字,越小,反应出现的情况越极端,基于现有的observation,在null hypothesis下
court,innocent,knife on the crime scene
statistical significance:
a decision, which concerning a value stated in the null hypothesis
hypothesis:
claim/population/parameter
claim reflect default ,ex no relation/no difference
claim about population/hypothesis H_0 false
significance level :
probability, reject under is true
type I error(冤狱):
when is true, but rejected
{p-value | H_0 is true} ~ uniform(0,1)
type I error rate = Pr(reject H_0| H_0 is true) = Pr(p-value < significance level| H_0 is true)= significance level
type II error(漏网):
when H_1 is true, but rejected
power(火眼金睛):
the probability of rejecting H_0, when H_1 is true
$power = 1-\beta$
Last updated