ML-Andrew Ng 学习笔记(2) Cost Function & Gradient Descent

笔者今年只有大一，虽然课程内有讲ml但是觉得不应止于此，想额外学一部分，故见解不到位的也请谅解

以下将会讨论成本函数（cost function）和梯度下降（gradient descent）的相关问题。

A.成本函数：用于衡量假设函数h(x)准确性的工具

给出一份数据，能够拟合出多条直线，有着多组θ_0, θ_1，但为了更精准地预测，需要找到其中一组，使得数据点尽可能多地落在直线上或者其附近，因此需要成本函数。

** 举例**：单变量线性回归（univariate linear regression）

x_i, y_i为第i组的样本
1/2m 作用仅仅是对后面整体求导时，作为指数的2乘下来与1/2抵消,不会影响到求θ的最小值
"To make the derivations mathematically more convenient"

cost function (or Squared error function)

** 几何理解：**

对于简化版h(x)=θ_1*x,

实质上是各组h(x_1)=θ_1*x_1与真实值y_1之间高度差绝对值的平方之和乘1/2样本数量

** 可视化：**

对于h(x)=θ_0+θ_1*x,

对于线性回归总会是这样的凸函数，只有全局最优解

右图线上的点虽然有着不同的θ_1, θ_0，但对应的J值是相同的
可看做上方凸函数图像的各个截面
椭圆最中心的点即为成本函数值最小处

计算cost（梯度下降步骤1）：

def compute_cost(x,y,w,b):    #equation 2 (for univariate)
    m = x.shape[0]
    cost = 0
    
    for i in range(m):
        f_wb = w * x[i] + b   #one feature
        cost = cost + (f_wb - y[i])** 2
    total_cost = 1/(2 * m) * cost
    
    return total_cost

def compute_cost(X, y, w, b):   #equation 2 (for multi-var)
    m = X.shape[0]
    cost = 0.0
    
    for i in range(m):
        f_wb_i = np.dot(X[i], w) + b  #multiple features
        cost = cost + (f_wb_i - y[i]) **2
    cost = cost / (2 * m)
    return cost

这里只能算出数据集中总的cost值，但如何具体得到使J值最小的θ_1, θ_0，需要用到梯度下降

B.梯度下降：是为了找到目标点的一种方法，透过一步步靠近目标的方式，最终找到极近似目标的函数

** 可视化说明1：**

想要从山顶想要最快速地走到山脚，因此需要找到最陡峭的路前进，对于代价函数来说，路即为斜率k

但存在问题：出发点稍微偏离一点，便可以得到完全不同的局部最优解
“实际应用中，评估所有可能的参数通常不能找到全局最优解。而是使用优化算法找到解，虽然可能不一定是全局最优的，但足够满足目标。另一种办法是用更高级的技术，如SGD或者Adam。”

** 具体算法**：

α为学习速率（learning rate），改变该值对应着下坡时的跨度
对***θ_1, θ_0 (w,b)的同步更新*（simultaneously update）
- 个人理解： θ_1, θ_0可以看作下坡时的向左向右，减去的α和导数项之积对于θ_1, θ_0是完全相同的，即重新调整的方向的跨度是相同的。

计算gradient（梯度下降步骤2）：

def compute_gradient (x,y,w,b):    #equation 4,5 (for univariate)
    m = x.shape[0]
    dj_dw = 0
    dj_db = 0
    
    for i in range(m):
        f_wb = w * x[i] + b
        dj_dw_i = (f_wb - y[i]) * x[i]
        dj_db_i = f_wb - y[i]
        dj_db += dj_db_i
        dj_dw += dj_dw_i
    dj_dw = dj_dw /m
    dj_db = dj_db /m
    
    return dj_dw , dj_db

def compute_gradient(X, y, w, b):  #equation 4,5 (for multi-var)   
    m,n = X.shape           #(# of examples, # of features)
    dj_dw = np.zeros((n,))
    dj_db = 0.

    for i in range(m):                             
        err = (np.dot(X[i], w) + b) - y[i]   
        for j in range(n):                         
            dj_dw[j] = dj_dw[j] + err * X[i, j]    
        dj_db = dj_db + err                        
    dj_dw = dj_dw / m                                
    dj_db = dj_db / m                                
        
    return dj_db, dj_dw

现在有了计算cost和gradient的函数，就可以进行descent的过程

梯度下降步骤3：


#equation 3
def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function):  
    """ 
    Args:
      x (ndarray (m,))  : Data, m examples 
      y (ndarray (m,))  : target values
      w_in,b_in (scalar): initial values of model parameters           
         
    Returns:
      w (scalar): Updated value 
      b (scalar): Updated value 
      J_history (List): History of cost values
      p_history (list): History of parameters [w,b] 
      """
    
    w = copy.deepcopy(w_in)   # 避免影响w_in。深拷贝，确保每个对象的独立性，从而避免在原对象和副本之间共享对象所导致的潜在问题）
    
    J_history = []
    p_history = []
    b = b_in
    w = w_in
    
    for i in range(num_iters):
        
        dj_dw, dj_db = gradient_function(x, y, w , b)     

        # Update Parameters using equation (3) above
        b = b - alpha * dj_db                            
        w = w - alpha * dj_dw                            

        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            J_history.append( cost_function(x, y, w , b))  #返回total_cost
            p_history.append([w,b])

        # 控制代码的执行频率，确保某些代码只在特定的迭代次数执行。        
        if i % math.ceil(num_iters/10) == 0: 
            print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
                  f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e} ",
                  f"w: {w: 0.3e}, b:{b: 0.5e}")
 
    return w, b, J_history, p_history #return for graphing

def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters): 
    """ 
    Args:
      X (ndarray (m,n))   : Data, m examples with n features
      y (ndarray (m,))    : target values
      w_in (ndarray (n,)), b_in (scalar) : initial parameters  
   
    Returns:
      w (ndarray (n,)) : Updated values 
      b (scalar)       : Updated value 
      """
    
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    w = copy.deepcopy(w_in)  #avoid modifying global w within function
    b = b_in
    
    for i in range(num_iters):
  
        dj_db,dj_dw = gradient_function(X, y, w, b)   

        w = w - alpha * dj_dw              
        b = b - alpha * dj_db               
      
        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            J_history.append( cost_function(X, y, w, b))

        # 控制代码的执行频率，确保某些代码只在特定的迭代次数执行。 
        if i% math.ceil(num_iters / 10) == 0:
            print(f"Iteration {i:4d}: Cost {J_history[-1]:8.2f}  ")
        
    return w, b, J_history #return for graphing