Pytorch Tutorial-Autograd
这是我学习Pytorch时记录的一些笔记 ,希望能对你有所帮助😊
Autograd
autograd 的用武之地:它跟踪每次计算的历史记录。PyTorch 模型中的每个计算张量都带有其输入张量和用于创建张量的函数的历史记录。结合用于作用于张量的 PyTorch 函数,每个函数都有一个用于计算自己的导数的内置实现,这大大加快了学习所需的局部导数的计算速度。
举例
1 |
|
Be aware than only leaf nodes of the computation have their gradients computed. If you tried, for example, print(c.grad)
you’d get back None
. In this simple example, only the input is a leaf node, so only it has gradients computed.
Jacobian
If you have a function with an n-dimensional input
and m-dimensional output
, , the complete gradient is a matrix of the derivative of every output with respect to every input, called the Jacobian:
$$
\begin{align}J=\left(\begin{array}{ccc}
\frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\
\vdots & \ddots & \vdots\
\frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
\end{array}\right)\end{align}
$$
If you have a second function, $l=g\left(\vec{y}\right)$ that takes m-dimensional input
(that is, the same dimensionality as the output above), and returns a scalar output
, you can express its gradients with respect to $\vec{y}$ as a column vector, $v=\left(\begin{array}{ccc}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)^{T}$ - which is really just a one-column Jacobian.
More concretely, imagine the first function as your PyTorch model
(with potentially many inputs and many outputs) and the second function as a loss function
(with the model’s output as input, and the loss value as the scalar output).
If we multiply the first function’s Jacobian by the gradient of the second function, and apply the chain rule, we get:
$$
\begin{align}J^{T}\cdot v=\left(\begin{array}{ccc}
\frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\
\vdots & \ddots & \vdots\
\frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
\end{array}\right)\left(\begin{array}{c}
\frac{\partial l}{\partial y_{1}}\
\vdots\
\frac{\partial l}{\partial y_{m}}
\end{array}\right)=\left(\begin{array}{c}
\frac{\partial l}{\partial x_{1}}\
\vdots\
\frac{\partial l}{\partial x_{n}}
\end{array}\right)\end{align}
$$
Note: You could also use the equivalent operation $v^{T}\cdot J$, and get back a row vector.
The resulting column vector is the gradient of the second function with respect to the inputs of the first - or in the case of our model and loss function, the gradient of the loss with respect to the model inputs.
torch.autograd
is an engine for computing these products. This is how we accumulate the gradients over the learning weights during the backward pass.
For this reason, the backward()
call can also take an optional vector input. This vector represents a set of gradients over the tensor, which are multiplied by the Jacobian of the autograd-traced tensor that precedes it. Let’s try a specific example with a small vector:
嗯,这也就是后面的示例2所演示的,戳我直达
High-Level API
There is an API on autograd that gives you direct access to important differential matrix and vector operations. In particular, it allows you to calculate the Jacobian and the Hessian matrices of a particular function for particular inputs. (The Hessian is like the Jacobian, but expresses all partial second derivatives.) It also provides methods for taking vector products with these matrices.
1 |
|
- 还有能够直接计算向量-雅各比乘积的function:The
torch.autograd.functional.jvp()
method performs the same matrix multiplication asvjp()
with the operands reversed
- 全文重点感觉不够突出,我让AI给了一份笔记
AI总结
PyTorch Autograd 功能总结
PyTorch 的 autograd
(自动微分)是其核心功能之一,用于 自动计算张量的梯度,支持动态计算图(Dynamic Computation Graph),是训练神经网络的基础。以下是关键点总结和示例:
1. Autograd 的核心机制
功能 | 说明 |
---|---|
动态计算图 | 在运行时构建计算图,灵活支持控制流(如循环、条件语句)。 |
自动梯度计算 | 通过 backward() 自动计算梯度,存储在 .grad 属性中。 |
叶子张量 | 直接创建的张量(如 torch.tensor() )是叶子节点,可累积梯度。 |
非叶子张量 | 由运算产生的张量,默认不保留梯度(除非显式设置 retain_grad() )。 |
2. 关键函数与方法
方法/类 | 作用 | 示例 |
---|---|---|
requires_grad=True | 启用梯度追踪 | x = torch.tensor(1.0, requires_grad=True) |
backward() | 反向传播计算梯度 | y.backward() |
grad 属性 | 存储梯度值 | x.grad |
torch.no_grad() | 临时禁用梯度计算 | with torch.no_grad(): |
detach() | 分离张量,阻止梯度传播 | y_detached = y.detach() |
torch.autograd.grad() | 直接计算梯度(不修改 .grad ) | grad = torch.autograd.grad(y, x) |
3. 示例详解
示例 1:基本梯度计算
1 |
|
示例 2:非标量输出的梯度(需指定 gradient
参数)
1 |
|
示例 3:冻结部分参数(detach
或 requires_grad=False
)
1 |
|
示例 4:高阶导数(create_graph=True
)
1 |
|
示例 5:梯度清零与累加
1 |
|
4. Autograd 的注意事项
内存管理:
- 默认情况下,
backward()
会释放计算图。若需多次反向传播,需设置retain_graph=True
。 - 示例:
loss.backward(retain_graph=True)
。
- 默认情况下,
性能优化:
- 推理时用
torch.no_grad()
禁用梯度以节省内存:1
2with torch.no_grad():
output = model(input)
- 推理时用
梯度截断:
- 防止梯度爆炸:
1
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
- 防止梯度爆炸:
非叶子张量的梯度:
- 默认不保留,如需保留需调用
retain_grad()
:1
2y = x * 2
y.retain_grad() # 保留 y 的梯度
- 默认不保留,如需保留需调用
5. 动态计算图的优势
支持控制流:计算图在运行时动态构建,可包含
if
、for
等语句。1
2
3
4
5
6x = torch.tensor(1.0, requires_grad=True)
if x > 0:
y = x ** 2
else:
y = x ** 3
y.backward() # 自动处理分支灵活性:适用于变长输入(如RNN)或复杂架构(如动态网络)。
总结
场景 | 操作 | 代码示例 |
---|---|---|
基本梯度计算 | backward() | y.backward() |
非标量输出 | backward(gradient=...) | y.backward(torch.tensor([0.1, 0.01])) |
冻结参数 | detach() 或 requires_grad=False | param.detach() |
高阶导数 | create_graph=True | torch.autograd.grad(..., create_graph=True) |
避免梯度累加 | zero_grad() | optimizer.zero_grad() |
PyTorch 的 autograd
通过动态计算图和自动微分,极大简化了梯度计算过程,是深度学习模型训练的核心工具。