Pytorch Tutorial-Building models

这是我学习Pytorch时记录的一些笔记，希望能对你有所帮助😊

torch.nn.Module & torch.nn.Parameter

In this section, we’ll be discussing some of the tools PyTorch makes available for building deep learning networks.

Except for Parameter, the classes we discuss in this section are all subclasses of torch.nn.Module. This is the PyTorch base class meant to encapsulate(封装) behaviors specific to PyTorch Models and their components.

One important behavior of torch.nn.Module is registering parameters. If a particular Module subclass has learning weights, these weights are expressed as instances of torch.nn.Parameter. The Parameter class is a subclass of torch.Tensor, with the special behavior that when they are assigned as attributes of a Module, they are added to the list of that modules parameters. These parameters may be accessed through the parameters() method on the Module class.

As a simple example, here’s a very simple model with two linear layers and an activation function. We’ll create an instance of it and ask it to report on its parameters:

import torch

class TinyModel(torch.nn.Module):
    
    def __init__(self):
        super(TinyModel, self).__init__()
        
        self.linear1 = torch.nn.Linear(100, 200)
        self.activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(200, 10)
        self.softmax = torch.nn.Softmax()
    
    def forward(self, x):
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.softmax(x)
        return x

tinymodel = TinyModel()

print('The model:')
print(tinymodel)

print('\n\nJust one layer:')
print(tinymodel.linear2)

print('\n\nModel params:')
for param in tinymodel.parameters():
    print(param)

print('\n\nLayer params:')
for param in tinymodel.linear2.parameters():
    print(param)

The model:
TinyModel(
  (linear1): Linear(in_features=100, out_features=200, bias=True)
  (activation): ReLU()
  (linear2): Linear(in_features=200, out_features=10, bias=True)
  (softmax): Softmax(dim=None)
)

Just one layer:
Linear(in_features=200, out_features=10, bias=True)

Model params:
Parameter containing:
tensor([[-0.0897,  0.0595,  0.0172,  ...,  0.0947, -0.0384, -0.0024],
        [ 0.0102, -0.0393, -0.0865,  ...,  0.0961,  0.0694,  0.0555],
        [-0.0251, -0.0372,  0.0264,  ...,  0.0535, -0.0535,  0.0745],
        ...,
        [-0.0554, -0.0434, -0.0032,  ..., -0.0441, -0.0671,  0.0100],
        [ 0.0469, -0.0174,  0.0883,  ..., -0.0825, -0.0478,  0.0232],
        [ 0.0877, -0.0416, -0.0567,  ..., -0.0455, -0.0185, -0.0559]],
       requires_grad=True)
Parameter containing:
tensor([ 0.0692, -0.0213,  0.0033,  0.0528,  0.0394, -0.0518, -0.0535, -0.0269,
         0.0172, -0.0897,  0.0809,  0.0125, -0.0566, -0.0490, -0.0566,  0.0478,
        -0.0488,  0.0989, -0.0641, -0.0068,  0.0420,  0.0358,  0.0186,  0.0748,
        -0.0308,  0.0472,  0.0568,  0.0026, -0.0920, -0.0553,  0.0737,  0.0881,
        -0.0992,  0.0300, -0.0234, -0.0443,  0.0221, -0.0552, -0.0067,  0.0612,
         0.0281, -0.0199, -0.0818,  0.0608,  0.0975, -0.0069,  0.0923, -0.0741,
         0.0516, -0.0787, -0.0593, -0.0303,  0.0115,  0.0701, -0.0171,  0.0291,
         0.0152,  0.0424, -0.0106, -0.0568,  0.0689,  0.0308,  0.0863, -0.0436,
         0.0061,  0.0822, -0.0556, -0.0668,  0.0828,  0.0758,  0.0888, -0.0535,
         0.0648,  0.0160, -0.0932,  0.0787,  0.0546, -0.0973,  0.0973,  0.0908,
         0.0108, -0.0090,  0.0644,  0.0990,  0.0384,  0.0852,  0.0864,  0.0565,
        -0.0974,  0.0768,  0.0337,  0.0590, -0.0362,  0.0914,  0.0038,  0.0516,
        -0.0632, -0.0569, -0.0475, -0.0564, -0.0192,  0.0279, -0.0243, -0.0621,
        -0.0559,  0.0921, -0.0583, -0.0508,  0.0401,  0.0414, -0.0770, -0.0378,
        -0.0786, -0.0110, -0.0289, -0.0778,  0.0427, -0.0105,  0.0680,  0.0146,
        -0.0859,  0.0440, -0.0420,  0.0613,  0.0321,  0.0289,  0.0668, -0.0028,
        -0.0421, -0.0372,  0.0391,  0.0479, -0.0232, -0.0610, -0.0355, -0.0896,
         0.0864,  0.0345, -0.0252, -0.0385,  0.0832,  0.0868, -0.0514,  0.0178,
         0.0716,  0.0796, -0.0794, -0.0538, -0.0163, -0.0929, -0.0643,  0.0782,
        -0.0047,  0.0024, -0.0610, -0.0259,  0.0719,  0.0840,  0.0946, -0.0291,
         0.0131, -0.0157,  0.0309, -0.0375, -0.0800, -0.0594, -0.0233, -0.0928,
        -0.0028, -0.0729,  0.0889, -0.0377, -0.0685,  0.0974, -0.0860, -0.0819,
        -0.0918, -0.0750, -0.0327, -0.0245, -0.0058, -0.0875, -0.0667, -0.0569,
         0.0075,  0.0986,  0.0977, -0.0291,  0.0081,  0.0127,  0.0544,  0.0711,
         0.0910,  0.0522, -0.0874, -0.0217,  0.0454, -0.0726,  0.0791, -0.0459],
       requires_grad=True)
Parameter containing:
tensor([[-0.0508,  0.0529,  0.0234,  ..., -0.0385,  0.0078, -0.0030],
        [ 0.0281,  0.0437, -0.0461,  ..., -0.0655, -0.0253, -0.0222],
        [ 0.0243,  0.0178, -0.0009,  ...,  0.0383, -0.0507, -0.0083],
        ...,
        [-0.0700, -0.0090,  0.0153,  ...,  0.0161,  0.0610,  0.0687],
        [-0.0509, -0.0291, -0.0591,  ...,  0.0173, -0.0191, -0.0705],
        [-0.0090,  0.0428, -0.0528,  ...,  0.0278, -0.0153, -0.0266]],
       requires_grad=True)
Parameter containing:
tensor([-0.0357, -0.0617,  0.0027, -0.0098, -0.0083, -0.0461, -0.0076,  0.0510,
        -0.0564,  0.0298], requires_grad=True)

Layer params:
Parameter containing:
tensor([[-0.0508,  0.0529,  0.0234,  ..., -0.0385,  0.0078, -0.0030],
        [ 0.0281,  0.0437, -0.0461,  ..., -0.0655, -0.0253, -0.0222],
        [ 0.0243,  0.0178, -0.0009,  ...,  0.0383, -0.0507, -0.0083],
        ...,
        [-0.0700, -0.0090,  0.0153,  ...,  0.0161,  0.0610,  0.0687],
        [-0.0509, -0.0291, -0.0591,  ...,  0.0173, -0.0191, -0.0705],
        [-0.0090,  0.0428, -0.0528,  ...,  0.0278, -0.0153, -0.0266]],
       requires_grad=True)
Parameter containing:
tensor([-0.0357, -0.0617,  0.0027, -0.0098, -0.0083, -0.0461, -0.0076,  0.0510,
        -0.0564,  0.0298], requires_grad=True)

This shows the fundamental structure of a PyTorch model: there is an __init__() method that defines the layers and other components of a model, and a forward() method where the computation gets done. Note that we can print the model, or any of its submodules, to learn about its structure.

Common Layer Types

Linear Layers

The most basic type of neural network layer is a linear or fully connected layer. This is a layer where every input influences every output of the layer to a degree specified by the layer’s weights. If a model has m inputs and n outputs, the weights will be an m * n matrix. For example:

lin = torch.nn.Linear(3, 2)
x = torch.rand(1, 3)
print('Input:')
print(x)

print('\n\nWeight and Bias parameters:')
for param in lin.parameters():
    print(param)

y = lin(x)
print('\n\nOutput:')
print(y)

Input:
tensor([[0.2807, 0.5842, 0.7967]])

Weight and Bias parameters:
Parameter containing:
tensor([[-0.1719,  0.4691, -0.0654],
        [-0.2522,  0.5453, -0.5438]], requires_grad=True)
Parameter containing:
tensor([0.2956, 0.2001], requires_grad=True)

Output:
tensor([[0.4693, 0.0146]], grad_fn=<AddmmBackward>)

Parameter会自动开启autograd
Linear layers are used widely in deep learning models. One of the most common places you’ll see them is in classifier models

Convolutional Layers

Convolutional layers are built to handle data with a high degree of spatial correlation. They are very commonly used in computer vision, where they detect close groupings of features which the compose into higher-level features. They pop up in other contexts too - for example, in NLP applications, where the a word’s immediate context (that is, the other words nearby in the sequence) can affect the meaning of a sentence.

import torch.functional as F

class LeNet(torch.nn.Module):

    def __init__(self):
        super(LeNet, self).__init__()
        # 1 input image channel (black & white), 6 output channels, 3x3 square convolution
        # kernel
        self.conv1 = torch.nn.Conv2d(1, 6, 5)
        self.conv2 = torch.nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = torch.nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = torch.nn.Linear(120, 84)
        self.fc3 = torch.nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

卷积层构造函数的第一个参数是输入通道的数量，第二个参数是输出特征的数量，第三个参数是窗口或kernel内核大小

关于卷积神经网络（CNN）中 卷积层（Convolutional Layer）、ReLU激活函数 和 最大池化层（Max Pooling Layer） 的处理流程及其作用。以下是逐步解析：

1. 卷积层（Convolutional Layer）的输出

输入假设：假设输入是一个单通道（灰度）的 32x32 图像，经过第一层卷积操作：
1
self.conv1 = torch.nn.Conv2d(1, 6, 5) # 输入通道1，输出通道6，卷积核5x5
输出尺寸：
卷积后输出的尺寸计算公式为：
$$
\text{输出尺寸} = \left\lfloor \frac{\text{输入尺寸} - \text{卷积核尺寸} + 2 \times \text{填充}}{\text{步长}} \right\rfloor + 1
$$
默认情况下，padding=0（无填充），stride=1（步长为1），因此：32 - 5 + 1 = 28
- 输出张量形状：[batch_size, 6, 28, 28]
  （6个通道，每个通道的激活图大小为 28x28）。

2. ReLU激活函数的作用

1	`F.relu(self.conv1(x))`

ReLU（Rectified Linear Unit）：定义为 $\text{ReLU}(x) = \max(0, x)$ 。
功能：
1. 引入非线性：使模型能够学习复杂的非线性关系。
2. 稀疏激活：将负值置零，保留正值，增强模型的稀疏性。
3. 缓解梯度消失：相比 Sigmoid/Tanh，ReLU 的梯度在正区间恒为1，避免梯度消失问题。
输出形状：与输入相同，仍为 [batch_size, 6, 28, 28]。

3. 最大池化层（Max Pooling）的细节

1	`F.max_pool2d(..., (2, 2)) # 2x2的池化窗口`

目的：降低空间维度（下采样），减少计算量并增强平移不变性。
操作规则：
- 将输入激活图划分为不重叠的 2x2 区域。
- 对每个区域取最大值，作为输出。
- 步长默认等于池化窗口大小（即 stride=2），因此输出尺寸减半。
计算示例：
- 输入尺寸：[batch_size, 6, 28, 28]。
- 输出尺寸：
  $\left\lfloor \frac{28 - 2}{2} \right\rfloor + 1 = 14$
- 输出张量形状：[batch_size, 6, 14, 14]。

为什么选择最大值？

保留最显著特征：最大值代表该区域最强烈的激活响应，有助于保留重要特征（如边缘、纹理）。
抑制噪声：忽略非最大值，降低噪声干扰。

4. 维度变化的直观理解

操作	输入形状	输出形状	关键作用
卷积（Conv1）	`[1, 1, 32, 32]`	`[1, 6, 28, 28]`	提取局部特征，增加通道数
ReLU	`[1, 6, 28, 28]`	`[1, 6, 28, 28]`	引入非线性，过滤负值
最大池化	`[1, 6, 28, 28]`	`[1, 6, 14, 14]`	降低分辨率，增强鲁棒性

5. 为什么需要这些步骤？

卷积层：
- 通过局部感受野提取空间特征（如边缘、角点）。
- 使用多个卷积核（通道）捕捉不同特征模式。
ReLU：
- 解决线性模型的局限性，使网络能拟合复杂函数。
池化层：
- 减少参数数量，防止过拟合
- 使模型对输入的小平移/形变更鲁棒（“近似不变性”）。

There are convolutional layers for addressing 1D, 2D, and 3D tensors. There are also many more optional arguments for a conv layer constructor, including stride length(e.g., only scanning every second or every third position) in the input, padding (so you can scan out to the edges of the input), and more. See the documentation for more information.

Recurrent Layers

Recurrent neural networks (or RNNs) are used for sequential data - anything from time-series measurements from a scientific instrument to natural language sentences to DNA nucleotides. An RNN does this by maintaining a hidden state that acts as a sort of memory for what it has seen in the sequence so far.
The internal structure of an RNN layer - or its variants, the LSTM (long short-term memory) and GRU (gated recurrent unit) - is moderately complex and beyond the scope of this video, but we’ll show you what one looks like in action with an LSTM-based part-of-speech tagger (a type of classifier that tells you if a word is a noun, verb, etc.):

class LSTMTagger(torch.nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = torch.nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = torch.nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = torch.nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

The constructor has four arguments:
vocab_size is the number of words in the input vocabulary. Each word is a one-hot vector (or unit vector) in a vocab_size-dimensional space.
tagset_size is the number of tags in the output set.
embedding_dim is the size of the embedding space for the vocabulary. An embedding maps a vocabulary onto a low-dimensional space, where words with similar meanings are close together in the space.
hidden_dim is the size of the LSTM’s memory.

The input will be a sentence with the words represented as indices of one-hot vectors. The embedding layer will then map these down to an embedding_dim-dimensional space. The LSTM takes this sequence of embeddings and iterates over it, fielding an output vector of length hidden_dim. The final linear layer acts as a classifier; applying log_softmax() to the output of the final layer converts the output into a normalized set of estimated probabilities that a given word maps to a given tag.

If you’d like to see this network in action, check out the Sequence Models and LSTM Networks tutorial on pytorch.org.

Transformers

Transformers are multi-purpose networks that have taken over the state of the art in NLP with models like BERT. A discussion of transformer architecture is beyond the scope of this video, but PyTorch has a Transformer class that allows you to define the overall parameters of a transformer model - the number of attention heads, the number of encoder & decoder layers, dropout and activation functions, etc. (You can even build the BERT model from this single class, with the right parameters!) The torch.nn.Transformer class also has classes to encapsulate the individual components (TransformerEncoder, TransformerDecoder) and subcomponents (TransformerEncoderLayer, TransformerDecoderLayer). For details, check out the documentation on transformer classes, and the relevant tutorial on pytorch.org.

Other Layers and Functions

Data Manipulation Layers

There are other layer types that perform important functions in models, but don’t participate in the learning process themselves.

Max pooling

Max pooling (and its twin, min pooling) reduce a tensor by combining cells, and assigning the maximum value of the input cells to the output cell. (We saw this ) For example:

my_tensor = torch.rand(1, 6, 6)
print(my_tensor)
maxpool_layer = torch.nn.MaxPool2d(3)
print(maxpool_layer(my_tensor))

tensor([[[0.8160, 0.1406, 0.5950, 0.0883, 0.5464, 0.3993],
         [0.0623, 0.6626, 0.3991, 0.4878, 0.7548, 0.2426],
         [0.9081, 0.4207, 0.8590, 0.3784, 0.6931, 0.5609],
         [0.6182, 0.8588, 0.3766, 0.9734, 0.9662, 0.9880],
         [0.0599, 0.8338, 0.6750, 0.0829, 0.3554, 0.3998],
         [0.6159, 0.7129, 0.8945, 0.8717, 0.9930, 0.9059]]])
tensor([[[0.9081, 0.7548],
         [0.8945, 0.9930]]])

If you look closely at the values above, you’ll see that each of the values in the maxpooled output is the maximum value of each quadrant of the 6x6 input.

Normalization layers

Normalization layers re-center and normalize the output of one layer before feeding it to another. Centering and scaling the intermediate tensors has a number of beneficial effects, such as letting you use higher learning rates without exploding/vanishing gradients.

my_tensor = torch.rand(1, 4, 4) * 20 + 5
print(my_tensor)
print(my_tensor.mean())
norm_layer = torch.nn.BatchNorm1d(4)
normed_tensor = norm_layer(my_tensor)
print(normed_tensor)
print(normed_tensor.mean())

tensor([[[18.0634,  5.6720,  5.7805, 12.3243],
         [ 9.3712, 19.7366,  6.4853, 22.8629],
         [14.6223, 21.5803, 17.8267, 20.3997],
         [21.7664,  5.0936, 19.5952, 11.8554]]])
tensor(14.5647)
tensor([[[ 1.4762, -0.9296, -0.9086,  0.3619],
         [-0.7650,  0.7475, -1.1862,  1.2037],
         [-1.4918,  1.1130, -0.2922,  0.6710],
         [ 1.0893, -1.4371,  0.7603, -0.4125]]],
       grad_fn=<NativeBatchNormBackward>)
tensor(1.3039e-08, grad_fn=<MeanBackward0>)

Running the cell above, we’ve added a large scaling factor and offset to an input tensor; you should see the input tensor’s mean() somewhere in the neighborhood of 15. After running it through the normalization layer, you can see that the values are smaller, and grouped around zero - in fact, the mean should be very small (> 1e-8).

This is beneficial because many activation functions (discussed below) have their strongest gradients near 0, but sometimes suffer from vanishing or exploding gradients for inputs that drive them far away from zero. Keeping the data centered around the area of steepest gradient will tend to mean faster, better learning and higher feasible learning rates.

Dropout layers

Dropout layers are a tool for encouraging sparse representations 稀疏表示 in your model - that is, pushing it to do inference with less data.

Dropout layers work by randomly setting parts of the input tensor zero during training - dropout layers are always turned off for inference 推理. This forces the model to learn against this masked or reduced dataset. For example:

my_tensor = torch.rand(1, 4, 4)
dropout = torch.nn.Dropout(p=0.4)
print(dropout(my_tensor))
print(dropout(my_tensor))

tensor([[[0.0000, 1.1702, 0.5911, 0.0000],
         [0.1932, 1.4928, 1.2912, 0.0000],
         [0.1236, 1.3672, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000]]])
tensor([[[1.5033, 1.1702, 0.5911, 0.9341],
         [0.0000, 0.0000, 0.0000, 1.5020],
         [0.1236, 1.3672, 0.0000, 0.0000],
         [0.4993, 0.9576, 0.0000, 1.6664]]])

Above, you can see the effect of dropout on a sample tensor. You can use the optional p argument to set the probability of an individual weight dropping out; if you don’t it defaults to 0.5.

Activation Functions

Activation functions make deep learning possible. A neural network is really a program - with many parameters - that simulates a mathematical function. If all we did was multiple tensors by layer weights repeatedly, we could only simulate linear functions; further, there would be no point to having many layers, as the whole network could be reduced to a single matrix multiplication. Inserting non-linear activation functions between layers is what allows a deep learning model to simulate any function, rather than just linear ones.

torch.nn.Module has objects encapsulating 封装 all of the major activation functions including ReLU and its many variants, Tanh, Hardtanh, sigmoid, and more. It also includes other functions, such as Softmax, that are most useful at the output stage of a model.

Loss Functions

Loss functions tell us how far a model’s prediction is from the correct answer. PyTorch contains a variety of loss functions, including common MSE (mean squared error = L2 norm), Cross Entropy Loss and Negative Likelihood Loss (useful for classifiers), and others.