Linux版PyTorch進行自然語言處理的步驟如下:
安裝基礎環境
sudo apt-get install python3 python3-pip
(Ubuntu/Debian)或對應Linux發行版命令。python3 -m venv myenv
,激活:source myenv/bin/activate
。安裝PyTorch及NLP庫
pip install torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
(替換為實際CUDA版本)。pip install transformers torchtext spacy
,并下載Spacy英文模型:python -m spacy download en_core_web_sm
。數據預處理
transformers
庫的分詞器(如BERT)處理文本:from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
encoded = tokenizer("Hello, world!", return_tensors='pt') # 轉為張量
torchtext
構建詞表和批處理數據(以IMDB數據集為例):from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("spacy")
train_iter, test_iter = IMDB(split=("train", "test"))
構建模型
import torch.nn as nn
class TextClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
self.fc = nn.Linear(hidden_dim, num_class)
def forward(self, text):
embedded = self.embedding(text)
_, (hidden, _) = self.lstm(embedded)
return self.fc(hidden.squeeze(0))
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
訓練與評估
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(5):
model.train()
for texts, labels in train_loader:
optimizer.zero_grad()
outputs = model(texts)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
保存與加載模型
# 保存
model.save_pretrained('./my_model')
tokenizer.save_pretrained('./my_model')
# 加載
from transformers import BertForSequenceClassification, BertTokenizer
model = BertForSequenceClassification.from_pretrained('./my_model')
tokenizer = BertTokenizer.from_pretrained('./my_model')
常用任務擴展:
參考資料: