LSTM PyTorch — Clasificación de sentimiento (IMDB)
Clasificación binaria de reseñas de cine con LSTM bidireccional en PyTorch: tokenización, embedding, entrenamiento y evaluación.
LSTM en PyTorch para clasificación de sentimiento (IMDB)
En este notebook vamos a construir de principio a fin un clasificador de sentimiento para reseñas de cine usando una arquitectura LSTM en PyTorch. El objetivo es que entiendas tanto la parte práctica (código) como la intuición matemática que hay detrás, en línea con la teoría del submódulo de LSTM:
- por qué las RNN simples sufren con dependencias largas,
- cómo la celda de memoria de LSTM ayuda a mitigar el vanishing gradient,
- y cómo llevar esto a un pipeline real de NLP.
1) Objetivo del notebook
Queremos resolver una tarea de clasificación binaria de texto:
- Entrada: una reseña de cine.
- Salida: etiqueta
0(negativa) o1(positiva).
Usaremos el dataset IMDB (Hugging Face Datasets), y entrenaremos un modelo con esta estructura:
- Tokenización + vocabulario (texto → índices enteros).
- Embedding (
nn.Embedding) para mapear tokens a vectores densos. - LSTM bidireccional (
nn.LSTM) para capturar contexto de izquierda a derecha y viceversa. - Capa lineal para producir logits de clase.
2) Fundamento matemático mínimo
En una LSTM, para cada paso temporal (t), se calculan puertas que controlan el flujo de información:
[
\begin{aligned}
i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i)
f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f)
g_t &= anh(W_g x_t + U_g h_{t-1} + b_g)
o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o)
\end{aligned}
]
[
\begin{aligned}
C_t &= f_t \odot C_{t-1} + i_t \odot g_t
h_t &= o_t \odot anh(C_t)
\end{aligned}
]
La ruta aditiva de la celda (C_t) permite preservar información durante muchos pasos cuando (f_t pprox 1), lo que facilita aprender dependencias largas.
3) Qué vamos a usar
- Framework: PyTorch
- Dataset NLP: IMDB (
stanfordnlp/imdb) - Métricas:
loss,accuracy,F1, matriz de confusión - Visualizaciones: EDA, curvas train/val, matriz de confusión
Nota: si usas CPU, reduce épocas o tamaño de muestra para ejecutar más rápido.
# Instalación opcional (descomentar si falta alguna librería)
# !pip install datasets scikit-learn seaborn -q
# Imports generales
import random
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
/home/nuberu/xuan/naux/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
# Reproducibilidad + dispositivo
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Dispositivo disponible: {device}")
Dispositivo disponible: cuda
Carga del dataset y EDA inicial
Antes de entrenar, exploramos:
- tamaño de splits,
- balance de clases,
- longitud de reseñas.
Además, crearemos subconjuntos para que la práctica sea manejable.
# Cargamos IMDB desde Hugging Face Datasets
# train: 25k, test: 25k
dataset = load_dataset("stanfordnlp/imdb")
print(dataset)
print("Tamaño train:", len(dataset["train"]))
print("Tamaño test:", len(dataset["test"]))
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 399868.82 examples/s] Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 617208.78 examples/s] Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 543032.11 examples/s]
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 25000
})
unsupervised: Dataset({
features: ['text', 'label'],
num_rows: 50000
})
})
Tamaño train: 25000
Tamaño test: 25000
# Muestra para EDA
eda_df = pd.DataFrame(dataset["train"][:3000])
eda_df["num_chars"] = eda_df["text"].str.len()
eda_df["num_words"] = eda_df["text"].str.split().apply(len)
eda_df.head()
| text | label | num_chars | num_words | |
|---|---|---|---|---|
| 0 | I rented I AM CURIOUS-YELLOW from my video sto... | 0 | 1640 | 288 |
| 1 | "I Am Curious: Yellow" is a risible and preten... | 0 | 1294 | 214 |
| 2 | If only to avoid making this type of film in t... | 0 | 528 | 93 |
| 3 | This film was probably inspired by Godard's Ma... | 0 | 706 | 118 |
| 4 | Oh, brother...after hearing about this ridicul... | 0 | 1814 | 311 |
# Balance de clases
plt.figure(figsize=(5, 3))
sns.countplot(x="label", data=eda_df)
plt.title("Distribución de etiquetas (0=neg, 1=pos)")
plt.xlabel("Etiqueta")
plt.ylabel("Frecuencia")
plt.show()
# Distribución de longitud de reseñas
plt.figure(figsize=(7, 4))
sns.histplot(eda_df["num_words"], bins=50, kde=True)
plt.title("Longitud de reseñas (número de palabras)")
plt.xlabel("# palabras")
plt.ylabel("Frecuencia")
plt.xlim(0, np.percentile(eda_df["num_words"], 99))
plt.show()
print("Percentiles de longitud (#palabras):")
print(eda_df["num_words"].quantile([0.5, 0.75, 0.9, 0.95, 0.99]))
Percentiles de longitud (#palabras): 0.50 171.00 0.75 273.00 0.90 432.10 0.95 558.05 0.99 906.02 Name: num_words, dtype: float64
# Ejemplos reales
for i in [0, 1, 2]:
print(f"\nEjemplo {i} | label={eda_df.iloc[i]['label']}")
print(eda_df.iloc[i]["text"][:400], "...")
Ejemplo 0 | label=0 I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student name ... Ejemplo 1 | label=0 "I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't matter what one's political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn't true. I've seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas a ... Ejemplo 2 | label=0 If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably ...
Preprocesado de texto
Pipeline didáctico:
- Limpieza básica.
- Tokenización por espacios.
- Vocabulario con frecuencia mínima.
- Conversión a índices (
<PAD>,<UNK>). - Padding/truncado a longitud fija.
SPECIAL_TOKENS = {"<PAD>": 0, "<UNK>": 1}
def clean_text(text: str) -> str:
# Minúsculas + limpieza ligera
text = text.lower()
text = re.sub(r"<br\s*/?>", " ", text)
text = re.sub(r"[^a-z0-9'\s]", " ", text)
text = re.sub(r"\s+", " ", text).strip()
return text
def tokenize(text: str):
# Tokenización simple por espacios
return text.split()
# Subconjuntos para entrenamiento rápido
train_size = 12000
val_size = 3000
test_size = 5000
train_raw = dataset["train"].shuffle(seed=SEED).select(range(train_size + val_size))
val_raw = train_raw.select(range(train_size, train_size + val_size))
train_raw = train_raw.select(range(train_size))
test_raw = dataset["test"].shuffle(seed=SEED).select(range(test_size))
print(f"Train: {len(train_raw)}, Val: {len(val_raw)}, Test: {len(test_raw)}")
Train: 12000, Val: 3000, Test: 5000
# Construcción de vocabulario usando solo train
min_freq = 3
counter = Counter()
for sample in train_raw:
tokens = tokenize(clean_text(sample["text"]))
counter.update(tokens)
vocab = dict(SPECIAL_TOKENS)
for token, freq in counter.items():
if freq >= min_freq:
vocab[token] = len(vocab)
id2token = {idx: tok for tok, idx in vocab.items()}
print(f"Tamaño vocabulario: {len(vocab):,}")
print("Tokens más frecuentes:", counter.most_common(15))
Tamaño vocabulario: 28,809
Tokens más frecuentes: [('the', 160794), ('and', 78217), ('a', 77596), ('of', 69456), ('to', 64896), ('is', 51252), ('in', 45066), ('it', 37770), ('i', 36873), ('this', 35980), ('that', 33592), ('was', 23170), ('as', 22566), ('for', 21168), ('with', 21154)]
# Longitud máxima sugerida por percentil 95
train_lengths = [len(tokenize(clean_text(s["text"]))) for s in train_raw]
max_len = int(np.percentile(train_lengths, 95))
max_len = min(max_len, 300)
print(f"max_len elegido: {max_len}")
max_len elegido: 300
PAD_IDX = vocab["<PAD>"]
UNK_IDX = vocab["<UNK>"]
def encode_text(text: str, vocab: dict, max_len: int):
# Texto -> ids + truncado + padding
ids = [vocab.get(tok, UNK_IDX) for tok in tokenize(clean_text(text))]
ids = ids[:max_len]
if len(ids) < max_len:
ids += [PAD_IDX] * (max_len - len(ids))
return ids
Dataset y DataLoader de PyTorch
Creamos una clase Dataset que devuelve (x, y) listo para entrenar.
class IMDBDataset(Dataset):
def __init__(self, hf_split, vocab, max_len):
self.samples = hf_split
self.vocab = vocab
self.max_len = max_len
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
sample = self.samples[idx]
x = encode_text(sample["text"], self.vocab, self.max_len)
y = sample["label"]
return torch.tensor(x, dtype=torch.long), torch.tensor(y, dtype=torch.long)
train_ds = IMDBDataset(train_raw, vocab, max_len)
val_ds = IMDBDataset(val_raw, vocab, max_len)
test_ds = IMDBDataset(test_raw, vocab, max_len)
batch_size = 64
train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)
xb, yb = next(iter(train_loader))
print("Batch X:", xb.shape)
print("Batch y:", yb.shape)
Batch X: torch.Size([64, 300]) Batch y: torch.Size([64])
Modelo LSTM en PyTorch
Arquitectura elegida:
- Embedding
- LSTM bidireccional (2 capas)
- Dropout
- Capa lineal final
También inicializamos el forget gate bias a 1.0.
class SentimentLSTM(nn.Module):
def __init__(self, vocab_size, embed_dim=128, hidden_dim=128, num_layers=2,
dropout=0.3, bidirectional=True, num_classes=2, pad_idx=0):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
self.lstm = nn.LSTM(
input_size=embed_dim,
hidden_size=hidden_dim,
num_layers=num_layers,
batch_first=True,
dropout=dropout if num_layers > 1 else 0.0,
bidirectional=bidirectional,
)
self.dropout = nn.Dropout(dropout)
factor = 2 if bidirectional else 1
self.fc = nn.Linear(hidden_dim * factor, num_classes)
self._init_forget_gate_bias(1.0)
def _init_forget_gate_bias(self, value=1.0):
# bias_ih y bias_hh concatenan [i, f, g, o]
for name, param in self.lstm.named_parameters():
if "bias" in name:
n = param.size(0)
start, end = n // 4, n // 2
with torch.no_grad():
param[start:end].fill_(value)
def forward(self, x):
emb = self.dropout(self.embedding(x))
_, (h_n, _) = self.lstm(emb)
if self.lstm.bidirectional:
h_final = torch.cat([h_n[-2], h_n[-1]], dim=1)
else:
h_final = h_n[-1]
logits = self.fc(self.dropout(h_final))
return logits
model = SentimentLSTM(vocab_size=len(vocab), pad_idx=PAD_IDX).to(device)
print(model)
SentimentLSTM( (embedding): Embedding(28809, 128, padding_idx=0) (lstm): LSTM(128, 128, num_layers=2, batch_first=True, dropout=0.3, bidirectional=True) (dropout): Dropout(p=0.3, inplace=False) (fc): Linear(in_features=256, out_features=2, bias=True) )
Entrenamiento
Usaremos CrossEntropyLoss, Adam y gradient clipping. Guardamos métricas por época para visualizar train/val.
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
num_epochs = 6
max_grad_norm = 1.0
def run_epoch(model, loader, criterion, optimizer=None):
is_train = optimizer is not None
model.train() if is_train else model.eval()
losses = []
all_preds, all_targets = [], []
for xb, yb in loader:
xb, yb = xb.to(device), yb.to(device)
if is_train:
optimizer.zero_grad()
logits = model(xb)
loss = criterion(logits, yb)
if is_train:
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()
losses.append(loss.item())
preds = logits.argmax(dim=1)
all_preds.extend(preds.detach().cpu().numpy().tolist())
all_targets.extend(yb.detach().cpu().numpy().tolist())
acc = accuracy_score(all_targets, all_preds)
f1 = f1_score(all_targets, all_preds)
return float(np.mean(losses)), acc, f1
history = {"train_loss": [], "val_loss": [], "train_acc": [], "val_acc": [], "train_f1": [], "val_f1": []}
for epoch in range(1, num_epochs + 1):
tr_loss, tr_acc, tr_f1 = run_epoch(model, train_loader, criterion, optimizer)
va_loss, va_acc, va_f1 = run_epoch(model, val_loader, criterion)
history["train_loss"].append(tr_loss)
history["val_loss"].append(va_loss)
history["train_acc"].append(tr_acc)
history["val_acc"].append(va_acc)
history["train_f1"].append(tr_f1)
history["val_f1"].append(va_f1)
print(f"Epoch {epoch:02d}/{num_epochs} | train_loss={tr_loss:.4f} val_loss={va_loss:.4f} | train_acc={tr_acc:.4f} val_acc={va_acc:.4f}")
Epoch 01/6 | train_loss=0.6525 val_loss=0.6549 | train_acc=0.6159 val_acc=0.6620 Epoch 02/6 | train_loss=0.5514 val_loss=0.4986 | train_acc=0.7359 val_acc=0.7783 Epoch 03/6 | train_loss=0.4386 val_loss=0.4708 | train_acc=0.8063 val_acc=0.7747 Epoch 04/6 | train_loss=0.3723 val_loss=0.4866 | train_acc=0.8420 val_acc=0.8213 Epoch 05/6 | train_loss=0.3144 val_loss=0.3781 | train_acc=0.8735 val_acc=0.8360 Epoch 06/6 | train_loss=0.2686 val_loss=0.3824 | train_acc=0.8948 val_acc=0.8517
# Curvas loss/accuracy train-val
epochs = np.arange(1, num_epochs + 1)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(epochs, history["train_loss"], marker="o", label="Train")
axes[0].plot(epochs, history["val_loss"], marker="o", label="Val")
axes[0].set_title("Loss vs Epoch")
axes[0].set_xlabel("Epoch")
axes[0].set_ylabel("Loss")
axes[0].legend()
axes[0].grid(alpha=0.3)
axes[1].plot(epochs, history["train_acc"], marker="o", label="Train")
axes[1].plot(epochs, history["val_acc"], marker="o", label="Val")
axes[1].set_title("Accuracy vs Epoch")
axes[1].set_xlabel("Epoch")
axes[1].set_ylabel("Accuracy")
axes[1].legend()
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()
Evaluación final y métricas
Evaluamos en test: accuracy, F1, reporte por clase y matriz de confusión.
model.eval()
all_preds, all_targets = [], []
with torch.no_grad():
for xb, yb in test_loader:
logits = model(xb.to(device))
preds = logits.argmax(dim=1).cpu().numpy()
all_preds.extend(preds.tolist())
all_targets.extend(yb.numpy().tolist())
test_acc = accuracy_score(all_targets, all_preds)
test_f1 = f1_score(all_targets, all_preds)
print(f"Test Accuracy: {test_acc:.4f}")
print(f"Test F1-score: {test_f1:.4f}")
print("\nClassification report:")
print(classification_report(all_targets, all_preds, digits=4))
Test Accuracy: 0.8448
Test F1-score: 0.8477
Classification report:
precision recall f1-score support
0 0.8564 0.8276 0.8418 2494
1 0.8340 0.8619 0.8477 2506
accuracy 0.8448 5000
macro avg 0.8452 0.8448 0.8447 5000
weighted avg 0.8452 0.8448 0.8447 5000
cm = confusion_matrix(all_targets, all_preds)
plt.figure(figsize=(5, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.title("Matriz de confusión (test)")
plt.xlabel("Predicción")
plt.ylabel("Real")
plt.show()
Inferencia con ejemplos nuevos
def predict_sentiment(text, model, vocab, max_len):
model.eval()
x = encode_text(text, vocab, max_len)
x = torch.tensor(x, dtype=torch.long).unsqueeze(0).to(device)
with torch.no_grad():
logits = model(x)
probs = torch.softmax(logits, dim=1).cpu().numpy()[0]
pred = int(np.argmax(probs))
return pred, probs
samples = [
"This movie was absolutely fantastic, the acting was great and the story was touching.",
"I regret watching this film. It was boring, predictable and too long.",
"The plot starts slow but the second half is surprisingly good and emotional.",
]
for text in samples:
pred, probs = predict_sentiment(text, model, vocab, max_len)
label = "Positivo" if pred == 1 else "Negativo"
print(f"Texto: {text}")
print(f"Predicción: {label} | P(neg)={probs[0]:.3f}, P(pos)={probs[1]:.3f}\n")
Texto: This movie was absolutely fantastic, the acting was great and the story was touching. Predicción: Positivo | P(neg)=0.011, P(pos)=0.989 Texto: I regret watching this film. It was boring, predictable and too long. Predicción: Negativo | P(neg)=0.932, P(pos)=0.068 Texto: The plot starts slow but the second half is surprisingly good and emotional. Predicción: Negativo | P(neg)=0.679, P(pos)=0.321
Tests rápidos de sanidad
# Test 1: encode devuelve longitud fija
sample_ids = encode_text("A simple test sentence", vocab, max_len)
assert len(sample_ids) == max_len
# Test 2: forward devuelve logits [batch, 2]
xb, _ = next(iter(train_loader))
with torch.no_grad():
logits = model(xb[:8].to(device))
assert logits.shape == (8, 2)
# Test 3: softmax suma 1
probs = torch.softmax(logits, dim=1)
assert np.allclose(probs.sum(dim=1).cpu().numpy(), 1.0, atol=1e-6)
print("✅ Tests de sanidad superados.")
✅ Tests de sanidad superados.
Conclusiones y siguientes pasos
En este notebook vimos un flujo completo de NLP con LSTM:
- EDA y comprensión del dataset.
- Pipeline de texto a tensores.
- Arquitectura LSTM bidireccional en PyTorch.
- Entrenamiento con curvas train/val.
- Evaluación cuantitativa y cualitativa.
Qué probar después
- Comparar con GRU.
- Añadir embeddings preentrenados.
- Usar tokenización subword (BPE/WordPiece).
- Aplicar early stopping y scheduler.
- Analizar errores por longitud/negaciones.