🏭 Caso de Uso

LSTM PyTorch — Clasificación de sentimiento (IMDB)

Clasificación binaria de reseñas de cine con LSTM bidireccional en PyTorch: tokenización, embedding, entrenamiento y evaluación.

🐍 Python 📓 Jupyter Notebook

LSTM en PyTorch para clasificación de sentimiento (IMDB)

En este notebook vamos a construir de principio a fin un clasificador de sentimiento para reseñas de cine usando una arquitectura LSTM en PyTorch. El objetivo es que entiendas tanto la parte práctica (código) como la intuición matemática que hay detrás, en línea con la teoría del submódulo de LSTM:

  • por qué las RNN simples sufren con dependencias largas,
  • cómo la celda de memoria de LSTM ayuda a mitigar el vanishing gradient,
  • y cómo llevar esto a un pipeline real de NLP.

1) Objetivo del notebook

Queremos resolver una tarea de clasificación binaria de texto:

  • Entrada: una reseña de cine.
  • Salida: etiqueta 0 (negativa) o 1 (positiva).

Usaremos el dataset IMDB (Hugging Face Datasets), y entrenaremos un modelo con esta estructura:

  1. Tokenización + vocabulario (texto → índices enteros).
  2. Embedding (nn.Embedding) para mapear tokens a vectores densos.
  3. LSTM bidireccional (nn.LSTM) para capturar contexto de izquierda a derecha y viceversa.
  4. Capa lineal para producir logits de clase.

2) Fundamento matemático mínimo

En una LSTM, para cada paso temporal (t), se calculan puertas que controlan el flujo de información:

[ \begin{aligned} i_t &= \sigma(W_i x_t + U_i h_{t-1} + b_i)
f_t &= \sigma(W_f x_t + U_f h_{t-1} + b_f)
g_t &= anh(W_g x_t + U_g h_{t-1} + b_g)
o_t &= \sigma(W_o x_t + U_o h_{t-1} + b_o) \end{aligned} ]

[ \begin{aligned} C_t &= f_t \odot C_{t-1} + i_t \odot g_t
h_t &= o_t \odot anh(C_t) \end{aligned} ]

La ruta aditiva de la celda (C_t) permite preservar información durante muchos pasos cuando (f_t pprox 1), lo que facilita aprender dependencias largas.


3) Qué vamos a usar

  • Framework: PyTorch
  • Dataset NLP: IMDB (stanfordnlp/imdb)
  • Métricas: loss, accuracy, F1, matriz de confusión
  • Visualizaciones: EDA, curvas train/val, matriz de confusión

Nota: si usas CPU, reduce épocas o tamaño de muestra para ejecutar más rápido.

[1]
# Instalación opcional (descomentar si falta alguna librería)
# !pip install datasets scikit-learn seaborn -q
[2]
# Imports generales
import random
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from collections import Counter

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

from datasets import load_dataset
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
/home/nuberu/xuan/naux/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[3]
# Reproducibilidad + dispositivo
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Dispositivo disponible: {device}")
Dispositivo disponible: cuda

Carga del dataset y EDA inicial

Antes de entrenar, exploramos:

  • tamaño de splits,
  • balance de clases,
  • longitud de reseñas.

Además, crearemos subconjuntos para que la práctica sea manejable.

[4]
# Cargamos IMDB desde Hugging Face Datasets
# train: 25k, test: 25k
dataset = load_dataset("stanfordnlp/imdb")

print(dataset)
print("Tamaño train:", len(dataset["train"]))
print("Tamaño test:", len(dataset["test"]))
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 399868.82 examples/s]
Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 617208.78 examples/s]
Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 543032.11 examples/s]
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
Tamaño train: 25000
Tamaño test: 25000
[5]
# Muestra para EDA
eda_df = pd.DataFrame(dataset["train"][:3000])
eda_df["num_chars"] = eda_df["text"].str.len()
eda_df["num_words"] = eda_df["text"].str.split().apply(len)
eda_df.head()
text label num_chars num_words
0 I rented I AM CURIOUS-YELLOW from my video sto... 0 1640 288
1 "I Am Curious: Yellow" is a risible and preten... 0 1294 214
2 If only to avoid making this type of film in t... 0 528 93
3 This film was probably inspired by Godard's Ma... 0 706 118
4 Oh, brother...after hearing about this ridicul... 0 1814 311
[6]
# Balance de clases
plt.figure(figsize=(5, 3))
sns.countplot(x="label", data=eda_df)
plt.title("Distribución de etiquetas (0=neg, 1=pos)")
plt.xlabel("Etiqueta")
plt.ylabel("Frecuencia")
plt.show()
Output
[7]
# Distribución de longitud de reseñas
plt.figure(figsize=(7, 4))
sns.histplot(eda_df["num_words"], bins=50, kde=True)
plt.title("Longitud de reseñas (número de palabras)")
plt.xlabel("# palabras")
plt.ylabel("Frecuencia")
plt.xlim(0, np.percentile(eda_df["num_words"], 99))
plt.show()

print("Percentiles de longitud (#palabras):")
print(eda_df["num_words"].quantile([0.5, 0.75, 0.9, 0.95, 0.99]))
Output
Percentiles de longitud (#palabras):
0.50    171.00
0.75    273.00
0.90    432.10
0.95    558.05
0.99    906.02
Name: num_words, dtype: float64
[8]
# Ejemplos reales
for i in [0, 1, 2]:
    print(f"\nEjemplo {i} | label={eda_df.iloc[i]['label']}")
    print(eda_df.iloc[i]["text"][:400], "...")
Ejemplo 0 | label=0
I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student name ...

Ejemplo 1 | label=0
"I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't matter what one's political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn't true. I've seen R-rated films with male nudity. Granted, they only offer some fleeting views, but where are the R-rated films with gaping vulvas a ...

Ejemplo 2 | label=0
If only to avoid making this type of film in the future. This film is interesting as an experiment but tells no cogent story.<br /><br />One might feel virtuous for sitting thru it because it touches on so many IMPORTANT issues but it does so without any discernable motive. The viewer comes away with no new perspectives (unless one comes up with one while one's mind wanders, as it will invariably  ...

Preprocesado de texto

Pipeline didáctico:

  1. Limpieza básica.
  2. Tokenización por espacios.
  3. Vocabulario con frecuencia mínima.
  4. Conversión a índices (<PAD>, <UNK>).
  5. Padding/truncado a longitud fija.
[9]
SPECIAL_TOKENS = {"<PAD>": 0, "<UNK>": 1}


def clean_text(text: str) -> str:
    # Minúsculas + limpieza ligera
    text = text.lower()
    text = re.sub(r"<br\s*/?>", " ", text)
    text = re.sub(r"[^a-z0-9'\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text


def tokenize(text: str):
    # Tokenización simple por espacios
    return text.split()
[10]
# Subconjuntos para entrenamiento rápido
train_size = 12000
val_size = 3000
test_size = 5000

train_raw = dataset["train"].shuffle(seed=SEED).select(range(train_size + val_size))
val_raw = train_raw.select(range(train_size, train_size + val_size))
train_raw = train_raw.select(range(train_size))

test_raw = dataset["test"].shuffle(seed=SEED).select(range(test_size))

print(f"Train: {len(train_raw)}, Val: {len(val_raw)}, Test: {len(test_raw)}")
Train: 12000, Val: 3000, Test: 5000
[11]
# Construcción de vocabulario usando solo train
min_freq = 3
counter = Counter()
for sample in train_raw:
    tokens = tokenize(clean_text(sample["text"]))
    counter.update(tokens)

vocab = dict(SPECIAL_TOKENS)
for token, freq in counter.items():
    if freq >= min_freq:
        vocab[token] = len(vocab)

id2token = {idx: tok for tok, idx in vocab.items()}
print(f"Tamaño vocabulario: {len(vocab):,}")
print("Tokens más frecuentes:", counter.most_common(15))
Tamaño vocabulario: 28,809
Tokens más frecuentes: [('the', 160794), ('and', 78217), ('a', 77596), ('of', 69456), ('to', 64896), ('is', 51252), ('in', 45066), ('it', 37770), ('i', 36873), ('this', 35980), ('that', 33592), ('was', 23170), ('as', 22566), ('for', 21168), ('with', 21154)]
[12]
# Longitud máxima sugerida por percentil 95
train_lengths = [len(tokenize(clean_text(s["text"]))) for s in train_raw]
max_len = int(np.percentile(train_lengths, 95))
max_len = min(max_len, 300)
print(f"max_len elegido: {max_len}")
max_len elegido: 300
[13]
PAD_IDX = vocab["<PAD>"]
UNK_IDX = vocab["<UNK>"]


def encode_text(text: str, vocab: dict, max_len: int):
    # Texto -> ids + truncado + padding
    ids = [vocab.get(tok, UNK_IDX) for tok in tokenize(clean_text(text))]
    ids = ids[:max_len]
    if len(ids) < max_len:
        ids += [PAD_IDX] * (max_len - len(ids))
    return ids

Dataset y DataLoader de PyTorch

Creamos una clase Dataset que devuelve (x, y) listo para entrenar.

[14]
class IMDBDataset(Dataset):
    def __init__(self, hf_split, vocab, max_len):
        self.samples = hf_split
        self.vocab = vocab
        self.max_len = max_len

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        sample = self.samples[idx]
        x = encode_text(sample["text"], self.vocab, self.max_len)
        y = sample["label"]
        return torch.tensor(x, dtype=torch.long), torch.tensor(y, dtype=torch.long)


train_ds = IMDBDataset(train_raw, vocab, max_len)
val_ds = IMDBDataset(val_raw, vocab, max_len)
test_ds = IMDBDataset(test_raw, vocab, max_len)

batch_size = 64
train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_ds, batch_size=batch_size, shuffle=False)

xb, yb = next(iter(train_loader))
print("Batch X:", xb.shape)
print("Batch y:", yb.shape)
Batch X: torch.Size([64, 300])
Batch y: torch.Size([64])

Modelo LSTM en PyTorch

Arquitectura elegida:

  • Embedding
  • LSTM bidireccional (2 capas)
  • Dropout
  • Capa lineal final

También inicializamos el forget gate bias a 1.0.

[15]
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim=128, hidden_dim=128, num_layers=2,
                 dropout=0.3, bidirectional=True, num_classes=2, pad_idx=0):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_dim,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0.0,
            bidirectional=bidirectional,
        )

        self.dropout = nn.Dropout(dropout)
        factor = 2 if bidirectional else 1
        self.fc = nn.Linear(hidden_dim * factor, num_classes)

        self._init_forget_gate_bias(1.0)

    def _init_forget_gate_bias(self, value=1.0):
        # bias_ih y bias_hh concatenan [i, f, g, o]
        for name, param in self.lstm.named_parameters():
            if "bias" in name:
                n = param.size(0)
                start, end = n // 4, n // 2
                with torch.no_grad():
                    param[start:end].fill_(value)

    def forward(self, x):
        emb = self.dropout(self.embedding(x))
        _, (h_n, _) = self.lstm(emb)

        if self.lstm.bidirectional:
            h_final = torch.cat([h_n[-2], h_n[-1]], dim=1)
        else:
            h_final = h_n[-1]

        logits = self.fc(self.dropout(h_final))
        return logits


model = SentimentLSTM(vocab_size=len(vocab), pad_idx=PAD_IDX).to(device)
print(model)
SentimentLSTM(
  (embedding): Embedding(28809, 128, padding_idx=0)
  (lstm): LSTM(128, 128, num_layers=2, batch_first=True, dropout=0.3, bidirectional=True)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=2, bias=True)
)

Entrenamiento

Usaremos CrossEntropyLoss, Adam y gradient clipping. Guardamos métricas por época para visualizar train/val.

[16]
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
num_epochs = 6
max_grad_norm = 1.0


def run_epoch(model, loader, criterion, optimizer=None):
    is_train = optimizer is not None
    model.train() if is_train else model.eval()

    losses = []
    all_preds, all_targets = [], []

    for xb, yb in loader:
        xb, yb = xb.to(device), yb.to(device)
        if is_train:
            optimizer.zero_grad()

        logits = model(xb)
        loss = criterion(logits, yb)

        if is_train:
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
            optimizer.step()

        losses.append(loss.item())
        preds = logits.argmax(dim=1)
        all_preds.extend(preds.detach().cpu().numpy().tolist())
        all_targets.extend(yb.detach().cpu().numpy().tolist())

    acc = accuracy_score(all_targets, all_preds)
    f1 = f1_score(all_targets, all_preds)
    return float(np.mean(losses)), acc, f1


history = {"train_loss": [], "val_loss": [], "train_acc": [], "val_acc": [], "train_f1": [], "val_f1": []}

for epoch in range(1, num_epochs + 1):
    tr_loss, tr_acc, tr_f1 = run_epoch(model, train_loader, criterion, optimizer)
    va_loss, va_acc, va_f1 = run_epoch(model, val_loader, criterion)

    history["train_loss"].append(tr_loss)
    history["val_loss"].append(va_loss)
    history["train_acc"].append(tr_acc)
    history["val_acc"].append(va_acc)
    history["train_f1"].append(tr_f1)
    history["val_f1"].append(va_f1)

    print(f"Epoch {epoch:02d}/{num_epochs} | train_loss={tr_loss:.4f} val_loss={va_loss:.4f} | train_acc={tr_acc:.4f} val_acc={va_acc:.4f}")
Epoch 01/6 | train_loss=0.6525 val_loss=0.6549 | train_acc=0.6159 val_acc=0.6620
Epoch 02/6 | train_loss=0.5514 val_loss=0.4986 | train_acc=0.7359 val_acc=0.7783
Epoch 03/6 | train_loss=0.4386 val_loss=0.4708 | train_acc=0.8063 val_acc=0.7747
Epoch 04/6 | train_loss=0.3723 val_loss=0.4866 | train_acc=0.8420 val_acc=0.8213
Epoch 05/6 | train_loss=0.3144 val_loss=0.3781 | train_acc=0.8735 val_acc=0.8360
Epoch 06/6 | train_loss=0.2686 val_loss=0.3824 | train_acc=0.8948 val_acc=0.8517
[17]
# Curvas loss/accuracy train-val
epochs = np.arange(1, num_epochs + 1)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(epochs, history["train_loss"], marker="o", label="Train")
axes[0].plot(epochs, history["val_loss"], marker="o", label="Val")
axes[0].set_title("Loss vs Epoch")
axes[0].set_xlabel("Epoch")
axes[0].set_ylabel("Loss")
axes[0].legend()
axes[0].grid(alpha=0.3)

axes[1].plot(epochs, history["train_acc"], marker="o", label="Train")
axes[1].plot(epochs, history["val_acc"], marker="o", label="Val")
axes[1].set_title("Accuracy vs Epoch")
axes[1].set_xlabel("Epoch")
axes[1].set_ylabel("Accuracy")
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()
Output

Evaluación final y métricas

Evaluamos en test: accuracy, F1, reporte por clase y matriz de confusión.

[18]
model.eval()
all_preds, all_targets = [], []

with torch.no_grad():
    for xb, yb in test_loader:
        logits = model(xb.to(device))
        preds = logits.argmax(dim=1).cpu().numpy()
        all_preds.extend(preds.tolist())
        all_targets.extend(yb.numpy().tolist())

test_acc = accuracy_score(all_targets, all_preds)
test_f1 = f1_score(all_targets, all_preds)

print(f"Test Accuracy: {test_acc:.4f}")
print(f"Test F1-score: {test_f1:.4f}")
print("\nClassification report:")
print(classification_report(all_targets, all_preds, digits=4))
Test Accuracy: 0.8448
Test F1-score: 0.8477

Classification report:
              precision    recall  f1-score   support

           0     0.8564    0.8276    0.8418      2494
           1     0.8340    0.8619    0.8477      2506

    accuracy                         0.8448      5000
   macro avg     0.8452    0.8448    0.8447      5000
weighted avg     0.8452    0.8448    0.8447      5000

[19]
cm = confusion_matrix(all_targets, all_preds)
plt.figure(figsize=(5, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.title("Matriz de confusión (test)")
plt.xlabel("Predicción")
plt.ylabel("Real")
plt.show()
Output

Inferencia con ejemplos nuevos

[20]
def predict_sentiment(text, model, vocab, max_len):
    model.eval()
    x = encode_text(text, vocab, max_len)
    x = torch.tensor(x, dtype=torch.long).unsqueeze(0).to(device)

    with torch.no_grad():
        logits = model(x)
        probs = torch.softmax(logits, dim=1).cpu().numpy()[0]

    pred = int(np.argmax(probs))
    return pred, probs


samples = [
    "This movie was absolutely fantastic, the acting was great and the story was touching.",
    "I regret watching this film. It was boring, predictable and too long.",
    "The plot starts slow but the second half is surprisingly good and emotional.",
]

for text in samples:
    pred, probs = predict_sentiment(text, model, vocab, max_len)
    label = "Positivo" if pred == 1 else "Negativo"
    print(f"Texto: {text}")
    print(f"Predicción: {label} | P(neg)={probs[0]:.3f}, P(pos)={probs[1]:.3f}\n")
Texto: This movie was absolutely fantastic, the acting was great and the story was touching.
Predicción: Positivo | P(neg)=0.011, P(pos)=0.989

Texto: I regret watching this film. It was boring, predictable and too long.
Predicción: Negativo | P(neg)=0.932, P(pos)=0.068

Texto: The plot starts slow but the second half is surprisingly good and emotional.
Predicción: Negativo | P(neg)=0.679, P(pos)=0.321

Tests rápidos de sanidad

[21]
# Test 1: encode devuelve longitud fija
sample_ids = encode_text("A simple test sentence", vocab, max_len)
assert len(sample_ids) == max_len

# Test 2: forward devuelve logits [batch, 2]
xb, _ = next(iter(train_loader))
with torch.no_grad():
    logits = model(xb[:8].to(device))
assert logits.shape == (8, 2)

# Test 3: softmax suma 1
probs = torch.softmax(logits, dim=1)
assert np.allclose(probs.sum(dim=1).cpu().numpy(), 1.0, atol=1e-6)

print("✅ Tests de sanidad superados.")
✅ Tests de sanidad superados.

Conclusiones y siguientes pasos

En este notebook vimos un flujo completo de NLP con LSTM:

  1. EDA y comprensión del dataset.
  2. Pipeline de texto a tensores.
  3. Arquitectura LSTM bidireccional en PyTorch.
  4. Entrenamiento con curvas train/val.
  5. Evaluación cuantitativa y cualitativa.

Qué probar después

  • Comparar con GRU.
  • Añadir embeddings preentrenados.
  • Usar tokenización subword (BPE/WordPiece).
  • Aplicar early stopping y scheduler.
  • Analizar errores por longitud/negaciones.