🏭 Caso de Uso

LSTM TensorFlow/Keras — Clasificación de sentimiento (IMDB)

Clasificación binaria de reseñas IMDB con LSTM en TensorFlow/Keras: EDA, preprocesado, entrenamiento y análisis de errores.

🐍 Python 📓 Jupyter Notebook

LSTM con TensorFlow/Keras para análisis de sentimiento (IMDB)

Objetivo del notebook

En este notebook construiremos, entrenaremos y evaluaremos una red LSTM (Long Short-Term Memory) con TensorFlow/Keras para resolver una tarea clásica de NLP: clasificación binaria de sentimiento (reseñas positivas/negativas) sobre el dataset IMDB.

La meta pedagógica es doble:

Entender por qué una LSTM mejora respecto a una RNN vanilla cuando la información relevante aparece lejos en la secuencia.
Implementar un flujo completo de trabajo: carga de datos, EDA básico, preprocesado, diseño de arquitectura, entrenamiento, evaluación y análisis de errores.

Fundamento matemático y computacional (visión práctica)

En una RNN vanilla, el estado oculto se actualiza con una transformación recurrente del tipo: [ h_t = \phi(W_x x_t + W_h h_{t-1} + b) ]

Esto obliga a que el gradiente atraviese muchas multiplicaciones al hacer BPTT. Si los factores son pequeños, el gradiente se desvanece (vanishing gradient); si son grandes, explota (exploding gradient).

La LSTM introduce una memoria explícita C_t (cell state) y compuertas (gates) para controlar flujo de información:

[ f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) \quad ext{(forget gate)} ] [ i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) \quad ext{(input gate)} ] [ ilde{C}t = anh(W_c [h{t-1}, x_t] + b_c) \quad ext{(candidate)} ] [ C_t = f_t \odot C_{t-1} + i_t \odot ilde{C}t ] [ o_t = \sigma(W_o [h{t-1}, x_t] + b_o) \quad ext{(output gate)} ] [ h_t = o_t \odot anh(C_t) ]

La clave es la actualización aditiva de C_t, que facilita el paso del gradiente a lo largo del tiempo. En términos intuitivos:

f_t decide qué olvidar.
i_t decide qué escribir en memoria.
o_t decide qué exponer como salida.

Modelo y dataset que usaremos

Dataset: IMDB (incluido en keras.datasets), con reseñas ya codificadas como secuencias de enteros.
Tarea: clasificación binaria (0 = negativo, 1 = positivo).
Arquitectura base: Embedding -> LSTM -> Dense(sigmoid).
Métricas: accuracy, precision, recall, AUC + matriz de confusión y reporte de clasificación.

Nota: usaremos un vocabulario truncado (num_words) y una longitud máxima (maxlen) para controlar coste computacional y facilitar batching.

[1]

import os
# forzar ejecución en CPU (evita errores del autotuner XLA/Triton en GPU)
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

# Imports principales
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    roc_curve,
    auc,
)

# Reproducibilidad básica
SEED = 42
np.random.seed(SEED)
tf.random.set_seed(SEED)

print('TensorFlow version:', tf.__version__)

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773756740.082451 3573413 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
I0000 00:00:1773756740.109166 3573413 cpu_feature_guard.cc:227] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

TensorFlow version: 2.21.0

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773756740.815541 3573413 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.

1) Carga de datos

Cargamos IMDB limitando el vocabulario a las palabras más frecuentes. Esto reduce ruido y consumo de memoria.

[2]


# Parámetros de preprocesado
NUM_WORDS = 20000  # tamaño del vocabulario
MAXLEN = 200       # longitud máxima por reseña (padding/truncado)

# Carga del dataset IMDB desde Keras
(x_train_raw, y_train), (x_test_raw, y_test) = keras.datasets.imdb.load_data(num_words=NUM_WORDS)

print('Número de muestras de entrenamiento:', len(x_train_raw))
print('Número de muestras de test:', len(x_test_raw))
print('Ejemplo de secuencia codificada (primeros 20 tokens):', x_train_raw[0][:20])
print('Etiqueta de ese ejemplo:', y_train[0])

Número de muestras de entrenamiento: 25000
Número de muestras de test: 25000
Ejemplo de secuencia codificada (primeros 20 tokens): [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25]
Etiqueta de ese ejemplo: 1

2) EDA rápido del dataset (antes de pad/truncate)

[3]


# Distribución de clases
train_counts = pd.Series(y_train).value_counts().sort_index()
test_counts = pd.Series(y_test).value_counts().sort_index()

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

sns.barplot(x=train_counts.index, y=train_counts.values, ax=axes[0])
axes[0].set_title('Distribución de clases (train)')
axes[0].set_xlabel('Clase')
axes[0].set_ylabel('Frecuencia')

sns.barplot(x=test_counts.index, y=test_counts.values, ax=axes[1])
axes[1].set_title('Distribución de clases (test)')
axes[1].set_xlabel('Clase')
axes[1].set_ylabel('Frecuencia')

plt.tight_layout()
plt.show()

[4]


# Longitud de secuencias antes de normalizar longitud
train_lengths = np.array([len(seq) for seq in x_train_raw])

print('Longitud media:', train_lengths.mean())
print('Percentil 90:', np.percentile(train_lengths, 90))
print('Percentil 95:', np.percentile(train_lengths, 95))
print('Longitud máxima:', train_lengths.max())

plt.figure(figsize=(9, 4))
sns.histplot(train_lengths, bins=50, kde=True)
plt.axvline(MAXLEN, color='red', linestyle='--', label=f'MAXLEN={MAXLEN}')
plt.title('Distribución de longitudes de reseñas (train)')
plt.xlabel('Número de tokens')
plt.ylabel('Frecuencia')
plt.legend()
plt.show()

Longitud media: 238.71364
Percentil 90: 467.0
Percentil 95: 610.0
Longitud máxima: 2494

Vemos que existe variabilidad grande en longitudes. Elegimos MAXLEN=200 como compromiso entre:

Retener suficiente contexto semántico.
Mantener entrenamiento razonable en tiempo/memoria.

[6]


# Diccionario inverso para decodificar tokens a palabras (solo para inspección)
word_index = keras.datasets.imdb.get_word_index()
inverted_word_index = {idx + 3: word for word, idx in word_index.items()}
inverted_word_index[0] = '<PAD>'
inverted_word_index[1] = '<START>'
inverted_word_index[2] = '<UNK>'
inverted_word_index[3] = '<UNUSED>'

def decode_review(encoded_review):
    return ' '.join(inverted_word_index.get(i, '?') for i in encoded_review)

print('Ejemplo de reseña decodificada (recortada):')
print(decode_review(x_train_raw[1][:80]))

Ejemplo de reseña decodificada (recortada):
<START> big hair big boobs bad music and a giant safety pin these are the words to best describe this terrible movie i love cheesy horror movies and i've seen hundreds but this had got to be on of the worst ever made the plot is paper thin and ridiculous the acting is an abomination the script is completely laughable the best is the end showdown with the cop and how he worked out who the killer is it's just

3) Preprocesado: padding/truncado y conjunto de validación

[7]


# Normalizamos longitud de secuencias para entrenamiento por lotes
x_train = keras.preprocessing.sequence.pad_sequences(
    x_train_raw,
    maxlen=MAXLEN,
    padding='post',
    truncating='post'
)

x_test = keras.preprocessing.sequence.pad_sequences(
    x_test_raw,
    maxlen=MAXLEN,
    padding='post',
    truncating='post'
)

print('Shape x_train:', x_train.shape)
print('Shape x_test:', x_test.shape)

# Verificación rápida tipo "test" para asegurar preprocesado correcto
assert x_train.shape[1] == MAXLEN
assert x_test.shape[1] == MAXLEN
assert set(np.unique(y_train)).issubset({0, 1})
print('Checks de preprocesado OK ✅')

Shape x_train: (25000, 200)
Shape x_test: (25000, 200)
Checks de preprocesado OK ✅

[8]


# Creamos conjunto de validación a partir de train
VAL_SIZE = 5000

x_val = x_train[:VAL_SIZE]
y_val = y_train[:VAL_SIZE]

x_train_final = x_train[VAL_SIZE:]
y_train_final = y_train[VAL_SIZE:]

print('Train final:', x_train_final.shape, y_train_final.shape)
print('Validación:', x_val.shape, y_val.shape)

Train final: (20000, 200) (20000,)
Validación: (5000, 200) (5000,)

4) Arquitectura LSTM en Keras

Diseño elegido:

Embedding(NUM_WORDS, 128): transforma cada token en un vector denso entrenable.
LSTM(64, dropout=0.2, recurrent_dropout=0.2): modela dependencias temporales.
Dense(1, activation='sigmoid'): salida binaria.

Usamos binary_crossentropy y optimizador Adam.

[9]


EMBED_DIM = 128
LSTM_UNITS = 64

model = keras.Sequential([
    layers.Embedding(input_dim=NUM_WORDS, output_dim=EMBED_DIM, input_length=MAXLEN),
    layers.LSTM(LSTM_UNITS, dropout=0.2, recurrent_dropout=0.2),
    layers.Dense(1, activation='sigmoid')
])

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss='binary_crossentropy',
    metrics=['accuracy', keras.metrics.Precision(name='precision'), keras.metrics.Recall(name='recall'), keras.metrics.AUC(name='auc')]
)

model.summary()

/home/nuberu/xuan/naux/.venv/lib/python3.10/site-packages/keras/src/layers/core/embedding.py:97: UserWarning: Argument `input_length` is deprecated. Just remove it.
  warnings.warn(
E0000 00:00:1773756759.079320 3573413 cuda_platform.cc:52] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
I0000 00:00:1773756759.079349 3573413 cuda_diagnostics.cc:160] env: CUDA_VISIBLE_DEVICES="-1"
I0000 00:00:1773756759.079355 3573413 cuda_diagnostics.cc:163] CUDA_VISIBLE_DEVICES is set to -1 - this hides all GPUs from CUDA
I0000 00:00:1773756759.079360 3573413 cuda_diagnostics.cc:171] verbose logging is disabled. Rerun with verbose logging (usually --v=1 or --vmodule=cuda_diagnostics=1) to get more diagnostic output from this module
I0000 00:00:1773756759.079361 3573413 cuda_diagnostics.cc:176] retrieving CUDA diagnostic information for host: tnp01-4090
I0000 00:00:1773756759.079363 3573413 cuda_diagnostics.cc:183] hostname: tnp01-4090
I0000 00:00:1773756759.079452 3573413 cuda_diagnostics.cc:190] libcuda reported version is: 580.126.9
I0000 00:00:1773756759.079461 3573413 cuda_diagnostics.cc:194] kernel reported version is: 580.126.9
I0000 00:00:1773756759.079462 3573413 cuda_diagnostics.cc:284] kernel version seems to match DSO: 580.126.9

Model: "sequential"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ embedding (Embedding)           │ ?                      │   0 (unbuilt) │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ lstm (LSTM)                     │ ?                      │   0 (unbuilt) │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense (Dense)                   │ ?                      │   0 (unbuilt) │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 0 (0.00 B)

 Trainable params: 0 (0.00 B)

 Non-trainable params: 0 (0.00 B)

[10]


# Callback de early stopping para evitar sobreajuste fuerte
early_stopping = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=2,
    restore_best_weights=True
)

EPOCHS = 8
BATCH_SIZE = 64

history = model.fit(
    x_train_final,
    y_train_final,
    validation_data=(x_val, y_val),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    callbacks=[early_stopping],
    verbose=1
)

Epoch 1/8
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 43ms/step - accuracy: 0.5287 - auc: 0.5406 - loss: 0.6890 - precision: 0.5250 - recall: 0.5570 - val_accuracy: 0.5144 - val_auc: 0.6081 - val_loss: 0.6797 - val_precision: 0.7269 - val_recall: 0.0742
Epoch 2/8
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 42ms/step - accuracy: 0.6227 - auc: 0.6847 - loss: 0.6372 - precision: 0.6216 - recall: 0.6182 - val_accuracy: 0.7366 - val_auc: 0.7765 - val_loss: 0.5558 - val_precision: 0.8740 - val_recall: 0.5640
Epoch 3/8
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 42ms/step - accuracy: 0.7773 - auc: 0.8262 - loss: 0.5032 - precision: 0.8006 - recall: 0.7358 - val_accuracy: 0.7586 - val_auc: 0.8022 - val_loss: 0.5515 - val_precision: 0.8220 - val_recall: 0.6712
Epoch 4/8
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 42ms/step - accuracy: 0.7210 - auc: 0.7852 - loss: 0.5584 - precision: 0.7049 - recall: 0.7557 - val_accuracy: 0.7146 - val_auc: 0.7521 - val_loss: 0.6248 - val_precision: 0.7987 - val_recall: 0.5876
Epoch 5/8
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m13s[0m 42ms/step - accuracy: 0.7973 - auc: 0.8525 - loss: 0.4668 - precision: 0.8426 - recall: 0.7290 - val_accuracy: 0.7648 - val_auc: 0.7945 - val_loss: 0.5745 - val_precision: 0.7541 - val_recall: 0.7985

5) Curvas de entrenamiento (loss y métricas)

[11]


# Función auxiliar para dibujar curvas train/val

def plot_history(history, metric):
    plt.figure(figsize=(8, 4))
    plt.plot(history.history[metric], label=f'train_{metric}')
    plt.plot(history.history[f'val_{metric}'], label=f'val_{metric}')
    plt.title(f'{metric} por época')
    plt.xlabel('Época')
    plt.ylabel(metric)
    plt.legend()
    plt.grid(alpha=0.3)
    plt.show()

plot_history(history, 'loss')
plot_history(history, 'accuracy')
plot_history(history, 'auc')

6) Evaluación final en test

[12]


# Evaluación cuantitativa en test
results = model.evaluate(x_test, y_test, verbose=0)
metric_names = model.metrics_names

print('Resultados en test:')
for name, value in zip(metric_names, results):
    print(f'- {name}: {value:.4f}')

Resultados en test:
- loss: 0.5655
- compile_metrics: 0.7498

[13]


# Probabilidades y etiquetas predichas
y_proba = model.predict(x_test, verbose=0).ravel()
y_pred = (y_proba >= 0.5).astype(int)

# Matriz de confusión
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Matriz de confusión (test)')
plt.xlabel('Predicción')
plt.ylabel('Real')
plt.show()

print('Reporte de clasificación:')
print(classification_report(y_test, y_pred, digits=4))

Reporte de clasificación:
              precision    recall  f1-score   support

           0     0.7098    0.8452    0.7716     12500
           1     0.8087    0.6545    0.7235     12500

    accuracy                         0.7498     25000
   macro avg     0.7593    0.7498    0.7475     25000
weighted avg     0.7593    0.7498    0.7475     25000

[14]


# Curva ROC y AUC
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(6, 5))
plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.4f}')
plt.plot([0, 1], [0, 1], 'k--', alpha=0.7)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Curva ROC en test')
plt.legend(loc='lower right')
plt.grid(alpha=0.3)
plt.show()

7) Inspección cualitativa de ejemplos

[15]


# Mostramos algunos ejemplos con confianza del modelo
indices = np.random.choice(len(x_test_raw), size=5, replace=False)

for idx in indices:
    raw_text = decode_review(x_test_raw[idx][:120])
    prob = y_proba[idx]
    pred = int(prob >= 0.5)
    real = int(y_test[idx])

    print('=' * 90)
    print(f'Índice: {idx} | Real: {real} | Predicción: {pred} | Prob(positivo): {prob:.3f}')
    print('Texto (truncado):')
    print(raw_text)

==========================================================================================
Índice: 6868 | Real: 1 | Predicción: 1 | Prob(positivo): 0.778
Texto (truncado):
<START> to tell you the truth i do not speak tamil and i did not understand the film my good tamil friend wow what a long name explained every thing to me what a great movie after watching this movie i felt i should have watched many more movies from <UNK> tamil film industry the war scenes were amazing camera work excellent and plot beautiful the actress what a beauty give her an award for best looking someone ding ding ding come on i smell a oscar winner i didnt understand the songs but they were excellent <UNK> is a great director and i hope his next film was a success
==========================================================================================
Índice: 24016 | Real: 1 | Predicción: 1 | Prob(positivo): 0.913
Texto (truncado):
<START> the royal rumble has traditionally been one of my favourite events and i've been a wrestling fan for a good few years now the other shows may have better matches but i've always found the actual rumble match to be full of excitement br br i'm not going to reveal the winners of any match as i don't see it as fair to ruin the results on a review i will comment on the quality of them though br br we have the standard 4 matches and then the big rumble event two from <UNK> and two from raw br br shawn michaels and edge open up for raw this proves to be a good match from two talented
==========================================================================================
Índice: 9668 | Real: 1 | Predicción: 0 | Prob(positivo): 0.417
Texto (truncado):
<START> i really wanted to be able to give this film a 10 i've long thought it was my favorite of the four modern live action batman films to date and maybe it still will be i have yet to watch the schumacher films again i'm also starting to become concerned about whether i'm somehow <UNK> being you see i always liked the schumacher films as far as i can remember they were either <UNK> or <UNK> to me but the conventional wisdom is that the two tim burton directed films are far superior i had serious problems with the first burton batman this time around i ended up giving it a 7 and apologize as i might i just
==========================================================================================
Índice: 13640 | Real: 1 | Predicción: 1 | Prob(positivo): 0.778
Texto (truncado):
<START> nothing dull about this movie which is held together by fully realized characters with some depth to them even the <UNK> <UNK> have body language <UNK> performance is brilliant all will want and need a henry <UNK> as he must have been <UNK> is maybe <UNK> and <UNK> than anne <UNK> but she plays the part as written a victim caught in the jaws of a big huge baby br br cinematography is gorgeous in the restoration the <UNK> sensuous lubitsch lets these characters breathe and reveal their corruption down to the tiniest of he takes his time which can try the patience of an audience accustomed to being carried away by action but the time is worth spending
==========================================================================================
Índice: 14018 | Real: 0 | Predicción: 1 | Prob(positivo): 0.526
Texto (truncado):
<START> this is a movie about a black man buying a airline company and turning the company into a african <UNK> over the top <UNK> they even portray the owner as not only being in control of the airline but also controlling part of the air terminal at the airport one day this guy wins 100 million dollars a the next time you see him he is walking all over the airport acting like the owner of the airport everyone calls this movie a parody but nothing about this movie shouts parody this movie is a flop and will forever be in the 4 95 bin at wal mart br br i can't even come to terms to why mgm

8) Mini-experimento: efecto de la longitud máxima (`MAXLEN`)

Una buena práctica en secuencias es validar si longitudes más cortas o largas alteran significativamente rendimiento/coste. El siguiente bloque (opcional) permite comparar rápidamente 2 configuraciones.

[16]


# Experimento opcional y ligero: comparar dos longitudes máximas
# Nota: puede tardar varios minutos según hardware.

def train_quick_variant(maxlen_variant, epochs=2):
    # Re-pad con nueva longitud
    x_tr = keras.preprocessing.sequence.pad_sequences(x_train_raw, maxlen=maxlen_variant, padding='post', truncating='post')
    x_te = keras.preprocessing.sequence.pad_sequences(x_test_raw, maxlen=maxlen_variant, padding='post', truncating='post')

    x_v = x_tr[:VAL_SIZE]
    y_v = y_train[:VAL_SIZE]
    x_trf = x_tr[VAL_SIZE:]
    y_trf = y_train[VAL_SIZE:]

    # Modelo pequeño para comparación rápida
    quick_model = keras.Sequential([
        layers.Embedding(NUM_WORDS, 64, input_length=maxlen_variant),
        layers.LSTM(32),
        layers.Dense(1, activation='sigmoid')
    ])

    quick_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    quick_model.fit(
        x_trf, y_trf,
        validation_data=(x_v, y_v),
        epochs=epochs,
        batch_size=128,
        verbose=0
    )

    loss, acc = quick_model.evaluate(x_te, y_test, verbose=0)
    return {'maxlen': maxlen_variant, 'test_loss': loss, 'test_acc': acc}

# Descomentar para ejecutar si se desea comparar en clase:
# res_120 = train_quick_variant(120)
# res_240 = train_quick_variant(240)
# print(res_120)
# print(res_240)

Conclusiones

La LSTM ofrece una mejora conceptual clave frente a RNN vanilla: una memoria explícita (cell state) regulada por compuertas.
En IMDB, una arquitectura sencilla Embedding + LSTM ya logra resultados competitivos para clasificación binaria.
Las curvas train/val ayudan a diagnosticar sobreajuste y decidir early stopping.
No basta con una única métrica: conviene revisar precisión/recall, AUC y matriz de confusión.

Sugerencias para seguir explorando

Probar LSTM bidireccional (Bidirectional(LSTM(...))).
Apilar LSTM (stacked) con return_sequences=True en capas intermedias.
Sustituir por GRU y comparar parámetros, velocidad y rendimiento.
Añadir regularización (SpatialDropout1D, L2) y/o ajustar MAXLEN.
Usar embeddings preentrenados (GloVe/FastText) para mejorar semántica inicial.
Comparar este enfoque con un Transformer pequeño para la misma tarea.