Estabilidad de entrenamiento en MLP: efecto del learning rate y batch size en FashionMNIST

Este notebook estudia de forma práctica cómo dos hiperparámetros fundamentales del entrenamiento afectan la estabilidad y el rendimiento de una red neuronal:

Learning rate ($\eta$)
Batch size ($B$)

Usaremos un MLP sobre FashionMNIST con SGD (descenso de gradiente estocástico sin momento) y compararemos curvas de train/validación para distintos valores.

¿Por qué SGD y no Adam? Los optimizadores adaptativos como Adam ajustan internamente la tasa de aprendizaje por parámetro, lo que amortigua las diferencias entre configuraciones. Con SGD puro, el efecto del learning rate y del batch size es mucho más pronunciado y didáctico, permitiendo observar claramente fenómenos como la divergencia, el estancamiento o la inestabilidad.

Objetivo didáctico

Queremos responder con experimentos reproducibles:

¿Qué pasa si el learning rate es demasiado pequeño o demasiado grande?
¿Cómo cambia el entrenamiento al variar batch size?
¿Qué combinaciones ofrecen mejor equilibrio entre convergencia, estabilidad y generalización?

Fundamentos matemáticos (visión aplicada)

La actualización genérica de parámetros en descenso de gradiente es:

$$ \theta_{t+1} = \theta_t - \eta , \nabla_{\theta} \mathcal{L}(\theta_t) $$

donde $\eta$ es el learning rate.

Efecto del learning rate

$\eta$ muy pequeño (~1e-4): aprendizaje extremadamente lento, no converge en pocas épocas.
$\eta$ muy grande (~1.0): oscilaciones violentas, la loss puede crecer o no bajar del azar.
$\eta$ adecuado (~0.01–0.1): descenso estable y rápido.

Efecto del batch size

El gradiente se estima con mini-batches de tamaño $B$:

$$ \hat{g} = \frac{1}{B}\sum_{i=1}^{B} \nabla_{\theta} \ell_i $$

Batch muy pequeño (8): gradiente muy ruidoso, oscilaciones en la loss, posible efecto regularizador.
Batch muy grande (2048+): gradiente suave pero pocas actualizaciones por época, convergencia más lenta.

Dataset y modelo

Dataset: FashionMNIST (10 clases de prendas, imágenes 28x28 en escala de grises).
Modelo: MLP sencillo (flatten + 2 capas ocultas ReLU + salida softmax).
Optimizador: SGD puro (sin momento) para maximizar la sensibilidad a los hiperparámetros.
Experimentos:
- 5 valores de learning rate con rango amplio: 1e-4 a 1.0 (batch fijo).
- 5 valores de batch size con rango amplio: 8 a 2048 (LR fijo).

En todos los casos monitorizamos:

Loss train/val
Accuracy train/val
Métricas finales en test
Tiempo de entrenamiento

[1]

# Librerías y configuración

import os
# Forzar ejecución en CPU (evita errores del autotuner XLA/Triton en GPU)
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

import time
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

sns.set_theme(style='whitegrid', context='notebook')
plt.rcParams['figure.figsize'] = (9, 5)

SEED = 42
np.random.seed(SEED)
tf.random.set_seed(SEED)

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773737167.287176 3113551 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
I0000 00:00:1773737167.315585 3113551 cpu_feature_guard.cc:227] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1773737167.935427 3113551 port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.

1) Carga de datos y EDA básico

[2]

# Carga de FashionMNIST
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()

# Split train/val
X_train, X_val = X_train_full[:54000], X_train_full[54000:]
y_train, y_val = y_train_full[:54000], y_train_full[54000:]

class_names = [
    'T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
    'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'
]

print('Train:', X_train.shape, 'Val:', X_val.shape, 'Test:', X_test.shape)
print('Rango píxeles:', X_train.min(), 'a', X_train.max())

Train: (54000, 28, 28) Val: (6000, 28, 28) Test: (10000, 28, 28)
Rango píxeles: 0 a 255

[3]

# EDA: distribución de clases
counts = pd.Series(y_train).value_counts().sort_index()

plt.figure(figsize=(10, 4))
sns.barplot(x=[class_names[i] for i in counts.index], y=counts.values, palette='viridis')
plt.xticks(rotation=30, ha='right')
plt.title('Distribución de clases en train')
plt.ylabel('Número de muestras')
plt.tight_layout()
plt.show()

[4]

# EDA: muestras visuales
plt.figure(figsize=(12, 6))
for i in range(20):
    ax = plt.subplot(4, 5, i+1)
    plt.imshow(X_train[i], cmap='gray')
    plt.title(class_names[y_train[i]], fontsize=9)
    plt.axis('off')
plt.suptitle('Ejemplos de FashionMNIST')
plt.tight_layout()
plt.show()

[5]

# Preprocesado: normalización y flatten implícito en el modelo
X_train = X_train.astype('float32') / 255.0
X_val = X_val.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

2) Definición del MLP y utilidades de entrenamiento

[6]

def build_mlp(input_shape=(28, 28), n_classes=10):
    """MLP sencillo para clasificación multiclase en FashionMNIST."""
    model = keras.Sequential([
        layers.Input(shape=input_shape),
        layers.Flatten(),
        layers.Dense(256, activation='relu'),
        layers.Dense(128, activation='relu'),
        layers.Dense(n_classes, activation='softmax')
    ])
    return model


def compile_model(model, learning_rate):
    """Compila el modelo con SGD puro (sin momento) y sparse categorical crossentropy.
    
    Usamos SGD en lugar de Adam para que el efecto del learning rate
    y batch size sea claramente visible en las curvas de entrenamiento.
    """
    model.compile(
        optimizer=keras.optimizers.SGD(learning_rate=learning_rate, momentum=0.0),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    return model


def train_once(learning_rate, batch_size, epochs=15, verbose=0):
    """Entrena un modelo y devuelve historial + métricas + tiempos."""
    model = build_mlp()
    model = compile_model(model, learning_rate)

    t0 = time.perf_counter()
    history = model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=epochs,
        batch_size=batch_size,
        verbose=verbose
    )
    train_time = time.perf_counter() - t0

    # Inferencia en test
    t1 = time.perf_counter()
    y_prob = model.predict(X_test, verbose=0)
    infer_time = time.perf_counter() - t1

    y_pred = np.argmax(y_prob, axis=1)

    metrics = {
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision_macro': precision_score(y_test, y_pred, average='macro', zero_division=0),
        'Recall_macro': recall_score(y_test, y_pred, average='macro', zero_division=0),
        'F1_macro': f1_score(y_test, y_pred, average='macro', zero_division=0),
        'Train_time_s': train_time,
        'Infer_time_s': infer_time,
    }

    return model, history.history, metrics


def plot_train_val(history, title_prefix='Modelo'):
    """Pinta loss y accuracy train/val de un historial Keras."""
    # Loss
    plt.figure(figsize=(8, 4.5))
    plt.plot(history['loss'], label='Train')
    plt.plot(history['val_loss'], label='Validación')
    plt.title(f'{title_prefix} - Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.tight_layout()
    plt.show()

    # Accuracy
    plt.figure(figsize=(8, 4.5))
    plt.plot(history['accuracy'], label='Train')
    plt.plot(history['val_accuracy'], label='Validación')
    plt.title(f'{title_prefix} - Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.tight_layout()
    plt.show()

3) Experimento A: barrido de learning rate (5 valores)

Mantenemos batch size fijo para aislar el efecto del learning rate.

[7]

# Rango amplio de learning rates: desde muy lento hasta divergente
learning_rates = [1e-4, 1e-3, 1e-2, 1e-1, 1.0]
fixed_batch = 128
EPOCHS = 15

lr_results = []
lr_histories = {}

for lr in learning_rates:
    print(f'\nEntrenando con learning_rate={lr} y batch_size={fixed_batch}')
    _, hist, metrics = train_once(learning_rate=lr, batch_size=fixed_batch, epochs=EPOCHS, verbose=0)
    lr_histories[lr] = hist

    row = {'learning_rate': lr, 'batch_size': fixed_batch, **metrics}
    lr_results.append(row)

    # Curvas por entrenamiento
    plot_train_val(hist, title_prefix=f'LR={lr}')

lr_df = pd.DataFrame(lr_results).sort_values('Accuracy', ascending=False).reset_index(drop=True)
lr_df

Entrenando con learning_rate=0.0001 y batch_size=128

E0000 00:00:1773737168.889297 3113551 cuda_platform.cc:52] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
I0000 00:00:1773737168.889316 3113551 cuda_diagnostics.cc:160] env: CUDA_VISIBLE_DEVICES="-1"
I0000 00:00:1773737168.889322 3113551 cuda_diagnostics.cc:163] CUDA_VISIBLE_DEVICES is set to -1 - this hides all GPUs from CUDA
I0000 00:00:1773737168.889327 3113551 cuda_diagnostics.cc:171] verbose logging is disabled. Rerun with verbose logging (usually --v=1 or --vmodule=cuda_diagnostics=1) to get more diagnostic output from this module
I0000 00:00:1773737168.889328 3113551 cuda_diagnostics.cc:176] retrieving CUDA diagnostic information for host: tnp01-4090
I0000 00:00:1773737168.889330 3113551 cuda_diagnostics.cc:183] hostname: tnp01-4090
I0000 00:00:1773737168.889420 3113551 cuda_diagnostics.cc:190] libcuda reported version is: 580.126.9
I0000 00:00:1773737168.889428 3113551 cuda_diagnostics.cc:194] kernel reported version is: 580.126.9
I0000 00:00:1773737168.889429 3113551 cuda_diagnostics.cc:284] kernel version seems to match DSO: 580.126.9

Entrenando con learning_rate=0.001 y batch_size=128

Entrenando con learning_rate=0.01 y batch_size=128

Entrenando con learning_rate=0.1 y batch_size=128

Entrenando con learning_rate=1.0 y batch_size=128

	learning_rate	batch_size	Accuracy	Precision_macro	Recall_macro	F1_macro	Train_time_s	Infer_time_s
0	0.1000	128	0.8810	0.882316	0.8810	0.880841	10.540850	0.217580
1	0.0100	128	0.8477	0.846038	0.8477	0.845823	10.524911	0.218404
2	0.0010	128	0.7850	0.782579	0.7850	0.780162	10.549443	0.223091
3	0.0001	128	0.6250	0.659627	0.6250	0.576452	10.717692	0.228575
4	1.0000	128	0.1000	0.010000	0.1000	0.018182	10.504773	0.215487

[8]

# Comparativas visuales para learning rate
fig, axes = plt.subplots(1, 3, figsize=(18, 4.5))

sns.barplot(data=lr_df, x='learning_rate', y='Accuracy', ax=axes[0], palette='Blues')
axes[0].set_xscale('log')
axes[0].set_title('Accuracy test vs learning rate')

sns.barplot(data=lr_df, x='learning_rate', y='F1_macro', ax=axes[1], palette='Greens')
axes[1].set_xscale('log')
axes[1].set_title('F1 macro vs learning rate')

sns.barplot(data=lr_df, x='learning_rate', y='Train_time_s', ax=axes[2], palette='Reds')
axes[2].set_xscale('log')
axes[2].set_title('Tiempo entrenamiento vs learning rate')

plt.tight_layout()
plt.show()

[9]

# Curvas superpuestas para comparar estabilidad por learning rate
plt.figure(figsize=(9, 5))
for lr in learning_rates:
    plt.plot(lr_histories[lr]['val_loss'], label=f'lr={lr}')
plt.title('Validación Loss vs Epoch para distintos learning rates')
plt.xlabel('Epoch')
plt.ylabel('Val loss')
plt.legend()
plt.tight_layout()
plt.show()

plt.figure(figsize=(9, 5))
for lr in learning_rates:
    plt.plot(lr_histories[lr]['val_accuracy'], label=f'lr={lr}')
plt.title('Validación Accuracy vs Epoch para distintos learning rates')
plt.xlabel('Epoch')
plt.ylabel('Val accuracy')
plt.legend()
plt.tight_layout()
plt.show()

4) Experimento B: barrido de batch size (5 valores)

Mantenemos learning rate fijo para aislar el efecto del batch size.

[10]

# Rango amplio de batch sizes: desde muy ruidoso hasta muy suave
batch_sizes = [8, 32, 128, 512, 2048]
fixed_lr = 0.01

bs_results = []
bs_histories = {}

for bs in batch_sizes:
    print(f'\nEntrenando con learning_rate={fixed_lr} y batch_size={bs}')
    _, hist, metrics = train_once(learning_rate=fixed_lr, batch_size=bs, epochs=EPOCHS, verbose=0)
    bs_histories[bs] = hist

    row = {'batch_size': bs, 'learning_rate': fixed_lr, **metrics}
    bs_results.append(row)

    # Curvas por entrenamiento
    plot_train_val(hist, title_prefix=f'Batch={bs}')

bs_df = pd.DataFrame(bs_results).sort_values('Accuracy', ascending=False).reset_index(drop=True)
bs_df

Entrenando con learning_rate=0.01 y batch_size=8

Entrenando con learning_rate=0.01 y batch_size=32

Entrenando con learning_rate=0.01 y batch_size=128

Entrenando con learning_rate=0.01 y batch_size=512

Entrenando con learning_rate=0.01 y batch_size=2048

	batch_size	learning_rate	Accuracy	Precision_macro	Recall_macro	F1_macro	Train_time_s	Infer_time_s
0	8	0.01	0.8849	0.885307	0.8849	0.884264	50.971556	0.221928
1	32	0.01	0.8721	0.871614	0.8721	0.870262	19.872815	0.232873
2	128	0.01	0.8512	0.850108	0.8512	0.849512	10.610145	0.212729
3	512	0.01	0.8163	0.812702	0.8163	0.812289	5.557927	0.219793
4	2048	0.01	0.7592	0.760068	0.7592	0.753110	2.717725	0.212947

[11]

# Comparativas visuales para batch size
fig, axes = plt.subplots(1, 3, figsize=(18, 4.5))

sns.barplot(data=bs_df, x='batch_size', y='Accuracy', ax=axes[0], palette='Blues')
axes[0].set_title('Accuracy test vs batch size')

sns.barplot(data=bs_df, x='batch_size', y='F1_macro', ax=axes[1], palette='Greens')
axes[1].set_title('F1 macro vs batch size')

sns.barplot(data=bs_df, x='batch_size', y='Train_time_s', ax=axes[2], palette='Reds')
axes[2].set_title('Tiempo entrenamiento vs batch size')

plt.tight_layout()
plt.show()

[12]

# Curvas superpuestas para comparar estabilidad por batch size
plt.figure(figsize=(9, 5))
for bs in batch_sizes:
    plt.plot(bs_histories[bs]['val_loss'], label=f'batch={bs}')
plt.title('Validación Loss vs Epoch para distintos batch sizes')
plt.xlabel('Epoch')
plt.ylabel('Val loss')
plt.legend()
plt.tight_layout()
plt.show()

plt.figure(figsize=(9, 5))
for bs in batch_sizes:
    plt.plot(bs_histories[bs]['val_accuracy'], label=f'batch={bs}')
plt.title('Validación Accuracy vs Epoch para distintos batch sizes')
plt.xlabel('Epoch')
plt.ylabel('Val accuracy')
plt.legend()
plt.tight_layout()
plt.show()

5) Tabla final comparativa y síntesis

[13]

lr_df['Experimento'] = 'learning_rate_sweep'
bs_df['Experimento'] = 'batch_size_sweep'

final_df = pd.concat([lr_df, bs_df], ignore_index=True, sort=False)
final_df

	learning_rate	batch_size	Accuracy	Precision_macro	Recall_macro	F1_macro	Train_time_s	Infer_time_s	Experimento
0	0.1000	128	0.8810	0.882316	0.8810	0.880841	10.540850	0.217580	learning_rate_sweep
1	0.0100	128	0.8477	0.846038	0.8477	0.845823	10.524911	0.218404	learning_rate_sweep
2	0.0010	128	0.7850	0.782579	0.7850	0.780162	10.549443	0.223091	learning_rate_sweep
3	0.0001	128	0.6250	0.659627	0.6250	0.576452	10.717692	0.228575	learning_rate_sweep
4	1.0000	128	0.1000	0.010000	0.1000	0.018182	10.504773	0.215487	learning_rate_sweep
5	0.0100	8	0.8849	0.885307	0.8849	0.884264	50.971556	0.221928	batch_size_sweep
6	0.0100	32	0.8721	0.871614	0.8721	0.870262	19.872815	0.232873	batch_size_sweep
7	0.0100	128	0.8512	0.850108	0.8512	0.849512	10.610145	0.212729	batch_size_sweep
8	0.0100	512	0.8163	0.812702	0.8163	0.812289	5.557927	0.219793	batch_size_sweep
9	0.0100	2048	0.7592	0.760068	0.7592	0.753110	2.717725	0.212947	batch_size_sweep

[14]

# Mejores configuraciones por cada barrido
best_lr = lr_df.sort_values('Accuracy', ascending=False).iloc[0]
best_bs = bs_df.sort_values('Accuracy', ascending=False).iloc[0]

print('Mejor learning rate (batch fijo=128):')
print(best_lr)
print('\nMejor batch size (lr fijo=1e-3):')
print(best_bs)

Mejor learning rate (batch fijo=128):
learning_rate                      0.1
batch_size                         128
Accuracy                         0.881
Precision_macro               0.882316
Recall_macro                     0.881
F1_macro                      0.880841
Train_time_s                  10.54085
Infer_time_s                   0.21758
Experimento        learning_rate_sweep
Name: 0, dtype: object

Mejor batch size (lr fijo=1e-3):
batch_size                        8
learning_rate                  0.01
Accuracy                     0.8849
Precision_macro            0.885307
Recall_macro                 0.8849
F1_macro                   0.884264
Train_time_s              50.971556
Infer_time_s               0.221928
Experimento        batch_size_sweep
Name: 0, dtype: object

6) Tests rápidos (sanity checks)

Comprobaciones básicas para garantizar que los resultados son consistentes.

[15]

assert len(lr_df) == 5, 'Deben evaluarse 5 learning rates'
assert len(bs_df) == 5, 'Deben evaluarse 5 batch sizes'

for df_name, df in [('lr_df', lr_df), ('bs_df', bs_df)]:
    for col in ['Accuracy', 'Precision_macro', 'Recall_macro', 'F1_macro']:
        assert np.isfinite(df[col]).all(), f'{df_name}:{col} contiene no finitos'
        assert ((df[col] >= 0) & (df[col] <= 1)).all(), f'{df_name}:{col} fuera de [0,1]'

    assert (df['Train_time_s'] > 0).all(), f'{df_name}:Train_time_s debe ser > 0'
    assert (df['Infer_time_s'] > 0).all(), f'{df_name}:Infer_time_s debe ser > 0'

# Chequeo de longitud de curvas
for lr in learning_rates:
    assert len(lr_histories[lr]['loss']) == EPOCHS
for bs in batch_sizes:
    assert len(bs_histories[bs]['loss']) == EPOCHS

print('✅ Sanity checks completados correctamente')

✅ Sanity checks completados correctamente

Conclusiones y siguientes pasos

Conclusiones principales

SGD es muy sensible al learning rate: con LR=1e-4 apenas aprende en 15 épocas; con LR=0.01–0.1 converge bien; con LR=1.0 la loss diverge o se estanca en nivel de azar (~2.3 para 10 clases). Este contraste es mucho más evidente que con Adam, que amortigua las diferencias.
El batch size afecta la forma de las curvas: batches pequeños (8) producen curvas ruidosas con oscilaciones; batches grandes (2048) producen curvas suaves pero convergencia más lenta (menos actualizaciones por época).
No existe un valor universal óptimo: la mejor combinación depende del problema, la arquitectura y el presupuesto computacional.
Observar curvas train/val es clave para detectar divergencia, estancamiento o sobreajuste temprano.

Nota sobre la elección de optimizador

Usamos SGD puro (sin momento) intencionadamente para este estudio. Los optimizadores adaptativos (Adam, RMSprop) ajustan la tasa de aprendizaje por parámetro, lo que hace que el efecto del LR global sea mucho menos visible. En la práctica, Adam con su LR por defecto (1e-3) funciona razonablemente bien en la mayoría de casos, pero entender el comportamiento de SGD es fundamental para comprender la dinámica del entrenamiento.

Qué podrías probar después

Repetir con SGD + momentum y comparar la mejora en convergencia.
Repetir con Adam para verificar que las diferencias se reducen drásticamente.
Usar scheduler de learning rate (cosine, step decay, warmup).
Probar batch normalization y regularización (dropout, weight decay).
Ejecutar el mismo barrido con CNN para contrastar sensibilidad.
Repetir cada configuración con varias semillas y reportar media/desviación.

Mensaje clave: ajustar learning rate y batch size no es un detalle menor; es una parte central de la estabilidad del entrenamiento en deep learning. Con SGD puro, la diferencia entre converger, estancarse o diverger puede ser un simple cambio de orden de magnitud en el learning rate.