Se parte de una base de datos extraído mediante scrapping por Gonzalo Pérez. Esta base contiene información sobre la venta de casas en Uruguay recolectada de Mercado Libre y del Gallito, en los años 2023 y 2025.
El objetivo que nos planteamos es conseguir un modelo que logre predecir el precio de las casas. Esto es relevante por varios motivos:
Si una persona intentara vender su casa, podría usar este modelo como guía para marcar el precio.
Nos permite ver qué casas están sobre-valuadas o infra-valuadas determinando oportunidades de compra o de corrección de precios.
Nos permite ver qué variables son las más relevantes al momento de determinar el precio de una propiedad.
En caso de que el modelo no arroje buenos resultados podria ser un símbolo de que existan variables relavantes que no estarían siendo contempladas.
Datos Iniciales
#En caso de no tener alguna librería descomentar y correr las líneas siguientes:
!pip install unidecode
!pip install catboost
!pip install optuna
!pip install optuna-integration[catboost]
Requirement already satisfied: unidecode in /usr/local/lib/python3.12/dist-packages (1.4.0) Requirement already satisfied: catboost in /usr/local/lib/python3.12/dist-packages (1.2.8) Requirement already satisfied: graphviz in /usr/local/lib/python3.12/dist-packages (from catboost) (0.21) Requirement already satisfied: matplotlib in /usr/local/lib/python3.12/dist-packages (from catboost) (3.10.0) Requirement already satisfied: numpy<3.0,>=1.16.0 in /usr/local/lib/python3.12/dist-packages (from catboost) (1.26.4) Requirement already satisfied: pandas>=0.24 in /usr/local/lib/python3.12/dist-packages (from catboost) (2.2.2) Requirement already satisfied: scipy in /usr/local/lib/python3.12/dist-packages (from catboost) (1.16.3) Requirement already satisfied: plotly in /usr/local/lib/python3.12/dist-packages (from catboost) (5.24.1) Requirement already satisfied: six in /usr/local/lib/python3.12/dist-packages (from catboost) (1.17.0) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas>=0.24->catboost) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas>=0.24->catboost) (2025.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas>=0.24->catboost) (2025.2) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (1.3.3) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (4.61.0) Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (1.4.9) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (25.0) Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (11.3.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost) (3.2.5) Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.12/dist-packages (from plotly->catboost) (9.1.2) Requirement already satisfied: optuna in /usr/local/lib/python3.12/dist-packages (4.6.0) Requirement already satisfied: alembic>=1.5.0 in /usr/local/lib/python3.12/dist-packages (from optuna) (1.17.2) Requirement already satisfied: colorlog in /usr/local/lib/python3.12/dist-packages (from optuna) (6.10.1) Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (from optuna) (1.26.4) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from optuna) (25.0) Requirement already satisfied: sqlalchemy>=1.4.2 in /usr/local/lib/python3.12/dist-packages (from optuna) (2.0.44) Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from optuna) (4.67.1) Requirement already satisfied: PyYAML in /usr/local/lib/python3.12/dist-packages (from optuna) (6.0.3) Requirement already satisfied: Mako in /usr/local/lib/python3.12/dist-packages (from alembic>=1.5.0->optuna) (1.3.10) Requirement already satisfied: typing-extensions>=4.12 in /usr/local/lib/python3.12/dist-packages (from alembic>=1.5.0->optuna) (4.15.0) Requirement already satisfied: greenlet>=1 in /usr/local/lib/python3.12/dist-packages (from sqlalchemy>=1.4.2->optuna) (3.3.0) Requirement already satisfied: MarkupSafe>=0.9.2 in /usr/local/lib/python3.12/dist-packages (from Mako->alembic>=1.5.0->optuna) (3.0.3) Requirement already satisfied: optuna-integration[catboost] in /usr/local/lib/python3.12/dist-packages (4.6.0) Requirement already satisfied: optuna in /usr/local/lib/python3.12/dist-packages (from optuna-integration[catboost]) (4.6.0) Requirement already satisfied: numpy<2.0.0 in /usr/local/lib/python3.12/dist-packages (from optuna-integration[catboost]) (1.26.4) Requirement already satisfied: catboost in /usr/local/lib/python3.12/dist-packages (from optuna-integration[catboost]) (1.2.8) Requirement already satisfied: graphviz in /usr/local/lib/python3.12/dist-packages (from catboost->optuna-integration[catboost]) (0.21) Requirement already satisfied: matplotlib in /usr/local/lib/python3.12/dist-packages (from catboost->optuna-integration[catboost]) (3.10.0) Requirement already satisfied: pandas>=0.24 in /usr/local/lib/python3.12/dist-packages (from catboost->optuna-integration[catboost]) (2.2.2) Requirement already satisfied: scipy in /usr/local/lib/python3.12/dist-packages (from catboost->optuna-integration[catboost]) (1.16.3) Requirement already satisfied: plotly in /usr/local/lib/python3.12/dist-packages (from catboost->optuna-integration[catboost]) (5.24.1) Requirement already satisfied: six in /usr/local/lib/python3.12/dist-packages (from catboost->optuna-integration[catboost]) (1.17.0) Requirement already satisfied: alembic>=1.5.0 in /usr/local/lib/python3.12/dist-packages (from optuna->optuna-integration[catboost]) (1.17.2) Requirement already satisfied: colorlog in /usr/local/lib/python3.12/dist-packages (from optuna->optuna-integration[catboost]) (6.10.1) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from optuna->optuna-integration[catboost]) (25.0) Requirement already satisfied: sqlalchemy>=1.4.2 in /usr/local/lib/python3.12/dist-packages (from optuna->optuna-integration[catboost]) (2.0.44) Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from optuna->optuna-integration[catboost]) (4.67.1) Requirement already satisfied: PyYAML in /usr/local/lib/python3.12/dist-packages (from optuna->optuna-integration[catboost]) (6.0.3) Requirement already satisfied: Mako in /usr/local/lib/python3.12/dist-packages (from alembic>=1.5.0->optuna->optuna-integration[catboost]) (1.3.10) Requirement already satisfied: typing-extensions>=4.12 in /usr/local/lib/python3.12/dist-packages (from alembic>=1.5.0->optuna->optuna-integration[catboost]) (4.15.0) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas>=0.24->catboost->optuna-integration[catboost]) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas>=0.24->catboost->optuna-integration[catboost]) (2025.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas>=0.24->catboost->optuna-integration[catboost]) (2025.2) Requirement already satisfied: greenlet>=1 in /usr/local/lib/python3.12/dist-packages (from sqlalchemy>=1.4.2->optuna->optuna-integration[catboost]) (3.3.0) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost->optuna-integration[catboost]) (1.3.3) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost->optuna-integration[catboost]) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost->optuna-integration[catboost]) (4.61.0) Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost->optuna-integration[catboost]) (1.4.9) Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost->optuna-integration[catboost]) (11.3.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib->catboost->optuna-integration[catboost]) (3.2.5) Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.12/dist-packages (from plotly->catboost->optuna-integration[catboost]) (9.1.2) Requirement already satisfied: MarkupSafe>=0.9.2 in /usr/local/lib/python3.12/dist-packages (from Mako->alembic>=1.5.0->optuna->optuna-integration[catboost]) (3.0.3)
# Librerías para mejorar la lctura y manipulación de datos
import pandas as pd
import numpy as np
# Librerías para las visualizaciones
import matplotlib.pyplot as plt
import seaborn as sns
# Para remover el límite de datos que se muestran en columnas y filas
pd.set_option("display.max_columns", None)
pd.set_option('display.max_rows', 500)
# Para normalizar los datos
from sklearn.preprocessing import StandardScaler
# Modelo de Regresión Lineal
import statsmodels.api as sm
from sklearn.model_selection import train_test_split, KFold
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.metrics import mean_squared_error, r2_score
# Modelo CatBoost
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, mean_absolute_percentage_error
from catboost import CatBoostRegressor
import numpy as np
# Optimización con Optuna
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
from optuna.integration import CatBoostPruningCallback
from catboost import CatBoostRegressor
import numpy as np
from sklearn.metrics import mean_squared_error
from optuna.trial import TrialState
# Para borrar las advertencias
import warnings
warnings.filterwarnings("ignore")
# Para usar Google Colab
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# Cargar los datos
#data = pd.read_excel("/content/drive/MyDrive/ventas_casas.xlsx")
data = pd.read_excel("/content/drive/MyDrive/Casas/ventas_casas.xlsx")
df = data.copy(deep=True)
# Paneo general de la base
df.head()
| NEIGHBORHOOD | COVERED_AREA | LISTING_TYPE_ID | ADDRESS_CITY_NAME | ROOMS | WITH_VIRTUAL_TOUR | HAS_AIR_CONDITIONING | PROCESS_DATE | TOTAL_AREA | DESCRIPTION | ITEM_CONDITION | PROCESS_DATE.1 | ADDRESS_STATE | ADDRESS | CONDITION | TITLE | PRICE | ADDRESS_LINE | CATEGORY_ID | ORIGEN | SITE_ID | HAS_TELEPHONE_LINE | ID | DESCRIPTION.1 | FULL_BATHROOMS | OPERATION | BEDROOMS | CURRENCY_ID | PROPERTY_TYPE | POSITION | ARTICLE_ID | GARAGE | CONSTRUCTION_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 151.0 | gold_premium | Atahualpa | 6 | NaN | Sí | NaN | 244.0 | NaN | Usado | 08/03/2023 | Montevideo | NaN | used | Venta Casa 4 Dormitorios 2 Baños Garage Atahualpa | 218000 | Avenida Burgues 3000 - 3300, Montevideo Depart... | MLU1468 | meli | MLU | No | MLU629463390 | NaN | 2 | Venta | 4.0 | USD | Casa | NaN | NaN | 0 | NaN |
| 1 | NaN | 162.0 | gold_premium | Malvin | 4 | NaN | No | NaN | 341.0 | NaN | Usado | 08/03/2023 | Montevideo | NaN | used | Vendo Casa De 3 Dormitorios En Malvin Cw193381 | 400000 | Amazonas | MLU1468 | meli | MLU | NaN | MLU629462486 | NaN | 2 | Venta | 3.0 | USD | Casa | NaN | NaN | 0 | NaN |
| 2 | NaN | 80.0 | gold_premium | Tres Cruces | 0 | NaN | Sí | NaN | 114.0 | NaN | Nuevo | 08/03/2023 | Montevideo | NaN | new | Casa En Venta, 2 Dormitorios, Montevideo, La C... | 160000 | Cagancha 2322, 11800 Montevideo, Departamento ... | MLU1468 | meli | MLU | Sí | MLU629462150 | NaN | 2 | Venta | 2.0 | USD | Casa | NaN | NaN | 0 | NaN |
| 3 | NaN | 100.0 | gold_premium | Malvin | 4 | NaN | Sí | NaN | 227.0 | NaN | Usado | 08/03/2023 | Montevideo | NaN | used | Casa - Buceo. 3 Dormitorios, Garaje Y Fondo. | 250000 | Sobre Lalleman, Padrón Único, Garaje, Admite B... | MLU1468 | meli | MLU | No | MLU629462032 | NaN | 1 | Venta | 3.0 | USD | Casa | NaN | NaN | 0 | NaN |
| 4 | NaN | 28.0 | gold_premium | Piriápolis | 3 | NaN | NaN | NaN | 169.0 | NaN | Usado | 08/03/2023 | Maldonado | NaN | used | ¡¡¡ Oportunidad En Piriápolis !!! Ubicadísima,... | 85000 | 4PWF+5CM, Monte Caseros, 20200 Piriápolis, Dep... | MLU1468 | meli | MLU | NaN | MLU629705446 | NaN | 1 | Venta | 1.0 | USD | Casa | NaN | NaN | 0 | NaN |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 177394 entries, 0 to 177393 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NEIGHBORHOOD 167917 non-null object 1 COVERED_AREA 176378 non-null float64 2 LISTING_TYPE_ID 9477 non-null object 3 ADDRESS_CITY_NAME 9477 non-null object 4 ROOMS 177394 non-null int64 5 WITH_VIRTUAL_TOUR 7465 non-null object 6 HAS_AIR_CONDITIONING 5092 non-null object 7 PROCESS_DATE 167917 non-null object 8 TOTAL_AREA 9477 non-null float64 9 DESCRIPTION 7599 non-null object 10 ITEM_CONDITION 79900 non-null object 11 PROCESS_DATE.1 9477 non-null object 12 ADDRESS_STATE 9477 non-null object 13 ADDRESS 167238 non-null object 14 CONDITION 9477 non-null object 15 TITLE 177394 non-null object 16 PRICE 177394 non-null object 17 ADDRESS_LINE 9185 non-null object 18 CATEGORY_ID 9477 non-null object 19 ORIGEN 177394 non-null object 20 SITE_ID 9477 non-null object 21 HAS_TELEPHONE_LINE 4044 non-null object 22 ID 9477 non-null object 23 DESCRIPTION.1 163803 non-null object 24 FULL_BATHROOMS 177394 non-null int64 25 OPERATION 177394 non-null object 26 BEDROOMS 175392 non-null float64 27 CURRENCY_ID 177394 non-null object 28 PROPERTY_TYPE 177394 non-null object 29 POSITION 589 non-null object 30 ARTICLE_ID 167917 non-null float64 31 GARAGE 177394 non-null int64 32 CONSTRUCTION_YEAR 47724 non-null float64 dtypes: float64(5), int64(3), object(25) memory usage: 44.7+ MB
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 177394 entries, 0 to 177393 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NEIGHBORHOOD 167917 non-null object 1 COVERED_AREA 176378 non-null float64 2 LISTING_TYPE_ID 9477 non-null object 3 ADDRESS_CITY_NAME 9477 non-null object 4 ROOMS 177394 non-null int64 5 WITH_VIRTUAL_TOUR 7465 non-null object 6 HAS_AIR_CONDITIONING 5092 non-null object 7 PROCESS_DATE 167917 non-null object 8 TOTAL_AREA 9477 non-null float64 9 DESCRIPTION 7599 non-null object 10 ITEM_CONDITION 79900 non-null object 11 PROCESS_DATE.1 9477 non-null object 12 ADDRESS_STATE 9477 non-null object 13 ADDRESS 167238 non-null object 14 CONDITION 9477 non-null object 15 TITLE 177394 non-null object 16 PRICE 177394 non-null object 17 ADDRESS_LINE 9185 non-null object 18 CATEGORY_ID 9477 non-null object 19 ORIGEN 177394 non-null object 20 SITE_ID 9477 non-null object 21 HAS_TELEPHONE_LINE 4044 non-null object 22 ID 9477 non-null object 23 DESCRIPTION.1 163803 non-null object 24 FULL_BATHROOMS 177394 non-null int64 25 OPERATION 177394 non-null object 26 BEDROOMS 175392 non-null float64 27 CURRENCY_ID 177394 non-null object 28 PROPERTY_TYPE 177394 non-null object 29 POSITION 589 non-null object 30 ARTICLE_ID 167917 non-null float64 31 GARAGE 177394 non-null int64 32 CONSTRUCTION_YEAR 47724 non-null float64 dtypes: float64(5), int64(3), object(25) memory usage: 44.7+ MB
(136074/177394)
0.7670721670405989
Vemos que hay duplicados por ARTICLE_ID que sean no nulos, asi que procederemos a quedarnos con las ultimas ingresadas luego de convertir las fechas en formato datetime.
Teniendo en cuenta que hay muchas variables que no son de nuestra utilidad para el modelo que queremos realizar, procederemos a borrarlas previo al análisis exploratorio.
# Eliminamos variables que no son relevantes para nuestro análisis
new_df = df.copy()
new_df = df.drop(columns=['DESCRIPTION', 'TITLE', 'CATEGORY_ID', 'SITE_ID', 'ID', 'DESCRIPTION.1', 'OPERATION', 'PROPERTY_TYPE', 'ADDRESS', 'ADDRESS_LINE', 'ARTICLE_ID'])
new_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 177394 entries, 0 to 177393 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NEIGHBORHOOD 167917 non-null object 1 COVERED_AREA 176378 non-null float64 2 LISTING_TYPE_ID 9477 non-null object 3 ADDRESS_CITY_NAME 9477 non-null object 4 ROOMS 177394 non-null int64 5 WITH_VIRTUAL_TOUR 7465 non-null object 6 HAS_AIR_CONDITIONING 5092 non-null object 7 PROCESS_DATE 167917 non-null object 8 TOTAL_AREA 9477 non-null float64 9 ITEM_CONDITION 79900 non-null object 10 PROCESS_DATE.1 9477 non-null object 11 ADDRESS_STATE 9477 non-null object 12 CONDITION 9477 non-null object 13 PRICE 177394 non-null object 14 ORIGEN 177394 non-null object 15 HAS_TELEPHONE_LINE 4044 non-null object 16 FULL_BATHROOMS 177394 non-null int64 17 BEDROOMS 175392 non-null float64 18 CURRENCY_ID 177394 non-null object 19 POSITION 589 non-null object 20 GARAGE 177394 non-null int64 21 CONSTRUCTION_YEAR 47724 non-null float64 dtypes: float64(4), int64(3), object(15) memory usage: 29.8+ MB
Nos quedamos con 22 variables, de las cuales:
Estas son:
Procedemos a mergear dicha información en la primer columna, y borramos todas las segundas columnas.
new_df['NEIGHBORHOOD'] = new_df['NEIGHBORHOOD'].combine_first(new_df['ADDRESS_CITY_NAME'])
new_df['PROCESS_DATE'] = new_df['PROCESS_DATE'].combine_first(new_df['PROCESS_DATE.1'])
new_df['CONDITION'] = new_df['CONDITION'].combine_first(new_df['ITEM_CONDITION'])
new_df = new_df.drop(columns=['ADDRESS_CITY_NAME', 'PROCESS_DATE.1', 'ITEM_CONDITION'])
Ahora que tenemos las fechas en PROCESS_DATE, nos quedamos con las mas recientes para los duplicados por ARTICLE_ID
# imprimimos los valores nulos junto con el %.
nulls = new_df.isnull().sum()
nulls_percent = (new_df.isnull().sum() / len(new_df)) * 100
nulls_df = pd.DataFrame({'Nulls': nulls, 'Nulls %': nulls_percent})
nulls_df
| Nulls | Nulls % | |
|---|---|---|
| NEIGHBORHOOD | 0 | 0.000000 |
| COVERED_AREA | 1016 | 0.572736 |
| LISTING_TYPE_ID | 167917 | 94.657655 |
| ROOMS | 0 | 0.000000 |
| WITH_VIRTUAL_TOUR | 169929 | 95.791853 |
| HAS_AIR_CONDITIONING | 172302 | 97.129553 |
| PROCESS_DATE | 0 | 0.000000 |
| TOTAL_AREA | 167917 | 94.657655 |
| ADDRESS_STATE | 167917 | 94.657655 |
| CONDITION | 96617 | 54.464638 |
| PRICE | 0 | 0.000000 |
| ORIGEN | 0 | 0.000000 |
| HAS_TELEPHONE_LINE | 173350 | 97.720329 |
| FULL_BATHROOMS | 0 | 0.000000 |
| BEDROOMS | 2002 | 1.128561 |
| CURRENCY_ID | 0 | 0.000000 |
| POSITION | 176805 | 99.667971 |
| GARAGE | 0 | 0.000000 |
| CONSTRUCTION_YEAR | 129670 | 73.097174 |
Vemos que, LISTING_TYPE_ID tiene 167917 datos nulos. Esto se debe a que este dato provino de MELI. Por lo que dichos nulos procedemos a completarlos con la palabra "gallito".
En la columna WITH_VIRTUAL_TOUR hay 169929 datos nulos. Procedemos a imputarlos con la palabra "no" a estos. Lo mismo haremos con la variable HAS_AIR_CONDITIONING
La variable TOTAL_AREA tiene un total de 167917 nulos. Siendo que esta representa un total de 94%, procederemos a borrarla. Cabe mencionar que en el análisis que realizamos de esta variable, encontramos algunas incoherencias que tambien nos llevaron a la decisión de borrarla: No tiene sentido que para algunos casos COVERED_AREA sea mayor que TOTAL_AREA, al menos conceptualmente.
Para la columna, COVERED_AREA, vemos que tenemos un total de 1016 datos nulos(menos del 1%), siendo que es un numero extremadamente bajo, procedemos a eliminar dichas filas. Los siguiente bloques de codigo harán lo que mencionamos aquí.
La columna HAS_TELEPHONE_LINE también tiene mas del 97% de datos nulos, así que procedemos a eliminarla. Además consideramos que hoy por hoy, tener línea telefónica no es una variable relevante como quizás lo era previo a la llegada de la fibra óptica.
La variable BEDROOMS tiene poco mas de 1% de valores nulos. Procedemos a realizar la misma estrategia que con la variable COVERED_AREA, eliminamos dichas filas.
Las variables POSITION y CONSTRUCTION_YEAR tienen un enorme porcentaje de datos faltantes tambien, asi que las eliminamos del análisis.
# Exploramos la variable TOTAL_AREA y su relación con COVERED_AREA
aux = new_df[new_df['TOTAL_AREA'].isna() & new_df['COVERED_AREA'].isna()]
aux['LISTING_TYPE_ID'].value_counts()
new_df['TOTAL_AREA'].isna().sum()
new_df[['COVERED_AREA','TOTAL_AREA']]
# filtrar solamente las que TOTAL_AREA < COVERED_AREA
aux = new_df[new_df['TOTAL_AREA'] < new_df['COVERED_AREA']]
aux
print("Hay", len(aux), "filas con TOTAL_AREA < COVERED_AREA")
Hay 417 filas con TOTAL_AREA < COVERED_AREA
# eliminamos la variable TOTAL_AREA, HAS_TELEPHONE_LINE, POSITION y CONSTRUCTION_YEAR
new_df = new_df.drop(columns=['TOTAL_AREA' ,'HAS_TELEPHONE_LINE', 'POSITION', 'CONSTRUCTION_YEAR'])
# Borramos las filas con datos nulos en COVERED_AREA
new_df = new_df.dropna(subset=['COVERED_AREA'])
# Borramos las filas con datos nulos en BEDROOMS.
new_df = new_df.dropna(subset=['BEDROOMS'])
# los listing type id nulos los imputamos con gallito
new_df['LISTING_TYPE_ID'] = new_df['LISTING_TYPE_ID'].fillna('gallito')
# imputamos WITH_VIRTUAL_TOUR y HAS_AIR_CONDITIONING
new_df['WITH_VIRTUAL_TOUR'] = new_df['WITH_VIRTUAL_TOUR'].fillna('No')
new_df['HAS_AIR_CONDITIONING'] = new_df['HAS_AIR_CONDITIONING'].fillna('No')
# TODO: borrar nico
# imputamos y creamos los nuevos valores de CONDITION. Como mencionamos arriba.
new_df['CONDITION'] = new_df['CONDITION'].fillna('not_specified')
def map_condition(condition):
if condition in ['new', ' Estrenar']:
return 'nuevo'
elif condition in [' Buen estado', ' Impecable', 'used', ' Para reciclar', ' Reciclado']:
return 'usado'
elif condition in [' En construcción', ' En pozo']:
return 'en_construccion'
elif condition in ['not_specified', 'NaN']:
return 'sin_especificar'
else:
return condition
new_df['CONDITION'] = new_df['CONDITION'].apply(map_condition)
# Ejemplo de lo que mencionamos arriba con respecto al ADDRESS_STATE:
df_carrasco = new_df[new_df['NEIGHBORHOOD'] == 'Carrasco']
df_carrasco[['NEIGHBORHOOD', 'ADDRESS_STATE']]
| NEIGHBORHOOD | ADDRESS_STATE | |
|---|---|---|
| 37 | Carrasco | Montevideo |
| 99 | Carrasco | Montevideo |
| 102 | Carrasco | Montevideo |
| 158 | Carrasco | Montevideo |
| 235 | Carrasco | Montevideo |
| ... | ... | ... |
| 177283 | Carrasco | NaN |
| 177288 | Carrasco | NaN |
| 177300 | Carrasco | NaN |
| 177347 | Carrasco | NaN |
| 177353 | Carrasco | NaN |
4494 rows × 2 columns
import unidecode
# 1. Crea un mapeo de barrio a estado/departamento.
# Agrupamos por NEIGHBORHOOD y encontramos la moda (valor más frecuente)
# de ADDRESS_STATE dentro de cada grupo. El .iloc[0] es para seleccionar
# el primer valor de la moda, en caso de que haya múltiples modas.
# Excluimos los valores NaN al calcular la moda con dropna=True por defecto.
# 1. Convertir todo a minúsculas
new_df['NEIGHBORHOOD'] = new_df['NEIGHBORHOOD'].str.lower()
# 2. Eliminar acentos (tildes)
# Usaremos la librería 'unidecode' para convertir caracteres acentuados
# a sus equivalentes sin acento (ej. 'í' -> 'i', 'á' -> 'a').
# Nota: Si no tienes esta librería instalada, debes usar: pip install unidecode
new_df['NEIGHBORHOOD'] = new_df['NEIGHBORHOOD'].apply(
lambda x: unidecode.unidecode(x) if isinstance(x, str) else x
)
# 3. Eliminar espacios en blanco extra al inicio/final
new_df['NEIGHBORHOOD'] = new_df['NEIGHBORHOOD'].str.strip()
# 4. Eliminar espacios duplicados dentro del texto
# Esto asegura que "santa lucia del este" sea igual a "santa lucia del este"
new_df['NEIGHBORHOOD'] = new_df['NEIGHBORHOOD'].str.replace(r'\s+', ' ', regex=True)
mapeo_estado = new_df.groupby('NEIGHBORHOOD')['ADDRESS_STATE'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)
# Convertir la Serie a un diccionario de mapeo si es más conveniente
mapeo_estado = mapeo_estado.to_dict()
mapeo_estado
# 2. Imputa los valores nulos usando el mapeo.
# Usamos .fillna() en la columna ADDRESS_STATE y le pasamos el mapeo.
# Esto solo rellenará los valores NaN.
new_df['ADDRESS_STATE'] = new_df['ADDRESS_STATE'].fillna(
new_df['NEIGHBORHOOD'].map(mapeo_estado)
)
# veamos por ejemplo que ahora se solucionó el tema para carrasco.
df_carrasco = new_df[new_df['NEIGHBORHOOD'] == 'carrasco']
df_carrasco[['NEIGHBORHOOD', 'ADDRESS_STATE']]
| NEIGHBORHOOD | ADDRESS_STATE | |
|---|---|---|
| 37 | carrasco | Montevideo |
| 99 | carrasco | Montevideo |
| 102 | carrasco | Montevideo |
| 158 | carrasco | Montevideo |
| 235 | carrasco | Montevideo |
| ... | ... | ... |
| 177283 | carrasco | Montevideo |
| 177288 | carrasco | Montevideo |
| 177300 | carrasco | Montevideo |
| 177347 | carrasco | Montevideo |
| 177353 | carrasco | Montevideo |
4494 rows × 2 columns
Veamos ahora, de los nulos que quedan si podemos deducir algunos mas incluso
# imprimir todos los NEIGHBORHOOD para los cuales ADDRESS_STATE es nulo
new_df[new_df['ADDRESS_STATE'].isna()]['NEIGHBORHOOD'].unique()
array(['carrasco norte', 'b. de carrasco', 'otros', 'piedras del chileno',
'golf', 'otra', 'los cerrillos', 'roosevelt', 'colonia nicolich',
'prado norte', 'punta colorada', 'playa mansa', 'la esmeralda',
'maronas curva', 'guazu vira', 'colonia suiza', 'cantegril',
'oceania del polonio', 'colonia', 'bikini', 'la arbolada',
'playa brava', 'parque burnett', 'peninsula', 'montoya',
'la residence', 'tio tom', 'fray marcos', 'san jose',
'abra de perdomo', 'paso carrasco', 'j. hipodromo', 'marly',
'barrio cordoba', 'pueblomio', 'solanas', 'cno. maldonado',
'altos de laguna', 'lugano', 'cno. carrasco', 'fray bentos',
'jardines de cordoba', 'las delicias', 'proa del mar',
'laguna garzon', 'valdense', 'pueblo eden', 'club del lago',
'lago merin', 'arroyo de la virgen', 'lausana', 'barrio sur',
'tarariras', 'beverly hills', 'lapataia', 'nuevo centro',
'villa aeroparque', 'san bautista', 'laguna del diario',
'lago de los cisnes', 'las grutas', 'cerro pelado',
'sierra del mar', 'artigas', 'lussich', 'verdemora', 'vichadero',
'shopping', 'lomo de la ballena', 'pinar del faro', 'eden rock',
'club del mar', 'portales', 'haras del lago', 'la palma',
'mata siete', 'camino eguzquiza', 'barra santa lucia',
'cerro san antonio', 'biarritz', 'cardona', 'barrio country',
'santa catalina'], dtype=object)
new_df[(new_df['NEIGHBORHOOD'] == 'otros')| new_df['NEIGHBORHOOD'] == 'otra']
| NEIGHBORHOOD | COVERED_AREA | LISTING_TYPE_ID | ROOMS | WITH_VIRTUAL_TOUR | HAS_AIR_CONDITIONING | PROCESS_DATE | ADDRESS_STATE | CONDITION | PRICE | ORIGEN | FULL_BATHROOMS | BEDROOMS | CURRENCY_ID | GARAGE |
|---|
Usando una IA podemos proceder a deducir cuales son los departamentos en cada uno de los casos restantes. Para esto se realizó prompt engineering, explicando el rol y el objetivo. Luego se realizó una verificación de los casos para chequear que fueran correctos.
nuevo_mapeo = {
"carrasco norte": "Montevideo",
"b. de carrasco": "Canelones",
"otros": "No Aplica",
"piedras del chileno": "Maldonado",
"golf": "Maldonado",
"otra": "No Aplica",
"los cerrillos": "Canelones",
"roosevelt": "Maldonado",
"colonia nicolich": "Canelones",
"prado norte": "Montevideo",
"punta colorada": "Maldonado",
"playa mansa": "Maldonado",
"la esmeralda": "Rocha",
"maronas curva": "Montevideo",
"guazu vira": "Canelones",
"colonia suiza": "Colonia",
"cantegril": "Maldonado",
"oceania del polonio": "Rocha",
"colonia": "Colonia",
"bikini": "Maldonado",
"la arbolada": "Maldonado",
"playa brava": "Maldonado",
"parque burnett": "Canelones",
"peninsula": "Maldonado",
"montoya": "Maldonado",
"la residence": "Maldonado",
"tio tom": "Maldonado",
"fray marcos": "Florida",
"san jose": "San José",
"abra de perdomo": "Maldonado",
"paso carrasco": "Canelones",
"j. hipodromo": "Montevideo",
"marly": "Maldonado",
"barrio cordoba": "Montevideo",
"pueblomio": "Canelones",
"solanas": "Maldonado",
"cno. maldonado": "Montevideo",
"altos de laguna": "Maldonado",
"lugano": "Maldonado",
"cno. carrasco": "Montevideo",
"fray bentos": "Río Negro",
"jardines de cordoba": "Montevideo",
"las delicias": "Maldonado",
"proa del mar": "Maldonado",
"laguna garzon": "Maldonado",
"valdense": "Colonia",
"pueblo eden": "Maldonado",
"club del lago": "Maldonado",
"lago merin": "Cerro Largo",
"arroyo de la virgen": "Maldonado",
"lausana": "Maldonado",
"barrio sur": "Montevideo",
"tarariras": "Colonia",
"beverly hills": "Maldonado",
"lapataia": "Maldonado",
"nuevo centro": "Montevideo",
"villa aeroparque": "Canelones",
"san bautista": "Canelones",
"laguna del diario": "Maldonado",
"lago de los cisnes": "Canelones",
"las grutas": "Canelones",
"cerro pelado": "Maldonado",
"sierra del mar": "Maldonado",
"artigas": "Artigas",
"lussich": "Maldonado",
"verdemora": "Canelones",
"vichadero": "Rivera",
"shopping": "Maldonado",
"lomo de la ballena": "Maldonado",
"pinar del faro": "Maldonado",
"eden rock": "Maldonado",
"club del mar": "Maldonado",
"portales": "Canelones",
"haras del lago": "Montevideo",
"la palma": "Canelones",
"mata siete": "Canelones",
"camino eguzquiza": "Montevideo",
"barra santa lucia": "Canelones",
"cerro san antonio": "Maldonado",
"biarritz": "Canelones",
"cardona": "Soriano",
"barrio country": "Maldonado",
"santa catalina": "Montevideo"
}
# imputamos ahora dichos valores
new_df['ADDRESS_STATE'] = new_df['ADDRESS_STATE'].fillna(new_df['NEIGHBORHOOD'].map(nuevo_mapeo))
# imprimir todos los valores de ADDRESS_STATE
new_df['ADDRESS_STATE'].unique()
array(['Montevideo', 'Maldonado', 'Canelones', 'San José', 'Colonia',
'Rocha', 'Salto', 'Rivera', 'Lavalleja', 'Paysandú', 'Florida',
'Flores', 'Treinta y Tres', 'Soriano', 'Durazno', 'Cerro Largo',
'Tacuarembó', 'Río Negro', 'No Aplica', 'Artigas'], dtype=object)
Bien, ahora vemos que pudimos imputar todos los valores de ADDRESS_STATE. Pasamos de un 94% a un 0%. Ahora que resolvimos los datos nulos, estamos prontos para seguir con los siguientes pasos.
new_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 174376 entries, 0 to 177393 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NEIGHBORHOOD 174376 non-null object 1 COVERED_AREA 174376 non-null float64 2 LISTING_TYPE_ID 174376 non-null object 3 ROOMS 174376 non-null int64 4 WITH_VIRTUAL_TOUR 174376 non-null object 5 HAS_AIR_CONDITIONING 174376 non-null object 6 PROCESS_DATE 174376 non-null object 7 ADDRESS_STATE 174376 non-null object 8 CONDITION 174376 non-null object 9 PRICE 174376 non-null object 10 ORIGEN 174376 non-null object 11 FULL_BATHROOMS 174376 non-null int64 12 BEDROOMS 174376 non-null float64 13 CURRENCY_ID 174376 non-null object 14 GARAGE 174376 non-null int64 dtypes: float64(2), int64(3), object(10) memory usage: 21.3+ MB
De las variables que nos quedan, identificamos que tenemos que convertir: PROCESS_DATE en formato fecha, price en float. Vemos que algunos precios tienen "." en el mismo, asi que tenemos que eliminar esos "." de alli para poder convertirlos.
new_df['PROCESS_DATE'] = pd.to_datetime(new_df['PROCESS_DATE'], dayfirst=True, format='mixed')
new_df['PRICE'] = new_df['PRICE'].astype(str).str.replace('.', '', regex=False).astype(float)
new_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 174376 entries, 0 to 177393 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NEIGHBORHOOD 174376 non-null object 1 COVERED_AREA 174376 non-null float64 2 LISTING_TYPE_ID 174376 non-null object 3 ROOMS 174376 non-null int64 4 WITH_VIRTUAL_TOUR 174376 non-null object 5 HAS_AIR_CONDITIONING 174376 non-null object 6 PROCESS_DATE 174376 non-null datetime64[ns] 7 ADDRESS_STATE 174376 non-null object 8 CONDITION 174376 non-null object 9 PRICE 174376 non-null float64 10 ORIGEN 174376 non-null object 11 FULL_BATHROOMS 174376 non-null int64 12 BEDROOMS 174376 non-null float64 13 CURRENCY_ID 174376 non-null object 14 GARAGE 174376 non-null int64 dtypes: datetime64[ns](1), float64(3), int64(3), object(8) memory usage: 21.3+ MB
Ahora cada variable tiene el tipo que esperamos que tenga. Sin embargo, hay un tema que aún tenemos que resolver: El tipo de moneda utilizado. Para ello, vamos a definir un objeto que mapee cuál era el precio del dólar en el momento que se procesó dicha información y vamos a convertir todo a esa moneda. Luego, podremos eliminar la columna CURRENCY_ID.
# imprimir todos los posibles process_date. Ordenalo de menor a mayor.
new_df['PROCESS_DATE'].sort_values().unique()
<DatetimeArray> ['2023-03-08 00:00:00', '2023-04-28 00:00:00', '2023-05-01 00:00:00', '2023-05-04 00:00:00', '2023-05-08 00:00:00', '2023-05-12 00:00:00', '2023-05-15 00:00:00', '2023-05-21 00:00:00', '2023-05-22 00:00:00', '2023-05-27 00:00:00', '2023-05-28 00:00:00', '2023-05-30 00:00:00', '2023-06-02 00:00:00', '2023-06-04 00:00:00', '2023-06-05 00:00:00', '2023-06-07 00:00:00', '2023-06-11 00:00:00', '2023-06-12 00:00:00', '2023-06-15 00:00:00', '2023-06-18 00:00:00', '2023-06-20 00:00:00', '2023-06-22 00:00:00', '2023-06-25 00:00:00', '2023-06-28 00:00:00', '2023-07-02 00:00:00', '2023-07-04 00:00:00', '2023-07-06 00:00:00', '2023-07-10 00:00:00', '2023-07-16 00:00:00', '2023-07-19 00:00:00', '2023-07-23 00:00:00', '2023-07-24 00:00:00', '2023-07-27 00:00:00', '2023-07-30 00:00:00', '2023-08-02 00:00:00', '2023-08-04 00:00:00', '2023-08-06 00:00:00', '2023-08-10 00:00:00', '2023-08-13 00:00:00', '2023-08-16 00:00:00', '2023-08-20 00:00:00', '2023-08-27 00:00:00', '2023-08-28 00:00:00', '2023-08-31 00:00:00', '2023-09-01 00:00:00', '2023-09-03 00:00:00', '2023-09-11 00:00:00', '2023-09-13 00:00:00', '2023-09-18 00:00:00', '2023-09-25 00:00:00', '2023-09-28 00:00:00', '2023-10-01 00:00:00', '2023-10-05 00:00:00', '2023-10-08 00:00:00', '2023-10-12 00:00:00', '2023-10-16 00:00:00', '2023-10-23 00:00:00', '2023-10-26 00:00:00', '2023-10-30 00:00:00', '2023-11-05 00:00:00', '2023-11-13 00:00:00', '2023-11-17 00:00:00', '2023-11-20 00:00:00', '2023-11-23 00:00:00', '2023-11-27 00:00:00', '2023-12-02 00:00:00', '2023-12-11 00:00:00', '2023-12-18 00:00:00', '2023-12-25 00:00:00', '2025-02-26 00:00:00', '2025-02-27 00:00:00', '2025-02-28 00:00:00', '2025-03-02 00:00:00', '2025-03-03 00:00:00', '2025-03-04 00:00:00', '2025-03-08 00:00:00', '2025-03-09 00:00:00', '2025-03-10 00:00:00', '2025-03-12 00:00:00', '2025-03-13 00:00:00', '2025-03-21 00:00:00', '2025-03-24 00:00:00', '2025-03-31 00:00:00'] Length: 83, dtype: datetime64[ns]
Vemos que las fechas van desde marzo del 2023 hasta marzo del 2025, teniendo más de 80 registros. Buscaremos un promedio mensual para realizar la converción.
Dado que tenemos los meses y años únicos, podemos definir un diccionario con las tasas de cambio aproximadas de UYU a USD para cada mes y año presente en los datos.
Advertencia: Estas tasas de cambio son aproximadas.
# aproximar $U a $USD
exchange_rates_averages = {
(1, 2025): 43.687,
(2, 2025): 43.117,
(3, 2025): 42.271,
(4, 2025): 42.304,
(5, 2025): 41.682,
(6, 2025): 40.854,
(7, 2025): 40.246,
(8, 2025): 40.043,
(1, 2023): 39.386,
(2, 2023): 39.028,
(3, 2023): 39.112,
(4, 2023): 38.782,
(5, 2023): 38.861,
(6, 2023): 38.2,
(7, 2023): 37.888,
(8, 2023): 37.851,
(9, 2023): 38.146,
(10, 2023): 39.745,
(11, 2023): 39.554,
(12, 2023): 39.303
}
def convert_to_usd(row):
if row['CURRENCY_ID'] == 'USD':
return row['PRICE']
elif row['CURRENCY_ID'] in ['$U', 'UYU']:
# hacer tupla (mes, ano)
month_year = (row['PROCESS_DATE'].month, row['PROCESS_DATE'].year)
rate = exchange_rates_averages.get(month_year)
if rate:
return row['PRICE'] / rate
else:
return np.nan
else:
return np.nan
new_df['PRICE_USD'] = new_df.apply(convert_to_usd, axis=1)
# chequeamos la variable PRICE para aquellas observaciones que tengan CURRENCY = Pesos Uruguayos, queremos ver su distribucion
new_df[(new_df['CURRENCY_ID'] == 'UYU')|(new_df['CURRENCY_ID'] == '$U')][['PRICE','CURRENCY_ID']].sort_values(by='PRICE',ascending=True)
| PRICE | CURRENCY_ID | |
|---|---|---|
| 121108 | 16500.0 | $U |
| 62513 | 17000.0 | $U |
| 115007 | 17000.0 | $U |
| 119440 | 17000.0 | $U |
| 98394 | 17000.0 | $U |
| 38920 | 17000.0 | $U |
| 78158 | 17000.0 | $U |
| 112050 | 18000.0 | $U |
| 109760 | 18000.0 | $U |
| 98670 | 19500.0 | $U |
| 87624 | 20000.0 | $U |
| 106788 | 20000.0 | $U |
| 96966 | 20000.0 | $U |
| 68622 | 22000.0 | $U |
| 48754 | 22000.0 | $U |
| 98414 | 24998.0 | $U |
| 112183 | 24998.0 | $U |
| 43009 | 25000.0 | $U |
| 26509 | 25000.0 | $U |
| 63532 | 25000.0 | $U |
| 101995 | 25500.0 | $U |
| 173410 | 27000.0 | $U |
| 15015 | 28000.0 | $U |
| 85287 | 28000.0 | $U |
| 28280 | 28000.0 | $U |
| 66751 | 28000.0 | $U |
| 46723 | 28000.0 | $U |
| 160851 | 29500.0 | $U |
| 50805 | 30000.0 | $U |
| 43977 | 45000.0 | $U |
| 8619 | 45000.0 | UYU |
| 19600 | 45000.0 | $U |
| 82613 | 45000.0 | $U |
| 151058 | 45000.0 | $U |
| 166763 | 45000.0 | $U |
| 99611 | 45000.0 | $U |
| 61818 | 45000.0 | $U |
| 115529 | 45000.0 | $U |
| 135112 | 45000.0 | $U |
| 5610 | 50000.0 | UYU |
| 15630 | 54000.0 | $U |
| 60280 | 54000.0 | $U |
| 65519 | 54000.0 | $U |
| 45592 | 54000.0 | $U |
| 28512 | 54000.0 | $U |
| 19671 | 55000.0 | $U |
| 27116 | 55000.0 | $U |
| 136 | 72000.0 | UYU |
| 390 | 85000.0 | UYU |
| 80558 | 89000.0 | $U |
| 52 | 89000.0 | UYU |
| 50146 | 90000.0 | $U |
| 73264 | 90000.0 | $U |
| 102861 | 90000.0 | $U |
| 372 | 90000.0 | UYU |
| 66543 | 90000.0 | $U |
| 85518 | 90000.0 | $U |
| 52332 | 90000.0 | $U |
| 126435 | 90000.0 | $U |
| 5811 | 93900.0 | UYU |
| 5778 | 100000.0 | UYU |
| 5471 | 105000.0 | UYU |
| 1168 | 109000.0 | UYU |
| 25814 | 110000.0 | $U |
| 7429 | 111111.0 | UYU |
| 7570 | 111111.0 | UYU |
| 5044 | 111111.0 | UYU |
| 9218 | 111111.0 | UYU |
| 141 | 111111.0 | UYU |
| 2828 | 111111.0 | UYU |
| 1414 | 115000.0 | UYU |
| 5218 | 116000.0 | UYU |
| 8604 | 120000.0 | UYU |
| 7792 | 125000.0 | UYU |
| 961 | 129000.0 | UYU |
| 7706 | 129000.0 | UYU |
| 4534 | 129900.0 | UYU |
| 4827 | 130000.0 | UYU |
| 135313 | 137000.0 | $U |
| 113692 | 137000.0 | $U |
| 9431 | 139000.0 | UYU |
| 138032 | 144990.0 | $U |
| 4803 | 149500.0 | UYU |
| 98302 | 150000.0 | $U |
| 96267 | 165000.0 | $U |
| 166792 | 165000.0 | $U |
| 40141 | 165000.0 | $U |
| 83709 | 165000.0 | $U |
| 112756 | 165000.0 | $U |
| 42520 | 165000.0 | $U |
| 151556 | 165000.0 | $U |
| 59688 | 165000.0 | $U |
| 137667 | 165000.0 | $U |
| 9254 | 195000.0 | UYU |
| 4508 | 220000.0 | UYU |
| 2555 | 220000.0 | UYU |
| 135277 | 250000.0 | $U |
| 8549 | 260000.0 | UYU |
| 6649 | 288000.0 | UYU |
| 151842 | 290000.0 | $U |
| 164260 | 290000.0 | $U |
| 9412 | 290000.0 | UYU |
| 38431 | 290000.0 | $U |
| 76188 | 290000.0 | $U |
| 112757 | 290000.0 | $U |
| 8368 | 310000.0 | UYU |
| 8596 | 385000.0 | UYU |
| 8447 | 485000.0 | UYU |
| 8588 | 630000.0 | UYU |
| 30230 | 1111111.0 | $U |
| 160545 | 1111111.0 | $U |
| 68086 | 1111111.0 | $U |
| 11214 | 1111111.0 | $U |
| 4354 | 1111111.0 | UYU |
| 8221 | 1111111.0 | UYU |
| 109015 | 1111111.0 | $U |
| 126844 | 1111111.0 | $U |
| 50690 | 1111111.0 | $U |
| 87833 | 1111111.0 | $U |
| 144254 | 1111111.0 | $U |
| 5051 | 11111111.0 | UYU |
| 5251 | 11111111.0 | UYU |
Como se puede ver incluso la que tiene el valor más alto en pesos uruguayos parece tener un valor irreal. Lo que debe estar sucediendo es que los usuarios se equivocan(o dejan la que viene por defecto) al seleccionar la moneda. Pero seguiremos analizando estos casos.
Es imposible que existan casas en el entorno de los 500 dólares. Lo que suponemos es que los usuarios seleccionan el tipo de moneda de forma incorrecta. A modo de ejemplo, la casa mas barata que se observa (16.500 \$U) suponemos que en realidad corresponde a 16.500 \$USD. Lo que haremos será asumir que todos los valores en realidad son en \$USD
# encontrame
new_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 174376 entries, 0 to 177393 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NEIGHBORHOOD 174376 non-null object 1 COVERED_AREA 174376 non-null float64 2 LISTING_TYPE_ID 174376 non-null object 3 ROOMS 174376 non-null int64 4 WITH_VIRTUAL_TOUR 174376 non-null object 5 HAS_AIR_CONDITIONING 174376 non-null object 6 PROCESS_DATE 174376 non-null datetime64[ns] 7 ADDRESS_STATE 174376 non-null object 8 CONDITION 174376 non-null object 9 PRICE 174376 non-null float64 10 ORIGEN 174376 non-null object 11 FULL_BATHROOMS 174376 non-null int64 12 BEDROOMS 174376 non-null float64 13 CURRENCY_ID 174376 non-null object 14 GARAGE 174376 non-null int64 15 PRICE_USD 174376 non-null float64 dtypes: datetime64[ns](1), float64(4), int64(3), object(8) memory usage: 22.6+ MB
pd.set_option('display.float_format', lambda x: '%.2f' % x)
new_df.describe()
| COVERED_AREA | ROOMS | PROCESS_DATE | PRICE | FULL_BATHROOMS | BEDROOMS | GARAGE | PRICE_USD | |
|---|---|---|---|---|---|---|---|---|
| count | 174376.00 | 174376.00 | 174376 | 174376.00 | 174376.00 | 174376.00 | 174376.00 | 174376.00 |
| mean | 4220.47 | 0.21 | 2023-09-06 21:37:36.916089600 | 485343.83 | 2.22 | 3.17 | 0.92 | 485083.40 |
| min | 0.00 | 0.00 | 2023-03-08 00:00:00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 25% | 144.00 | 0.00 | 2023-06-07 00:00:00 | 175000.00 | 2.00 | 3.00 | 0.00 | 175000.00 |
| 50% | 400.00 | 0.00 | 2023-08-13 00:00:00 | 320000.00 | 2.00 | 3.00 | 1.00 | 320000.00 |
| 75% | 791.00 | 0.00 | 2023-10-26 00:00:00 | 550000.00 | 3.00 | 4.00 | 1.00 | 550000.00 |
| max | 111111111.00 | 1111.00 | 2025-03-31 00:00:00 | 25000000.00 | 22.00 | 30.00 | 160.00 | 25000000.00 |
| std | 447886.12 | 2.88 | NaN | 633547.11 | 0.84 | 0.88 | 1.35 | 632556.15 |
Hay algunas cosas que nos llaman la atención:
numerical_cols = new_df.select_dtypes(include=np.number).columns
values_to_remove = [1111111111, 111111111, 11111111, 1111111, 111111, 11111, 1111, 111]
mask = new_df[numerical_cols].isin(values_to_remove).any(axis=1)
new_df = new_df[~mask]
# encontrame
new_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 173879 entries, 0 to 177393 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NEIGHBORHOOD 173879 non-null object 1 COVERED_AREA 173879 non-null float64 2 LISTING_TYPE_ID 173879 non-null object 3 ROOMS 173879 non-null int64 4 WITH_VIRTUAL_TOUR 173879 non-null object 5 HAS_AIR_CONDITIONING 173879 non-null object 6 PROCESS_DATE 173879 non-null datetime64[ns] 7 ADDRESS_STATE 173879 non-null object 8 CONDITION 173879 non-null object 9 PRICE 173879 non-null float64 10 ORIGEN 173879 non-null object 11 FULL_BATHROOMS 173879 non-null int64 12 BEDROOMS 173879 non-null float64 13 CURRENCY_ID 173879 non-null object 14 GARAGE 173879 non-null int64 15 PRICE_USD 173879 non-null float64 dtypes: datetime64[ns](1), float64(4), int64(3), object(8) memory usage: 22.6+ MB
# Consideramos que las casas que cuestan menos de USD 1000 fueron imputadas con errores
new_df[new_df['PRICE'] <= 1000].head()
| NEIGHBORHOOD | COVERED_AREA | LISTING_TYPE_ID | ROOMS | WITH_VIRTUAL_TOUR | HAS_AIR_CONDITIONING | PROCESS_DATE | ADDRESS_STATE | CONDITION | PRICE | ORIGEN | FULL_BATHROOMS | BEDROOMS | CURRENCY_ID | GARAGE | PRICE_USD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5232 | santa bernardina | 135.00 | gold | 0 | No | No | 2025-02-26 | Durazno | sin_especificar | 999.00 | meli | 3 | 4.00 | USD | 0 | 999.00 |
| 5593 | punta del este | 1200.00 | gold | 0 | No | No | 2025-02-26 | Maldonado | sin_especificar | 1000.00 | meli | 3 | 4.00 | USD | 0 | 1000.00 |
| 9966 | aigua | 1.00 | gallito | 0 | No | No | 2023-04-28 | Maldonado | sin_especificar | 150.00 | gallito | 2 | 4.00 | USD | 1 | 150.00 |
| 10067 | paso molino | 560.00 | gallito | 0 | No | No | 2023-04-28 | Montevideo | sin_especificar | 350.00 | gallito | 3 | 4.00 | USD | 0 | 350.00 |
| 10706 | punta del este | 701.00 | gallito | 0 | No | No | 2023-04-28 | Maldonado | sin_especificar | 1.00 | gallito | 3 | 4.00 | USD | 1 | 1.00 |
# Vamos a sacar los casos que tienen precios menores a 1000
new_df = new_df[new_df['PRICE'] > 1000]
# encontrame
new_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 173581 entries, 0 to 177393 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NEIGHBORHOOD 173581 non-null object 1 COVERED_AREA 173581 non-null float64 2 LISTING_TYPE_ID 173581 non-null object 3 ROOMS 173581 non-null int64 4 WITH_VIRTUAL_TOUR 173581 non-null object 5 HAS_AIR_CONDITIONING 173581 non-null object 6 PROCESS_DATE 173581 non-null datetime64[ns] 7 ADDRESS_STATE 173581 non-null object 8 CONDITION 173581 non-null object 9 PRICE 173581 non-null float64 10 ORIGEN 173581 non-null object 11 FULL_BATHROOMS 173581 non-null int64 12 BEDROOMS 173581 non-null float64 13 CURRENCY_ID 173581 non-null object 14 GARAGE 173581 non-null int64 15 PRICE_USD 173581 non-null float64 dtypes: datetime64[ns](1), float64(4), int64(3), object(8) memory usage: 22.5+ MB
Luego de haber realizado una gran limpieza preliminar de la base, vamos a chequear si tenemos datos duplicados.
# ahora si podemos eliminar PROCESS_DATE
new_df = new_df.drop(columns=['PROCESS_DATE'])
# imprimir si hay duplicados.
new_df.duplicated().sum()
136560
# eliminamos los duplicados
new_df = new_df.drop_duplicates()
# creamos una tabla que contiene solo a los duplicados
duplicated_rows = new_df[new_df.duplicated()]
duplicated_rows
| NEIGHBORHOOD | COVERED_AREA | LISTING_TYPE_ID | ROOMS | WITH_VIRTUAL_TOUR | HAS_AIR_CONDITIONING | ADDRESS_STATE | CONDITION | PRICE | ORIGEN | FULL_BATHROOMS | BEDROOMS | CURRENCY_ID | GARAGE | PRICE_USD |
|---|
new_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 37021 entries, 0 to 177385 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NEIGHBORHOOD 37021 non-null object 1 COVERED_AREA 37021 non-null float64 2 LISTING_TYPE_ID 37021 non-null object 3 ROOMS 37021 non-null int64 4 WITH_VIRTUAL_TOUR 37021 non-null object 5 HAS_AIR_CONDITIONING 37021 non-null object 6 ADDRESS_STATE 37021 non-null object 7 CONDITION 37021 non-null object 8 PRICE 37021 non-null float64 9 ORIGEN 37021 non-null object 10 FULL_BATHROOMS 37021 non-null int64 11 BEDROOMS 37021 non-null float64 12 CURRENCY_ID 37021 non-null object 13 GARAGE 37021 non-null int64 14 PRICE_USD 37021 non-null float64 dtypes: float64(4), int64(3), object(8) memory usage: 4.5+ MB
# Eliminamos los datos duplicados
new_df = new_df.drop_duplicates(keep='first')
new_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 37021 entries, 0 to 177385 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NEIGHBORHOOD 37021 non-null object 1 COVERED_AREA 37021 non-null float64 2 LISTING_TYPE_ID 37021 non-null object 3 ROOMS 37021 non-null int64 4 WITH_VIRTUAL_TOUR 37021 non-null object 5 HAS_AIR_CONDITIONING 37021 non-null object 6 ADDRESS_STATE 37021 non-null object 7 CONDITION 37021 non-null object 8 PRICE 37021 non-null float64 9 ORIGEN 37021 non-null object 10 FULL_BATHROOMS 37021 non-null int64 11 BEDROOMS 37021 non-null float64 12 CURRENCY_ID 37021 non-null object 13 GARAGE 37021 non-null int64 14 PRICE_USD 37021 non-null float64 dtypes: float64(4), int64(3), object(8) memory usage: 4.5+ MB
#vamos a ver cuantas observaciones fueron encontradas en 2025
#new_df['PROCESS_DATE'].dt.year.value_counts()
Si bien algunas observaciones son del 2025, decidimos dejarlas ya que no ha habido grandes cambios del tipo de cambio. Pero si quisieramos ser mas precisos deberíamos hacer un análisis del tipo de cambio e inflación.
# eliminamos las variables PRICE_USD, CURRENCY_ID y PROCESS_DATE
new_df = new_df.drop(columns=['PRICE_USD', 'CURRENCY_ID'])
# encontrame
new_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 37021 entries, 0 to 177385 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NEIGHBORHOOD 37021 non-null object 1 COVERED_AREA 37021 non-null float64 2 LISTING_TYPE_ID 37021 non-null object 3 ROOMS 37021 non-null int64 4 WITH_VIRTUAL_TOUR 37021 non-null object 5 HAS_AIR_CONDITIONING 37021 non-null object 6 ADDRESS_STATE 37021 non-null object 7 CONDITION 37021 non-null object 8 PRICE 37021 non-null float64 9 ORIGEN 37021 non-null object 10 FULL_BATHROOMS 37021 non-null int64 11 BEDROOMS 37021 non-null float64 12 GARAGE 37021 non-null int64 dtypes: float64(3), int64(3), object(7) memory usage: 4.0+ MB
new_df = new_df.drop_duplicates(keep='first')
new_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 36982 entries, 0 to 177385 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NEIGHBORHOOD 36982 non-null object 1 COVERED_AREA 36982 non-null float64 2 LISTING_TYPE_ID 36982 non-null object 3 ROOMS 36982 non-null int64 4 WITH_VIRTUAL_TOUR 36982 non-null object 5 HAS_AIR_CONDITIONING 36982 non-null object 6 ADDRESS_STATE 36982 non-null object 7 CONDITION 36982 non-null object 8 PRICE 36982 non-null float64 9 ORIGEN 36982 non-null object 10 FULL_BATHROOMS 36982 non-null int64 11 BEDROOMS 36982 non-null float64 12 GARAGE 36982 non-null int64 dtypes: float64(3), int64(3), object(7) memory usage: 4.0+ MB
La distribución de datos ahora quedó determinado de la siguiente forma:
new_df.describe()
| COVERED_AREA | ROOMS | PRICE | FULL_BATHROOMS | BEDROOMS | GARAGE | |
|---|---|---|---|---|---|---|
| count | 36982.00 | 36982.00 | 36982.00 | 36982.00 | 36982.00 | 36982.00 |
| mean | 2112.63 | 0.86 | 427402.36 | 2.14 | 3.13 | 0.74 |
| std | 260267.65 | 2.15 | 597756.27 | 1.01 | 1.04 | 1.51 |
| min | 0.00 | 0.00 | 1150.00 | 0.00 | 0.00 | 0.00 |
| 25% | 110.00 | 0.00 | 150000.00 | 1.00 | 2.00 | 0.00 |
| 50% | 246.00 | 0.00 | 275000.00 | 2.00 | 3.00 | 0.00 |
| 75% | 560.00 | 0.00 | 480000.00 | 3.00 | 4.00 | 1.00 |
| max | 50000000.00 | 38.00 | 25000000.00 | 22.00 | 30.00 | 160.00 |
En este momento creamos una copia del dataset actual, ya que algunos de los cambios que aplicaremos a continuacón no sirven para todos los modelos. Para los modelos lineales en conveniente que las variables categóricas tengan pocas categorías debido a que cada una va a ser leída ocmo una nueva variable por el modelo y el tener categorías con muy pocas observaciones aumenta la multicolinearidad, complejizando la interpretabilidad del modelo. Por otro lado en modelos como random forest o catboost, estas modificaciones implican pérdidad de información que afectan negativamente el poder predictivo. Por este motivo trabajaremos con dos datasets con distintos tratamientos.
df_sin_outliers_para_catboost = new_df.copy()
# Contenido actual de la variable CONDITION
new_df['CONDITION'].value_counts(dropna=False)
| count | |
|---|---|
| CONDITION | |
| usado | 19837 |
| sin_especificar | 13777 |
| nuevo | 3119 |
| en_construccion | 249 |
# imputamos y creamos los nuevos valores de CONDITION. Como mencionamos arriba.
new_df['CONDITION'] = new_df['CONDITION'].fillna('not_specified')
def map_condition(condition):
if condition in ['new', ' Estrenar']:
return 'nuevo'
elif condition in [' Buen estado', ' Impecable', 'used', ' Para reciclar', ' Reciclado']:
return 'usado'
elif condition in [' En construcción', ' En pozo']:
return 'en_construccion'
elif condition in ['not_specified', 'NaN']:
return 'sin_especificar'
else:
return condition
new_df['CONDITION'] = new_df['CONDITION'].apply(map_condition)
new_df['CONDITION'].value_counts(dropna=False)
| count | |
|---|---|
| CONDITION | |
| usado | 19837 |
| sin_especificar | 13777 |
| nuevo | 3119 |
| en_construccion | 249 |
numerical_columns = new_df.select_dtypes(include=np.number).columns
for col in numerical_columns:
print(col)
print('Skew :',round(new_df[col].skew(),2))
plt.figure(figsize=(15,4))
plt.subplot(1,2,1)
new_df[col].hist(bins=10, grid=False)
plt.ylabel('count')
plt.subplot(1,2,2)
sns.boxplot(x=new_df[col])
plt.show()
COVERED_AREA Skew : 191.71
ROOMS Skew : 3.22
PRICE Skew : 7.96
FULL_BATHROOMS Skew : 1.73
BEDROOMS Skew : 2.77
GARAGE Skew : 41.92
En las siguientes visualizaciones, decidimos quitar el 10% de los datos de la derecha para las variables: COVERED_AREA', 'PRICE' , 'ROOMS', 'GARAGE', ya que observamos que teniamos muchisimos outliers y se nos dificultaba la visualizaciión.
# Gráficos
data_cont=new_df[['COVERED_AREA','PRICE','ROOMS','FULL_BATHROOMS','BEDROOMS','GARAGE']]
pd.set_option('display.float_format', lambda x: '%.2f' % x)
needs_90_per_cent = ['COVERED_AREA', 'PRICE' , 'ROOMS', 'GARAGE']
for col in data_cont.columns:
print(col)
# Calculate the 90th percentile
limit_90th_percentile = data_cont[col].quantile(0.90 if col in needs_90_per_cent else 1)
# Filter out the top 10%
data_filtered = data_cont[data_cont[col] <= limit_90th_percentile][col]
print('Skew :',round(data_filtered.skew(),2))
pd.set_option('display.float_format', lambda x: '%.2f' % x)
plt.figure(figsize=(15,4))
plt.subplot(1,2,1)
data_filtered.hist(bins=10, grid=False)
plt.ylabel('count')
plt.subplot(1,2,2)
sns.boxplot(x=data_filtered)
plt.show()
COVERED_AREA Skew : 1.1
PRICE Skew : 0.98
ROOMS Skew : 3.04
FULL_BATHROOMS Skew : 1.73
BEDROOMS Skew : 2.77
GARAGE Skew : 0.88
Procedemos a eliminar algunos outliers de las variables:
COVERED_AREA: La mayoría de los datos está súper concentrada entre 0 y 200 metros cuadrados aprox. La mitad central de los datos se mueve entre 110 y 560 metros cuadrados. Tiene una asimetría positiva (la cola larga se va a la derecha). Esta es muy grande, a pesar de que la mediana es solo 246, el máximo llega a 5000000. Se ve una cantidad grande de outliers.
PRICE: La mayoría de los datos están concentrada en el entorno de los 100000. La mitad central de los datos se mueve entre 150000 y 480000. Tiene una asimetría positiva (la cola larga se va a la derecha). Esta asimetría es muy grande, a pesar de que la mediana es 275000, el máximo real del dataset llega a 25000000. Se ve una cantidad grande de outliers.
ROOMS: La mayoría de los datos están concentrados en cero habitaciones. Tiene una asimetría positiva muy grande: la mediana es 0, pero el máximo llega a 38.
FULL_BATHROOMS: La mayoría de los datos estan concentrados entre 0 y 2 baños. con una caída en la frecuencia a partir de los 3 baños. La mitad central de los datos se mueve entre 1 y 3. Tiene una asimetría positiva. A pesar de que la mediana es solo 2, el máximo llega a 22. El boxplot muestra que hay muchos outliers que se extienden hasta el máximo.
BEDROOMS: La mayoría de los datos estan concentrados entre 3 y 5 dormitorios. La mitad central de los datos se mueve entre 2 y 4. Tiene una asimetría positiva. A pesar de que la mediana es 3, el máximo llega a 30.
GARAGE: La mayoría de las casas tienen cero o un garaje. El pico más alto está claramente en 0 garajes. La mediana es 0, pero el máximo llega a 160.
Dicho esto, hay algunos outliers que nos llaman mucho la atención:
df[(df['ROOMS'] >= 30) | (df['BEDROOMS'] >= 30) | (df['FULL_BATHROOMS'] > 20) | (df['GARAGE'] > 20)]
| NEIGHBORHOOD | COVERED_AREA | LISTING_TYPE_ID | ADDRESS_CITY_NAME | ROOMS | WITH_VIRTUAL_TOUR | HAS_AIR_CONDITIONING | PROCESS_DATE | TOTAL_AREA | DESCRIPTION | ITEM_CONDITION | PROCESS_DATE.1 | ADDRESS_STATE | ADDRESS | CONDITION | TITLE | PRICE | ADDRESS_LINE | CATEGORY_ID | ORIGEN | SITE_ID | HAS_TELEPHONE_LINE | ID | DESCRIPTION.1 | FULL_BATHROOMS | OPERATION | BEDROOMS | CURRENCY_ID | PROPERTY_TYPE | POSITION | ARTICLE_ID | GARAGE | CONSTRUCTION_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 658 | NaN | 1535.00 | gold_premium | Ciudad Vieja | 30 | NaN | No | NaN | 1535.00 | NaN | Usado | 08/03/2023 | Montevideo | NaN | used | Edificio Palacio Amézaga, Oportunidad De Inver... | 1000000 | BUENOS AIRES 495 | MLU1468 | meli | MLU | No | MLU474107663 | NaN | 6 | Venta | 10.00 | USD | Casa | NaN | NaN | 0 | NaN |
| 1411 | NaN | 738.00 | gold_premium | Punta Carretas | 1111 | NaN | NaN | NaN | 738.00 | NaN | Nuevo | 08/03/2023 | Montevideo | NaN | new | Oportunidad De Inversión, Edificio Con 12 Apar... | 2000000 | Doctor José María Montero 2700 - 3000, Montevi... | MLU1468 | meli | MLU | Sí | MLU624736480 | NaN | 12 | Venta | 22.00 | USD | Casa | NaN | NaN | 0 | NaN |
| 1769 | NaN | 457.00 | gold_premium | Ciudad Vieja | 38 | NaN | No | NaN | 457.00 | NaN | Usado | 08/03/2023 | Montevideo | NaN | used | Hospedaje Hotel Pensión Con Renta, 30 Habitaci... | 480000 | guarani 1470 | MLU1468 | meli | MLU | Sí | MLU613059371 | NaN | 5 | Venta | 30.00 | USD | Casa | NaN | NaN | 0 | NaN |
| 4235 | NaN | 457.00 | gold_premium | Ciudad Vieja | 30 | No | No | NaN | 457.00 | DescripciónBarrio Ciudad ViejaGuaraní esq. 25 ... | Usado | 26/02/2025 | Montevideo | NaN | used | Oportunidad De Inversión. Casa Hospedaje En V... | 479000 | Guaraní Esq. 25 de Mayo | MLU1468 | meli | MLU | No | MLU692643450 | NaN | 5 | Venta | 30.00 | USD | Casa | NaN | NaN | 0 | NaN |
| 7038 | NaN | 457.00 | silver | Ciudad Vieja | 38 | No | No | NaN | 457.00 | DescripciónVISION Vende Hospedaje de 30 habita... | Usado | 03/03/2025 | Montevideo | NaN | used | Oportunidad De Inversion: Venta De Hospedaje H... | 479000 | guarani 1470 | MLU1468 | meli | MLU | Sí | MLU676629270 | NaN | 5 | Venta | 30.00 | USD | Casa | NaN | NaN | 0 | NaN |
| 7051 | NaN | 1535.00 | gold_premium | Ciudad Vieja | 30 | No | No | NaN | 1535.00 | Sin descripcion | Usado | 03/03/2025 | Montevideo | NaN | used | Edificio Palacio Amézaga, Oportunidad De Inver... | 1000000 | BUENOS AIRES 495 | MLU1468 | meli | MLU | No | MLU714533788 | NaN | 6 | Venta | 10.00 | USD | Casa | NaN | NaN | 0 | NaN |
| 8918 | NaN | 215.00 | gold_premium | Carrasco | 11 | No | No | NaN | 496.00 | DescripciónGran Oportunidad de Inversión!! ven... | Usado | 31/03/2025 | Montevideo | NaN | used | Imperdible!! Oportunidad, Venta De 3 Casas Sól... | 90000 | Cno. Oncativo 2775, Montevideo | MLU1468 | meli | MLU | Sí | MLU649937183 | NaN | 22 | Venta | 4.00 | USD | Casa | NaN | NaN | 0 | NaN |
| 17188 | Ciudad Vieja | 400.00 | NaN | NaN | 0 | NaN | NaN | 2023-04-28 | NaN | NaN | NaN | NaN | NaN | Bueno Aires y Misiones | NaN | Venta ciudad vieja | 350.000 | NaN | NaN | gallito | NaN | NaN | NaN | Excelente Padrón en ciudad vieja. Calle Buenos... | 3 | Venta | 4.00 | USD | Casa | NaN | 18419839.00 | 50 | 1930.00 |
| 24874 | Brazo Oriental | 1434.00 | NaN | NaN | 0 | NaN | NaN | 2023-04-28 | NaN | NaN | Para reciclar | NaN | NaN | Av. SAN MARTIN 3606 y R. KOCK | NaN | 2023. SAN MARTÍN 3606: CASA TERRENO 1.434 m2 | 250.000 | NaN | NaN | gallito | NaN | NaN | NaN | NaN | 2 | Venta | 4.00 | USD | Casa | NaN | 16031422.00 | 80 | 1940.00 |
| 26548 | Ciudad Vieja | 400.00 | NaN | NaN | 0 | NaN | NaN | 2023-05-01 | NaN | NaN | NaN | NaN | NaN | Bueno Aires y Misiones | NaN | Venta ciudad vieja | 350.000 | NaN | NaN | gallito | NaN | NaN | NaN | Excelente Padrón en ciudad vieja. Calle Buenos... | 3 | Venta | 4.00 | USD | Casa | NaN | 18419839.00 | 50 | 1930.00 |
| 28586 | Brazo Oriental | 1434.00 | NaN | NaN | 0 | NaN | NaN | 2023-05-04 | NaN | NaN | Para reciclar | NaN | NaN | Av. SAN MARTIN 3606 y R. KOCK | NaN | 2023. SAN MARTÍN 3606: CASA TERRENO 1.434 m2 | 250.000 | NaN | NaN | gallito | NaN | NaN | NaN | NaN | 2 | Venta | 4.00 | USD | Casa | NaN | 16031422.00 | 80 | 1940.00 |
| 34791 | Brazo Oriental | 1434.00 | NaN | NaN | 0 | NaN | NaN | 2023-05-15 | NaN | NaN | Para reciclar | NaN | NaN | Av. SAN MARTIN 3606 y R. KOCK | NaN | SAN MARTÍN 3606: CASA TERRENO 1.434 m2 | 250.000 | NaN | NaN | gallito | NaN | NaN | NaN | NaN | 2 | Venta | 4.00 | USD | Casa | NaN | 16031422.00 | 80 | 1940.00 |
| 46585 | Brazo Oriental | 1434.00 | NaN | NaN | 0 | NaN | NaN | 2023-06-02 | NaN | NaN | Para reciclar | NaN | NaN | Av. SAN MARTIN 3606 y R. KOCK | NaN | SAN MARTÍN 3606: CASA TERRENO 1.434 m2 | 250.000 | NaN | NaN | gallito | NaN | NaN | NaN | NaN | 2 | Venta | 4.00 | USD | Casa | NaN | 16031422.00 | 80 | 1940.00 |
| 71632 | Brazo Oriental | 1434.00 | NaN | NaN | 0 | NaN | NaN | 2023-07-06 | NaN | NaN | Para reciclar | NaN | NaN | Av. SAN MARTIN 3606 y R. KOCK | NaN | SAN MARTÍN 3606: CASA TERRENO 1.434 m2 | 250.000 | NaN | NaN | gallito | NaN | NaN | NaN | NaN | 2 | Venta | 4.00 | USD | Casa | NaN | 16031422.00 | 80 | 1940.00 |
| 84229 | Brazo Oriental | 1434.00 | NaN | NaN | 0 | NaN | NaN | 2023-07-27 | NaN | NaN | Para reciclar | NaN | NaN | Av. SAN MARTIN 3606 y R. KOCK | NaN | SAN MARTÍN 3606: CASA TERRENO 1.434 m2 | 150.000 | NaN | NaN | gallito | NaN | NaN | NaN | NaN | 2 | Venta | 4.00 | USD | Casa | NaN | 16031422.00 | 80 | 1940.00 |
| 84283 | Parque Batlle | 1055.00 | NaN | NaN | 0 | NaN | NaN | 2023-07-27 | NaN | NaN | Para reciclar | NaN | NaN | Luis Alberto Herrera y Monte Caseros | NaN | VENDO SINERGIA DOS CASAS / Mansiones para edif... | 1.500.000 | NaN | NaN | gallito | NaN | NaN | NaN | SINERGIA INVERSIONES VENDE DOS CASAS PARA EMPR... | 3 | Venta | 4.00 | USD | Casa | NaN | 24095041.00 | 25 | 1950.00 |
| 104455 | Brazo Oriental | 1434.00 | NaN | NaN | 0 | NaN | NaN | 2023-08-28 | NaN | NaN | Para reciclar | NaN | NaN | Av. SAN MARTIN 3606 y R. KOCK | NaN | SAN MARTÍN 3606: CASA TERRENO 1.434 m2 | 150.000 | NaN | NaN | gallito | NaN | NaN | NaN | NaN | 2 | Venta | 4.00 | USD | Casa | NaN | 16031422.00 | 80 | 1940.00 |
| 104826 | Punta Del Este | 711.00 | NaN | NaN | 0 | NaN | NaN | 2023-08-28 | NaN | NaN | Buen estado | NaN | NaN | Santiago de Chile y No Especificado | NaN | Gran chalet de época en Cantegril | 410.000 | NaN | NaN | gallito | NaN | NaN | NaN | Gran chalet de excelente construcción y muy bi... | 3 | Venta | 4.00 | USD | Casa | NaN | 24263149.00 | 160 | NaN |
| 110516 | Parque Batlle | 1055.00 | NaN | NaN | 0 | NaN | NaN | 2023-09-11 | NaN | NaN | Para reciclar | NaN | NaN | Luis Alberto Herrera y Monte Caseros | NaN | VENDO SINERGIA DOS CASAS / Mansiones para edif... | 1.500.000 | NaN | NaN | gallito | NaN | NaN | NaN | SINERGIA INVERSIONES VENDE DOS CASAS PARA EMPR... | 3 | Venta | 4.00 | USD | Casa | NaN | 24095041.00 | 25 | 1950.00 |
| 119544 | Parque Batlle | 1055.00 | NaN | NaN | 0 | NaN | NaN | 2023-09-25 | NaN | NaN | Para reciclar | NaN | NaN | Luis Alberto Herrera y Monte Caseros | NaN | VENDO SINERGIA DOS CASAS / Mansiones para edif... | 1.500.000 | NaN | NaN | gallito | NaN | NaN | NaN | SINERGIA INVERSIONES VENDE DOS CASAS PARA EMPR... | 3 | Venta | 4.00 | USD | Casa | NaN | 24095041.00 | 25 | 1950.00 |
| 123865 | Punta Del Este | 711.00 | NaN | NaN | 0 | NaN | NaN | 2023-09-28 | NaN | NaN | Buen estado | NaN | NaN | Santiago de Chile y No Especificado | NaN | Gran chalet de época en Cantegril | 410.000 | NaN | NaN | gallito | NaN | NaN | NaN | Gran chalet de excelente construcción y muy bi... | 3 | Venta | 4.00 | USD | Casa | NaN | 24263149.00 | 160 | NaN |
| 139270 | Parque Batlle | 1055.00 | NaN | NaN | 0 | NaN | NaN | 2023-10-23 | NaN | NaN | Para reciclar | NaN | NaN | Luis Alberto Herrera y Monte Caseros | NaN | VENDO SINERGIA DOS CASAS / Mansiones para edif... | 1.500.000 | NaN | NaN | gallito | NaN | NaN | NaN | SINERGIA INVERSIONES VENDE DOS CASAS PARA EMPR... | 3 | Venta | 4.00 | USD | Casa | NaN | 24095041.00 | 25 | 1950.00 |
| 153964 | Parque Batlle | 1055.00 | NaN | NaN | 0 | NaN | NaN | 2023-11-17 | NaN | NaN | Para reciclar | NaN | NaN | Luis Alberto Herrera y Monte Caseros | NaN | VENDO SINERGIA DOS CASAS / Mansiones para edif... | 1.500.000 | NaN | NaN | gallito | NaN | NaN | NaN | SINERGIA INVERSIONES VENDE DOS CASAS PARA EMPR... | 3 | Venta | 4.00 | USD | Casa | NaN | 24095041.00 | 25 | 1950.00 |
| 170500 | Parque Batlle | 1055.00 | NaN | NaN | 0 | NaN | NaN | 2023-12-18 | NaN | NaN | Para reciclar | NaN | NaN | Luis Alberto Herrera y Monte Caseros | NaN | VENDO SINERGIA DOS CASAS / Mansiones para edif... | 1.500.000 | NaN | NaN | gallito | NaN | NaN | NaN | SINERGIA INVERSIONES VENDE DOS CASAS PARA EMPR... | 3 | Venta | 4.00 | USD | Casa | NaN | 24095041.00 | 25 | 1950.00 |
# eliminamos la variable rooms.
new_df = new_df.drop(columns=['ROOMS'])
df_sin_outliers_para_catboost = df_sin_outliers_para_catboost.drop(columns=['ROOMS'])
# quitamos el 10% de los outliers de price
Q1 = new_df['PRICE'].quantile(0.25)
Q3 = new_df['PRICE'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR
df_sin_outliers = new_df[new_df['PRICE'] <= upper_bound]
# quitamos los outliers de coverd_area > 200 igual que 0.
df_sin_outliers = df_sin_outliers[(df_sin_outliers['COVERED_AREA'] <= 200) & (df_sin_outliers['COVERED_AREA'] > 0)]
df_sin_outliers_para_catboost= df_sin_outliers_para_catboost[(df_sin_outliers_para_catboost['COVERED_AREA']<= 200) & (df_sin_outliers_para_catboost['COVERED_AREA'] > 0)]
# quitamos el precio con valor 1234
df_sin_outliers = df_sin_outliers[df_sin_outliers['PRICE'] != 1234]
df_sin_outliers_para_catboost=df_sin_outliers_para_catboost[df_sin_outliers_para_catboost['PRICE'] != 1234]
# quitamos el covered_area en 1
df_sin_outliers = df_sin_outliers[df_sin_outliers['COVERED_AREA'] != 1]
df_sin_outliers_para_catboost=df_sin_outliers_para_catboost[df_sin_outliers_para_catboost['COVERED_AREA'] !=1]
# Impute COVERED_AREA less than 10 with the mean grouped by bathrooms and bedrooms
df_sin_outliers['COVERED_AREA'] = df_sin_outliers.apply(
lambda row: df_sin_outliers[(df_sin_outliers['FULL_BATHROOMS'] == row['FULL_BATHROOMS']) & (df_sin_outliers['BEDROOMS'] == row['BEDROOMS'])]['COVERED_AREA'].mean() if row['COVERED_AREA'] < 10 else row['COVERED_AREA'],
axis=1
)
df_sin_outliers_para_catboost['COVERED_AREA']=df_sin_outliers_para_catboost.apply(
lambda row: df_sin_outliers_para_catboost[(df_sin_outliers_para_catboost['FULL_BATHROOMS'] == row['FULL_BATHROOMS']) & (df_sin_outliers_para_catboost['BEDROOMS'] == row['BEDROOMS'])]['COVERED_AREA'].mean() if row['COVERED_AREA'] < 10 else row['COVERED_AREA'],
axis=1
)
# quitamos las restantes de 1
df_sin_outliers = df_sin_outliers[df_sin_outliers['COVERED_AREA'] != 1]
df_sin_outliers_para_catboost=df_sin_outliers_para_catboost[df_sin_outliers_para_catboost['COVERED_AREA'] !=1]
numerical_columns = df_sin_outliers.select_dtypes(include=np.number).columns
for col in numerical_columns:
print(col)
print('Skew :',round(df_sin_outliers[col].skew(),2))
plt.figure(figsize=(15,4))
plt.subplot(1,2,1)
df_sin_outliers[col].hist(bins=10, grid=False)
plt.ylabel('count')
plt.subplot(1,2,2)
sns.boxplot(x=df_sin_outliers[col])
plt.show()
COVERED_AREA Skew : 0.18
PRICE Skew : 1.63
FULL_BATHROOMS Skew : 0.9
BEDROOMS Skew : 0.56
GARAGE Skew : 3.16
#new_df = new_df[new_df['COVERED_AREA'] < 1300.00]
df_sin_outliers.info()
<class 'pandas.core.frame.DataFrame'> Index: 14337 entries, 0 to 177373 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NEIGHBORHOOD 14337 non-null object 1 COVERED_AREA 14337 non-null float64 2 LISTING_TYPE_ID 14337 non-null object 3 WITH_VIRTUAL_TOUR 14337 non-null object 4 HAS_AIR_CONDITIONING 14337 non-null object 5 ADDRESS_STATE 14337 non-null object 6 CONDITION 14337 non-null object 7 PRICE 14337 non-null float64 8 ORIGEN 14337 non-null object 9 FULL_BATHROOMS 14337 non-null int64 10 BEDROOMS 14337 non-null float64 11 GARAGE 14337 non-null int64 dtypes: float64(3), int64(2), object(7) memory usage: 1.4+ MB
df_sin_outliers.describe()
| COVERED_AREA | PRICE | FULL_BATHROOMS | BEDROOMS | GARAGE | |
|---|---|---|---|---|---|
| count | 14337.00 | 14337.00 | 14337.00 | 14337.00 | 14337.00 |
| mean | 116.90 | 207224.29 | 1.67 | 2.71 | 0.40 |
| std | 44.50 | 132914.24 | 0.75 | 0.88 | 0.72 |
| min | 10.00 | 1350.00 | 0.00 | 0.00 | 0.00 |
| 25% | 81.00 | 115000.00 | 1.00 | 2.00 | 0.00 |
| 50% | 112.00 | 170000.00 | 2.00 | 3.00 | 0.00 |
| 75% | 150.00 | 270000.00 | 2.00 | 3.00 | 1.00 |
| max | 200.00 | 970000.00 | 10.00 | 12.00 | 15.00 |
categorical_variables = df_sin_outliers.select_dtypes(include=['object'])
categorical_variables_sin_neighborhood = categorical_variables.drop(columns=['NEIGHBORHOOD'])
for col in categorical_variables_sin_neighborhood.columns:
plt.figure(figsize=(5, 3))
sns.countplot(data=df_sin_outliers, x=col)
plt.title(f'Distribution of {col}')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
# contar la cantidad de No Aplica que hay en ADDRESS_STATE
df_sin_outliers['ADDRESS_STATE'].value_counts()
| count | |
|---|---|
| ADDRESS_STATE | |
| Montevideo | 8189 |
| Canelones | 2574 |
| Maldonado | 2559 |
| No Aplica | 302 |
| Rocha | 191 |
| Colonia | 160 |
| San José | 160 |
| Flores | 57 |
| Lavalleja | 36 |
| Durazno | 19 |
| Treinta y Tres | 14 |
| Paysandú | 13 |
| Florida | 13 |
| Rivera | 11 |
| Salto | 10 |
| Soriano | 9 |
| Tacuarembó | 9 |
| Cerro Largo | 4 |
| Artigas | 4 |
| Río Negro | 3 |
#
df_sin_outliers_para_catboost = df_sin_outliers.copy()
departments_to_group = [
'No Aplica', 'Rocha', 'San José', 'Colonia', 'Flores', 'Lavalleja',
'Treinta y Tres', 'Florida', 'Paysandú', 'Durazno', 'Rivera', 'Salto',
'Soriano', 'Tacuarembó', 'Cerro Largo', 'Río Negro', 'Artigas'
]
df_sin_outliers['ADDRESS_STATE'] = df_sin_outliers['ADDRESS_STATE'].replace(departments_to_group, 'Otros')
print(df_sin_outliers['ADDRESS_STATE'].value_counts())
ADDRESS_STATE Montevideo 8189 Canelones 2574 Maldonado 2559 Otros 1015 Name: count, dtype: int64
plt.figure(figsize=(10, 6))
sns.countplot(data=df_sin_outliers, x='ADDRESS_STATE')
plt.title('Distribution of ADDRESS_STATE')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
numerical_columns = df_sin_outliers.select_dtypes(include=np.number).columns
plt.figure(figsize=(16,12))
sns.heatmap(df_sin_outliers[numerical_columns].corr(), annot=True, fmt='0.2f', cmap="crest")
plt.show()
categorical_variables = df_sin_outliers.select_dtypes(include=['object']).columns
categorical_variables_to_plot = [col for col in categorical_variables if col != 'NEIGHBORHOOD']
for col in categorical_variables_to_plot:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_sin_outliers, x=col, y='PRICE')
plt.title(f'Price Distribution by {col}')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
# Get the count of listings per neighborhood
neighborhood_counts = df_sin_outliers['NEIGHBORHOOD'].value_counts()
# Identify neighborhoods with more than 50 listings
neighborhoods_to_plot = neighborhood_counts[neighborhood_counts > 100].index
# Filter the DataFrame to include only these neighborhoods
df_filtered_neighborhoods = df_sin_outliers[df_sin_outliers['NEIGHBORHOOD'].isin(neighborhoods_to_plot)]
# Create the boxplot
plt.figure(figsize=(15, 8))
sns.boxplot(data=df_filtered_neighborhoods, x='NEIGHBORHOOD', y='PRICE')
plt.title('Price Distribution by Neighborhood (More than 50 Listings)')
plt.xticks(rotation=90, ha='right')
plt.tight_layout()
plt.show()
df_sin_outliers.info()
<class 'pandas.core.frame.DataFrame'> Index: 14337 entries, 0 to 177373 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NEIGHBORHOOD 14337 non-null object 1 COVERED_AREA 14337 non-null float64 2 LISTING_TYPE_ID 14337 non-null object 3 WITH_VIRTUAL_TOUR 14337 non-null object 4 HAS_AIR_CONDITIONING 14337 non-null object 5 ADDRESS_STATE 14337 non-null object 6 CONDITION 14337 non-null object 7 PRICE 14337 non-null float64 8 ORIGEN 14337 non-null object 9 FULL_BATHROOMS 14337 non-null int64 10 BEDROOMS 14337 non-null float64 11 GARAGE 14337 non-null int64 dtypes: float64(3), int64(2), object(7) memory usage: 1.4+ MB
Q1 = new_df['PRICE'].quantile(0.25)
Q3 = new_df['PRICE'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR
new_df[new_df['PRICE'] <= upper_bound]
df_sin_outliers_para_catboost[df_sin_outliers_para_catboost['PRICE'] <= upper_bound]
| NEIGHBORHOOD | COVERED_AREA | LISTING_TYPE_ID | WITH_VIRTUAL_TOUR | HAS_AIR_CONDITIONING | ADDRESS_STATE | CONDITION | PRICE | ORIGEN | FULL_BATHROOMS | BEDROOMS | GARAGE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | atahualpa | 151.00 | gold_premium | No | Sí | Montevideo | usado | 218000.00 | meli | 2 | 4.00 | 0 |
| 1 | malvin | 162.00 | gold_premium | No | No | Montevideo | usado | 400000.00 | meli | 2 | 3.00 | 0 |
| 2 | tres cruces | 80.00 | gold_premium | No | Sí | Montevideo | nuevo | 160000.00 | meli | 2 | 2.00 | 0 |
| 3 | malvin | 100.00 | gold_premium | No | Sí | Montevideo | usado | 250000.00 | meli | 1 | 3.00 | 0 |
| 4 | piriapolis | 28.00 | gold_premium | No | No | Maldonado | usado | 85000.00 | meli | 1 | 1.00 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 177280 | villa munoz | 106.00 | gallito | No | No | Montevideo | usado | 129000.00 | gallito | 2 | 2.00 | 0 |
| 177355 | la blanqueada | 80.00 | gallito | No | No | Montevideo | usado | 167000.00 | gallito | 1 | 2.00 | 1 |
| 177362 | parque batlle | 125.00 | gallito | No | No | Montevideo | nuevo | 290000.00 | gallito | 2 | 3.00 | 1 |
| 177366 | buceo | 90.00 | gallito | No | No | Montevideo | usado | 220000.00 | gallito | 2 | 3.00 | 1 |
| 177373 | barra del chuy | 30.00 | gallito | No | No | Rocha | usado | 50000.00 | gallito | 1 | 3.00 | 1 |
14337 rows × 12 columns
# 1. Definir variables
y_var = 'PRICE'
x_vars = ['NEIGHBORHOOD', 'COVERED_AREA', 'LISTING_TYPE_ID', 'WITH_VIRTUAL_TOUR',
'HAS_AIR_CONDITIONING', 'ADDRESS_STATE', 'CONDITION',
'FULL_BATHROOMS', 'BEDROOMS', 'GARAGE']
df = df_sin_outliers.copy()
# Eliminar origen porque contiene información muy similar a listing type
if 'ORIGEN' in df.columns:
df = df.drop(columns=['ORIGEN'])
df_sin_outliers_para_catboost = df_sin_outliers_para_catboost.drop(columns=['ORIGEN'])
# 2. Agrupar barrios poco frecuentes
min_obs = 100
neigh_counts = df['NEIGHBORHOOD'].value_counts()
rare_neigh = neigh_counts[neigh_counts < min_obs].index
df['NEIGHBORHOOD'] = df['NEIGHBORHOOD'].replace(rare_neigh, 'Otros')
# 3. Codificación de variables categóricas
# Seleccionar variables categóricas
categorical_cols = ['NEIGHBORHOOD', 'LISTING_TYPE_ID', 'ADDRESS_STATE', 'CONDITION',
'WITH_VIRTUAL_TOUR', 'HAS_AIR_CONDITIONING']
# Convertir a tipo 'category'
df[categorical_cols] = df[categorical_cols].astype('category')
# Crear variables dummies (drop_first=True para evitar multicolinealidad)
df_dummies = pd.get_dummies(df, columns=categorical_cols, drop_first=True, dtype=np.uint8)
# 4. Definir X e y
X = df_dummies.drop(columns=[y_var])
y = df_dummies[y_var]
# Agregar constante (intercepto)
X = sm.add_constant(X)
# # 5. Train/Test split y modelos
# # Usar X sin la constante para sklearn
# X_no_const = X.drop(columns=['const']) if 'const' in X.columns else X
# # Split
# X_tr, X_te, y_tr, y_te = train_test_split(X_no_const, y, test_size=0.2, random_state=42)
# # ---- OLS con statsmodels (entrenar en train, evaluar en test) ----
# ols_model = sm.OLS(y_tr, sm.add_constant(X_tr)).fit()
# print(ols_model.summary())
# y_pred_ols = ols_model.predict(sm.add_constant(X_te))
# r2_ols = r2_score(y_te, y_pred_ols)
# rmse_ols = np.sqrt(mean_squared_error(y_te, y_pred_ols))
# print(f"\nOLS (test) R² = {r2_ols:.4f} RMSE = {rmse_ols:,.2f}")
# # ---- Ridge y Lasso con CV ----
# alphas = np.logspace(-2, 3, 100)
# ridge = RidgeCV(alphas=alphas, cv=5).fit(X_tr, y_tr)
# y_pred_ridge = ridge.predict(X_te)
# print(f"Ridge (test) R² = {r2_score(y_te, y_pred_ridge):.4f} RMSE = {np.sqrt(mean_squared_error(y_te, y_pred_ridge)):,.2f}")
# print("Alpha óptimo Ridge:", ridge.alpha_)
# lasso = LassoCV(alphas=None, cv=5, random_state=42, max_iter=10000).fit(X_tr, y_tr)
# y_pred_lasso = lasso.predict(X_te)
# print(f"Lasso (test) R² = {r2_score(y_te, y_pred_lasso):.4f} RMSE = {np.sqrt(mean_squared_error(y_te, y_pred_lasso)):,.2f}")
# print("Alpha óptimo Lasso:", lasso.alpha_)
# Usar X sin la constante para sklearn, asegurando que la columna 'const' se elimina si existe.
X_no_const = X.drop(columns=['const']) if 'const' in X.columns else X
# Split: 80% para entrenamiento, 20% para prueba.
X_tr, X_te, y_tr, y_te = train_test_split(X_no_const, y, test_size=0.2, random_state=42)
print("="*60)
print(" MÉTRICAS DE RENDIMIENTO DE REGRESIÓN ")
print("="*60)
## ----------------------------------------------------------------
## 1. OLS con statsmodels
## ----------------------------------------------------------------
print("\n--- REGRESIÓN OLS (Ordinary Least Squares) ---")
# Entrenamiento (statsmodels requiere añadir explícitamente la constante)
ols_model = sm.OLS(y_tr, sm.add_constant(X_tr)).fit()
print(ols_model.summary())
# Predicción en el conjunto de prueba
y_pred_ols = ols_model.predict(sm.add_constant(X_te))
# Evaluación de métricas
r2_ols = r2_score(y_te, y_pred_ols)
rmse_ols = np.sqrt(mean_squared_error(y_te, y_pred_ols))
mae_ols = mean_absolute_error(y_te, y_pred_ols)
mape_ols = mean_absolute_percentage_error(y_te, y_pred_ols)
print(f"\n[EVALUACIÓN OLS (TEST)]")
print(f" R² = {r2_ols:.4f}")
print(f" RMSE = {rmse_ols:,.2f}")
print(f" MAE = {mae_ols:,.2f}")
print(f" MAPE = {mape_ols:.2f}%")
## ----------------------------------------------------------------
## 2. Ridge y Lasso con CV (Cross-Validation)
## ----------------------------------------------------------------
alphas = np.logspace(-2, 3, 100)
print("\n\n--- REGRESIÓN RIDGE (L2) ---")
# Ridge con CV para encontrar el alpha óptimo (cv=5)
ridge = RidgeCV(alphas=alphas, cv=5).fit(X_tr, y_tr)
y_pred_ridge = ridge.predict(X_te)
# Evaluación de métricas
r2_ridge = r2_score(y_te, y_pred_ridge)
rmse_ridge = np.sqrt(mean_squared_error(y_te, y_pred_ridge))
mae_ridge = mean_absolute_error(y_te, y_pred_ridge)
mape_ridge = mean_absolute_percentage_error(y_te, y_pred_ridge)
print(f"Alpha óptimo Ridge: {ridge.alpha_:.4f}")
print(f"[EVALUACIÓN RIDGE (TEST)]")
print(f" R² = {r2_ridge:.4f}")
print(f" RMSE = {rmse_ridge:,.2f}")
print(f" MAE = {mae_ridge:,.2f}")
print(f" MAPE = {mape_ridge:.2f}")
print("\n\n--- REGRESIÓN LASSO (L1) ---")
# Lasso con CV para encontrar el alpha óptimo (cv=5)
lasso = LassoCV(alphas=None, cv=5, random_state=42, max_iter=10000).fit(X_tr, y_tr)
y_pred_lasso = lasso.predict(X_te)
# Evaluación de métricas
r2_lasso = r2_score(y_te, y_pred_lasso)
rmse_lasso = np.sqrt(mean_squared_error(y_te, y_pred_lasso))
mae_lasso = mean_absolute_error(y_te, y_pred_lasso)
mape_lasso = mean_absolute_percentage_error(y_te, y_pred_lasso)
print(f"Alpha óptimo Lasso: {lasso.alpha_:.4f}")
print(f"[EVALUACIÓN LASSO (TEST)]")
print(f" R² = {r2_lasso:.4f}")
print(f" RMSE = {rmse_lasso:,.2f}")
print(f" MAE = {mae_lasso:,.2f}")
print(f" MAPE = {mape_lasso:.2f}")
============================================================
MÉTRICAS DE RENDIMIENTO DE REGRESIÓN
============================================================
--- REGRESIÓN OLS (Ordinary Least Squares) ---
OLS Regression Results
==============================================================================
Dep. Variable: PRICE R-squared: 0.583
Model: OLS Adj. R-squared: 0.581
Method: Least Squares F-statistic: 257.5
Date: Sat, 13 Dec 2025 Prob (F-statistic): 0.00
Time: 17:23:59 Log-Likelihood: -1.4653e+05
No. Observations: 11469 AIC: 2.932e+05
Df Residuals: 11406 BIC: 2.936e+05
Df Model: 62
Covariance Type: nonrobust
=====================================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------------
const -3028.4862 8340.867 -0.363 0.717 -1.94e+04 1.33e+04
COVERED_AREA 663.0929 22.261 29.787 0.000 619.457 706.729
FULL_BATHROOMS 5.108e+04 1387.788 36.807 0.000 4.84e+04 5.38e+04
BEDROOMS 9093.6147 1159.710 7.841 0.000 6820.384 1.14e+04
GARAGE 9940.3038 1291.841 7.695 0.000 7408.074 1.25e+04
NEIGHBORHOOD_aguada -1828.3372 7382.817 -0.248 0.804 -1.63e+04 1.26e+04
NEIGHBORHOOD_atahualpa 5.053e+04 8097.174 6.241 0.000 3.47e+04 6.64e+04
NEIGHBORHOOD_atlantida 5745.0854 8296.593 0.692 0.489 -1.05e+04 2.2e+04
NEIGHBORHOOD_barra de carrasco 1.299e+05 9633.188 13.489 0.000 1.11e+05 1.49e+05
NEIGHBORHOOD_belvedere -1.562e+04 9291.574 -1.681 0.093 -3.38e+04 2597.404
NEIGHBORHOOD_brazo oriental 2.965e+04 6291.103 4.713 0.000 1.73e+04 4.2e+04
NEIGHBORHOOD_buceo 7.061e+04 5153.186 13.702 0.000 6.05e+04 8.07e+04
NEIGHBORHOOD_carrasco 1.682e+05 6429.897 26.153 0.000 1.56e+05 1.81e+05
NEIGHBORHOOD_carrasco norte 8.257e+04 8759.186 9.427 0.000 6.54e+04 9.97e+04
NEIGHBORHOOD_centro 2.169e+04 9823.912 2.208 0.027 2436.038 4.09e+04
NEIGHBORHOOD_cerrito -1.897e+04 8099.964 -2.341 0.019 -3.48e+04 -3087.692
NEIGHBORHOOD_ciudad de la costa 4.059e+04 7702.711 5.270 0.000 2.55e+04 5.57e+04
NEIGHBORHOOD_colon -2.017e+04 8515.722 -2.369 0.018 -3.69e+04 -3480.536
NEIGHBORHOOD_cordon 3.314e+04 5908.628 5.608 0.000 2.16e+04 4.47e+04
NEIGHBORHOOD_el pinar 1.295e+04 7389.974 1.753 0.080 -1532.843 2.74e+04
NEIGHBORHOOD_goes 4280.0618 8560.345 0.500 0.617 -1.25e+04 2.11e+04
NEIGHBORHOOD_jacinto vera 3.262e+04 7527.246 4.334 0.000 1.79e+04 4.74e+04
NEIGHBORHOOD_la barra 1.017e+05 7385.789 13.769 0.000 8.72e+04 1.16e+05
NEIGHBORHOOD_la blanqueada 4.846e+04 5413.512 8.952 0.000 3.79e+04 5.91e+04
NEIGHBORHOOD_la comercial 1.162e+04 6175.500 1.882 0.060 -481.824 2.37e+04
NEIGHBORHOOD_la teja -3.625e+04 8980.782 -4.036 0.000 -5.39e+04 -1.86e+04
NEIGHBORHOOD_lagomar 2.794e+04 8655.432 3.228 0.001 1.1e+04 4.49e+04
NEIGHBORHOOD_maldonado -5.976e+04 6991.633 -8.547 0.000 -7.35e+04 -4.61e+04
NEIGHBORHOOD_malvin 8.905e+04 5049.034 17.637 0.000 7.92e+04 9.89e+04
NEIGHBORHOOD_malvin norte -7501.0150 8851.838 -0.847 0.397 -2.49e+04 9850.111
NEIGHBORHOOD_manantiales 1.578e+05 6554.204 24.084 0.000 1.45e+05 1.71e+05
NEIGHBORHOOD_maronas -3.45e+04 9741.491 -3.542 0.000 -5.36e+04 -1.54e+04
NEIGHBORHOOD_otra 1.016e+05 8243.599 12.329 0.000 8.55e+04 1.18e+05
NEIGHBORHOOD_otros 4.706e+04 9190.720 5.121 0.000 2.9e+04 6.51e+04
NEIGHBORHOOD_parque batlle 8.146e+04 6342.683 12.843 0.000 6.9e+04 9.39e+04
NEIGHBORHOOD_parque del plata -3.215e+04 9826.251 -3.272 0.001 -5.14e+04 -1.29e+04
NEIGHBORHOOD_parque miramar 1.401e+05 8940.407 15.668 0.000 1.23e+05 1.58e+05
NEIGHBORHOOD_parque rodo 9.97e+04 9037.621 11.032 0.000 8.2e+04 1.17e+05
NEIGHBORHOOD_paso molino -9803.0578 9432.849 -1.039 0.299 -2.83e+04 8686.948
NEIGHBORHOOD_pinares 1.217e+04 8237.755 1.478 0.139 -3972.427 2.83e+04
NEIGHBORHOOD_piriapolis -3.459e+04 6906.565 -5.009 0.000 -4.81e+04 -2.11e+04
NEIGHBORHOOD_pocitos 1.26e+05 5231.994 24.084 0.000 1.16e+05 1.36e+05
NEIGHBORHOOD_prado 5.452e+04 5125.258 10.637 0.000 4.45e+04 6.46e+04
NEIGHBORHOOD_punta carretas 1.868e+05 7909.026 23.613 0.000 1.71e+05 2.02e+05
NEIGHBORHOOD_punta del este 5.931e+04 5533.504 10.718 0.000 4.85e+04 7.02e+04
NEIGHBORHOOD_punta gorda 1.464e+05 7820.247 18.721 0.000 1.31e+05 1.62e+05
NEIGHBORHOOD_reducto 5068.2169 7594.332 0.667 0.505 -9817.980 2e+04
NEIGHBORHOOD_san jose de carrasco 3.067e+04 9557.320 3.210 0.001 1.19e+04 4.94e+04
NEIGHBORHOOD_sayago -1.014e+04 7288.702 -1.391 0.164 -2.44e+04 4147.379
NEIGHBORHOOD_solymar 1.711e+04 6409.490 2.670 0.008 4550.797 2.97e+04
NEIGHBORHOOD_tres cruces 2.176e+04 7602.968 2.862 0.004 6857.175 3.67e+04
NEIGHBORHOOD_union -1.379e+04 6584.103 -2.095 0.036 -2.67e+04 -888.277
LISTING_TYPE_ID_gold -1229.0474 4293.486 -0.286 0.775 -9645.018 7186.923
LISTING_TYPE_ID_gold_premium 9646.2787 2314.060 4.169 0.000 5110.322 1.42e+04
LISTING_TYPE_ID_silver 6849.3322 4109.753 1.667 0.096 -1206.490 1.49e+04
ADDRESS_STATE_Maldonado 6.283e+04 4271.726 14.708 0.000 5.45e+04 7.12e+04
ADDRESS_STATE_Montevideo -2.689e+04 3797.233 -7.081 0.000 -3.43e+04 -1.94e+04
ADDRESS_STATE_Otros -3.371e+04 4562.984 -7.388 0.000 -4.27e+04 -2.48e+04
CONDITION_nuevo -7774.7221 7879.295 -0.987 0.324 -2.32e+04 7670.052
CONDITION_sin_especificar -1.228e+04 7754.663 -1.583 0.113 -2.75e+04 2920.980
CONDITION_usado -1.601e+04 7646.553 -2.093 0.036 -3.1e+04 -1017.534
WITH_VIRTUAL_TOUR_Sí 6768.0397 8393.905 0.806 0.420 -9685.459 2.32e+04
HAS_AIR_CONDITIONING_Sí 2.314e+04 2987.733 7.746 0.000 1.73e+04 2.9e+04
==============================================================================
Omnibus: 4054.043 Durbin-Watson: 2.004
Prob(Omnibus): 0.000 Jarque-Bera (JB): 29175.930
Skew: 1.512 Prob(JB): 0.00
Kurtosis: 10.205 Cond. No. 2.44e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.44e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
[EVALUACIÓN OLS (TEST)]
R² = 0.5728
RMSE = 87,826.96
MAE = 59,003.08
MAPE = 0.46%
--- REGRESIÓN RIDGE (L2) ---
Alpha óptimo Ridge: 0.7391
[EVALUACIÓN RIDGE (TEST)]
R² = 0.5727
RMSE = 87,833.38
MAE = 59,013.75
MAPE = 0.46
--- REGRESIÓN LASSO (L1) ---
Alpha óptimo Lasso: 2746.2569
[EVALUACIÓN LASSO (TEST)]
R² = 0.3984
RMSE = 104,221.91
MAE = 73,210.82
MAPE = 0.57
# Transformar PRICE y COVERED_AREA a logaritmos
df_dummies['LOG_PRICE'] = np.log(df_dummies['PRICE'])
df_dummies['LOG_COVERED_AREA'] = np.log(df_dummies['COVERED_AREA'])
# 4. Definir X e y (versión logarítmica)
y = df_dummies['LOG_PRICE']
X = df_dummies.drop(columns=['PRICE', 'LOG_PRICE']) # quitamos precio original
# Reemplazar COVERED_AREA por su logaritmo
X = X.drop(columns=['COVERED_AREA'])
X['LOG_COVERED_AREA'] = df_dummies['LOG_COVERED_AREA']
# Agregar constante
X = sm.add_constant(X)
# # 5. Train/Test split y modelos
# # Usar X sin la constante para sklearn
# X_no_const = X.drop(columns=['const']) if 'const' in X.columns else X
# # Split
# X_tr, X_te, y_tr, y_te = train_test_split(X_no_const, y, test_size=0.2, random_state=42)
# # ---- OLS con statsmodels (entrenar en train, evaluar en test) ----
# ols_model = sm.OLS(y_tr, sm.add_constant(X_tr)).fit()
# print(ols_model.summary())
# y_pred_ols = ols_model.predict(sm.add_constant(X_te))
# r2_ols = r2_score(y_te, y_pred_ols)
# rmse_ols = np.sqrt(mean_squared_error(y_te, y_pred_ols))
# print(f"\nOLS (test) R² = {r2_ols:.4f} RMSE = {rmse_ols:,.2f}")
# # ---- Ridge y Lasso con CV ----
# alphas = np.logspace(-2, 3, 100)
# ridge = RidgeCV(alphas=alphas, cv=5).fit(X_tr, y_tr)
# y_pred_ridge = ridge.predict(X_te)
# print(f"Ridge (test) R² = {r2_score(y_te, y_pred_ridge):.4f} RMSE = {np.sqrt(mean_squared_error(y_te, y_pred_ridge)):,.2f}")
# print("Alpha óptimo Ridge:", ridge.alpha_)
# lasso = LassoCV(alphas=None, cv=5, random_state=42, max_iter=10000).fit(X_tr, y_tr)
# y_pred_lasso = lasso.predict(X_te)
# print(f"Lasso (test) R² = {r2_score(y_te, y_pred_lasso):.4f} RMSE = {np.sqrt(mean_squared_error(y_te, y_pred_lasso)):,.2f}")
# print("Alpha óptimo Lasso:", lasso.alpha_)
# =================================================================
# 5. Train/Test split y modelos
# =================================================================
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
from sklearn.linear_model import RidgeCV, LassoCV
# =================================================================
# ASUMIMOS QUE LA VARIABLE Y FUE TRANSFORMADA CON LOGARITMO NATURAL
# Y QUE ESTAMOS USANDO LA FUNCIÓN MAPE DE SKLEARN (devuelve fracción)
# =================================================================
# Usar X sin la constante para sklearn, asegurando que la columna 'const' se elimina si existe.
X_no_const = X.drop(columns=['const']) if 'const' in X.columns else X
# Split: 80% para entrenamiento, 20% para prueba.
# Ojo: y_tr y y_te están todavía en la escala logarítmica (ln(y))
X_tr, X_te, y_tr, y_te = train_test_split(X_no_const, y, test_size=0.2, random_state=42)
# Deshacemos el logaritmo de y_te AHORA para usarlo en el cálculo de métricas
y_te_original = np.exp(y_te)
print("="*70)
print(" EVALUACIÓN DE MODELOS (ANTILOGARITMO APLICADO) ")
print("="*70)
## ----------------------------------------------------------------
## 1. OLS con statsmodels
## ----------------------------------------------------------------
print("\n--- REGRESIÓN OLS (Ordinary Least Squares) ---")
ols_model = sm.OLS(y_tr, sm.add_constant(X_tr)).fit()
print(ols_model.summary())
# Predicción en escala logarítmica
y_pred_ols_log = ols_model.predict(sm.add_constant(X_te))
# Antilogaritmo: Pasamos la predicción a la escala original
y_pred_ols_original = np.exp(y_pred_ols_log)
# Evaluación de métricas en escala original
r2_ols = r2_score(y_te_original, y_pred_ols_original)
rmse_ols = np.sqrt(mean_squared_error(y_te_original, y_pred_ols_original))
mae_ols = mean_absolute_error(y_te_original, y_pred_ols_original)
mape_ols = mean_absolute_percentage_error(y_te_original, y_pred_ols_original)
print(f"\n[EVALUACIÓN OLS (TEST - Escala Original)]")
print(f" R² = {r2_ols:.4f}")
print(f" RMSE = {rmse_ols:,.2f}")
print(f" MAE = {mae_ols:,.2f}")
print(f" MAPE = {mape_ols * 100:.2f}%") # Multiplicamos por 100 para el porcentaje real
## ----------------------------------------------------------------
## 2. Ridge y Lasso con CV
## ----------------------------------------------------------------
alphas = np.logspace(-2, 3, 100)
# --- RIDGE ---
print("\n\n--- REGRESIÓN RIDGE (L2) ---")
ridge = RidgeCV(alphas=alphas, cv=5).fit(X_tr, y_tr)
y_pred_ridge_log = ridge.predict(X_te)
y_pred_ridge_original = np.exp(y_pred_ridge_log) # Antilogaritmo
# Evaluación de métricas en escala original
r2_ridge = r2_score(y_te_original, y_pred_ridge_original)
rmse_ridge = np.sqrt(mean_squared_error(y_te_original, y_pred_ridge_original))
mae_ridge = mean_absolute_error(y_te_original, y_pred_ridge_original)
mape_ridge = mean_absolute_percentage_error(y_te_original, y_pred_ridge_original)
print(f"Alpha óptimo Ridge: {ridge.alpha_:.4f}")
print(f"[EVALUACIÓN RIDGE (TEST - Escala Original)]")
print(f" R² = {r2_ridge:.4f}")
print(f" RMSE = {rmse_ridge:,.2f}")
print(f" MAE = {mae_ridge:,.2f}")
print(f" MAPE = {mape_ridge * 100:.2f}%")
# --- LASSO ---
print("\n\n--- REGRESIÓN LASSO (L1) ---")
lasso = LassoCV(alphas=None, cv=5, random_state=42, max_iter=10000).fit(X_tr, y_tr)
y_pred_lasso_log = lasso.predict(X_te)
y_pred_lasso_original = np.exp(y_pred_lasso_log) # Antilogaritmo
# Evaluación de métricas en escala original
r2_lasso = r2_score(y_te_original, y_pred_lasso_original)
rmse_lasso = np.sqrt(mean_squared_error(y_te_original, y_pred_lasso_original))
mae_lasso = mean_absolute_error(y_te_original, y_pred_lasso_original)
mape_lasso = mean_absolute_percentage_error(y_te_original, y_pred_lasso_original)
print(f"Alpha óptimo Lasso: {lasso.alpha_:.4f}")
print(f"[EVALUACIÓN LASSO (TEST - Escala Original)]")
print(f" R² = {r2_lasso:.4f}")
print(f" RMSE = {rmse_lasso:,.2f}")
print(f" MAE = {mae_lasso:,.2f}")
print(f" MAPE = {mape_lasso * 100:.2f}%")
======================================================================
EVALUACIÓN DE MODELOS (ANTILOGARITMO APLICADO)
======================================================================
--- REGRESIÓN OLS (Ordinary Least Squares) ---
OLS Regression Results
==============================================================================
Dep. Variable: LOG_PRICE R-squared: 0.603
Model: OLS Adj. R-squared: 0.601
Method: Least Squares F-statistic: 279.6
Date: Sat, 13 Dec 2025 Prob (F-statistic): 0.00
Time: 17:24:11 Log-Likelihood: -5908.4
No. Observations: 11469 AIC: 1.194e+04
Df Residuals: 11406 BIC: 1.241e+04
Df Model: 62
Covariance Type: nonrobust
=====================================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------------
const 9.6209 0.056 173.264 0.000 9.512 9.730
FULL_BATHROOMS 0.2182 0.007 33.525 0.000 0.205 0.231
BEDROOMS 0.0722 0.005 13.144 0.000 0.061 0.083
GARAGE 0.0515 0.006 8.439 0.000 0.040 0.063
NEIGHBORHOOD_aguada 0.0859 0.035 2.457 0.014 0.017 0.154
NEIGHBORHOOD_atahualpa 0.4126 0.038 10.763 0.000 0.337 0.488
NEIGHBORHOOD_atlantida 0.1118 0.039 2.847 0.004 0.035 0.189
NEIGHBORHOOD_barra de carrasco 0.6579 0.046 14.425 0.000 0.569 0.747
NEIGHBORHOOD_belvedere -0.1151 0.044 -2.616 0.009 -0.201 -0.029
NEIGHBORHOOD_brazo oriental 0.2806 0.030 9.422 0.000 0.222 0.339
NEIGHBORHOOD_buceo 0.5180 0.024 21.232 0.000 0.470 0.566
NEIGHBORHOOD_carrasco 0.7517 0.030 24.693 0.000 0.692 0.811
NEIGHBORHOOD_carrasco norte 0.5362 0.041 12.932 0.000 0.455 0.618
NEIGHBORHOOD_centro 0.2255 0.047 4.849 0.000 0.134 0.317
NEIGHBORHOOD_cerrito -0.1357 0.038 -3.539 0.000 -0.211 -0.061
NEIGHBORHOOD_ciudad de la costa 0.2854 0.036 7.828 0.000 0.214 0.357
NEIGHBORHOOD_colon -0.1162 0.040 -2.883 0.004 -0.195 -0.037
NEIGHBORHOOD_cordon 0.3297 0.028 11.788 0.000 0.275 0.385
NEIGHBORHOOD_el pinar 0.1236 0.035 3.532 0.000 0.055 0.192
NEIGHBORHOOD_goes 0.1385 0.041 3.417 0.001 0.059 0.218
NEIGHBORHOOD_jacinto vera 0.3000 0.036 8.419 0.000 0.230 0.370
NEIGHBORHOOD_la barra 0.3023 0.035 8.647 0.000 0.234 0.371
NEIGHBORHOOD_la blanqueada 0.4117 0.026 16.063 0.000 0.361 0.462
NEIGHBORHOOD_la comercial 0.1892 0.029 6.472 0.000 0.132 0.247
NEIGHBORHOOD_la teja -0.2754 0.043 -6.477 0.000 -0.359 -0.192
NEIGHBORHOOD_lagomar 0.3034 0.041 7.405 0.000 0.223 0.384
NEIGHBORHOOD_maldonado -0.2379 0.033 -7.186 0.000 -0.303 -0.173
NEIGHBORHOOD_malvin 0.5961 0.024 24.943 0.000 0.549 0.643
NEIGHBORHOOD_malvin norte -0.0135 0.042 -0.321 0.748 -0.096 0.069
NEIGHBORHOOD_manantiales 0.5166 0.031 16.649 0.000 0.456 0.577
NEIGHBORHOOD_maronas -0.2665 0.046 -5.779 0.000 -0.357 -0.176
NEIGHBORHOOD_otra 0.5475 0.039 14.030 0.000 0.471 0.624
NEIGHBORHOOD_otros 0.3120 0.044 7.171 0.000 0.227 0.397
NEIGHBORHOOD_parque batlle 0.5704 0.030 18.997 0.000 0.512 0.629
NEIGHBORHOOD_parque del plata -0.0947 0.047 -2.037 0.042 -0.186 -0.004
NEIGHBORHOOD_parque miramar 0.6606 0.042 15.616 0.000 0.578 0.744
NEIGHBORHOOD_parque rodo 0.6238 0.043 14.578 0.000 0.540 0.708
NEIGHBORHOOD_paso molino -0.0140 0.045 -0.313 0.754 -0.102 0.074
NEIGHBORHOOD_pinares 0.1356 0.039 3.477 0.001 0.059 0.212
NEIGHBORHOOD_piriapolis -0.0656 0.033 -2.007 0.045 -0.130 -0.002
NEIGHBORHOOD_pocitos 0.7181 0.025 28.994 0.000 0.670 0.767
NEIGHBORHOOD_prado 0.4127 0.024 17.009 0.000 0.365 0.460
NEIGHBORHOOD_punta carretas 0.8613 0.037 23.003 0.000 0.788 0.935
NEIGHBORHOOD_punta del este 0.2165 0.026 8.267 0.000 0.165 0.268
NEIGHBORHOOD_punta gorda 0.7233 0.037 19.539 0.000 0.651 0.796
NEIGHBORHOOD_reducto 0.1311 0.036 3.647 0.000 0.061 0.202
NEIGHBORHOOD_san jose de carrasco 0.2917 0.045 6.449 0.000 0.203 0.380
NEIGHBORHOOD_sayago 0.0187 0.035 0.542 0.588 -0.049 0.086
NEIGHBORHOOD_solymar 0.2313 0.030 7.621 0.000 0.172 0.291
NEIGHBORHOOD_tres cruces 0.2499 0.036 6.944 0.000 0.179 0.320
NEIGHBORHOOD_union -0.0117 0.031 -0.376 0.707 -0.073 0.049
LISTING_TYPE_ID_gold 0.0080 0.020 0.395 0.693 -0.032 0.048
LISTING_TYPE_ID_gold_premium 0.0208 0.011 1.896 0.058 -0.001 0.042
LISTING_TYPE_ID_silver 0.0020 0.019 0.105 0.916 -0.036 0.040
ADDRESS_STATE_Maldonado 0.3674 0.020 18.168 0.000 0.328 0.407
ADDRESS_STATE_Montevideo -0.1475 0.018 -8.206 0.000 -0.183 -0.112
ADDRESS_STATE_Otros -0.2261 0.022 -10.468 0.000 -0.268 -0.184
CONDITION_nuevo -0.0193 0.037 -0.518 0.605 -0.092 0.054
CONDITION_sin_especificar -0.1050 0.037 -2.860 0.004 -0.177 -0.033
CONDITION_usado -0.0895 0.036 -2.471 0.013 -0.160 -0.018
WITH_VIRTUAL_TOUR_Sí 0.0158 0.040 0.398 0.691 -0.062 0.094
HAS_AIR_CONDITIONING_Sí 0.1296 0.014 9.159 0.000 0.102 0.157
LOG_COVERED_AREA 0.3700 0.010 36.094 0.000 0.350 0.390
==============================================================================
Omnibus: 2075.402 Durbin-Watson: 2.017
Prob(Omnibus): 0.000 Jarque-Bera (JB): 14786.868
Skew: -0.682 Prob(JB): 0.00
Kurtosis: 8.393 Cond. No. 118.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[EVALUACIÓN OLS (TEST - Escala Original)]
R² = 0.5216
RMSE = 92,933.08
MAE = 57,197.59
MAPE = 39.90%
--- REGRESIÓN RIDGE (L2) ---
Alpha óptimo Ridge: 0.5857
[EVALUACIÓN RIDGE (TEST - Escala Original)]
R² = 0.5212
RMSE = 92,974.29
MAE = 57,208.72
MAPE = 39.93%
--- REGRESIÓN LASSO (L1) ---
Alpha óptimo Lasso: 0.0003
[EVALUACIÓN LASSO (TEST - Escala Original)]
R² = 0.5149
RMSE = 93,584.11
MAE = 57,403.03
MAPE = 40.22%
# === Importancia de variables (coeficientes estandarizados) ===
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
scaler = StandardScaler()
# 1) Estandarizar y reconstruir DataFrame conservando índice y columnas
X_tr_std = scaler.fit_transform(X_tr)
X_tr_std = pd.DataFrame(X_tr_std, index=X_tr.index, columns=X_tr.columns)
# 2) Reajustar OLS con variables estandarizadas
ols_std = sm.OLS(y_tr, sm.add_constant(X_tr_std)).fit()
# 3) Magnitud absoluta de coeficientes (sin la constante)
coef_abs = ols_std.params.drop('const').abs().sort_values(ascending=False)
# 4) Gráfico (top 15)
plt.figure(figsize=(8, 10))
coef_abs.head(15).iloc[::-1].plot(kind='barh')
plt.title('Variables más explicativas (coeficientes estandarizados)')
plt.xlabel('Magnitud absoluta del coeficiente')
plt.ylabel('Variable')
plt.tight_layout()
plt.show()
# coeficientes estandarizados del modelo
coef = ols_std.params.drop('const') # conserva el signo
top_n = 20 # cantidad a mostrar
order = coef.abs().sort_values(ascending=False).index[:top_n]
coef_top = coef.loc[order]
# colores según signo
colors = ['tab:green' if v > 0 else 'tab:red' for v in coef_top]
plt.figure(figsize=(9, max(6, int(top_n * 0.45))))
# invertimos para que el más importante quede arriba
coef_top.iloc[::-1].plot(kind='barh', color=colors[::-1], edgecolor='black')
plt.axvline(0, color='black', linewidth=0.8)
plt.title('Variables más explicativas (coeficientes estandarizados)\nVerde = ↑ precio, Rojo = ↓ precio')
plt.xlabel('Coeficiente estandarizado')
plt.ylabel('Variable')
plt.tight_layout()
plt.show()
Ya que de las 10 variables explicativas 6 son categóricas, procedemos a utilizar el modelo CatBoost.
X = df_sin_outliers_para_catboost[x_vars].copy()
y = df_sin_outliers_para_catboost[y_var].copy()
# Identificar categóricas
categoricas = ['NEIGHBORHOOD', 'LISTING_TYPE_ID', 'ADDRESS_STATE', 'CONDITION', 'WITH_VIRTUAL_TOUR', 'HAS_AIR_CONDITIONING']
# for i, c in enumerate(X.columns):
# print(i, c, X[c].dtype, "CATEG?" , c in categoricas)
# # en esta versión de la base de datos CONDITION tiene datos faltantes los asignamos a la categoría not_specified
# # --- Limpieza específica de la variable CONDITION ---
# # 1) Unificar categorías equivalentes
# X['CONDITION'] = X['CONDITION'].replace({
# 'Estrenar': 'Nuevo',
# 'new': 'Nuevo',
# 'Impecable': 'Impecable',
# 'Buen estado': 'Buen estado',
# 'Reciclado': 'Reciclado',
# 'Para reciclar': 'Para reciclar',
# 'En construcción': 'En construcción',
# 'En pozo': 'En pozo',
# 'not_specified': 'No informado' # opcional, podés mantenerlo separado si querés
# })
# # 2) Convertir NaN en una categoría válida (con sentido económico)
# X['CONDITION'] = X['CONDITION'].fillna('No informado')
# # 3) Asegurar el tipo correcto para CatBoost
# X['CONDITION'] = X['CONDITION'].astype('string')
# # --- Debug opcional para confirmar que quedó limpio ---
# print(X['CONDITION'].value_counts(dropna=False))
# print("Dtype:", X['CONDITION'].dtype)
# print("NaNs:", X['CONDITION'].isna().sum())
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42
)
# Define model parameters
model_params = {
"iterations": 1000,
"learning_rate": 0.05,
"depth": 6,
"l2_leaf_reg": 3,
"loss_function": 'RMSE',
"eval_metric": 'MAPE',
"random_seed": 42,
"verbose": False, # Set to False for cleaner output during CV
"early_stopping_rounds": 100 # Early stopping for each fold
}
# Initialize KFold cross-validator
kf = KFold(n_splits=8, shuffle=True, random_state=42)
# Lists to store metrics for each fold
fold_rmses = []
fold_maes = []
fold_r2s = []
fold_mape = []
print("Starting CatBoost Cross-Validation...")
# Ensure categorical columns are explicitly of 'category' dtype in X for robustness
for col in categoricas:
if col in X.columns:
X[col] = X[col].astype('category')
for fold, (train_index, val_index) in enumerate(kf.split(X, y)):
X_train_fold, X_val_fold = X.iloc[train_index], X.iloc[val_index]
y_train_fold, y_val_fold = y.iloc[train_index], y.iloc[val_index]
fold_model = CatBoostRegressor(**model_params)
fold_model.fit(
X_train_fold, y_train_fold,
cat_features=categoricas,
eval_set=(X_val_fold, y_val_fold),
use_best_model=True,
early_stopping_rounds=100,
verbose=False # Keep verbose off for cleaner CV output
)
y_pred_fold = fold_model.predict(X_val_fold)
rmse = np.sqrt(mean_squared_error(y_val_fold, y_pred_fold))
mae = mean_absolute_error(y_val_fold, y_pred_fold)
r2 = r2_score(y_val_fold, y_pred_fold)
mape = mean_absolute_percentage_error(y_val_fold, y_pred_fold)
fold_rmses.append(rmse)
fold_maes.append(mae)
fold_mape.append(mape)
fold_r2s.append(r2)
print(f"Fold {fold+1} - RMSE: {rmse:,.2f}, MAE: {mae:,.2f}, R²: {r2:.3f}, MAPE: {mean_absolute_percentage_error(y_val_fold, y_pred_fold):.2%}")
print("\n--- Cross-Validation Results ---")
print(f"Average RMSE: {np.mean(fold_rmses):,.2f} \u00b1 {np.std(fold_rmses):,.2f}")
print(f"Average MAE: {np.mean(fold_maes):,.2f} \u00b1 {np.std(fold_maes):,.2f}")
print(f"Average R\u00b2: {np.mean(fold_r2s):.3f} \u00b1 {np.std(fold_r2s):.3f}")
print(f"Average MAPE: {np.mean(fold_mape):.2%}")
Starting CatBoost Cross-Validation... Fold 1 - RMSE: 78,627.92, MAE: 49,047.67, R²: 0.653, MAPE: 34.59% Fold 2 - RMSE: 73,774.59, MAE: 49,183.64, R²: 0.702, MAPE: 40.99% Fold 3 - RMSE: 72,372.11, MAE: 48,771.51, R²: 0.697, MAPE: 30.88% Fold 4 - RMSE: 67,871.67, MAE: 46,583.43, R²: 0.710, MAPE: 27.94% Fold 5 - RMSE: 74,020.09, MAE: 49,587.11, R²: 0.684, MAPE: 31.66% Fold 6 - RMSE: 73,365.28, MAE: 49,297.60, R²: 0.689, MAPE: 30.47% Fold 7 - RMSE: 71,464.20, MAE: 48,136.16, R²: 0.684, MAPE: 33.89% Fold 8 - RMSE: 88,084.09, MAE: 54,432.44, R²: 0.634, MAPE: 40.11% --- Cross-Validation Results --- Average RMSE: 74,947.49 ± 5,697.46 Average MAE: 49,379.94 ± 2,105.47 Average R²: 0.682 ± 0.024 Average MAPE: 33.82%
# def evaluar(y_true, y_pred, etiqueta='Set'):
# mae = mean_absolute_error(y_true, y_pred)
# rmse = np.sqrt(mean_squared_error(y_true, y_pred))
# r2 = r2_score(y_true, y_pred)
# print(f"{etiqueta} -> MAE: {mae:,.2f} | RMSE: {rmse:,.2f} | R²: {r2:.3f} | MAPE: {mean_absolute_percentage_error(y_true, y_pred):.2%}")
# return mae, rmse, r2
# print("\nResultados:")
# _ = evaluar(y_valid, fold_model.predict(X_valid), etiqueta='Validación')
# _ = evaluar(y_test, fold_model.predict(X_test), etiqueta='Test')
importancias = fold_model.get_feature_importance(prettified=True)
# Asegurar que se mapeen a los nombres de X (CatBoost ya devuelve 'Feature Id'/'Feature Name' en prettified)
imp_df = pd.DataFrame({
'Variable': X.columns,
'Importancia': fold_model.get_feature_importance()
}).sort_values('Importancia', ascending=False)
print("\nTop 10 variables por importancia:")
print(imp_df.head(10))
# Gráfico
plt.figure(figsize=(8, 5))
plt.barh(imp_df['Variable'][:15][::-1], imp_df['Importancia'][:15][::-1])
plt.title('Importancia de variables (CatBoost)')
plt.xlabel('Importancia')
plt.ylabel('Variable')
plt.tight_layout()
plt.show()
Top 10 variables por importancia:
Variable Importancia
0 NEIGHBORHOOD 36.15
1 COVERED_AREA 18.01
7 FULL_BATHROOMS 17.16
5 ADDRESS_STATE 9.78
8 BEDROOMS 6.73
6 CONDITION 4.13
2 LISTING_TYPE_ID 3.85
9 GARAGE 3.18
4 HAS_AIR_CONDITIONING 0.92
3 WITH_VIRTUAL_TOUR 0.08
# optuna usando solamente train y test.
kf = KFold(n_splits=5, shuffle=True, random_state=42)
def objective(trial):
"""
Define la función objetivo para la optimización de hiperparámetros de CatBoost
utilizando Optuna, ¡ahora con K-Fold!
"""
params = {
"iterations": trial.suggest_int("iterations", 600, 2000),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.15, log=True),
"depth": trial.suggest_int("depth", 4, 10),
"l2_leaf_reg": trial.suggest_float("l2_leaf_reg", 1.0, 10.0, log=True),
"bagging_temperature": trial.suggest_float("bagging_temperature", 0.0, 1.0),
"subsample": trial.suggest_float("subsample", 0.6, 1.0),
"min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 64),
"loss_function": "RMSE",
"eval_metric": "MAPE", # Métrica usada por CatBoost para early stopping
"random_seed": 42,
"verbose": False,
"allow_writing_files": False
}
# Lista para almacenar el MAPE de cada pliegue
mape_folds = []
# Pruner para el K-Fold (opcional, pero mejora el rendimiento de Optuna)
pruning_cb = CatBoostPruningCallback(trial, "MAPE")
# 2. Iterar sobre cada pliegue (Fold)
# kf.split genera los índices de Train y Validación (inner-train/inner-valid)
for fold, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
# 3. Obtener los subconjuntos de datos para este pliegue
X_train_fold, X_valid_fold = X_train.iloc[train_index], X_train.iloc[valid_index]
y_train_fold, y_valid_fold = y_train.iloc[train_index], y_train.iloc[valid_index]
# Reiniciar el modelo para cada pliegue con los mismos parámetros
model = CatBoostRegressor(**params)
model.fit(
X_train_fold, y_train_fold,
cat_features=categoricas,
eval_set=(X_valid_fold, y_valid_fold),
use_best_model=True,
early_stopping_rounds=200,
callbacks=[pruning_cb] # El pruner de Optuna es por Trial, no por Fold
)
# 4. Calcular el MAPE de este pliegue (usamos el mejor score de CatBoost)
best_scores = model.get_best_score()
key = "validation" if "validation" in best_scores else ("eval" if "eval" in best_scores else None)
if key is not None and "MAPE" in best_scores[key]:
mape_folds.append(float(best_scores[key]["MAPE"]))
else:
# Fallback: calcular MAPE manualmente
y_valid_pred = model.predict(X_valid_fold)
mape_folds.append(float(mean_absolute_percentage_error(y_valid_fold, y_valid_pred)))
# 5. OBJETIVO DE OPTUNA: Devolver el MAPE promedio de todos los pliegues
avg_mape = np.mean(mape_folds)
return avg_mape
# --- El resto del código de Optuna permanece igual ---
# ====== Crear y correr el estudio de Optuna ======
study = optuna.create_study(
direction="minimize",
sampler=TPESampler(seed=42),
pruner=MedianPruner(n_warmup_steps=10)
)
print("Iniciando optimización de hiperparámetros con K-Fold...")
study.optimize(objective, n_trials=30, n_jobs=1, show_progress_bar=True)
# Imprimir el mejor resultado (MAPE)
print("\n=== Mejor Resultado de Optuna ===")
print(f"Mejor MAPE Promedio (K-Fold): {study.best_value:.2%}")
print(f"Mejores Parámetros: {study.best_params}")
[I 2025-12-13 17:26:19,497] A new study created in memory with name: no-name-1a75f907-b39e-4d83-bc43-f6662093a410
Iniciando optimización de hiperparámetros con K-Fold...
[I 2025-12-13 17:28:22,915] Trial 0 finished with value: 0.3200905560844422 and parameters: {'iterations': 1124, 'learning_rate': 0.13125830316209655, 'depth': 9, 'l2_leaf_reg': 3.968793330444372, 'bagging_temperature': 0.15601864044243652, 'subsample': 0.662397808134481, 'min_data_in_leaf': 4}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:31:15,131] Trial 1 finished with value: 0.3222038195777087 and parameters: {'iterations': 1813, 'learning_rate': 0.05092911283433821, 'depth': 8, 'l2_leaf_reg': 1.0485387725194617, 'bagging_temperature': 0.9699098521619943, 'subsample': 0.9329770563201687, 'min_data_in_leaf': 14}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:32:10,871] Trial 2 finished with value: 0.34438686306398847 and parameters: {'iterations': 854, 'learning_rate': 0.016432378919707624, 'depth': 6, 'l2_leaf_reg': 3.347776308515933, 'bagging_temperature': 0.43194501864211576, 'subsample': 0.7164916560792167, 'min_data_in_leaf': 40}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:33:03,539] Trial 3 finished with value: 0.3411325462433037 and parameters: {'iterations': 795, 'learning_rate': 0.022059149678071027, 'depth': 6, 'l2_leaf_reg': 2.858051065806936, 'bagging_temperature': 0.7851759613930136, 'subsample': 0.6798695128633439, 'min_data_in_leaf': 33}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:35:43,994] Trial 4 finished with value: 0.33396754964391695 and parameters: {'iterations': 1429, 'learning_rate': 0.011340440501807348, 'depth': 8, 'l2_leaf_reg': 1.4808945119975185, 'bagging_temperature': 0.06505159298527952, 'subsample': 0.9795542149013333, 'min_data_in_leaf': 62}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:35:44,317] Trial 5 finished with value: 0.7787625950740062 and parameters: {'iterations': 1732, 'learning_rate': 0.022816739880816207, 'depth': 4, 'l2_leaf_reg': 4.833180632488465, 'bagging_temperature': 0.4401524937396013, 'subsample': 0.6488152939379115, 'min_data_in_leaf': 32}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:36:21,731] Trial 6 finished with value: 0.3314184138787878 and parameters: {'iterations': 648, 'learning_rate': 0.1173393765991262, 'depth': 5, 'l2_leaf_reg': 4.597505784732165, 'bagging_temperature': 0.31171107608941095, 'subsample': 0.8080272084711243, 'min_data_in_leaf': 35}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:36:46,319] Trial 7 finished with value: 0.6647583437063739 and parameters: {'iterations': 858, 'learning_rate': 0.13814017620274424, 'depth': 9, 'l2_leaf_reg': 8.699593128513321, 'bagging_temperature': 0.8948273504276488, 'subsample': 0.8391599915244341, 'min_data_in_leaf': 59}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:36:47,008] Trial 8 finished with value: 0.7837394562358646 and parameters: {'iterations': 723, 'learning_rate': 0.017001754132211097, 'depth': 4, 'l2_leaf_reg': 2.1150972021685583, 'bagging_temperature': 0.388677289689482, 'subsample': 0.7085396127095583, 'min_data_in_leaf': 54}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:36:47,600] Trial 9 finished with value: 0.7759386163983182 and parameters: {'iterations': 1099, 'learning_rate': 0.02139954903017622, 'depth': 7, 'l2_leaf_reg': 1.383324997521996, 'bagging_temperature': 0.8021969807540397, 'subsample': 0.6298202574719083, 'min_data_in_leaf': 64}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:37:19,355] Trial 10 finished with value: 0.6880157547348174 and parameters: {'iterations': 1319, 'learning_rate': 0.06768269073143272, 'depth': 10, 'l2_leaf_reg': 8.556385477088206, 'bagging_temperature': 0.015367737328104342, 'subsample': 0.7592173368880707, 'min_data_in_leaf': 1}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:37:19,865] Trial 11 finished with value: 0.736369191752449 and parameters: {'iterations': 1950, 'learning_rate': 0.054610024983365375, 'depth': 9, 'l2_leaf_reg': 1.036910215048229, 'bagging_temperature': 0.6174544729930211, 'subsample': 0.9193086824158574, 'min_data_in_leaf': 4}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:37:34,208] Trial 12 finished with value: 0.6831143551940407 and parameters: {'iterations': 1653, 'learning_rate': 0.08281655519018158, 'depth': 8, 'l2_leaf_reg': 5.296674788500096, 'bagging_temperature': 0.20216151916664463, 'subsample': 0.8715792259594299, 'min_data_in_leaf': 14}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:37:34,806] Trial 13 finished with value: 0.7534264446185135 and parameters: {'iterations': 1110, 'learning_rate': 0.0390450080691491, 'depth': 10, 'l2_leaf_reg': 2.3874310439811692, 'bagging_temperature': 0.9904360623477237, 'subsample': 0.9575376190082351, 'min_data_in_leaf': 15}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:37:35,316] Trial 14 finished with value: 0.7546526535936451 and parameters: {'iterations': 1531, 'learning_rate': 0.040332226977607015, 'depth': 8, 'l2_leaf_reg': 3.871467818257122, 'bagging_temperature': 0.6191678774040926, 'subsample': 0.8902937844021914, 'min_data_in_leaf': 16}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:37:55,332] Trial 15 finished with value: 0.6853564998929738 and parameters: {'iterations': 1998, 'learning_rate': 0.08559149542508514, 'depth': 9, 'l2_leaf_reg': 6.427854453713886, 'bagging_temperature': 0.19592898855184532, 'subsample': 0.7680228521304249, 'min_data_in_leaf': 24}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:37:56,152] Trial 16 finished with value: 0.7401823714615796 and parameters: {'iterations': 1098, 'learning_rate': 0.054853609404047106, 'depth': 7, 'l2_leaf_reg': 1.4360242559303533, 'bagging_temperature': 0.5895985018457592, 'subsample': 0.601402329355137, 'min_data_in_leaf': 9}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:38:19,156] Trial 17 finished with value: 0.6754179034295401 and parameters: {'iterations': 1756, 'learning_rate': 0.10593051997205254, 'depth': 9, 'l2_leaf_reg': 2.002492183845991, 'bagging_temperature': 0.21147450087095934, 'subsample': 0.9217095739850046, 'min_data_in_leaf': 25}. Best is trial 0 with value: 0.3200905560844422.
[I 2025-12-13 17:38:19,611] Trial 18 finished with value: 0.7621310124014122 and parameters: {'iterations': 1243, 'learning_rate': 0.03252002518412039, 'depth': 8, 'l2_leaf_reg': 1.0507541827528841, 'bagging_temperature': 0.7343107569994993, 'subsample': 0.8238463572387135, 'min_data_in_leaf': 7}. Best is trial 0 with value: 0.3200905560844422.
# ==========================
# 1. Filtrar Maldonado
# ==========================
df_filtrado = df_sin_outliers_para_catboost[
df_sin_outliers_para_catboost["ADDRESS_STATE"] == "Maldonado"
].copy()
X = df_filtrado[x_vars].copy()
y = df_filtrado[y_var].copy()
# ==========================
# 3. Log-transformaciones
# ==========================
# Objetivo: log(PRICE)
y = np.log1p(y)
# Feature: log(COVERED_AREA) si existe
if "COVERED_AREA" in X.columns:
X["COVERED_AREA"] = np.log1p(X["COVERED_AREA"])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42
)
# ==========================
# 7. K-Fold para Optuna
# ==========================
kf = KFold(n_splits=5, shuffle=True, random_state=42)
def objective(trial):
"""
Optimizamos hiperparámetros de CatBoost entrenando en log(PRICE),
pero minimizando el MAPE en PRECIO REAL.
"""
params = {
"iterations": trial.suggest_int("iterations", 600, 2000),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.15, log=True),
"depth": trial.suggest_int("depth", 4, 10),
"l2_leaf_reg": trial.suggest_float("l2_leaf_reg", 1.0, 10.0, log=True),
"bagging_temperature": trial.suggest_float("bagging_temperature", 0.0, 1.0),
"subsample": trial.suggest_float("subsample", 0.6, 1.0),
"min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 64),
"loss_function": "RMSE", # RMSE en log(PRICE)
"eval_metric": "RMSE", # usamos RMSE para early stopping/pruning
"random_seed": 42,
"verbose": False,
"allow_writing_files": False
}
mape_folds = []
# Pruner basado en RMSE (en log-espacio)
pruning_cb = CatBoostPruningCallback(trial, "RMSE")
for fold, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
X_train_fold, X_valid_fold = X_train.iloc[train_index], X_train.iloc[valid_index]
y_train_fold, y_valid_fold = y_train.iloc[train_index], y_train.iloc[valid_index] # log(PRICE)
model = CatBoostRegressor(**params)
model.fit(
X_train_fold, y_train_fold,
cat_features=categoricas,
eval_set=(X_valid_fold, y_valid_fold),
use_best_model=True,
early_stopping_rounds=200,
callbacks=[pruning_cb]
)
# Predicción en LOG-PRECIO
y_valid_pred_log = model.predict(X_valid_fold)
# Volver a PRECIO REAL
y_valid_price = np.expm1(y_valid_fold)
y_valid_pred_price = np.expm1(y_valid_pred_log)
# MAPE en PRECIO REAL
mape_fold = mean_absolute_percentage_error(y_valid_price, y_valid_pred_price)
mape_folds.append(float(mape_fold))
return np.mean(mape_folds)
# ==========================
# 8. Ejecutar Optuna
# ==========================
study = optuna.create_study(
direction="minimize",
sampler=TPESampler(seed=42),
pruner=MedianPruner(n_warmup_steps=10)
)
print("Iniciando optimización de hiperparámetros con K-Fold...")
study.optimize(objective, n_trials=30, n_jobs=1, show_progress_bar=True)
print("\n=== Mejor Resultado de Optuna ===")
print(f"Mejor MAPE (K-Fold, PRECIO REAL): {study.best_value:.2%}")
print(f"Mejores Parámetros: {study.best_params}")
# ==========================
# 1. Filtrar Maldonado
# ==========================
df_filtrado = df_sin_outliers_para_catboost[
df_sin_outliers_para_catboost["ADDRESS_STATE"] == "Montevideo"
].copy()
# ==========================
# 2. Crear X e y (precio en nivel por ahora)
# ==========================
X = df_filtrado[x_vars].copy()
y = df_filtrado[y_var].copy() # PRICE en nivel
# ==========================
# 3. Log-transformaciones
# ==========================
# Objetivo: log(PRICE)
y = np.log1p(y) # log(1 + PRICE)
# Feature: log(COVERED_AREA) si existe
if "COVERED_AREA" in X.columns:
X["COVERED_AREA"] = np.log1p(X["COVERED_AREA"])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42
)
# ==========================
# 7. K-Fold para Optuna
# ==========================
kf = KFold(n_splits=5, shuffle=True, random_state=42)
def objective(trial):
"""
Optimizamos hiperparámetros de CatBoost entrenando en log(PRICE),
pero minimizando el MAPE en PRECIO REAL.
"""
params = {
"iterations": trial.suggest_int("iterations", 600, 2000),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.15, log=True),
"depth": trial.suggest_int("depth", 4, 10),
"l2_leaf_reg": trial.suggest_float("l2_leaf_reg", 1.0, 10.0, log=True),
"bagging_temperature": trial.suggest_float("bagging_temperature", 0.0, 1.0),
"subsample": trial.suggest_float("subsample", 0.6, 1.0),
"min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 64),
"loss_function": "RMSE", # RMSE en log(PRICE)
"eval_metric": "RMSE", # usamos RMSE para early stopping/pruning
"random_seed": 42,
"verbose": False,
"allow_writing_files": False
}
mape_folds = []
# Pruner basado en RMSE (en log-espacio)
pruning_cb = CatBoostPruningCallback(trial, "RMSE")
for fold, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
X_train_fold, X_valid_fold = X_train.iloc[train_index], X_train.iloc[valid_index]
y_train_fold, y_valid_fold = y_train.iloc[train_index], y_train.iloc[valid_index] # log(PRICE)
model = CatBoostRegressor(**params)
model.fit(
X_train_fold, y_train_fold,
cat_features=categoricas,
eval_set=(X_valid_fold, y_valid_fold),
use_best_model=True,
early_stopping_rounds=200,
callbacks=[pruning_cb]
)
# Predicción en LOG-PRECIO
y_valid_pred_log = model.predict(X_valid_fold)
# Volver a PRECIO REAL
y_valid_price = np.expm1(y_valid_fold)
y_valid_pred_price = np.expm1(y_valid_pred_log)
# MAPE en PRECIO REAL
mape_fold = mean_absolute_percentage_error(y_valid_price, y_valid_pred_price)
mape_folds.append(float(mape_fold))
return np.mean(mape_folds)
# ==========================
# 8. Ejecutar Optuna
# ==========================
study = optuna.create_study(
direction="minimize",
sampler=TPESampler(seed=42),
pruner=MedianPruner(n_warmup_steps=10)
)
print("Iniciando optimización de hiperparámetros con K-Fold...")
study.optimize(objective, n_trials=30, n_jobs=1, show_progress_bar=True)
print("\n=== Mejor Resultado de Optuna ===")
print(f"Mejor MAPE (K-Fold, PRECIO REAL): {study.best_value:.2%}")
print(f"Mejores Parámetros: {study.best_params}")
# ==========================
# 1. Filtrar Maldonado
# ==========================
df_filtrado = df_sin_outliers_para_catboost[
df_sin_outliers_para_catboost["ADDRESS_STATE"] == "Canelones"
].copy()
# ==========================
# 2. Crear X e y (precio en nivel por ahora)
# ==========================
X = df_filtrado[x_vars].copy()
y = df_filtrado[y_var].copy() # PRICE en nivel
# ==========================
# 3. Log-transformaciones
# ==========================
# Objetivo: log(PRICE)
y = np.log1p(y) # log(1 + PRICE)
# Feature: log(COVERED_AREA) si existe
if "COVERED_AREA" in X.columns:
X["COVERED_AREA"] = np.log1p(X["COVERED_AREA"])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42
)
# ==========================
# 7. K-Fold para Optuna
# ==========================
kf = KFold(n_splits=5, shuffle=True, random_state=42)
def objective(trial):
"""
Optimizamos hiperparámetros de CatBoost entrenando en log(PRICE),
pero minimizando el MAPE en PRECIO REAL.
"""
params = {
"iterations": trial.suggest_int("iterations", 600, 2000),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.15, log=True),
"depth": trial.suggest_int("depth", 4, 10),
"l2_leaf_reg": trial.suggest_float("l2_leaf_reg", 1.0, 10.0, log=True),
"bagging_temperature": trial.suggest_float("bagging_temperature", 0.0, 1.0),
"subsample": trial.suggest_float("subsample", 0.6, 1.0),
"min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 64),
"loss_function": "RMSE", # RMSE en log(PRICE)
"eval_metric": "RMSE", # usamos RMSE para early stopping/pruning
"random_seed": 42,
"verbose": False,
"allow_writing_files": False
}
mape_folds = []
# Pruner basado en RMSE (en log-espacio)
pruning_cb = CatBoostPruningCallback(trial, "RMSE")
for fold, (train_index, valid_index) in enumerate(kf.split(X_train, y_train)):
X_train_fold, X_valid_fold = X_train.iloc[train_index], X_train.iloc[valid_index]
y_train_fold, y_valid_fold = y_train.iloc[train_index], y_train.iloc[valid_index] # log(PRICE)
model = CatBoostRegressor(**params)
model.fit(
X_train_fold, y_train_fold,
cat_features=categoricas,
eval_set=(X_valid_fold, y_valid_fold),
use_best_model=True,
early_stopping_rounds=200,
callbacks=[pruning_cb]
)
# Predicción en LOG-PRECIO
y_valid_pred_log = model.predict(X_valid_fold)
# Volver a PRECIO REAL
y_valid_price = np.expm1(y_valid_fold)
y_valid_pred_price = np.expm1(y_valid_pred_log)
# MAPE en PRECIO REAL
mape_fold = mean_absolute_percentage_error(y_valid_price, y_valid_pred_price)
mape_folds.append(float(mape_fold))
return np.mean(mape_folds)
# ==========================
# 8. Ejecutar Optuna
# ==========================
study = optuna.create_study(
direction="minimize",
sampler=TPESampler(seed=42),
pruner=MedianPruner(n_warmup_steps=10)
)
print("Iniciando optimización de hiperparámetros con K-Fold...")
study.optimize(objective, n_trials=30, n_jobs=1, show_progress_bar=True)
print("\n=== Mejor Resultado de Optuna ===")
print(f"Mejor MAPE (K-Fold, PRECIO REAL): {study.best_value:.2%}")
print(f"Mejores Parámetros: {study.best_params}")
El mejor modelo es CatBoost optimizado, que obtiene el menor error promedio.
Con un precio promedio de USD 200.000, un MAPE de 25% implica un error típico de ± USD 50.000 por vivienda en Montevideo.
Esto lo convierte en una herramienta confiable para estimar valores inmobiliarios.
Estos fueron los resultados finales:
| Modelo | MAPE |
|---|---|
| Regresión Lineal (RL) | 39.9% |
| CatBoost | 33.8% |
| CatBoost + Optuna | 32.0% |
| CatBoost + Optuna (Montevideo) | 25.0% |
| CatBoost + Optuna (Maldonado) | 37.0% |