Árboles de decisión


Los Ć”rboles de decisiĆ³n son un mĆ©todo de aprendizaje supervisado no paramĆ©trico utilizado para la clasificaciĆ³n y la regresiĆ³n, es un modelo que no es sensible a los NA ni a valores atipicos.

los Ć”rboles de decisiĆ³n aprenden de los datos para aproximar curva con un conjunto de reglas de decisiĆ³n if/else, ver ejemplo acontinuaciĆ³n.

Un arbol de decisiĆ³n posee los siguientes elementos:

  • nodo raiz
  • ramas
  • nodos de decisiĆ³n
  • nodo hojas

Árboles de Clasificación


Data set Iris

In [0]:
# Cargando librerias
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
In [0]:
# leyendo la base de datos del repositorio de unalytics
data = pd.read_csv("https://raw.githubusercontent.com/unalyticsteam/databases/master/iris.csv")
In [29]:
data.head(3)
Out[29]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa

Preparando los datos.


Lo primero es identificar las variables que me serviran de predictores y cual la varible objetivo.

In [0]:
# tomando los nombres de las variables en una lista
colnames = data.columns.values.tolist()
In [0]:
# identificando predictores y variable de respuesta
predictors = colnames[:4]
target = colnames[4]

Tomemos una muestra aleatoria del aproximadamente el 75 % de las observaciones como datos de entrenamiento y el restante 25 % como datos de prueba.

In [0]:
# creando una columna para identificar cuales observaciones nos serviran como datos de prueba. 
data["is_train"] = np.random.uniform(low=0, high=1, size= len(data)) <= 0.75
In [37]:
data.head(5)
Out[37]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species is_train
0 5.1 3.5 1.4 0.2 setosa True
1 4.9 3.0 1.4 0.2 setosa True
2 4.7 3.2 1.3 0.2 setosa True
3 4.6 3.1 1.5 0.2 setosa False
4 5.0 3.6 1.4 0.2 setosa True
In [0]:
# tomando los datos de entrenamiento y los datos de prueba
train, test = data[data["is_train"] == True] , data[data["is_train"]== False]  
In [40]:
# observemos las primeras entradas
train.head(3)
Out[40]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species is_train
0 5.1 3.5 1.4 0.2 setosa True
1 4.9 3.0 1.4 0.2 setosa True
2 4.7 3.2 1.3 0.2 setosa True
In [0]:
# Importando librerias 
# sklearn: biblioteca para aprendizaje de mĆ”quina de software libre (Incluye varios algoritmos de clasificaciĆ³n, regresiĆ³n y anĆ”lisis de grupos)
from sklearn.tree import DecisionTreeClassifier
In [42]:
# creando el Ɣrbol. 
tree = DecisionTreeClassifier(criterion="entropy", min_samples_split=20, random_state= 99)
tree.fit(train[predictors], train[target])
Out[42]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=20,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=99, splitter='best')

Predicción

In [0]:
# predicciones
preds = tree.predict(test[predictors])
In [45]:
# Matriz de confusiĆ³n. 
pd.crosstab(test[target], preds, rownames=["acutual"], colnames=["predictors"])
Out[45]:
predictors setosa versicolor virginica
acutual
setosa 12 0 0
versicolor 0 12 0
virginica 0 3 9

Vusualización

In [46]:
# Instalacion en colab. 
! pip install graphviz
Requirement already satisfied: graphviz in /usr/local/lib/python3.6/dist-packages (0.10.1)
In [0]:
# importando libreria de visualizaciĆ³n. 
from sklearn.tree import export_graphviz
In [0]:
# hay que exportar como .dot (grafos)
with open("dtree.dot", "w") as dotfile:
    export_graphviz(tree, out_file=dotfile, feature_names=predictors)
    dotfile.close()
In [50]:
# leyendo el objeto.
from graphviz import Source
file = open("dtree.dot", "r")
text = file.read()
Source(text)
Out[50]:
Tree 0 Petal.Length <= 2.6 entropy = 1.573 samples = 114 value = [38, 32, 44] 1 entropy = 0.0 samples = 38 value = [38, 0, 0] 0->1 True 2 Petal.Width <= 1.75 entropy = 0.982 samples = 76 value = [0, 32, 44] 0->2 False 3 Petal.Length <= 4.95 entropy = 0.513 samples = 35 value = [0, 31, 4] 2->3 8 Petal.Length <= 4.85 entropy = 0.165 samples = 41 value = [0, 1, 40] 2->8 4 Sepal.Length <= 4.95 entropy = 0.211 samples = 30 value = [0, 29, 1] 3->4 7 entropy = 0.971 samples = 5 value = [0, 2, 3] 3->7 5 entropy = 0.0 samples = 1 value = [0, 0, 1] 4->5 6 entropy = 0.0 samples = 29 value = [0, 29, 0] 4->6 9 entropy = 0.918 samples = 3 value = [0, 1, 2] 8->9 10 entropy = 0.0 samples = 38 value = [0, 0, 38] 8->10

Validación cruzada.


In [0]:
X = data[predictors]
Y = data[target]
In [62]:
tree = DecisionTreeClassifier(criterion="entropy", max_depth= 5,
                              min_samples_split=20, random_state=99)
tree.fit(X, Y)
Out[62]:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=20,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=99, splitter='best')
In [0]:
from sklearn.model_selection import KFold
In [68]:
cv = KFold(X.shape[0], shuffle=True, random_state=1 )
Out[68]:
sklearn.model_selection._split.KFold
In [0]:
from sklearn.model_selection import  cross_val_score
In [77]:
scores = cross_val_score(tree, X, Y, scoring="accuracy", cv = cv, n_jobs=1)
scores
Out[77]:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1.,
       0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
In [78]:
score = np.mean(scores)
score
Out[78]:
0.9466666666666667

Árboles de Regresión


In [0]:
# Impotando librerias
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

Leyendo los datos de las casas de Boston(1978), del repositorio de unalytics

In [0]:
data = pd.read_csv("https://raw.githubusercontent.com/unalyticsteam/databases/master/Boston.csv")

crim: indice de criminologia per-capita.(100 induviduos)

zn: proporcion de zona de residencia por cada pie cudrado

indus: Acceso a zona de industria.

chas: cerca rio de boston (0, 1)

rm: promedio de habitaciones.

...

Rápido Análisis


In [0]:
# Observemos las 3 primeras filas
data.head(3)
Out[0]:
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7

observemos las variables, tipo de dato, memoria.

In [0]:
# dimensiĆ³n de la base de datos.
data.shape
Out[0]:
(506, 14)
In [0]:
# informacion de la base de datos.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
crim       506 non-null float64
zn         506 non-null float64
indus      506 non-null float64
chas       506 non-null int64
nox        506 non-null float64
rm         506 non-null float64
age        506 non-null float64
dis        506 non-null float64
rad        506 non-null int64
tax        506 non-null int64
ptratio    506 non-null float64
black      506 non-null float64
lstat      506 non-null float64
medv       506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 55.4 KB

Pre-Procesando base de datos.


In [0]:
colnames = data.columns.values.tolist()
colnames
Out[0]:
['crim',
 'zn',
 'indus',
 'chas',
 'nox',
 'rm',
 'age',
 'dis',
 'rad',
 'tax',
 'ptratio',
 'black',
 'lstat',
 'medv']
In [0]:
# preprocesado 
predictors = colnames[0:13]
targets = colnames[13]
x = data[predictors]
y = data[targets]
In [0]:
regtree = DecisionTreeRegressor(min_samples_split=30, min_samples_leaf=10, random_state=0)
In [0]:
regtree.fit(x,y) #predictoras, prediccion
Out[0]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=10,
                      min_samples_split=30, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=0, splitter='best')
In [0]:
preds = regtree.predict(data[predictors])
In [0]:
data["preds"] = preds
In [0]:
data[["preds", "medv"]].head()
Out[0]:
preds medv
0 22.840000 24.0
1 22.840000 21.6
2 35.247826 34.7
3 35.247826 33.4
4 35.247826 36.2

Visualizacion

In [0]:
from sklearn.tree import export_graphviz
In [0]:
from sklearn.tree import export_graphviz
with open("boston_rtree.dot", "w") as dotfile:
    export_graphviz(regtree, out_file=dotfile, feature_names=predictors)
    dotfile.close()
    
import os
from graphviz import Source
file = open("boston_rtree.dot", "r")
text = file.read()
Source(text)
Out[0]:
Tree 0 rm <= 6.941 mse = 84.42 samples = 506 value = 22.533 1 lstat <= 14.4 mse = 40.273 samples = 430 value = 19.934 0->1 True 48 rm <= 7.437 mse = 79.729 samples = 76 value = 37.238 0->48 False 2 lstat <= 4.91 mse = 26.009 samples = 255 value = 23.35 1->2 29 crim <= 6.992 mse = 19.276 samples = 175 value = 14.956 1->29 3 mse = 47.187 samples = 20 value = 31.565 2->3 4 lstat <= 9.715 mse = 17.974 samples = 235 value = 22.651 2->4 5 age <= 87.6 mse = 22.287 samples = 122 value = 24.393 4->5 18 ptratio <= 17.85 mse = 6.503 samples = 113 value = 20.77 4->18 6 rm <= 6.125 mse = 11.111 samples = 112 value = 23.787 5->6 17 mse = 97.42 samples = 10 value = 31.17 5->17 7 mse = 4.865 samples = 29 value = 20.624 6->7 8 rm <= 6.611 mse = 8.576 samples = 83 value = 24.893 6->8 9 tax <= 332.5 mse = 6.848 samples = 60 value = 23.99 8->9 16 mse = 5.413 samples = 23 value = 27.248 8->16 10 age <= 63.7 mse = 6.345 samples = 50 value = 24.366 9->10 15 mse = 5.119 samples = 10 value = 22.11 9->15 11 dis <= 4.631 mse = 6.944 samples = 40 value = 24.747 10->11 14 mse = 1.038 samples = 10 value = 22.84 10->14 12 mse = 14.017 samples = 11 value = 26.564 11->12 13 mse = 2.536 samples = 29 value = 24.059 11->13 19 tax <= 309.0 mse = 8.556 samples = 33 value = 21.864 18->19 22 indus <= 10.245 mse = 4.96 samples = 80 value = 20.319 18->22 20 mse = 9.029 samples = 15 value = 23.127 19->20 21 mse = 5.725 samples = 18 value = 20.811 19->21 23 age <= 70.1 mse = 4.35 samples = 44 value = 19.661 22->23 26 nox <= 0.627 mse = 4.531 samples = 36 value = 21.122 22->26 24 mse = 4.541 samples = 26 value = 20.292 23->24 25 mse = 2.668 samples = 18 value = 18.75 23->25 27 mse = 4.093 samples = 24 value = 20.783 26->27 28 mse = 4.717 samples = 12 value = 21.8 26->28 30 nox <= 0.531 mse = 11.391 samples = 101 value = 17.138 29->30 39 nox <= 0.606 mse = 14.674 samples = 74 value = 11.978 29->39 31 mse = 9.016 samples = 24 value = 20.021 30->31 32 lstat <= 18.885 mse = 8.733 samples = 77 value = 16.239 30->32 33 age <= 85.2 mse = 5.952 samples = 53 value = 17.234 32->33 38 mse = 7.862 samples = 24 value = 14.042 32->38 34 mse = 4.531 samples = 12 value = 19.408 33->34 35 crim <= 0.615 mse = 4.579 samples = 41 value = 16.598 33->35 36 mse = 3.971 samples = 16 value = 18.112 35->36 37 mse = 2.559 samples = 25 value = 15.628 35->37 40 mse = 18.606 samples = 12 value = 16.633 39->40 41 lstat <= 19.645 mse = 8.908 samples = 62 value = 11.077 39->41 42 mse = 4.18 samples = 18 value = 13.922 41->42 43 nox <= 0.675 mse = 6.177 samples = 44 value = 9.914 41->43 44 mse = 1.544 samples = 10 value = 12.63 43->44 45 crim <= 13.24 mse = 4.731 samples = 34 value = 9.115 43->45 46 mse = 3.989 samples = 15 value = 10.46 45->46 47 mse = 2.76 samples = 19 value = 8.053 45->47 49 lstat <= 5.495 mse = 41.296 samples = 46 value = 32.113 48->49 52 ptratio <= 15.4 mse = 36.628 samples = 30 value = 45.097 48->52 50 mse = 17.249 samples = 23 value = 35.248 49->50 51 mse = 45.69 samples = 23 value = 28.978 49->51 53 mse = 7.774 samples = 16 value = 47.975 52->53 54 mse = 49.315 samples = 14 value = 41.807 52->54