Kaggle Pipeline Hitchhike

The more work I do, the more hikes I take. As I was doing some background Machine Learning research for my Sentiment Analysis hitchhike, I fell into the Kaggle rabbit hole. I started with the Titanic competition and then moved towards the Housing Price competition and before I knew it, I was concatenating detours so much that I figured I’d better document the process so I could find my way back.

The Kaggle website has a surplus of ML resources, with step by step tutorials and live notebooks that let you test code in real time. Spend a few hours digging around the site and then try telling me you don’t understand the basics of artificial intelligence. You’d be lying. Anyways, the first couple of tutorials I meandered my way through provided me with plenty of instructions on preprocessing data and creating models – so much instruction that I felt a bit overwhelmed. The process of replacing missing values, breaking apart category columns, and fitting a model to a data set that matches the size of the testing set was a tricky and tedious one. It was easy to lose track of the data set you were updating and certain preprocessing steps would butcher the structure you were working with, requiring repair steps that would either replace a missing index or append dropped columns. Nonetheless, it was clear that these steps would be useful and thereby necessary each and every time you created a model. If there was only a way to streamline this…

Behold the pipeline! The scikit learn library explains that a pipeline can be used to chain estimators and can reduce your code by requiring that you call the “fit” and “predict” commands only once on your entire data set. So with this newfound utility, I set out to create the dankest pipeline know to the AI community…or, well, a pipeline that could fill in missing values and make category values useful. It will get danker with time.

The Optimized Journey (~5 min)

This is the path I would have taken if I knew what I know now…

  • Load the Data

    X_full = pd.read_csv('..train.csv', index_col='Id')
    X_full.dropna(axis=0, inplace=True)
    y = X_full.Target
  • Find the Columns

    # Select categorical columns with low cardinality 
    low_cat_cols = 
    [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
    X_train_full[cname].dtype == "object"]
    
    # Select categorical columns with high cardinality 
    high_cat_cols= 
    [cname for cname in X_train_full.columns if X_train_full[cname].nunique() >= 10 and 
    X_train_full[cname].dtype == "object"]
    
    # Select numerical columns
    numerical_cols = 
    [cname for cname in X_train_full.columns if 
     X_train_full[cname].dtype in ['int64', 'float64']]
  • Create Transformers for each Set

    # Preprocessing for categorical data
    low_transformer = Pipeline(steps=[
       ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    high_transformer = Pipeline(steps= [
        ('imputer',SimpleImputer(strategy='most_frequent')),
        ('onehot',OneHotEncoder(handle_unknown='ignore'))
    ])
    
    # Preprocessing for numerical data
    numerical_transformer = SimpleImputer(strategy='constant')
  • Preprocessor, Model, Pipeline

    # Bundle preprocessing for num and cat data
    preprocessor = ColumnTransformer(
        transformers=[                             
     ('num', numerical_transformer, numerical_cols),
                ('high',high_transformer,high_cats),  ('cat',categorical_transformer,categorical_cols)
    ])
    
    # Define model
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    
    # Bundle preprocessing and modeling code in a pipeline
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('model', model)
                         ])
  • Fit the Pipeline

    # Preprocessing of training data, fit model 
    clf.fit(X_train, y_train)

The Hitchhiker’s Journey (1 hour)

This is the path I actually took to create a functional Pipeline template…

  • Red – Obstacles I couldn’t have avoided
  • Yellow- Obstacles I could have avoided if I read carefully
  • Green – Wow. Actual progress
  • Trek Through the Pipeline Tutorial

    I will reiterate once more that Kaggle has the greatest online tutorials I have ever stumbled my way through. The age of online learning is upon us. Begin your new education today, padawan.

    Here’s the pipeline tutorial.

  • Copy the Tutorial Code and Feel Knowledgeable

    It’s funny how often I take the code that someone else has written, paste it into one of my test projects, and feel like I’ve accomplished something. But really, I think I have a short term memory issue and I’ve found it incredibly helpful to create a “toolbox” of all the valuable code I find on the inter-web. When I need it, I grab it, tweak it, test it and save myself the time of having to go search for it again. The main part of the code is here:

    # Preprocessing for numerical data
    num_transformer = SimpleImputer(strategy='constant')
    
    # Preprocessing for categorical data
    cat_transformer = 
    Pipeline(steps=[
      ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    # Bundle preprocessing for num and cat data
    preprocessor = 
    ColumnTransformer(transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('cat',categorical_transformer,categorical_cols)
     ])
    
    # Define model
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    
    # Bundle preprocessing and modeling code in a pipeline
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('mod
    
    # Preprocessing of training data, fit model 
    clf.fit(X_train, y_train)

  • Edit the Tutorial code

    The code seemed simple enough so I got the bright idea to add a third transformer to the preprocessor. The code currently wasn’t handling columns with high cardinality so my goal was to handle it and handle it good.

    I found the high cardinality columns and then created a LabelEncoder() transformer to presumably convert any category values to numbers…idk, seemed like an okay thing to do.

    high_transformer = LabelEncoder()
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('high',high_transformer,high_cats),
            ('cat',categorical_transformer,categorical_cols)
        ])

    All is good in the hood.

  • Run the Code into a Wall

    I ran the code and saw the black box of grief and misery.

    Error: TypeError: fit_transform() takes 2 positional arguments but 3 were given

    I wish I knew what this meant. Before doing too much research, I wanted to test another type of transformer to see if something was off about the LabelEncoder()

    high_transformer = SimpleImputer(strategy='median')

    The SimpleImputer simply imputes – I mean, it fills in missing values.

  • Run the Code into Another Wall

    The black wall of errors filled the screen again, but this time the root error was different.

    Error: AttributeError: ‘DataFrame’ object has no attribute ‘dtype’

    Maybe it’s my strategy?

    high_transformer = SimpleImputer(strategy='most_frequent')

    Error: ValueError: could not convert string to float: ‘NridgHt’

    BAH.

  • Read Up on Imputers

    The scikit Learn website has a User’s guide that goes over the fundamentals behind each feature that their library offers and I found one on Imputation. Apparently, the SimpleImputer CAN be used on categorical data if the strategy = ‘constant’ or ‘most_frequent’ so I wasn’t wrong.

    Looking back at the error message, I could see that the error was thrown when the pipeline was attempted to be fit to the data and this is when I realized ML models can only deal with numerical values, which is why OneHotEncoding and LabelEncoding are a thing. Duh.

    I added another line to my high_transformer to convert the imputed string column to numerical values.

    high_transformer = Pipeline(steps= [
        ('imputer',SimpleImputer(strategy='most_frequent')),
        ('onehot',OneHotEncoder(handle_unknown='ignore'))
    ])
  • Run the Code Away from the Wall

    The code executed this time without error and I now had explicit transformers for numerical columns, low cardinality columns, and high cardinality columns. Perfect, now we’re getting somewhere.

    Also, the LabelEncoder didn’t work the first time because this transformer is meant to be used on the target values, y, and not the training data, X.

    In case you didn’t believe me ^^

  • Create Pipeline Template

    Since the code finally started working, I saved the important pieces off to a notebook for quick reference. Walla

Put that in your pipeline and smoke it.

Leave a Reply