A Guide for Feature Engineering and Feature Selection, with implementations and examples in Python.

Overview

Feature Engineering & Feature Selection

A comprehensive guide [pdf] [markdown] for Feature Engineering and Feature Selection, with implementations and examples in Python.

Motivation

Feature Engineering & Selection is the most essential part of building a useable machine learning project, even though hundreds of cutting-edge machine learning algorithms coming in these days like deep learning and transfer learning. Indeed, like what Prof Domingos, the author of  'The Master Algorithm' says:

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”

— Prof. Pedro Domingos

001

Data and feature has the most impact on a ML project and sets the limit of how well we can do, while models and algorithms are just approaching that limit. However, few materials could be found that systematically introduce the art of feature engineering, and even fewer could explain the rationale behind. This repo is my personal notes from learning ML and serves as a reference for Feature Engineering & Selection.

Download

Download the PDF here:

Same, but in markdown:

PDF has a much readable format, while Markdown has auto-generated anchor link to navigate from outer source. GitHub sucks at displaying markdown with complex grammar, so I would suggest read the PDF or download the repo and read markdown with Typora.

What You'll Learn

Not only a collection of hands-on functions, but also explanation on Why, How and When to adopt Which techniques of feature engineering in data mining.

  • the nature and risk of data problem we often encounter
  • explanation of the various feature engineering & selection techniques
  • rationale to use it
  • pros & cons of each method
  • code & example

Getting Started

This repo is mainly used as a reference for anyone who are doing feature engineering, and most of the modules are implemented through scikit-learn or its communities.

To run the demos or use the customized function, please download the ZIP file from the repo or just copy-paste any part of the code you find helpful. They should all be very easy to understand.

Required Dependencies:

  • Python 3.5, 3.6 or 3.7
  • numpy>=1.15
  • pandas>=0.23
  • scipy>=1.1.0
  • scikit_learn>=0.20.1
  • seaborn>=0.9.0

Table of Contents and Code Examples

Below is a list of methods currently implemented in the repo.

1. Data Exploration

2. Feature Cleaning

3. Feature Engineering

4. Feature Selection

Key Links and Resources

  • Udemy's Feature Engineering online course

https://www.udemy.com/feature-engineering-for-machine-learning/

  • Udemy's Feature Selection online course

https://www.udemy.com/feature-selection-for-machine-learning

  • JMLR Special Issue on Variable and Feature Selection

http://jmlr.org/papers/special/feature03.html

  • Data Analysis Using Regression and Multilevel/Hierarchical Models, Chapter 25: Missing data

http://www.stat.columbia.edu/~gelman/arm/missing.pdf

  • Data mining and the impact of missing data

http://core.ecu.edu/omgt/krosj/IMDSDataMining2003.pdf

  • PyOD: A Python Toolkit for Scalable Outlier Detection

https://github.com/yzhao062/pyod

  • Weight of Evidence (WoE) Introductory Overview

http://documentation.statsoft.com/StatisticaHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview

  • About Feature Scaling and Normalization

http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

  • Feature Generation with RF, GBDT and Xgboost

https://blog.csdn.net/anshuai_aw1/article/details/82983997

  • A review of feature selection methods with applications

https://ieeexplore.ieee.org/iel7/7153596/7160221/07160458.pdf

Owner
Yimeng.Zhang
I'm a lovely machine learning learner~
Yimeng.Zhang
Number calculator application.

Number calculator application.

Michael J Bailey 3 Oct 08, 2021
Addons like multipages for streamlit webapp

streamlit_pages Installation $ pip install streamlit-pages Features Adding multiple pages to streamlit Sharing specific pages Usage import streamlit

36 Dec 25, 2022
200 LeetCode problems

LeetCode I classify 200 leetcode problems into some categories and upload my code to who concern WEEK 1 # Title Difficulty Array 15 3Sum Medium 1324 P

Hoang Cao Bao 108 Dec 08, 2022
Basic Hspice runner with Python

HSpicePy Bilgisayarınıza PATH değişkenlerine eklediğiniz HSPICE programını python ile çalıştırmanızı sağlayan basit bir araç. A simple tool that allow

1 Nov 16, 2021
Chicks get hostloc points regularly

hostloc_getPoints 小鸡定时获取hostloc积分 github action大规模失效,mjj平均一人10鸡,以下可以部署到自己的小鸡上

59 Dec 28, 2022
The tool helps to find hidden parameters that can be vulnerable or can reveal interesting functionality that other hunters miss.

The tool helps to find hidden parameters that can be vulnerable or can reveal interesting functionality that other hunters miss. Greater accuracy is achieved thanks to the line-by-line comparison of

197 Nov 14, 2022
Find out where all films you want to watch are streaming

Just Watch Letterboxd Find out where all films you want to watch are streaming Ever wonder what films you want to watch are already on the streaming p

Jordan Oslislo 2 Feb 04, 2022
A collection of repositories used to realise various end-to-end high-level synthesis (HLS) flows centering around the CIRCT project.

circt-hls What is this?: A collection of repositories used to realise various end-to-end high-level synthesis (HLS) flows centering around the CIRCT p

29 Dec 14, 2022
Un script en python qui permet d'automatique bumpée (disboard.org) tout les 2h

auto-bumper Un script en python qui permet d'automatique bumpée (disboard.org) tout les 2h Pour la première utilisation, 1.Lancer Install.bat 2.(faire

!! 1 Jan 09, 2022
🌌A Python library to exhaustively enumerate a combinatorial space represented by a function

exhaust A Python library to exhaustively enumerate a combinatorial space represented by a function. The API is modelled after Python's random module a

Maik Riechert 1 Dec 05, 2021
WinBoost: Boost your windows system.

Winboost runs a complete checkup of your entire system locating junk files, speed-reducing issues and causes of any system or application glitches or crashes. Through a lot of research and testing, w

Smit Parmar 4 Oct 01, 2021
Python library to decode the EU Covid-19 vaccine certificate

DCC Utils Python library to decode the EU Covid-19 vaccine certificate, as specified by the EU. Setup pip install dcc-utils Make sure zbar is installe

Developers Italia 13 Mar 11, 2022
Script de monitoramento de telemetria para missões espaciais, cansat e foguetemodelismo.

Aeroespace_GroundStation Script de monitoramento de telemetria para missões espaciais, cansat e foguetemodelismo. Imagem 1 - Dashboard realizando moni

Vinícius Azevedo 5 Nov 27, 2022
Personal Finance Forecaster - An AI tool for forecasting personal expenses

Personal Finance Forecaster - An AI tool for forecasting personal expenses

2 Mar 09, 2022
This is a simple SV calling package for diploid assemblies.

dipdiff This is a simple SV calling package for diploid assemblies. It uses a modified version of svim-asm. The package includes its own version minim

Mikhail Kolmogorov 11 Jan 05, 2023
Snek-test - An operating system kernel made in python and assembly

pythonOS An operating system kernel made in python and assembly Wait what? It us

TechStudent10 2 Jan 25, 2022
Ant Colony Optimization for Traveling Salesman Problem

tsp-aco Ant Colony Optimization for Traveling Salesman Problem Dependencies Python 3.8 tqdm numpy matplotlib To run the solver run main.py from the p

Baha Eren YALDIZ 4 Feb 03, 2022
Inspect the resources of your android projects and understand which ones are not being used and could potentially be removed.

Android Resources Checker What This program will inspect the resources of your app and help you understand which ones are not being used and could pot

Fábio Carballo 39 Feb 08, 2022
WordlistPasswordGenerator - Shuhfab Basheer

WordlistPasswordGenerator - Shuhfab Basheer Python wordlist generator MAINTAINER

1 Dec 31, 2021
Rename and categorize your DMOJ solutions

DMOJ Downloader What is this for? DMOJ lets you download the code for all your solutions, however the files are just named as numbers

Evan Wild 1 Dec 04, 2022