This repo includes some graph-based CTR prediction models and other representative baselines.

Overview

Graph-based CTR prediction

This is a repository designed for graph-based CTR prediction methods, it includes our graph-based CTR prediction methods:

  • Fi-GNN: Modeling Feature Interactions via Graph Neural Networks for CTR Prediction paper
  • GraphFM: Graph Factorization Machines for Feature Interaction Modeling paper

and some other representative baselines:

  • HoAFM: A High-order Attentive Factorization Machine for CTR Prediction paper
  • AutoInt: AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks paper
  • InterHAt: Interpretable Click-Through Rate Prediction through Hierarchical Attention paper

Requirements:

  • Tensorflow 1.5.0
  • Python 3.6
  • CUDA 9.0+ (For GPU)

Usage

Our code is based on AutoInt.

Input Format

The required input data is in the following format:

  • train_x: matrix with shape (num_sample, num_field). train_x[s][t] is the feature value of feature field t of sample s in the dataset. The default value for categorical feature is 1.
  • train_i: matrix with shape (num_sample, num_field). train_i[s][t] is the feature index of feature field t of sample s in the dataset. The maximal value of train_i is the feature size.
  • train_y: label of each sample in the dataset.

If you want to know how to preprocess the data, please refer to data/Dataprocess/Criteo/preprocess.py

Example

There are four public real-world datasets(Avazu, Criteo, KDD12, MovieLens-1M) that you can use. You can run the code on MovieLens-1M dataset directly in /movielens. The other three datasets are super huge, and they can not be fit into the memory as a whole. Therefore, we split the whole dataset into 10 parts and we use the first file as test set and the second file as valid set. We provide the codes for preprocessing these three datasets in data/Dataprocess. If you want to reuse these codes, you should first run preprocess.py to generate train_x.txt, train_i.txt, train_y.txt as described in Input Format. Then you should run data/Dataprocesss/Kfold_split/StratifiedKfold.py to split the whole dataset into ten folds. Finally you can run scale.py to scale the numerical value(optional).

To help test the correctness of the code and familarize yourself with the code, we upload the first 10000 samples of Criteo dataset in train_examples.txt. And we provide the scripts for preprocessing and training.(Please refer to data/sample_preprocess.sh and run_criteo.sh, you may need to modify the path in config.py and run_criteo.sh).

After you run the data/sample_preprocess.sh, you should get a folder named Criteo which contains part*, feature_size.npy, fold_index.npy, train_*.txt. feature_size.npy contains the number of total features which will be used to initialize the model. train_*.txt is the whole dataset.

Here's how to run the preprocessing.

cd data
mkdir Criteo
python ./Dataprocess/Criteo/preprocess.py
python ./Dataprocess/Kfold_split/stratifiedKfold.py
python ./Dataprocess/Criteo/scale.py

Here's how to train GraphFM on Criteo dataset.

CUDA_VISIBLE_DEVICES=$GPU python -m code.train \
--model_type GraphFM \
                        --data_path $YOUR_DATA_PATH --data Criteo \
                        --blocks 3 --heads 2 --block_shape "[64, 64, 64]" \
                        --ks "[39, 20, 5]" \
                        --is_save --has_residual \
                        --save_path ./models/GraphFM/Criteo/b3h2_64x64x64/ \
                        --field_size 39  --run_times 1 \
                        --epoch 2 --batch_size 1024 \

Here's how to train GraphFM on Avazu dataset.

CUDA_VISIBLE_DEVICES=$GPU python -m code.train \
--model_type GraphFM \
                        --data_path $YOUR_DATA_PATH --data Avazu \
                        --blocks 3 --heads 2 --block_shape "[64, 64, 64]" \
                        --ks "[23, 10, 2]" \
                        --is_save --has_residual \
                        --save_path ./models/GraphFM/Avazu/b3h2_64x64x64/ \
                        --field_size 23  --run_times 1 \
                        --epoch 2 --batch_size 1024 \

You can run the training on the relatively small MovieLens dataset in /movielens.

You should see the output like this:

...
train logs
...
start testing!...
restored from ./models/Criteo/b3h2_64x64x64/1/
test-result = 0.8088, test-logloss = 0.4430
test_auc [0.8088305055534442]
test_log_loss [0.44297631300399626]
avg_auc 0.8088305055534442
avg_log_loss 0.44297631300399626

Citation

If you find this repo useful for your research, please consider citing the following paper:

@inproceedings{li2019fi,
  title={Fi-gnn: Modeling feature interactions via graph neural networks for ctr prediction},
  author={Li, Zekun and Cui, Zeyu and Wu, Shu and Zhang, Xiaoyu and Wang, Liang},
  booktitle={Proceedings of the 28th ACM International Conference on Information and Knowledge Management},
  pages={539--548},
  year={2019}
}

@article{li2021graphfm,
  title={GraphFM: Graph Factorization Machines for Feature Interaction Modeling},
  author={Li, Zekun and Wu, Shu and Cui, Zeyu and Zhang, Xiaoyu},
  journal={arXiv preprint arXiv:2105.11866},
  year={2021}
}

Contact information

You can contact Zekun Li ([email protected]), if there are questions related to the code.

Acknowledgement

This implementation is based on Weiping Song and Chence Shi's AutoInt. Thanks for their sharing and contribution.

Owner
Big Data and Multi-modal Computing Group, CRIPAC
Big Data and Multi-modal Computing Group, Center for Research on Intelligent Perception and Computing
Big Data and Multi-modal Computing Group, CRIPAC
A collection of Machine Learning Models To Web Api which are built on open source technologies/frameworks like Django, Flask.

Author Ibrahim Koné From-Machine-Learning-Models-To-WebAPI A collection of Machine Learning Models To Web Api which are built on open source technolog

Ibrahim Koné 2 May 24, 2022
MLOps pipeline project using Amazon SageMaker Pipelines

This project shows steps to build an end to end MLOps architecture that covers data prep, model training, realtime and batch inference, build model registry, track lineage of artifacts and model drif

AWS Samples 3 Sep 16, 2022
Applied Machine Learning for Graduate Program in Computer Science (PPGCC)

Applied Machine Learning for Graduate Program in Computer Science (PPGCC) - Federal University of Santa Catarina

Jônatas Negri Grandini 1 Dec 22, 2021
Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

Causal Inference and Machine Learning in Practice with EconML and CausalML: Industrial Use Cases at Microsoft, TripAdvisor, Uber

EconML/CausalML KDD 2021 Tutorial 124 Dec 28, 2022
A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

Stox 31 Dec 16, 2022
A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

KXY: A Seemless API to 10x The Productivity of Machine Learning Engineers Documentation https://www.kxy.ai/reference/ Installation From PyPi: pip inst

KXY Technologies, Inc. 35 Jan 02, 2023
Open source time series library for Python

PyFlux PyFlux is an open source time series library for Python. The library has a good array of modern time series models, as well as a flexible array

Ross Taylor 2k Jan 02, 2023
Pandas DataFrames and Series as Interactive Tables in Jupyter

Pandas DataFrames and Series as Interactive Tables in Jupyter Star Turn pandas DataFrames and Series into interactive datatables in both your notebook

Marc Wouts 364 Jan 04, 2023
Tangram makes it easy for programmers to train, deploy, and monitor machine learning models.

Tangram Website | Discord Tangram makes it easy for programmers to train, deploy, and monitor machine learning models. Run tangram train to train a mo

Tangram 1.4k Jan 05, 2023
Arquivos do curso online sobre a estatística voltada para ciência de dados e aprendizado de máquina.

Estatistica para Ciência de Dados e Machine Learning Arquivos do curso online sobre a estatística voltada para ciência de dados e aprendizado de máqui

Renan Barbosa 1 Jan 10, 2022
Using Logistic Regression and classifiers of the dataset to produce an accurate recall, f-1 and precision score

Using Logistic Regression and classifiers of the dataset to produce an accurate recall, f-1 and precision score

Thines Kumar 1 Jan 31, 2022
Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

Intel(R) Extension for Scikit-learn* Installation | Documentation | Examples | Support | FAQ With Intel(R) Extension for Scikit-learn you can accelera

Intel Corporation 858 Dec 25, 2022
Probabilistic time series modeling in Python

GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

Amazon Web Services - Labs 3.3k Jan 03, 2023
We have a dataset of user performances. The project is to develop a machine learning model that will predict the salaries of baseball players.

Salary-Prediction-with-Machine-Learning 1. Business Problem Can a machine learning project be implemented to estimate the salaries of baseball players

Ayşe Nur Türkaslan 9 Oct 14, 2022
Bayesian optimization in JAX

Bayesian optimization in JAX

Predictive Intelligence Lab 26 May 11, 2022
A basic Ray Tracer that exploits numpy arrays and functions to work fast.

Python-Fast-Raytracer A basic Ray Tracer that exploits numpy arrays and functions to work fast. The code is written keeping as much readability as pos

Rafael de la Fuente 393 Dec 27, 2022
BudouX is the successor to Budou, the machine learning powered line break organizer tool.

BudouX Standalone. Small. Language-neutral. BudouX is the successor to Budou, the machine learning powered line break organizer tool. It is standalone

Google 868 Jan 05, 2023
ml4ir: Machine Learning for Information Retrieval

ml4ir: Machine Learning for Information Retrieval | changelog Quickstart → ml4ir Read the Docs | ml4ir pypi | python ReadMe ml4ir is an open source li

Salesforce 77 Jan 06, 2023
This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Crypto-Currency-Predictor This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you

Hazim Arafa 6 Dec 04, 2022
Optimal Randomized Canonical Correlation Analysis

ORCCA Optimal Randomized Canonical Correlation Analysis This project is for the python version of ORCCA algorithm. It depends on Numpy for matrix calc

Yinsong Wang 1 Nov 21, 2021