Machine Learning Algorithms

Overview

Machine-Learning-Algorithms

In this project, the dataset was created through a survey opened on Google forms. The purpose of the form is to find the person's favorite shopping type based on the information provided. In this context, 13 questions were asked to the user. As a result of these questions, the estimation of the shopping type, which is a classification problem, will be carried out with 5 different algorithms.

These algorithms;

  • Logistic Regression
  • Random Forest Classifier
  • Support Vector Machine
  • K Neighbors
  • Decision Tree

algorithms will have a total of 12 parameters

A total of 219 people participated in the survey and the answers given to this form were used in the training of the algorithm.

Target variables to be estimated;

  • Clothing
  • Technology
  • Home/Life
  • Book/Magazine

The questions asked to make the estimation are as follows:

  • Gender
  • Age
  • Which store would you prefer to go to?
  • Which store would you prefer to go to?
  • Which store would you prefer to go to?
  • What is your favorite season?
  • What is the importance of the dollar exchange rate for your shopping?
  • What is your satisfaction level with your budget for shopping?
  • How would you rate your social life?
  • Which of the online shopping sites do you prefer?
  • How often do you go shopping?
  • What is your average sleep time per day?
  • What is your favorite type of shopping? // target

The dataset, which is in the form of a csv file, is read to the system as a dataframe. And the column of information in which hour and minute the user filled out the form, which does not make sense for our algorithm, is removed.

Since the numbers in some columns is way more different than the others before the PCA operation is performed, the standardization process is applied to the columns so that they do not have a greater effect than the combination of these columns during the PCA operation.

The features and target columns to be used during the export of the dataset to the algorithms are determined.

In order to fit the resulting algorithms, the initial state of the dataset, its normalized state and the pca applied states are kept separately. The generated data is divided into parts as train = 0.8 and test = 0.2. Cross Validation process will be applied on 0.8 train data.

Before giving the dataset to the 5 algorithms, the answers written in the text in the dataset and the text in the other questions are encoded and the dataset is converted into numbers.

The 5 algorithms are functions from the sklearn library. The Cross Validation process was performed using the GridSearchCV() function, excluding the Logistic Regression algorithm. In the Logistic regression algorithm, since it is possible to do Cross Validation with the logistic regression function it is not necessary to use GridSearchCV().

GridSearchCV() applies K-Fold Cross Validation by trying the parameters I gave for the function, the number of K for my project is 10. By dividing the cross validation process parameters and the train data we provide, it is determined at which values we can get the best result.

An algorithm is created using the determined parameters and the algorithm is tested with the test data to be fitted with the train data.

Detailed information about dataset can be found in the report.

Owner
Göktuğ Ayar
Computer Engineering student at Yildiz Technical University
Göktuğ Ayar
Open MLOps - A Production-focused Open-Source Machine Learning Framework

Open MLOps - A Production-focused Open-Source Machine Learning Framework Open MLOps is a set of open-source tools carefully chosen to ease user experi

Data Revenue 590 Dec 28, 2022
This repo includes some graph-based CTR prediction models and other representative baselines.

Graph-based CTR prediction This is a repository designed for graph-based CTR prediction methods, it includes our graph-based CTR prediction methods: F

Big Data and Multi-modal Computing Group, CRIPAC 47 Dec 30, 2022
PyHarmonize: Adding harmony lines to recorded melodies in Python

PyHarmonize: Adding harmony lines to recorded melodies in Python About To use this module, the user provides a wav file containing a melody, the key i

Julian Kappler 2 May 20, 2022
A flexible CTF contest platform for coming PKU GeekGame events

Project Guiding Star: the Backend A flexible CTF contest platform for coming PKU GeekGame events Still in early development Highlights Not configurabl

PKU GeekGame 14 Dec 15, 2022
A high performance and generic framework for distributed DNN training

BytePS BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on eith

Bytedance Inc. 3.3k Dec 28, 2022
BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

4.1k Jan 09, 2023
End to End toy example of MLOps

churn_model MLOps Toy Example End to End You might find below links useful Connect VSCode to Git MLFlow Port Heroku App Project Organization ├── LICEN

Ashish Tele 6 Feb 06, 2022
Python bindings for MPI

MPI for Python Overview Welcome to MPI for Python. This package provides Python bindings for the Message Passing Interface (MPI) standard. It is imple

MPI for Python 604 Dec 29, 2022
50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

[Due to the time taken @ uni, work + hell breaking loose in my life, since things have calmed down a bit, will continue commiting!!!] [By the way, I'm

Daniel Han-Chen 1.4k Jan 01, 2023
SIMD-accelerated bitwise hamming distance Python module for hexidecimal strings

hexhamming What does it do? This module performs a fast bitwise hamming distance of two hexadecimal strings. This looks like: DEADBEEF = 1101111010101

Michael Recachinas 12 Oct 14, 2022
moDel Agnostic Language for Exploration and eXplanation

moDel Agnostic Language for Exploration and eXplanation Overview Unverified black box model is the path to the failure. Opaqueness leads to distrust.

Model Oriented 1.2k Jan 04, 2023
PySurvival is an open source python package for Survival Analysis modeling

PySurvival What is Pysurvival ? PySurvival is an open source python package for Survival Analysis modeling - the modeling concept used to analyze or p

Square 265 Dec 27, 2022
hgboost - Hyperoptimized Gradient Boosting

hgboost is short for Hyperoptimized Gradient Boosting and is a python package for hyperparameter optimization for xgboost, catboost and lightboost using cross-validation, and evaluating the results o

Erdogan Taskesen 34 Jan 03, 2023
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and t

164 Jan 04, 2023
ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

Broad Institute 65 Dec 20, 2022
K-Means clusternig example with Python and Scikit-learn

Unsupervised-Machine-Learning Flat Clustering K-Means clusternig example with Python and Scikit-learn Flat clustering Clustering algorithms group a se

Emin 1 Dec 13, 2021
A library to generate synthetic time series data by easy-to-use factors and generator

timeseries-generator This repository consists of a python packages that generates synthetic time series dataset in a generic way (under /timeseries_ge

Nike Inc. 87 Dec 20, 2022
Katana project is a template for ASAP 🚀 ML application deployment

Katana project is a FastAPI template for ASAP 🚀 ML API deployment

Mohammad Shahebaz 100 Dec 26, 2022
LibTraffic is a unified, flexible and comprehensive traffic prediction library based on PyTorch

LibTraffic is a unified, flexible and comprehensive traffic prediction library, which provides researchers with a credibly experimental tool and a convenient development framework. Our library is imp

432 Jan 05, 2023