Mining the Stack Overflow Developer Survey

Overview

Mining the Stack Overflow Developer Survey

A prototype data mining application to compare the accuracy of decision tree and random forest regression models to predict annual compensation of tech workers in the US and Europe.

Objectives

Usage

To run, download the repository and execute the file main.py in the src directory with your python path variable. For example, python3 main.py.

Dependencies

  • python 3.8.1 and up
  • pandas 1.3.4 and up
  • matplotlib 3.4.3 and up
  • numpy 1.21.0 and up
  • sklearn 1.0.1 and up

Methodology

Preprocessing

The original data set provided by Stack Overflow contained 48 attribute columns and 83439 data records. Due to the large size of the data set, we wanted to narrow our focus to a certain subset of the data. In the preprocessing of the original data file, we decided to discard any records that were not employed full-time in the technology industry. Any record that did not contain country, converted annual salary, or yeared coded was also discarded, as this data is vital to our model. We also discarded some of the columns from the original data set that were open-ended. Out of the records that fit our requirements, we exported them to two output csv files. Records of United States data were put together in one output file, and records of European countries were put in the other. Data from any other countries were discarded. Once we have the two cleaned files, we applied additional preprocessing techniques. Any missing attributes that remained were replaced with 'NA' if the attributes were nominal. Two special cases existed in the columns for years coded and years coded professionally. Most contained a numerical value for the years, but some had a string for 'Less than one year' and 'More than 50 years'. These strings were replaced with 0 and 50, respectively, to keep these columns numerical. With these preprocessing steps complete, the data files are now ready to be processed to generate the models.

Models

We evaluated a variety of data mining models and algorithms to find the ones that would make the most sense for our data set and objectives. With our goal of predicting a numerical value for annual salary, we knew we needed to use a compatible regression model. We found regression models for decision trees and random forests and wanted to compare their accuracy. We wanted to see how the accuracy of a single decision tree compares to the accuracy of a random forest model, which is a number of trees together. The results are detailed in the results and analysis section. Below are the implementation details of each model.

Decision tree model

We selected the DecisionTreeRegressor model from the Scikit Learn machine learning package. In order to get the most accurate model, we trained several models with different parameters and selected the one with the highest accuracy to validate. The parameter we changed was the maximum depth level of each tree. Additional factors that affect the model are the testing split percentage and the cross validation folds. For our models, we used 20% of the data as testing and 80% as training and a cross validation value of 10. Out of every combination we tried, we found that a maximum depth of ADD RES HERE resulted in the most accurate decision tree model. The accuracy of the model was ADD RES HERE. This model will output the tree itself, several statistics of the model such as R-squared, mean absolute error, and mean squared error, and the ten attributes that have the largest weight in determining the result. With the best model selected, we then validated it against the testing data set. These steps of model generation were done for both the US data and the European data.

Random forest model

We selected the RandomForestRegressor model from the Scikit Learn machine learning package. In order to get the most accurate model, we trained several models with different parameters and selected the one with the highest accuracy to validate. The parameters we changed were the number of trees to estimate with and the maximum depth level of each tree. Additional factors that affect the model are the testing split percentage and the cross validation folds. For our models, we used 20% of the data as testing and 80% as training and a cross validation value of 10. Out of every combination we tried, we found that ADD RES HERE trees in the forest with a maximum depth of ADD RES HERE resulted in the most accurate random forest model. The accuracy of the model was ADD RES HERE. This model will output the tree itself, several statistics of the model such as R-squared, mean absolute error, and mean squared error, and the ten attributes that have the largest weight in determining the result. With the best model selected, we then validated it against the testing data set. These steps of model generation were done for both the US data and the European data.

Results and Analysis

Authors

Creating a statistical model to predict 10 year treasury yields

Predicting 10-Year Treasury Yields Intitially, I wanted to see if the volatility in the stock market, represented by the VIX index (data source), had

10 Oct 27, 2021
CRISP: Critical Path Analysis of Microservice Traces

CRISP: Critical Path Analysis of Microservice Traces This repo contains code to compute and present critical path summary from Jaeger microservice tra

Uber Research 110 Jan 06, 2023
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm

Jacob Schreiber 457 Dec 20, 2022
Package for decomposing EMG signals into motor unit firings, as used in Formento et al 2021.

EMGDecomp Package for decomposing EMG signals into motor unit firings, created for Formento et al 2021. Based heavily on Negro et al, 2016. Supports G

13 Nov 01, 2022
An Integrated Experimental Platform for time series data anomaly detection.

Curve Sorry to tell contributors and users. We decided to archive the project temporarily due to the employee work plan of collaborators. There are no

Baidu 486 Dec 21, 2022
This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP.

Overview Welcome to the Step-X repository. This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP. Be

Keanu Pang 0 Jan 20, 2022
Gaussian processes in TensorFlow

Website | Documentation (release) | Documentation (develop) | Glossary Table of Contents What does GPflow do? Installation Getting Started with GPflow

GPflow 1.7k Jan 06, 2023
This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

AECID 32 Nov 27, 2022
Detecting Underwater Objects (DUO)

Underwater object detection for robot picking has attracted a lot of interest. However, it is still an unsolved problem due to several challenges. We take steps towards making it more realistic by ad

27 Dec 12, 2022
Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

PandasVault ⁠— Advanced Pandas Functions and Code Snippets The only Pandas utility package you would ever need. It has no exotic external dependencies

Derek Snow 374 Jan 07, 2023
Analysis scripts for QG equations

qg-edgeofchaos Analysis scripts for QG equations FIle/Folder Structure eigensolvers.py - Spectral and finite-difference solvers for Rossby wave eigenf

Norman Cao 2 Sep 27, 2022
Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day.

Analyse the limit order book in seconds. Zoom to tick level or get yourself an overview of the trading day. Correlate the market activity with the Apple Keynote presentations.

2 Jan 04, 2022
A forecasting system dedicated to smart city data

smart-city-predictions System prognostyczny dedykowany dla danych inteligentnych miast Praca inżynierska realizowana przez Michała Stawikowskiego and

Kevin Lai 1 Nov 08, 2021
A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

TennisBusinessIntelligenceProject - A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

carlo paladino 1 Jan 02, 2022
Data imputations library to preprocess datasets with missing data

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

Elton Law 329 Dec 05, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 09, 2023
A Python Tools to imaging the shallow seismic structure

ShallowSeismicImaging Tools to imaging the shallow seismic structure, above 10 km, based on the ZH ratio measured from the ambient seismic noise, and

Xiao Xiao 9 Aug 09, 2022
A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

SymPy 9.9k Dec 31, 2022
ASOUL直播间弹幕抓取&&数据分析

ASOUL直播间弹幕抓取&&数据分析(更新中) 这些文件用于爬取ASOUL直播间的弹幕(其他直播间也可以)和其他信息,以及简单的数据分析生成。

159 Dec 10, 2022
Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

Meltano 625 Jan 02, 2023