某学校选课系统GIF验证码数据集 + Baseline模型 + 上下游相关工具

Overview

elective-dataset-2021spring

某学校2021春季选课系统GIF验证码数据集(29338张) + 准确率98.4%的Baseline模型 + 上下游相关工具。

数据集采用 知识共享署名-非商业性使用 4.0 国际许可协议 进行许可。

Baseline模型和上下游相关工具采用 MIT License 进行许可。

数据集

dataset/ 目录包含了收集到的所有带标签验证码数据,共29338张。

  • dataset/manual: 人工标注的带标签验证码GIF数据集,标签经过了elective验证因此都是正确的。共5471张。
  • dataset/auto-corrdataset/auto-fail-tagged: 模型自动标注的带标签验证码GIF数据集,其中 auto-corr 是识别正确(通过了elective验证)的部分,auto-fail-tagged 是识别错误然后手工重新标注的部分(此部分不保证正确性)。共22931(正确)+936(错误)张。

使用时请注意,由于 GitHub 的限制

  • auto-fail-tagged 在仓库中存储为7-zip压缩包;
  • manual 在仓库中存储为7个不超过48MB的7-zip分卷;
  • auto-corr 没有存储在仓库中,而是压缩为14个不超过95MB的7-zip分卷放在了 Release页面

Baseline 模型

baseline/ 目录包含一个简易的验证码识别模型。

此模型进行了提取关键帧、基于OpenCV的图像增强以及基于CNN的分类器等一系列工作以完成识别。

将训练集和测试集图片分别放入 set-trainset-test 后运行 train.py 进行训练,用一块TITAN RTX训练需要几分钟的时间。

用大约一万张图片训练好的 checkpoints/model_29.pth 能达到 98.4% 的整体精确度。

predict_bootstrap.py 在elective系统上测试当前模型,将检验正确的带标签图片放入 bootstrap_img_succ 目录,错误的图片放入 bootstrap_img_fail 目录。

上下游相关工具

  • crawl/: 验证码众包标注平台,可以从elective爬取验证码、辅助多名用户同时标注、检验正确性后将正确的数据放入 img_correct 目录。检验错误的验证码将被抛弃,这是初期的一个设计失误,这样将使得数据集的分布与真实分布有偏差。
  • retag/: 手工标注模型识别错误数据的工具。从 bootstrap_img_fail 读取标注错误图片,人工输入正确标注后移动到 bootstrap_img_fail_tagged
  • serve/: 提供在线验证码识别服务的 HTTP RPC 服务器。POST /fire 并传入base64编码的验证码GIF来进行识别。

数据处理过程

首先,我们设立了众包标注平台,多名志愿者累计标注了超过五千张验证码。

有了这些数据后,我们利用OpenCV进行了简单的图片增强、二值化、分字、裁切,然后随手糊了一个简单的CNN网络来识别。在随意调参之后,模型的整体(四个字)准确率接近95%。

然后,我们利用此模型来对数据集进行自举:爬取验证码后调用模型识别然后检验正确性,其中识别错误的部分手工标注。这样我们可以轻易地扩大数据集的规模,从而提升模型效果。

经过了更多的随意调参,模型的整体准确率可以达到98.4%。因为继续提升准确率意义不大,就没有继续优化。考虑到 PyTorch 安装比较麻烦,模型不易于部署到用户的设备上,我们实现了一个 HTTP API 可以用于云端识别。

相关工作

by Elector Quartet (按字典序的倒序 @xmcp, @SpiritedAwayCN, @Rabbit, @gzz)

You might also like...
Owner
xmcp
叶氏筛法第 NaN 代传人
xmcp
An implementation of Deep Forest 2021.2.1.

Deep Forest (DF) 21 DF21 is an implementation of Deep Forest 2021.2.1. It is designed to have the following advantages: Powerful: Better accuracy than

LAMDA Group, Nanjing University 795 Jan 03, 2023
Put blind watermark into a text with python

text_blind_watermark Put blind watermark into a text. Can be used in Wechat dingding ... How to Use install pip install text_blind_watermark Alice Pu

郭飞 164 Dec 30, 2022
Bare bones use-case for deploying a containerized web app (built in streamlit) on AWS.

Containerized Streamlit web app This repository is featured in a 3-part series on Deploying web apps with Streamlit, Docker, and AWS. Checkout the blo

Collin Prather 62 Jan 02, 2023
Official repository for MixFaceNets: Extremely Efficient Face Recognition Networks

MixFaceNets This is the official repository of the paper: MixFaceNets: Extremely Efficient Face Recognition Networks. (Accepted in IJCB2021) https://i

Fadi Boutros 51 Dec 13, 2022
🔎 Super-scale your images and run experiments with Residual Dense and Adversarial Networks.

Image Super-Resolution (ISR) The goal of this project is to upscale and improve the quality of low resolution images. This project contains Keras impl

idealo 4k Jan 08, 2023
ArtEmis: Affective Language for Art

ArtEmis: Affective Language for Art Created by Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed Elhoseiny, Leonidas J. Guibas Introducti

Panos 268 Dec 12, 2022
The official implementation of EIGNN: Efficient Infinite-Depth Graph Neural Networks (NeurIPS 2021)

EIGNN: Efficient Infinite-Depth Graph Neural Networks The official implementation of EIGNN: Efficient Infinite-Depth Graph Neural Networks (NeurIPS 20

Juncheng Liu 14 Nov 22, 2022
Implementation for our ICCV2021 paper: Internal Video Inpainting by Implicit Long-range Propagation

Implicit Internal Video Inpainting Implementation for our ICCV2021 paper: Internal Video Inpainting by Implicit Long-range Propagation paper | project

202 Dec 30, 2022
Fuzzer for Linux Kernel Drivers

difuze: Fuzzer for Linux Kernel Drivers This repo contains all the sources (including setup scripts), you need to get difuze up and running. Tested on

seclab 344 Dec 27, 2022
This dlib-based facial login system

Facial-Login-System This dlib-based facial login system is a technology capable of matching a human face from a digital webcam frame capture against a

Mushahid Ali 3 Apr 23, 2022
TagLab: an image segmentation tool oriented to marine data analysis

TagLab: an image segmentation tool oriented to marine data analysis TagLab was created to support the activity of annotation and extraction of statist

Visual Computing Lab - ISTI - CNR 49 Dec 29, 2022
A set of tools for converting a darknet dataset to COCO format working with YOLOX

darknet格式数据→COCO darknet训练数据目录结构(详情参见dataset/darknet): darknet ├── class.names ├── gen_config.data ├── gen_train.txt ├── gen_valid.txt └── images

RapidAI-NG 148 Jan 03, 2023
Algorithmic encoding of protected characteristics and its implications on disparities across subgroups

Algorithmic encoding of protected characteristics and its implications on disparities across subgroups This repository contains the code for the paper

Team MIRA - BioMedIA 15 Oct 24, 2022
Official repository of the paper Privacy-friendly Synthetic Data for the Development of Face Morphing Attack Detectors

SMDD-Synthetic-Face-Morphing-Attack-Detection-Development-dataset Official repository of the paper Privacy-friendly Synthetic Data for the Development

10 Dec 12, 2022
MATLAB codes of the book "Digital Image Processing Fourth Edition" converted to Python

Digital Image Processing Python MATLAB codes of the book "Digital Image Processing Fourth Edition" converted to Python TO-DO: Refactor scripts, curren

Merve Noyan 24 Oct 16, 2022
optimization routines for hyperparameter tuning

Hyperopt: Distributed Hyperparameter Optimization Hyperopt is a Python library for serial and parallel optimization over awkward search spaces, which

Marc Claesen 398 Nov 09, 2022
Implementation of Squeezenet in pytorch, pretrained models on Cifar 10 data to come

Pytorch Squeeznet Pytorch implementation of Squeezenet model as described in https://arxiv.org/abs/1602.07360 on cifar-10 Data. The definition of Sque

gaurav pathak 86 Oct 28, 2022
Compare outputs between layers written in Tensorflow and layers written in Pytorch

Compare outputs of Wasserstein GANs between TensorFlow vs Pytorch This is our testing module for the implementation of improved WGAN in Pytorch Prereq

Hung Nguyen 72 Dec 20, 2022
TensorFlow Implementation of Unsupervised Cross-Domain Image Generation

Domain Transfer Network (DTN) TensorFlow implementation of Unsupervised Cross-Domain Image Generation. Requirements Python 2.7 TensorFlow 0.12 Pickle

Yunjey Choi 864 Dec 30, 2022
[TIP 2020] Multi-Temporal Scene Classification and Scene Change Detection with Correlation based Fusion

Multi-Temporal Scene Classification and Scene Change Detection with Correlation based Fusion Code for Multi-Temporal Scene Classification and Scene Ch

Lixiang Ru 33 Dec 12, 2022