PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Overview

PatrickStar: Parallel Training of Large Language Models via a Chunk-based Memory Management

logo

Meeting PatrickStar

Pre-Trained Models (PTM) are becoming the hotspot of both NLP research and industry application. They are models that are trained with massive data and have learned generic features of the language. In pactice, they are fine-tuned for downstream tasks with task-specific datasets. In this way, PTMs have achieved great performances in almost every tasks. However, the training of PTMs requires enormous hardware resources, which makes it only accessible to small portion of people in the AI community. Now, PatrickStar will make PTM training available to everyone!

Out of memory error (OOM) is the nightmare of every engineer training PTMs. To prevent such error, we often have to introduce more GPUs to store the model params. PatrickStar brings a better solution for such problem. With the heterogeneous training (DeepSpeed Zero Stage 3 also uses it), PatrickStar could make full use of both the CPU and GPU memory, so that you could use fewer GPUs to train larger models.

We noticed that the GPU memory usage varies during training, but the current heterogenous training solutions are all statically spliting the model and optimizer states to CPU and GPU. To make better use of the GPU, PatrickStar proposes a dynamic memory scheduling with the help of a chunk-based memory management module. The memory management of PatrickStar supports offloading everything but the current computing part of the model to CPU. This results in training a much larger model within the same hardware environment. In terms of performance, the chunk-based memory management takes advantage of the linear structure of the transformer-based PTMs, so that it will inherently prefetch the upcoming layers to GPUs, resulting in a great performance.

In experiment, Patrickstar is able to train a 12B param model with 8 Tesla V100 GPU and 240GB GPU memory, which is twice as large as the state of art. And the performance of PatrickStar is better for models of the same size as well. The deeps indicates performance of DeepSpeed v0.4.3 using the official example DeepSpeed example zero3 stage with activation optimzations openinig by default.

alt perf

We've also trained the CLUE-GPT2 model with PatrickStar, the loss and accuracy curve is shown below:

CLUE-GPT2

Installation

pip install .

Note that PatrickStar requires gcc of version 7 or higher. You could also use NVIDIA NGC images, the following image is tested:

docker pull nvcr.io/nvidia/pytorch:21.06-py3

Usage

PatrickStar is based on PyTorch, which makes it easy to migrate a pytorch project. Here is a example of PatrickStar:

from patrickstar.runtime import initialize_engine

config = {
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001,
            "betas": (0.9, 0.999),
            "eps": 1e-6,
            "weight_decay": 0,
            "use_hybrid_adam": True,
        },
    },
    "fp16": {  # loss scaler params
        "enabled": True,
        "loss_scale": 0,
        "initial_scale_power": 2 ** 3,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1,
    },
    "default_chunk_size": 64 * 1024 * 1024,
    "release_after_init": True,
    "use_cpu_embedding": False,
}

def model_func():
    # MyModel is a derived class for torch.nn.Module
    return MyModel(...)

model, optimizer = initialize_engine(model_func=model_func, local_rank=0, config=config)

...

for data in dataloader:
    optimizer.zero_grad()

    loss = model(data)
    model.backward(loss)
    optimizer.step()

We use the same config format as DeepSpeed configuration JSON, which mainly includes params of optimizer, loss scaler and some PatrickStar specific configuration.

For some detail explanation of the above example, please check the guide here

For more examples, please check here.

Inside PatrickStar

See this doc for the idea behind PatrickStar.

License

BSD 3-Clause License

Cite Us

@article{fang2021patrickstar,
  title={PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based Memory Management},
  author={Fang, Jiarui and Yu, Yang and Zhu, Zilin and Li, Shenggui and You, Yang and Zhou, Jie},
  journal={arXiv preprint arXiv:2108.05818},
  year={2021}
}

Contact Us

{jiaruifang, zilinzhu, josephyu}@tencent.com

Powered by WeChat AI Team, Tencent NLP Oteam.

Comments
  • CPU Embedding

    CPU Embedding

    因为embedding的参数比其他layer参数大很多,我们不把ebd参数交给chunk管理,并将其计算固定在CPU中。 这样每次计算前用hook将input从gpu拷贝到cpu,CPU ebd layers计算之后,再将输出的activations拷贝回GPU,以参与后面计算。 但是,有些PyTorch版本不支持torch.half类型的CPU embedding计算(比如torch, 1.4.0+cu100不支持,1.7.1+cu110则支持)。 现在cpu ebd也有param fp16和param fp32两份参数,但是param fp16也存成torch.float类型,用于FWD和BWD的计算,param fp32用于ADAM计算。而且每个进程都存储全部参数。 这样存在巨大内存浪费,首先其实只需要存一份torch.float类型的param,并可以用模型并行方式,分布在多个进程。

    help wanted 
    opened by feifeibear 6
  • C++ adam速度

    C++ adam速度

    Aug 10的性能结果 log.GPT2small_gpu_1_cs_64_bs_128_cpueb_1_margin_0.8_warmup_0.2_gpu_0.8_adamcvt_1

    2021-08-10:14:34:53,509 INFO [memory_monitor.py:65] CPU Virtual Memory: used = 15.08 GB, percent = 96.6% 605 2021-08-10:14:34:53,509 INFO [test_bert.py:223] ckp True fp16 True ps True: step elapse 5.177955627441406 sec/iter, 18.463766371092152 Tflops 606 2021-08-10:14:34:53,509 INFO [test_bert.py:225] model 0.72940493 607 2021-08-10:14:34:53,509 INFO [global_timer.py:45] *********** PROFILE RESULTS ************* 608 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CHUNK_LIST_prepare_device, 0, 0.0 % 609 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CHUNK_allocate_payload, 0, 0.0 % 610 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_access, 0.019408226013183594, 0.338427821424322 % 611 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_release, 0.014924049377441406, 0.2602357121256555 % 612 2021-08-10:14:34:53,509 INFO [global_timer.py:50] chunk_cpu_gpu_move, 0, 0.0 % 613 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_access_dist, 0.03873419761657715, 0.6754213447995139 % 614 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CLIENT_release_dist, 0.3606679439544678, 6.289089298897653 % 615 2021-08-10:14:34:53,509 INFO [global_timer.py:50] chunk_gpu_cpu_move, 0, 0.0 % 616 2021-08-10:14:34:53,509 INFO [global_timer.py:50] CHUNK_LIST_chunk_move, 0, 0.0 % 617 2021-08-10:14:34:53,509 INFO [global_timer.py:50] FWD, 0.28232502937316895, 4.9229973187357 % 618 2021-08-10:14:34:53,509 INFO [global_timer.py:50] BWD, 2.9886157512664795, 52.1135067722565 % 619 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_prepare_data_fp16_grad_to_fp32_grad_copy, 0.2039637565612793, 3.5565852198787224 % 620 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_prepare_data, 0.22702884674072266, 3.958779022397416 % 621 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_compute, 0.013135433197021484, 0.2290470049819615 % 622 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_param_fp32_to_fp16, 0.5844182968139648, 10.190700111226695 % 623 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM_release_data, 0.016661882400512695, 0.29053889612597344 % 624 2021-08-10:14:34:53,509 INFO [global_timer.py:50] ADAM, 0.9849364757537842, 17.174671477149886 % 625 2021-08-10:14:34:53,509 INFO [global_timer.py:76] *********** DATA MOVE RESULTS ************* 626 2021-08-10:14:34:53,509 INFO [global_timer.py:86] chunk_cpu_gpu_move: 0.0 MB 627 2021-08-10:14:34:53,509 INFO [global_timer.py:86] chunk_gpu_cpu_move: 0.0 MB 628 2021-08-10:14:34:53,509 INFO [global_timer.py:83] ADAM_prepare_data_fp16_grad_to_fp32_grad_copy: 2782.4589920043945 MB, 393 times, 13641.92854120348 MB/s 629 2021-08-10:14:34:53,509 INFO [global_timer.py:83] ADAM_param_fp32_to_fp16: 2782.4589920043945 MB, 393 times, 4761.0744002597885 MB/s

    opened by feifeibear 5
  • [idea]进一步减少内存消耗,通过融合FWD+BWD+ADAM

    [idea]进一步减少内存消耗,通过融合FWD+BWD+ADAM

    有一个想法,我们其实可以进一步缩减memory footprint。 我们可以只保留param fp32,FWD时, submodule(etc. :Linear)需要的param fp16临时分配,并从param fp32拷贝数据。计算完毕就释放。 BWD时候, submodule需要的话再从param fp32转化,产生grad fp16,后立刻开始adam计算,更新param fp32。这样grad fp16也可以扔掉。 总的内存消耗从14M降低到12M,也就是等于OS的大小(M参数个数)。 也就是fusion FWD,BWD and ADAM 有一个paper支持了我这个想法: OPTIMIZER FUSION: EFFICIENT TRAINING WITH BETTER LOCALITY AND PARALLELISM https://arxiv.org/pdf/2104.00237.pdf

    opened by feifeibear 5
  • Support partial chunk management

    Support partial chunk management

    As we hope to support MoE in #187, and MoE is mainly of model parallel structure instead of data parallel, we need to support managing only part of the model with chunk.

    There are several design choices need to make, including but not limited to:

    • Shall we use mixed precision training in the unmanaged part of the model
    • How could we connect the backward of the unmanaged parts with the managed parts, i.e. if there are 3 parts in the model:
      class Net(nn.Module):
        def __init__(self, ...):
          self.A = SubNetA(...)  # managed by chunk
          self.B = SubNetB(...)  # not managed by chunk
          self.C = SubNetC(...)  # managed by chunk
      

      Then self.A and self.C need model.backward(loss) while self.B only need loss.backward().

    cc @feifeibear

    enhancement 
    opened by zhuzilin 4
  • RuntimeError: chunk move failed.

    RuntimeError: chunk move failed.

    While training a GPT3_6B model on 4x v100, the program stop because of runtime error at step 47. The exception show like this:

    RuntimeError: chunk move failed. cpu has not 385.875968 MB memory space. Free space is 320.948224 MB. The reason may be that the overall memory of CPU and GPU is not enough for the model.

    But the training progress only cost like 60% of the cpu memory, and the overall_cpu_mem_ratio is 0.9. 76127907-bcb1-4c69-91a0-10425b19874f

    opened by ouyangliqi 3
  • support using PatrickStar on MegatronDeepSpeed?

    support using PatrickStar on MegatronDeepSpeed?

    PatrickStar is awesome, it helps reduce memory used by model state!

    currently trends show using MegatronDeepSpeed as framework to train transformer based NLP models, both pretrain and finetuing. so will you guys support PatrickStar on MegatronDeepSpeed?

    opened by Jack47 3
  • Improve Memory Saving Communication.

    Improve Memory Saving Communication.

    Memory saving communication (MSC): using one-to-all communication to replace the original collective communication. More specifically, reduce scatter is replaced with Nx reduce. all gather is replaced with Nx bcast. In this way, we do not need to keep a Nx chunk buffer for distributed training, therefore save the GPU memory. I have implemented MSC but may not be optimal. I triggered communication in the granularity of the chunk group (Nx chunk). This may lead to chunks frequently move between CPU and GPU.

    This MR further optimizes the MSC. Communication is triggered in the granularity of chunks.

    opened by feifeibear 3
  • Add cuda event to enable async move

    Add cuda event to enable async move

    This MR introduce the compute_finished_event for each chunk to both enable async move and prevent the error solved by #243 .

    This MR would reduce the running time of GPT_DS_20B model from 28s to 25s.

    before:

    LOSS of step 4: 36.15625
    After step 4. using patrickstar, gradient checkpoint: True, fp16 True
    MA 24654.34 MB         Max_MA 26702.34 MB         CA 35922.0 MB         Max_CA 35922 MB 
    CPU Virtual Memory: used = 383.02 GB, percent = 38.0%
    Step 4 elaspe 28.63487219810486 s, 93.65687834254476 Tflops
    CLIENT_access_dist ........... 7.611969232559204, 26.592323953642254 %
    CLIENT_release ............... 0.05824542045593262, 0.20347968341163614 %
    CHUNK_LIST_prepare_device .... 2.1636338233947754, 7.558629021077558 %
    chunk_cpu_gpu_move ........... 5.4131646156311035, 18.910826183786217 %
    chunk_gpu_cpu_move ........... 4.49863076210022, 15.715913046770075 %
    CHUNK_LIST_chunk_move ........ 4.499696254730225, 15.719635332596733 %
    FWD .......................... 5.740366220474243, 20.05390109755791 %
    CLIENT_release_dist .......... 0.013424396514892578, 0.04689796951348792 %
    CHUNK_LIST_make_room ......... 2.359687566757202, 8.243540441044644 %
    BWD .......................... 12.1508309841156, 42.44878348344568 %
    ADAM_prepare_data_grad_copy .. 2.031972646713257, 7.098672266725892 %
    ADAM_prepare_data ............ 2.0699024200439453, 7.231179478602624 %
    ADAM_compute ................. 5.176488876342773, 18.084002304335637 %
    ADAM_param_fp32_to_fp16 ...... 3.381661891937256, 11.813795587537914 %
    ADAM_release_data ............ 0.041625261306762695, 0.14541735515558452 %
    CLIENT_access ................ 0.02957916259765625, 0.10333445262887372 %
    ADAM ......................... 10.73348879814148, 37.49731541899641 %
    TOTAL ........................ 28.624686002731323
    ------------- DATA MOVE RESULTS --------------
    chunk_cpu_gpu_move: 456704.0 MB, 446 times, 84369.13200112506 MB/s
    chunk_gpu_cpu_move: 434176.0 MB, 424 times, 96512.9220334815 MB/s
    ADAM_prepare_data_grad_copy: 195130.4687690735 MB, 2045 times, 96030.06668652732 MB/s
    ADAM_param_fp32_to_fp16: 390260.937538147 MB, 2045 times, 115405.07301118085 MB/s
    ******************** LOSS ********************
    [1.125, 51.21875, 107.125, 74.75, 36.15625]
    

    After:

    CPU Virtual Memory: used = 383.02 GB, percent = 38.0%
    Step 4 elaspe 25.567237615585327 s, 104.89411418374799 Tflops
    CLIENT_access_dist ........... 4.2600257396698, 16.6691598791813 %
    CLIENT_release ............... 0.057218313217163086, 0.22389095027108755 %
    CHUNK_LIST_prepare_device .... 1.330047607421875, 5.204376116458931 %
    chunk_cpu_gpu_move ........... 2.8953936100006104, 11.329457131871948 %
    chunk_gpu_cpu_move ........... 2.739708662033081, 10.720273655751937 %
    CHUNK_LIST_chunk_move ........ 2.74080228805542, 10.724552932011854 %
    FWD .......................... 3.507220506668091, 13.723489699319316 %
    CLIENT_release_dist .......... 0.013592243194580078, 0.05318542393237539 %
    CHUNK_LIST_make_room ......... 1.4351551532745361, 5.61565402729667 %
    BWD .......................... 11.206431150436401, 43.84992108900761 %
    ADAM_prepare_data_grad_copy .. 2.0489084720611572, 8.017224539409424 %
    ADAM_prepare_data ............ 2.0858242511749268, 8.161673202801694 %
    ADAM_compute ................. 5.232354164123535, 20.47380777399615 %
    ADAM_param_fp32_to_fp16 ...... 3.4178943634033203, 13.373963228245337 %
    ADAM_release_data ............ 0.0403749942779541, 0.1579843118019314 %
    CLIENT_access ................ 0.02886509895324707, 0.1129469582541441 %
    ADAM ......................... 10.842679738998413, 42.42658921167307 %
    TOTAL ........................ 25.556331396102905
    ------------- DATA MOVE RESULTS --------------
    chunk_cpu_gpu_move: 456704.0 MB, 446 times, 157734.68533692858 MB/s
    chunk_gpu_cpu_move: 434176.0 MB, 424 times, 158475.2444728218 MB/s
    ADAM_prepare_data_grad_copy: 195130.4687690735 MB, 2045 times, 95236.30334388558 MB/s
    ADAM_param_fp32_to_fp16: 390260.937538147 MB, 2045 times, 114181.68499202828 MB/s
    ******************** LOSS ********************
    [1.125, 51.21875, 107.125, 74.75, 36.15625]
    
    opened by zhuzilin 3
  • Mem cache to avoid too much allocation and free.

    Mem cache to avoid too much allocation and free.

    针对payload allocate提升还是比较大的

    40B 8GPU两个方案对比,差别只有是否使用mem_cache log.GPT_DS_40B_gpu_8_cs_384_bs_8_cpueb_0_lightseq_0_offload_0_SP_0_AMM_1_MSC_1_CACHE_1 CHUNK_allocate_payload_cuda ........... 4.351799488067627, 4.512294773814017 % log.GPT_DS_40B_gpu_8_cs_384_bs_8_cpueb_0_lightseq_0_offload_0_SP_0_AMM_1_MSC_1_CACHE_0 CHUNK_allocate_payload_cuda ........... 6.351553440093994, 6.1092681342726864 %

    40B 4GPU两个方案对比,差别只有是否使用mem_cache log.GPT_DS_40B_gpu_4_cs_384_bs_8_cpueb_0_lightseq_0_offload_0_SP_0_AMM_1_MSC_1_CACHE_1 CHUNK_allocate_payload_cuda ........... 1.5783910751342773, 2.8674014523329285 % log.GPT_DS_40B_gpu_4_cs_384_bs_8_cpueb_0_lightseq_0_offload_0_SP_0_AMM_1_MSC_1_CACHE_0 CHUNK_allocate_payload_cuda ........... 6.586869716644287, 7.966604944847982 %

    log.GPT_DS_20B_gpu_4_cs_384_bs_4_cpueb_0_lightseq_0_offload_0_SP_0_AMM_1_MSC_1_CACHE_1 CHUNK_allocate_payload_cuda ........... 1.8855564594268799, 8.558600729826443 % log.GPT_DS_20B_gpu_4_cs_384_bs_4_cpueb_0_lightseq_0_offload_0_SP_0_AMM_1_MSC_1_CACHE_0 CHUNK_allocate_payload_cuda ........... 7.258604526519775, 52.21233450104307 %

    opened by feifeibear 3
  • Move the gradients of torch based params to CPU before Adam

    Move the gradients of torch based params to CPU before Adam

    The CPUAdam requires the torch based params to be on CPU before updation, otherwise we need to move the momentum and variance of the tensor to GPU. There are 2 design I could think of:

    1. Load the torch based params to CPU the entire forward and backward computation and offload them in the loop of adam.
    2. Load them to GPU before modules and offload after modules for both forward and backward.

    This PR implements the latter design, which may introduce more offloading overhead, but more consistent with other parts of the current design.

    opened by zhuzilin 3
  • 目前Chunk Reuse的局限

    目前Chunk Reuse的局限

    现在chunk reuse方式可以将overall memory footprint从DeepSpeed的18M降低到14M(M是参数量)。但是派大星目前实现有局限。派大星采用静态方式去设计重用方案。在训练开始前,它规定param fp16所在的chunk内存,被grad fp16复用。这种方式前提是参数不会被BWD更新两边。对于BERT和GPT不会有任何问题,但是对于LSTM和seq2seq transformer不能work。

    针对后者,我们可以改成一种动态重用方式。也就是BWD时候实时监测chunk的空隙(param fp16不用时候可以释放出chunk的空隙),把grad fp16内存分配在空隙处。 不过LSTM和seq2seq目前很少有超大模型需求。我们可以暂时保留这个需求,等到必要时再去做。

    enhancement 
    opened by feifeibear 3
  • Support communication config before training

    Support communication config before training

    Currently, the training will start whether the config of 2 nodes are the same or not. This may cause some weird result during benchmarking. We should consider communicate the config among nodes to make sure they are running the same program...

    opened by zhuzilin 0
  • PatrickStar's Performance in Models Like GANs

    PatrickStar's Performance in Models Like GANs

    Hi! I am a newbie in this field. DeepSpeed provides a tutorial on GAN (https://www.deepspeed.ai/tutorials/gan/). I am curious about PatrickStar's performance in models like GANs or other CV models. I really hope that PatrickStar can make my poor GPU accommodate a large-scale GAN.

    opened by openRiemann 2
  • 支持TencentPretrain

    支持TencentPretrain

    TencentPretrain是TEG数据安全中心的repo,我们可以利用它们的模型结构和数据 https://git.woa.com/TencentNLP/TencentPretrain/merge_requests/61 TencentPretrain还有一个野生开源项目 https://github.com/dbiir/UER-py

    documentation 
    opened by feifeibear 5
  • Add CI

    Add CI

    We would like to have a CI to run unitests each time an MR proposed to branch develop and master. However, we currently have no idea how to find a GPU to run the unitests. Does anyone have ideas?

    help wanted 
    opened by feifeibear 1
  • 我们真的需要模型并行(MP)么?

    我们真的需要模型并行(MP)么?

    MP的风潮是Megatron-LM引入到PTM训练中的,通过对transformer的实现插入定制的集合通信操作,实现了模型切分。 模型并行有很多诟病,

    1. 在FWD和BWD都有大量的activations全局通信,通信量和batch size成正比。不仅通信量大于DP,还限制了batch size从而限制MP训练的计算负载规模,影响了计算性能(越大batch计算效率越高)。
    2. MP需要对model定义代码进行定制修改。因此DeepSpeed的Example中也是在Megatron-LM基础上改的。有一些工作尝试简化这个修改工作,比如Mesh-TensorFlow和阿里巴巴的Whale,PyTorch似乎没有相关工作。如果从刷性能角度,这样并无大碍。如果从使用角度,算法同学不会接受的,因为推理端的代码还需要把自定义并行算子转化成PyTorch串行的。
    3. 在HP(异构并行),MP,PP,DP等组合下,MP的用法已经非常局限,并有被替代之势。DeepSpeed吧MP被安排在节点内并行,PP和DP用在节点间。HP+DP的引入,让GPU内存墙被进一步打破,模型并行的主要优势正在被HP和ZeroDP代替,以后节点内是否继续用MP都不一定。

    MP and PatrickStar

    在PatrickStar中,显存的最大消耗量和chunk size有关,即使不使用异构存储空间,把所有chunk都放在gpu中,model data的尺寸也是原来的1/N,和MP消耗类似。PatrickStar和PP兼容即可,不需要兼容MP。 之前Zero-Offload会去兼容MP,这是很奇怪的。阅读代码,我觉得是因为Zero3的通信用了非常差的设计,需要临时在gpu分配world_size*tensor_numel大小的临时buffer,加上预取的存在,可能同时分配了多个这样的buffer,尤其对于embedding layer这种大参数层,可能会爆炸内存,因此需要用MP减少单个进程的tensor_numel。

    documentation 
    opened by feifeibear 5
Releases(v0.4.6)
Owner
Tencent
Tencent
Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Workshop: Enterprise-Scale NLP with Hugging Face & Amazon SageMaker Earlier this year we announced a strategic collaboration with Amazon to make it ea

Philipp Schmid 161 Dec 16, 2022
Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations Created by Jiahao Pang, Duanshun Li, and Dong Tian from InterDigital In

InterDigital 21 Dec 29, 2022
2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

繁體中文場景文字辨識 程式碼說明 組別:這就是我 成員:蔣明憲 唐碩謙 黃玥菱 林冠霆 蕭靖騰 目錄 環境套件 安裝方式 資料夾布局 前處理-製作偵測訓練註解檔 前處理-製作分類訓練樣本 part.py : 從 json 裁切出分類訓練樣本 Class.py : 將切出來的樣本按照文字分類到各資料夾

HuanyueTW 3 Jan 14, 2022
मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

For English, scroll down मराठी शब्द मराठी भाषा वाचवण्यासाठी मी हा ओपन सोर्स प्रोजेक्ट सुरू केला आहे. माझ्या मते, आपली भाषा हळूहळू आणि कोणाचाही लक्षात

मुक्त स्त्रोत 20 Oct 11, 2022
Guide to using pre-trained large language models of source code

Large Models of Source Code I occasionally train and publicly release large neural language models on programs, including PolyCoder. Here, I describe

Vincent Hellendoorn 947 Dec 28, 2022
Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features Train python main.py --dataset brazil-flights C

wang zhang 0 Jun 28, 2022
Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

smaller-LaBSE LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fi

Jeong Ukjae 13 Sep 02, 2022
Spooky Skelly For Python

_____ _ _____ _ _ _ | __| ___ ___ ___ | |_ _ _ | __|| |_ ___ | || | _ _ |__ || . || . || . || '

Kur0R1uka 1 Dec 23, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Dec 31, 2022
Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

Little Endian 1 Apr 28, 2022
Code to reprudece NeurIPS paper: Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks

Accelerated Sparse Neural Training: A Provable and Efficient Method to FindN:M Transposable Masks Recently, researchers proposed pruning deep neural n

itay hubara 4 Feb 23, 2022
TruthfulQA: Measuring How Models Imitate Human Falsehoods

TruthfulQA: Measuring How Models Imitate Human Falsehoods

69 Dec 25, 2022
Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter-NLP-Analysis Business Problem I got last @turk_politika 3000 tweets with

Çağrı Karadeniz 7 Mar 12, 2022
Topic Inference with Zeroshot models

zeroshot_topics Table of Contents Installation Usage License Installation zeroshot_topics is distributed on PyPI as a universal wheel and is available

Rita Anjana 55 Nov 28, 2022
Nateve compiler developed with python.

Adam Adam is a Nateve Programming Language compiler developed using Python. Nateve Nateve is a new general domain programming language open source ins

Nateve 7 Jan 15, 2022
A collection of models for image - text generation in ACM MM 2021.

Bi-directional Image and Text Generation UMT-BITG (image & text generator) Unifying Multimodal Transformer for Bi-directional Image and Text Generatio

Multimedia Research 63 Oct 30, 2022
TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TweebankNLP This repo contains the new Tweebank-NER dataset and off-the-shelf Twitter-Stanza pipeline for state-of-the-art Tweet NLP, as described in

Laboratory for Social Machines 84 Dec 20, 2022
Code for the ACL 2021 paper "Structural Guidance for Transformer Language Models"

Structural Guidance for Transformer Language Models This repository accompanies the paper, Structural Guidance for Transformer Language Models, publis

International Business Machines 10 Dec 14, 2022
Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

GDAP The code of paper "Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"" Event Datasets Prep

45 Oct 29, 2022
Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

This repository contains the code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

Chenhe Dong 28 Nov 10, 2022